Edgewall Software

Ticket #3446 (assigned defect)

Opened 2 years ago

Last modified 19 months ago

Rate of `database is locked` errors is too high

Reported by: cboos Owned by: cboos
Priority: high Milestone: not applicable
Component: general Version: devel
Severity: major Keywords: database lock pysqlite
Cc: trac@…, carlos@…

Description

I still get way too many database is locked errors. This morning, I simply tried to log in and the server was not loaded (approx. half of the threads were "_" Waiting for connection).

Traceback (most recent call last):
  File "/usr/lib/python2.3/site-packages/trac/web/main.py", line 314, in dispatch_request
    dispatcher.dispatch(req)
  File "/usr/lib/python2.3/site-packages/trac/web/main.py", line 186, in dispatch
    req.session = Session(self.env, req)
  File "/usr/lib/python2.3/site-packages/trac/web/session.py", line 52, in __init__
    self.promote_session(sid)
  File "/usr/lib/python2.3/site-packages/trac/web/session.py", line 125, in promote_session
    "AND authenticated=0", (sid,))
  File "/usr/lib/python2.3/site-packages/trac/db/util.py", line 47, in execute
    return self.cursor.execute(sql_escape_percent(sql), args)
  File "/usr/lib/python2.3/site-packages/trac/db/sqlite_backend.py", line 44, in execute
    args or [])
  File "/usr/lib/python2.3/site-packages/trac/db/sqlite_backend.py", line 36, in _rollback_on_error
    return function(self, *args, **kwargs)
OperationalError: database is locked

Jonas/Christopher, can one of you attach a recent gzipped dump of the t.e.o Trac log here? I'd like to make some stats about the locks frequency. Ideally I'd like to understand why this still happens so frequently and find ways to improve the situation...

Beside locks, I'm also frequently seeing blank pages when going on the Markup Trac. But probably that's a different issue.

Attachments

Change History

  Changed 2 years ago by cboos

#3536 marked as duplicate.

Quite obviously, this is not only an issue with Trac's Trac, but a more general one due to our database access patterns... which currently limit our concurrency to at most one write access each few seconds (if nobody starts a long read query in the meantime, of course ...).

Beside locks, I'm also frequently seeing blank pages when going on the Markup Trac. But probably that's a different issue.

This has been filed as #3455.

  Changed 2 years ago by cboos

Some random documents about SQLite and concurrent writes, for reference:

Also, don't forget about locks triggered by concurrent writes, which we should now handle correctly (#2170 and #ps126).

  Changed 2 years ago by cboos

... and we also shouldn't write into the db when it's not necessary: see r3616.

It's probably worth to upgrade t.e.o, in order to see how much this improves the situation.

  Changed 2 years ago by cboos

  • owner changed from jonas to cboos

Well, despite r3616 we still have a high rate of "database is locked" error.

See #3503 for a short-term solution (an nice TracError explaining what happens instead of a backtrace).

Long-term, I think we should do the following:

  • Shorten our transactions. In particular, a main reason for the problem seems to be the long periods of time during which SHARED locks are held. The common programming idiom in Trac is to iterate over cursors, performing some operation for each of the returned row. While this may sounds a good approach which lowers memory consumption, it lengthens the period during which a SHARED lock is held and during which not a single write can be done. We should instead read everything in memory first, then process the data. Of course, if there are some very straightforward tasks (like filtering) that can be done while reading the data, it could still be done by iterating over the cursor.
  • Also, if there are situations where not all the rows from a cursor are fetched, the cursor should be closed, so that the SHARED lock can be release as soon as possible.
  • Repeatable transactions: instead of doing writes into the database as soon as we feel like doing so, we could pass a callback to the database layer which will be used for writing all the information. The advantage of this would be to be able to retry transparently the write sequence if a database is locked exception is detected. This should also play well with the journalling approach described in TracDev/Proposals/Journaling.

follow-up: ↓ 6   Changed 2 years ago by wreese@…

Did you guys try setting the busy_timeout on the SQLite connection? It will cause a process to wait for a lock (up to the timeout) before throwing the database is locked exception. The wait is usually only a few milliseconds. It is a super easy way to get your "Repeatable transactions".

I just submitted a patch to Rails (http://dev.rubyonrails.org/ticket/6126/) for the same functionality. :)

-- Will Reese

in reply to: ↑ 5   Changed 2 years ago by anonymous

Replying to wreese@gmail.com:

Did you guys try setting the busy_timeout on the SQLite connection?

The timeout is set, that helps a bit, but it's by no mean a miracle solution to the issue... it can even introduce its own problems, like the dead-locks on concurrent writes. If you wish, you can have a look at the various documents I've given pointers to in comment:2.

  Changed 2 years ago by anonymous

  • cc trac@… added

I've also seeing this quite a bit.

  Changed 2 years ago by anonymous

I have noticed that even this site having the same problem. Seems to be happening more any more.

follow-up: ↓ 10   Changed 2 years ago by cboos

  • milestone set to 2.0

What should perhaps be made clear is that this issue is only seen with the PySqlite DatabaseBackend. If the locks are becoming an issue, an option is to convert the database to use another backend (e.g. th:SqliteToPgScript). If you know beforehand that you will get some high traffic and have an important number of developers that will use your Trac, you can also directly create the Trac environment with one of the alternative database backend.

SQLite has by design been created to allow multiple process to access independantly to the same database file, even in theory by processes running on different machines. This requires the use of OS-level file locking facilities (flock) and basically you can't have more than one thread or process writing to the database at the same time, and a write can happen only when there's no longer any thread or process reading the database.

We are nevertheless interested in getting the situation improved for SQLite. Besides shortening the read and write periods as outlined above, I also plan to check whether we can gain some performance by sharing connections between threads, which is possible with SQLite >= 3.3.1 (see SQLite:MultiThreading). OTOH, the shared cache mode doesn't seem to be able to help here, as this is only about sharing a cache between different connections in the same thread and we don't have more than one connection per thread.

Finally, there's also a hint about making use of temporary table, in the abovementioned wiki page:

When you use temporary tables, the main database is not locked (...) By creating a temporary table containing the results of a large query for processing, rather than processing it directly out of the main database, you greatly reduce lock contentions.

Using in-memory temporary tables should be fast.

in reply to: ↑ 9   Changed 2 years ago by cboos

Replying to myself:

I also plan to check whether we can gain some performance by sharing connections between threads, which is possible with SQLite >= 3.3.1 (see SQLite:MultiThreading).

Implemented in r3830. Not only does it seem to be faster and still reliable, but I've not been able to trigger a single lock so far! More testing welcomed.

  Changed 2 years ago by cboos

  • status changed from new to assigned
  • milestone changed from 2.0 to 0.10.1

I've seen only good things so far after r3830, so I propose to backport this to [0.10.1]. Any objection?

  Changed 2 years ago by anonymous

Please backport this.

  Changed 2 years ago by cboos

r3830 is really a big plus for avoiding locks; given a set of test requests that I launch simultaneously (i.e. tabs in Firefox -> reload all tabs), most of them doing session updates, some others simply doing time consuming queries, I get:

  • repeatably 1 or 2 "Database is Locked" errors without r3830, among all the loaded tabs, at each Reload All action
  • not a single "Database is Locked" errors so far with r3830 (and I tried hard ;)

My current theory about this is that SQLite has a very high likelihood of getting locks while creating connections for the first time. Given the complexity of the hot journal checks (see http://www.sqlite.org/lockingv3.html), that would be not that surprising and I guess there are optimizations in this procedure when the connection is not closed.

Now all seems good with pooling Connection objects, except that it's perhaps the cause of crashes in SQLite itself, see #3790, in particular the last two reports. We'll see...

  Changed 2 years ago by cboos

  • keywords needinfo added
  • milestone changed from 0.10.3 to 0.10.1

Now that the abovementioned fixes are in 0.10.2 (0.10.1 even), I'd be interested to get feedback from people who used to experience the database is locked error, in order to check that this has really been fixed.

follow-up: ↓ 16   Changed 2 years ago by ed.pasma@…

Indeed the database is locked error can no longer be reproduced. I tried this hard too, with the help of a script http://home.wanadoo.nl/italy4ever/trac/r3446.py. It can put any load of requests on the trac server. The findings are exactly as you described. No more locks. Only the IntegrityError?: columns sid, authenticated are not unique. The script reproduces this if you add a sleep time of some seconds between the requests. May be an academic case. Good news is that I did not see any segfaults or errors like that on Windows 2000. Only once, a python message ended in the Apache access.log, which is weird:

Exception <ProgrammingError instance at 0x020E8558> in <bound method PooledConnection.__del__ of <trac.db.pool.PooledConnection object at 0x0194B4B8>> ignored

If you still wish to explain why the locking errors no longer occur, I may have a theory. The main difference in the sql processing is, as I see it, that by reusing PySQLite databasae connections, the PySQLite statement cache is kept intact. Therefore the sqlite prepare phase is skipped most of the time. This is not the case otherwise, as Trac constantly disconnects and reconnects and hence destroys the statement cache.

This has the following effect on the locking phases. First here is the old situation:

- begin statement (by PySQLite) -- still UNLOCKED
- sqlite3_prepare -- SHARED
- sqlite3_step -- RESERVED
- commit -- PENDING/EXCLUSVE/UNLOCKED

If two users happen to be in the prepare phase at the same time, only one can promote his lock from SHARED to RESERVED. The other will get a deadlock type of database is locked error. The chance of two users being in the prepare phase at the same time is small, considering the fraction of a milisecond that it takes. But I suspect there is a dependency.

In the new situation, the sqlite3_prepare step is mostly skipped. So a lock is promoted from UNLOCKED to RESERVED. This seems not vulnarable to deadlock.

in reply to: ↑ 15 ; follow-up: ↓ 17   Changed 2 years ago by anonymous

Replying to ed.pasma@wanadoo.nl:

Indeed the database is locked error can no longer be reproduced.

Glad to hear that!

Exception <ProgrammingError instance at 0x020E8558> in <bound method PooledConnection.__del__ of <trac.db.pool.PooledConnection object at 0x0194B4B8>> ignored

Hm, this probably comes from close being called twice on the same connection, from within the destructor. Which revision were you using?

If you still wish to explain why the locking errors no longer occur, I may have a theory. The main difference in the sql processing is, as I see it, that by reusing PySQLite databasae connections, the PySQLite statement cache is kept intact.

Yes, this may play a big role.

In the new situation, the sqlite3_prepare step is mostly skipped. So a lock is promoted from UNLOCKED to RESERVED. This seems not vulnarable to deadlock.

I think the SHARED lock must always be acquired (see http://www.sqlite.org/lockingv3.html 5.0).

in reply to: ↑ 16   Changed 2 years ago by ed.pasma@…

Exception <ProgrammingError instance at 0x020E8558> in <bound method PooledConnection.__del__ of <trac.db.pool.PooledConnection object at 0x0194B4B8>> ignored

Hm, this probably comes from close being called twice on the same connection, from within the destructor. Which revision were you using?

It was trac version 0.10.2. But may be it is less important. I see from the logs that this occured while stopping the apache server and while it was busy.

In the new situation, the sqlite3_prepare step is mostly skipped. So a lock is promoted from UNLOCKED to RESERVED. This seems not vulnarable to deadlock.

I think the SHARED lock must always be acquired (see http://www.sqlite.org/lockingv3.html 5.0).

You are right here and I wrong. But what about the whole theory?

follow-up: ↓ 19   Changed 2 years ago by cboos

  • priority changed from normal to high
  • milestone changed from 0.10.1 to 0.10.4

Unfortunately, it looks like r3830 is causing crashes on Linux (64-bits, Python 2.4.2, sqlite-3.3.8, PySqlite-2.3.2).

I think I'll disable this feature, except on Windows where it's working great.

OTOH, even without r3830 on Linux, it seems to be still difficult to trigger a database is locked error.

in reply to: ↑ 18   Changed 2 years ago by ed.pasma@…

Hello Christian, If the locking issue might again turn up after the connection-pooling gets disabled, there is no need to worry. I have a very simple patch to pysqlite that definitely fixes it. It is already in the tracker, see http://initd.org/tracker/pysqlite/ticket/184. May be we should raise its priority. Thanks, Edzard Pasma

  Changed 2 years ago by cboos

  • milestone changed from 0.10.4 to none

Thanks for the suggestion, Edzard. I indeed saw pysqlite:ticket:184, but didn't realize that you made your testings with Trac!

I think this is issue is more or less solved on the Trac side. We simply need to wait for the time we can revert r4493, i.e. when pysqlite:ticket:187 gets fixed, or, as you said, if/when your patch in #184 gets integrated.

So I'm now post-poning the ticket, waiting for external resolution.

Finally, while testing 2.3.2 on Linux (with the pooling enabled), I've noticed for the first time that I got stuck at the get_db lock's timeout. I'm not sure this is a bug in the pool code (again...) or a side-effect of the instabilities related to the connection pooling.

follow-up: ↓ 22   Changed 2 years ago by ed.pasma@…

Yes, even with pooled connections you may run into database locks. I would hope that this occured when you run the test right after starting the server. Then it can be explained because there are no pooled connections available yet (whatever the further theory is).

in reply to: ↑ 21   Changed 2 years ago by cboos

Replying to ed.pasma@wanadoo.nl:

... I would hope that this occured when you run the test right after starting the server. Then it can be explained because there are no pooled connections available yet (whatever the further theory is).

Right, this is what is happening, when I do a Reload All Tabs in the web browser after I just started tracd.

  Changed 2 years ago by anonymous

If you wish to try the pysqlite patch for this, i uploaded it to http://home.wanadoo.nl/italy4ever/trac/cursor.2.3.2%23184.c.

  Changed 2 years ago by anonymous

  • cc carlos@… added

I've been able to reliably repeat the database lock error on Trac 0.10.2 and WebAdmin r4240.

I go to the admin page -> Permissions. Whatever alteration I do there results in a traceback similar or equal to the one below.

I don't exactly know what triggered this behaviour, it was working fine before.

2006-12-28 14:51:37,862 Trac[main] ERROR: database is locked
Traceback (most recent call last):
  File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 387, in dispatch_request
    dispatcher.dispatch(req)
  File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 238, in dispatch
    resp = chosen_handler.process_request(req)
  File "/usr/lib/python2.4/site-packages/TracWebAdmin-0.1.2dev_r4240-py2.4.egg/webadmin/web_ui.py", line 109, in process_request
    path_info)
  File "/usr/lib/python2.4/site-packages/TracWebAdmin-0.1.2dev_r4240-py2.4.egg/webadmin/perm.py", line 60, in process_admin_request
    perm.revoke_permission(subject, action)
  File "/usr/lib/python2.4/site-packages/trac/perm.py", line 198, in revoke_permission
    self.store.revoke_permission(username, action)
  File "/usr/lib/python2.4/site-packages/trac/perm.py", line 152, in revoke_permission
    (username, action))
  File "/usr/lib/python2.4/site-packages/trac/db/util.py", line 50, in execute
    return self.cursor.execute(sql_escape_percent(sql), args)
  File "/usr/src/build/539311-i386/install//usr/lib/python2.4/site-packages/sqlite/main.py", line 255, in execute
OperationalError: database is locked

This is another traceback I get less often (this stops on commit, instead of execute):

Traceback (most recent call last):
  File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 387, in dispatch_request
    dispatcher.dispatch(req)
  File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 238, in dispatch
    resp = chosen_handler.process_request(req)
  File "/usr/lib/python2.4/site-packages/TracWebAdmin-0.1.2dev_r4240-py2.4.egg/webadmin/web_ui.py", line 109, in process_request
    path_info)
  File "/usr/lib/python2.4/site-packages/TracWebAdmin-0.1.2dev_r4240-py2.4.egg/webadmin/perm.py", line 60, in process_admin_request
    perm.revoke_permission(subject, action)
  File "/usr/lib/python2.4/site-packages/trac/perm.py", line 198, in revoke_permission
    self.store.revoke_permission(username, action)
  File "/usr/lib/python2.4/site-packages/trac/perm.py", line 154, in revoke_permission
    db.commit()
  File "/usr/src/build/539311-i386/install//usr/lib/python2.4/site-packages/sqlite/main.py", line 539, in commit
OperationalError: database is locked

I just can't use WebAdmin Permissions management anymore. As a note, I made sure there was no one else using Trac at the time.

Would upgrading/patching pysqlite solve it?

Well, I can't way until SqlAlchemy goes to trunk so I can use Oracle.

  Changed 19 months ago by cboos

  • keywords needinfo removed

Note: this ticket is for discussing long term plans for improving the behavior of Trac using PySqlite in presence of concurrent access, not for reporting once again the OperationalError: database is locked issue...

Add/Change #3446 (Rate of `database is locked` errors is too high)

Author



Change Properties
<Author field>
Action
as assigned
as The resolution will be set. Next status will be 'closed'
to The owner will change from cboos. Next status will be 'new'
 
Note: See TracTickets for help on using tickets.