Edgewall Software

Ticket #4465 (closed defect: fixed)

Opened 2 years ago

Last modified 2 months ago

Timeout reached while waiting for a connection

Reported by: ezra@… Owned by: cboos
Priority: highest Milestone: 0.11.2
Component: general Version: 0.10.4
Severity: critical Keywords: db pool
Cc: osimons

Description (last modified by thatch) (diff)

Python Traceback

Traceback (most recent call last):
  File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 387, in dispatch_request
    dispatcher.dispatch(req)
  File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 183, in dispatch
    req.perm = PermissionCache(self.env, req.authname)
  File "/usr/lib/python2.4/site-packages/trac/perm.py", line 263, in __init__
    self.perms = PermissionSystem(env).get_user_permissions(username)
  File "/usr/lib/python2.4/site-packages/trac/perm.py", line 227, in get_user_permissions
    for perm in self.store.get_user_permissions(username):
  File "/usr/lib/python2.4/site-packages/trac/perm.py", line 109, in get_user_permissions
    db = self.env.get_db_cnx()
  File "/usr/lib/python2.4/site-packages/trac/env.py", line 182, in get_db_cnx
    return DatabaseManager(self).get_connection()
  File "/usr/lib/python2.4/site-packages/trac/db/api.py", line 75, in get_connection
    return self._cnx_pool.get_cnx(self.timeout or None)
  File "/usr/lib/python2.4/site-packages/trac/db/pool.py", line 107, in get_cnx
    raise TimeoutError('Unable to get database '
TimeoutError: Unable to get database connection within 20 seconds

Attachments

test_pool_timeout.py (2.4 KB) - added by cboos 21 months ago.
Simple test program demonstrating the issue - needs two paramaters, the $TRAC_ENV and the number of threads to use. If more than 5 given, it freezes.

Change History

  Changed 2 years ago by thatch

  • keywords needinfo added
  • priority changed from high to normal
  • component changed from wiki to general
  • description modified (diff)

A little more information would be helpful. For starters, what db backend are you using? What server method? Does this happen on the first request or only after sitting a while?

  Changed 2 years ago by ezra@…

  • status changed from new to closed
  • resolution set to worksforme

sorry this was on the edgewall wiki itself. it seems to have cleared itself up though

thank you for the response

  Changed 22 months ago by cboos

  • status changed from closed to reopened
  • severity changed from normal to major
  • priority changed from normal to high
  • milestone set to 0.10.4
  • keywords db pool weird added; needinfo removed
  • resolution worksforme deleted

That error happens now systematically when trying to load a bunch of pages simultaneously. Once a thread gets stuck on the self._available.wait(), all the other threads seem to get stuck as well, and then nothing progresses until the timeout is reached. At that point, things resume normally. Weird.

  Changed 21 months ago by cboos

  • owner changed from jonas to cboos
  • status changed from reopened to new
  • severity changed from major to blocker

Actually, this is the other main problem we have now in Trac, together with #4586.

  Changed 21 months ago by cboos

  • summary changed from TracPermissions wiki down to Timeout reached while waiting for a connection

The problem is not really with the fact that one Request can timeout while waiting for a connection, it's rather the fact that when one thread starts to wait, all activity is frozen. It seems that no other thread gets back the control... So in fact, the waiting thread has no chance to ever get a connection, as no other thread can terminate and release the connection. The waiting thread has no "choice" but wait until the timeout and exit with an error. At this point, the other threads seem to proceed normally.

I've seen this on both Linux and Windows, and I've not yet understood completely the reasons for this behavior.

  Changed 21 months ago by cboos

  • keywords weird removed
  • status changed from new to assigned

Ok, "weirdness" elucidated.

The problem comes from the interference between the connection pool and the repository cache.

Inside the repository cache (i.e. with the repo cache lock held), if we don't already have a repository object available, we try to create a new one. In case this implies a CachedRepository (as with Subversion), this will in turn try to acquire a connection from the pool.

If we don't already have a connection and if the pool is empty (all connections are in use), we wait on a timeout there in hope that the other threads will proceed and release their connection to the pool. However, it's very likely that before they'll do so, they require a repository object for themselves... deadlock.

A simple solution would be to make sure that we already have the db connection before trying to get a repository from the cache and getting hold of the repository cache lock.

Changed 21 months ago by cboos

Simple test program demonstrating the issue - needs two paramaters, the $TRAC_ENV and the number of threads to use. If more than 5 given, it freezes.

  Changed 21 months ago by cboos

  • status changed from assigned to closed
  • resolution set to fixed

Fixed in r4886 (trunk) and r4887 (0.10-stable).

  Changed 19 months ago by datallah@…

  • version changed from 0.10.3 to 0.10.4

I'm still running into this after upgrading to 0.10.4.

We're the fastcgi interface with lighttpd and a postgres db on a Debian etch box. I haven't changed any connection pooling settings. The www-data user is allowed 1024 open files.

I'm not sure if this helps to track down the issue, but I currently have version control disabled: trac.versioncontrol.* = disabled

The stack traces are across the board - anything that tries to get a db connection fails.

Please let me know what other info I can provide.

  Changed 19 months ago by eblot

  • status changed from closed to reopened
  • resolution fixed deleted

(re-opening the ticked to match the latest comment)

  Changed 13 months ago by cboos

  • keywords verify added
  • severity changed from blocker to critical
  • milestone changed from 0.10.4 to 0.10.5

#5335 and #6188 report identical backtraces. I think there could be a real problem here... though I guess this will be difficult to reproduce.

  Changed 13 months ago by cboos

... and #5536 as well, though the backtrace is slightly different.

  Changed 13 months ago by anonymous

  • keywords really added

  Changed 12 months ago by cboos

  • milestone changed from 0.10.5 to 0.11.1

Update: r4886 is no longer current (see r6124) and despite of that, I'm unable to reproduce the deadlocks in Trac, the attachment:test_pool_timeout.py sample works fine for any number of threads.

  Changed 12 months ago by cboos

  • keywords verify really removed
  • priority changed from high to highest
  • status changed from reopened to new
  • milestone changed from 0.11.1 to 0.11

#6405 was closed as duplicate. That ticket contained a detailed report showing that this issue is still present with the very latest version of everything, when the server is tracd on Linux.

That prompted me to do some further tests, and indeed I was able to reproduce the same timeout issue, with tracd on Linux. The only difference in my initial tests was that after the timeout, it was possible to resume working without having to restart the server.

I then tried again with r6124 reverted (i.e. reapplying the r4886 fix) and that definitely helped in making the timeout errors much less likely.

However this doesn't prevent the timeout situation completely, in particular when one do a reload all tabs while they're not yet fully loaded. Even worse, when doing so, it seems that the server can really be put in a state where all later requests will fail with the timeout.

  Changed 12 months ago by cboos

  • status changed from new to closed
  • resolution set to fixed

Ok, so in r6260 I've restored the r4886 fix, which is kind of bad as I really wanted to get rid of that. It appears that for some reason it's really needed.

Now, even with that change, I was still able to trigger the "persistent" timeout issue, specifically when triggering lots of "broken connection" errors. This made me think that the env.shutdown(tid) done at the end of request processing was maybe not called every time.

Indeed, when making sure that the env.shutdown(tid) gets always called, with r6261, the persistent timeout issue didn't happen again.

I've not checked yet if there's something to backport to 0.10.5dev, but that appears less critical now that the beta1 is about to be released.

  Changed 2 months ago by osimons

  • cc osimons added
  • status changed from closed to reopened
  • resolution fixed deleted
  • milestone changed from 0.11 to 0.11.2

I still get this TimeoutError at sporadic intervals, and there is something not quite right. Once one TimeoutError is reached there is no helping the site apart from a restart - all projects and all requests hang (Apache Windows single-process setup).

Reading up on threading I see that the pool code calls wait(), but there is no code anywhere that actually calls notify() to wake them up again. With the help of rblank on IRC, this patch was put together that I'm currently testing:

  • trac/db/pool.py

    a b  
    130130                self._pool.append(cnx) 
    131131                self._pool_key.append(key) 
    132132                self._pool_time.append(time.time()) 
     133                self._available.notify() 
    133134            elif num == 1: 
    134135                del self._active[(tid, key)] 
     136                self._available.notify() 
    135137            else: 
    136138                self._active[(tid, key)] = (cnx, num - 1) 
    137139        finally: 

If anyone else sees regular TimeoutError, it may be well worth trying the patch to see if it helps. Further insight and testing much appreciated...

  Changed 2 months ago by cboos

Out of curiosity, what happens when you try attachment:test_pool_timeout.py, are you able to reproduce the freeze when using more threads than the default available connections (10 now, IIRC)? I wasn't, but maybe this program was testing a different condition than the one you're dealing with now.

  Changed 2 months ago by osimons

Tried it, and here is findings using 250 threads:

  • Without patch I get consistently longer periods without output (everything waiting). With patch there is continuous output from the script.
  • Without patch and a timeout setting of 20, i get quite a lot of timeout errors. Adding the patch does not visibly reduce the amount of errors markedly, but quite a few of them also additionally show database is locked + timeout messages - ie. it throws the error for the wrong reason, I suppose.
  • Setting timeout to 0 in trac.ini and using patch gives continuous output with occasionally errors due to database is locked, but all these threads acquire cnx shortly after.
  • I tried changing the pool size, but did not find any particular changes from increasing it - except if it got too high (100) it increased errors. However, with just one project in the test script, this change may not be effective anyway - perhaps this increase will make more sense across a larger multi-project setup with many environments.

Looking at the pool code, I see that without the explicit call to notify() (patch) only 2 attempts at getting a connection will ever be made: First on entry, and if it doesn't get it it will wait for the full timeout period regardless of available connections in the meantime, then it will get woken up and tries one more time. If no connection is available then, it will hit the time-test and fail this due to fractionally more time than timeout will have passed.

I think the patch improves things, and I'm now using it + timeout = 0 but haven't yet changed pool size. We'll see how it all works out.

  Changed 2 months ago by jonas

Good catch osimons. Without calling notify() everytime a pool slot is made available we risk raising TimeoutErrors in some high load situations.

Here's an updated patch with a small restructuring of _return_cnx to make it easier to follow and adding another notify() call to the _shutdown method.

  • trac/db/pool.py

     
    125125        try: 
    126126            assert (tid, key) in self._active 
    127127            cnx, num = self._active[(tid, key)] 
    128             if num == 1 and cnx.poolable and try_rollback(cnx): 
     128            if num == 1: 
    129129                del self._active[(tid, key)] 
    130                 self._pool.append(cnx) 
    131                 self._pool_key.append(key) 
    132                 self._pool_time.append(time.time()) 
    133             elif num == 1: 
    134                 del self._active[(tid, key)] 
     130                self._available.notify()  
     131                if cnx.poolable and try_rollback(cnx): 
     132                    self._pool.append(cnx) 
     133                    self._pool_key.append(key) 
     134                    self._pool_time.append(time.time()) 
    135135            else: 
    136136                self._active[(tid, key)] = (cnx, num - 1) 
    137137        finally: 
     
    149149                self._pool.pop(0) 
    150150                self._pool_key.pop(0) 
    151151                self._pool_time.pop(0) 
     152                self._available.notify()  
    152153        finally: 
    153154            self._available.release() 
    154155 

follow-up: ↓ 21   Changed 2 months ago by jonas

I've been giving this some more thought and the notify() call in _shutdown isn't really needed after all since it only cleans up inactive connections.

in reply to: ↑ 20   Changed 2 months ago by osimons

Replying to jonas:

I've been giving this some more thought and the notify() call in _shutdown isn't really needed after all since it only cleans up inactive connections.

Right, same thing rblank said when we discussed it on IRC.

  Changed 2 months ago by jonas

  • status changed from reopened to closed
  • resolution set to fixed

I've committed the changes in [7537,7538] so I'm closing this ticket again, thanks guys.

Add/Change #4465 (Timeout reached while waiting for a connection)

Author



Change Properties
<Author field>
Action
as closed
Next status will be 'reopened'
to The owner will change from cboos. Next status will be 'closed'
 
Note: See TracTickets for help on using tickets.