Edgewall Software

Opened 6 months ago

Last modified 6 months ago

#13634 closed defect

cache invalidation for Environment::get_known_users can cause DB temp space exhaustion with MySQL backend — at Initial Version

Reported by: Klaus Espenlaub <klaus.espenlaub@…> Owned by:
Priority: normal Milestone: 1.6.1
Component: database backend Version: 1.4.3
Severity: normal Keywords: mysql temporary table known_users disk full
Cc: klaus.espenlaub@… Branch:
Release Notes:
API Changes:
Internal Changes:

Description

On larger Trac deployments (ours has several backend servers, each with multiple worker processes) with frequent accesses the cache invalidation logic essentially implements a trigger for a "thundering herd": when the generation counter is increased then all worker processes will rather quickly run the query in Environment::_known_users. Looks all normal and harmless on first sight, but:

MySQL creates a temporary table each for such SELECT DISTINCT ... ORDER BY ... queries. What makes it worse is that the query has columns of type mediumtext and this forces the temporary table to go straight to disk. As long as the DB server CPU, IO and memory load is low this doesn't cause noticeable issues. With our approx. 200000 known users results in a temp table of about 8MB (taking about 2 seconds for the query which isn't too annoying because almost always it is cached). If this gets multiplied by ~35 (the number of Trac WSGI workers in our setup for a busy situation) then it still works just fine, just causes some slowdown due to IO. When there is other activity (such as viewing source with 'blame' etc.) then the refreshing of the known users can get to DB connections timing out.

That's the "point of no return": the query wasn't finished yet but the worker will issue another one. This growing avalanche of queries (all limited by IO performance and needing 8MB of temp table space each) will exhaust the disk space of the DB server at the maximum IO performance, using all CPU and memory resources on the way too. No wonder with about 1500 such queries running simultaneously at the end (they build up over 45 to 60 minutes in our case).

Once the disk space is exhausted the MySQL server goes into "offline mode", needing a server restart (until that happens Trac is obviously down). After the restart everything is perfectly working again. Temporary tables are automatically removed, therefore the disk usage is back to normal.

Proposed solution: remove the DISTINCT qualifier from this query. Due to the design of the session and session_attribute tables I can't see how it could ever produce non-unique results.

Reference: MySQL 8.0 documentation about internal use of temporary tables

Change History (0)

Note: See TracTickets for help on using tickets.