Edgewall Software

The problem of cache invalidation

This was the design discussion behind the trac.cache module (trunk, 1.0.x). See also apidoc:api/trac_cache.

Problem

Trac uses various caches at the component level, in order to speed-up costly tasks. Some examples are the recent addition of ticket fields cache (#6436), others are the InterMapTxt cache, the user permission cache, the oldest example being the Wiki page cache.

Those cache are held at the level of Component instances. For a given class, there's one such instance per environment in any given server process. The first thing to take into account here is that those caches must be safely accessed and modified when accessed by concurrent threads (in multi-threaded web front ends, that is). That's not a big deal, and I think it's already handled appropriately.

But due to the always possible concurrent access at the underlying database level by multiple processes, there's also a need to maintain a consistency and up-to-date status of those caches across all processes involved. Otherwise, you might do a change by the way of one request and the next request (even the GET following a redirect after your POST!) might be handled by a different server process which has a different "view" of the application state, and you end up confused at best, filing a bug on t.e.o at worst.

This doesn't even have to imply a multi-process server setup, as all what is needed is e.g. a modification of the database done using trac-admin.

This proposal was previously logged as wiki:TracDev/Proposals/Journaling.

Current Situation

So the current solution to the above problem is to use some kind of global reset mechanism, which will not only invalidate the caches, but simply "throw away" all the Component instances of the environment that has been globally reset. That reset happens by the way of a simulated change on the TracIni file, triggered by a call to self.config.touch() from a component instance. The next time an environment instance is retrieved, the old environment instance is found to be out of date and a new one will be created (see trac.env.open_environment). Consequently, new Component instances will be created as well, and the caches will be repopulated as needed.

Pros:

  • it works well

Cons:

  • it's a bit costly - though I've no numbers on that, it's easy to imagine that if this full reset happens too frequently, then the benefits from the caches will simply disappears. In the past, when the reset rate was abnormally high due to some bug, the performance impact was very perceptible.
  • it's all or nothing - the more we rely on this mechanism for different caches, the more we'll aggravate the above situation. Ideally, invalidating one cache should not force all the other caches to be reset.

Final implementation

The final implementation, committed in [8071], is a refinement of the CacheManager idea. The documentation for this feature (starting with the content of this section) now lives in TracDev/CacheManager.

  • Creating a cached attribute is done by defining a retrieval function and decorating it with the @cached_value decorator. For example, for the wiki page names:
    @cached_value
    def pages(self, db):
        """Return the names of all existing wiki pages."""
        cursor = db.cursor()
        cursor.execute("SELECT DISTINCT name FROM wiki")
        return [name for (name,) in cursor]
    
  • Invalidating a cached attribute is done by deleting the attribute:
    def wiki_page_added(self, page):
        if not self.has_page(page.name):
            del self.pages
    
  • If more control is needed, for example to invalidate an attribute within an existing transaction, the attribute should be decorated with the @cached decorator. Accessing the attribute then yields a proxy object with two methods, get() and invalidate(), taking an optional db argument. For example, this is used in the case of ticket fields to invalidate them in the same transaction as e.g. an enum modification.
  • The cache is consistent within a request. That is, a cached attribute will always have the same value during a given transaction. Obviously, cached values should be treated as immutable.
  • The CacheManager component contains all the logic for data retrieval, caching and invalidation. Cache invalidation across processes is done by incrementing a generation counter for the given attribute in the cache database table. The invalidation granularity is at the attribute level.
  • There are two cache levels:
    • A thread-local (per-request) cache is used to minimize locking and ensure that the cached data is consistent during a request. It is emptied at the beginning of every request.
    • A process cache keeps retrieved data as long as it has not been invalidated.
  • The cache table is read the first time a cached attribute is accessed during a request. This avoids slowing down requests that don't touch cached attributes, like requests for static content for example.

The sections below are kept as documentation of the implementation process.

Idea 1: The CacheManager

This idea introduces a centralized cache manager component that manages cached data, retrieval from the database and invalidation of cached data. The assumption is that it doesn't make sense to retrieve data to be cached from the database more than once per HTTP request.

Every cache is identified by a unique identifier string, for example 'ticket.fields' and 'wiki.InterMapTxt'. It has an associated retrieval function that populates the cache if required, and a generation number that starts at 0 and is incremented at every cache invalidation.

A new table in the database stores the cache identifiers, along with the current generation number and possibly the time of the last invalidation (for timed cache invalidation). The schema would be something like:

Table('cache', key='id')[
    Column('id'),
    Column('generation', type='int'),
    Column('time', time='int'),
]

So how is the cache used?

  • HTTP request: At the beginning of every HTTP request, the complete cache table is read into memory. This provides the CacheManager with the current state of the database data. Timed invalidation could also be done at this point, by dropping cached data that is too old.
  • Retrieval of cached data: The CacheManager can be queried for a reference to a cache. At this point, it checks if the generation number of the cached data matches the number read at the start of the HTTP request. If it does, the cached data is simply returned. Otherwise, the cached data is discarded, the retrieval function is called to populate the cache with fresh data, and the data is returned.
  • Invalidation of cached data: Invalidation of cached data is done explicitly after updating the database by incrementing the generation number for the cache in the cache table, in the same transaction as the data update, and invalidating the currently cached data in the CacheManager.

Pros:

  • Caches are managed in a single place, and the cache logic is implemented once and for all. This should avoid bugs due to re-implementing cache logic for every individual cache.
  • Cached data is consistent for the duration of an HTTP request.
  • Caches can be made fine-grained. For example, it may be possible to use separate caches for the values of every ticket field (not sure we want that, though). Invalidation is fine-grained as well.

Cons:

  • One additional database query per HTTP request. I don't know how much impact this can have, but I would expect this to be negligible, as the cache table should never grow past a few dozen rows.
  • Caches must be invalidated explicitly. The same drawback applies to the current situation, so nothing is lost there.

Open questions:

  • This strategy should work well in a multi-process scenario. In a multi-thread scenario, proper locking must ensure that cached data is not modified during a request. It may be possible to use thread-local storage to ensure that a single request has a consistent view of the cache, even if a second thread invalidates the cache.

Comments and improvements are welcome. If this approach sounds reasonable, I'd like to do a prototype implementation and apply it to a few locations (the wiki page cache and ticket fields).

Prototype implementation

A prototype implementation of this approach is available in cache-manager-r7941.patch. It implements the cache manager and has been applied to the wiki page, InterMapTxt and ticket field caches.

  • Creating a cached attribute is extremely simple: declare a retrieval function with the desired name of the attribute and apply the @cached_value or @cached decorator. For example, for the wiki pages:
    @cached_value('wiki.WikiSystem.pages')
    def pages(self, db):
        """Return the names of all existing wiki pages."""
        cursor = db.cursor()
        cursor.execute("SELECT DISTINCT name FROM wiki")
        return [name for (name,) in cursor]
    
  • Invalidating a cached attribute is done by deleting the attribute:
    def wiki_page_added(self, page):
        if not self.has_page(page.name):
            del self.pages
    
  • If more control is needed, for example to invalidate an attribute within an existing transaction, create the attribute with the @cached decorator. Accessing the attribute then yields a proxy object with two methods, get() and invalidate(), taking an optional db argument. This is used in the case of ticket fields, to invalidate them in the same transaction as e.g. an enum modification.
  • The cache is consistent within a request. That is, a cached attribute will always have the same value during a single transaction. Obviously, cached values should be treated as immutable.
  • The cache table is read the first time a cached attribute is accessed during a request. This avoids slowing down requests that don't touch cached attributes, like requests for static content for example.
  • To test the patch, the table cache must be created by hand in the database, as follows (for SQLite):
    CREATE TABLE cache (
        key TEXT PRIMARY KEY,
        generation INT
    );
    
  • Two new trac-admin commands allow listing the cache table and invalidating one or more caches. They are intended for debugging purposes, and should probably be removed if the proposal is applied to trunk.

Discussion

Comments and testing are very welcome. The implementation is quite complete, except for the missing database upgrade module. I have only tested with several concurrent tracd instances so far.

cboos: feedback

It's really nice, I like it a lot!

When reviewing the code, I think I've detected some possible issues in CacheManager.get.

  • in case there are multiple "out-of-date" threads, each might trigger a retrieval. An improvement would be to check if the CacheManager already has a "newer" generation.
  • in the locked section, if the generation increases after the cached value retrieval and before the fetch of the latest generation, the CacheManager may think it is up to date yet have old data.

Those are admittedly corner cases, I hope I have not missed more important issues while focusing on that ;-) See the first patch attachment:cache-manager_get-corner-cases.patch.

Another little improvement I'd like to propose is to not have to bother with key names and let the decorators figure out the key from the method itself. See attachment:cache-manager-automatic-key.patch. I've also added more documentation to the decorators and changed the order of the definitions in a top-down way (first the decorators, then the descriptors, ending with the proxy class), as I think it's easier to understand that way.

rblank:

Thanks for the improvement ideas, I'll integrate them shortly.

  • I'm not sure the DB race condition you describe can actually happen. At least with SQLite, issuing a SELECT sets the lock state to SHARED, which disallows writes, so it should not be possible to increase the generation between the data retrieval and fetching the generation. I don't know how this behaves with other databases, though. Maybe it's just safer to fetch the generation first.

  • You're right about automating the cache key generation. I didn't want to do it first, because renaming a module or class would have changed the key. But we're not going to rename them, and even if we do, it will only leave an orphaned row in the cache table, so it's no big deal. Your patch proposed {module}.{function} as the key, I'd like to make it {module}.{class}.{function}.
  • If the keys are auto-generated, the decorators don't need any arguments anymore. This allows simplifying them even more by dropping the cached() and cached_value() functions, and calling the descriptor classes cached and cached_value directly.

cboos:

  • SELECT statements don't start a transaction in PySqlite, as opposed to the other DML statements. So in my understanding, each retrieval is "atomic" and I think there can indeed be a race condition between the SELECT(s) done for data retrieval and the SELECT for fetching the generation. As this picked my curiosity, I tried to see how multiple SELECTs could be done within a single transaction, and this is indeed possible, but a bit heavyweight: see e.g. pysqlite:IdRange, look for def get_id_range. So I think it's better to simply cope with the race.
  • simple oversight on my part; sure, make the class name part of the key.
  • great, I didn't know one can do that

cboos: CacheManager used in non-Component classes

When thinking about how to use the CacheManager for getting the youngest_rev information from CachedRepository, I saw two additional problems:

  • we can't use the decorators here, as the CachedRepository is not a Component (and shouldn't be, as we may have many instances per environment)
  • so far we have avoided propagating the env to the CachedRepository. I think we can no longer afford to do this, if we want to access the CacheManager conveniently. Having the env available would also simplify the getdb stuff.

So we need to access the CacheManager directly, using something like:

    @property
    def metadata(self):
        CacheManager(self.env).get('CachedRepository.metadata:'+self.name, self.get_metadata)

    def get_metadata(self, db): 
        # SELECT * FROM repository WHERE id=self.name

Do you see a better way to do it?

rblank:

Yes, by instantiating CacheProxy in the constructor and storing it as an instance attribute. This gives it the same interface as if @cached was used.

    self.metadata = CacheProxy('CachedRepository.metadata:' + self.name, self.get_metadata, env)

This does indeed require env, and changing that will make the CachedRepository unit tests a bit more complicated :-/

rblank: Update with feedback

The attachment:cache-manager-r7989.patch is an updated patch which should take into account all corner cases described above. Cache keys are now auto-generated from the module, class and attribute name. I have also added the database upgrade code, so the db_version is now 22.

Are there any other issues that should be considered? If not, the next step would be to plan the integration into trunk. Are there any special considerations when upgrading the database version? What else (other than committing) must be done?

cboos: feedback

  • Last round of feedback:
    • The API documentation should also mention that cached and cached_value must be used within Component sub-classes, and what the retriever method should look like
    • in CacheManager.invalidate, the SELECT ... , if fetchone UPDATE, else INSERT is not thread-safe (again for the same reason that a SELECT doesn't start a transaction) so we should rather do try INSERT, except UPDATE.

Both points are minor and could be done on trunk.

  • What to do next?
    • maybe send a mail on Trac-dev (in the same thread you started a while ago) saying the topic work is done and ask if anyone has some extra feedback to give
    • after the commit, warn loudly on the milestone:0.12, on the TracDev/ReleaseNotes/0.12 and 0.12/TracUpgrade pages that the DB version has increased. It's not that it's problematic to do the upgrade, it's rather because it's inconvenient to downgrade. As long as we keep the DB version compatible, users can eventually go back and forth between trunk and 0.11-stable. Once they did an upgrade, it's not that convenient anymore (but still relatively easy to do in this specific case, of course).
  • We could also think about adding some tests for this, though that might be more involved.

rblank:

  • Replies to last round of feedback:
    • Will do.
    • That's what I tried first, but the error due to the INSERT rolled back the whole transaction. I'll have to find a way to do this in a single statement.

cboos:

Hm right, that can be problematic. So what about this:

    cursor.execute("SELECT generation FROM cache WHERE key=%s",
                   (key,)) 
    do_update = cursor.fetchone()
    if not do_update:
        try: 
            cursor.execute("INSERT INTO cache VALUES (%s, %s)",
                           (key, 0))
        except Exception:
            do_update = True
    if do_update:
            cursor.execute("UPDATE cache SET generation=generation+1" 
                           "WHERE key=%s", (key,))

If we were in a transaction, then I suppose the SELECT/INSERT sequence can't fail. Conversely, if it fails, then we were not in a transaction, and we can follow-up with an UPDATE to recover from the failed INSERT.

rblank: Alternative for atomic UPSERT

That could work, yes. How about this:

    cursor.execute("UPDATE cache SET generation=generation+1 "
                   "WHERE key=%s", (key,))
    cursor.execute("SELECT generation FROM cache WHERE key=%s", (key,))
    if not cursor.fetchone():
        cursor.execute("INSERT INTO cache VALUES (%s, %s)", (key, 0))

If the row already exists, it is updated, the SELECT returns a row and we're done.

If not, the UPDATE does nothing except starting a transaction (or we may already be in a transaction), the SELECT doesn't return any rows, and we do the INSERT in the same transaction. Doesn't the UPDATE even return the number of altered rows? That would void the need for a separate SELECT. I'm not sure though that the UPDATE starts a transaction if no rows are altered. We may have to use a dummy row that is always updated in addition to the desired row.

cboos:

Looks great! I don't think we can have something any simpler, in particular the UPDATE doesn't seem to return the number of modified rows for all backends (at least, that doesn't seem to be possible with SQLite3 and Pysqlite).

rblank: Updated patch

The attachment:cache-manager-r7992.patch improves the docstrings for the decorators and the proxy, and makes invalidation atomic. I'll now ask for feedback on trac-dev.

Idea 2: Cache control

I'm currently thinking about the following solution.

Each time a cache needs to be invalidated (i.e. in the current situations where we call config.touch()), we would instead call env.cache_invalidate(cache_key), where cache_key is some unique key identifying that cache (e.g. "InterMapTxt" or "repository-reponame" for the MultipleRepositorySupport/Cache). This call will atomically increment some generation value associated to the key, in the db (that might be tricky - select for update for Pgsql, explicit transaction for Pysqlite). A simple create table cachecontrol (key text, generation int) should be enough.

At periodic times, e.g. in open_environment, we would call env.cache_update(). That will do a select * from cachecontrol. The results are stored besides the previously known latest values, therefore we can quickly see which caches need a refresh.

Whenever a Component has to fetch a value from the cache, it will first call env.cache_is_valid(cache_key). If the result is true, it can retrieve values from the cache. If not, the cache has to be updated first. Once the cache is refreshed, the component calls env.cache_validate(cache_key).

Example: InterMapTxt cache

For convenience, if a Component only manages one cache (the common case), it can pass self instead of a string key and its class name will be used.

Only the code changes for trac/env.py and trac/wiki/interwiki.py are roughly implemented. Not tested yet and just to illustrate the above.

See attachment:cache_control-r7933.diff. For testing, manual creation of a cache_control table is needed:

CREATE TABLE cache_control( key text primary key, generation int);

The method names and API has a bit evolved, now I have:

  • env.update_cache_control(), called in open_environment
  • env.invalidate_cache(key), called by the Component in place of config.touch()
  • env.is_cache_valid(key), called by the Component when checking for cache validity
  • env.validate_cache(key) once the cache has been updated

That concludes my initial approach to the problem. Now let's take into account what was proposed in idea 1…

Discussion

While the two approaches are quite close in spirit, there are a few differences.

I initially thought that having the cache control managed at the level of the environment was more natural than having a specialized Component (it's a "service" offered by the environment to all its Components, like providing them with a db connection).

But I see your point in having the cache logic handled "once for all", without the need to re-implement it in various places. If that's doable in a not too complicated way, it may be worth doing it.

I've not yet added time-based invalidation, but if really needed, that can be added as well.

The open problem I see as well is about maintaining a coherent view from the cache during the lifetime of a given request. That might indeed be another argument in favor of a dedicated component with a more advanced cache logic. Anyway, the patch above is at least a first step that seems to work fine in my testing.

Indeed, the basic idea is the same. My goal was to push as much of the logic into the CacheManager as possible, so that cache users would only have two functionalities: get the data (this could even be hidden by using a property-like descriptor) and invalidate the cache. There should be no need for cache users to "remember to first check if the cache is valid, then …": this logic is common to all cache users, and can be integrated into the cache manager.

Last modified 2 years ago Last modified on May 9, 2015, 11:53:37 PM

Attachments (6)

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.