Edgewall Software
Modify

Opened 10 years ago

Closed 10 years ago

Last modified 8 years ago

#9870 closed enhancement (fixed)

Improve the cache subsystem

Reported by: Remy Blank Owned by: Remy Blank
Priority: high Milestone: 1.0
Component: general Version: 0.12dev
Severity: normal Keywords: cache
Cc: Branch:
Release Notes:

Improved behavior and documentation for @cached properties.

API Changes:

It is now safe to inherit from classes having @cached properties.

Internal Changes:

Description

#9866 got me thinking about the cache subsystem, and there are at least two issues that could be improved:

  • CachedProperty currently uses instance.__class__ to determine the module and class names. This is not correct when a component is subclassed, as seems to be the case e.g. in Agilo. This means that a cached property defined in an abstract base component, and inherited by sub-components, will have a different id in every subclass, and hence invalidating in one subclass will not invalidate the cached value in the other subclasses.
  • Using a string identifier as a property id in the database was a pretty silly idea in retrospect. An integer hash of the identifier would be more space and time-efficient. Collisions are not an issue, as the only effect would be to invalidate slightly more than necessary.

The first issue is a bug, so I would like to fix it in 0.12-stable. The solution for the first item is to use the class where the property is defined to determine the module and class names.

The second is an enhancement that requires a DB upgrade, so I will implement it on trunk.

Attachments (3)

9870-int-cache-id-r10344.patch (2.8 KB ) - added by Remy Blank 10 years ago.
Change the cache table primary key to int.
cache_timings.py (2.3 KB ) - added by Christian Boos 10 years ago.
simple timing program for various kind of cache table
9870-int-cache-id-r10551.patch (5.2 KB ) - added by Remy Blank 10 years ago.
Store cache id strings in the database.

Download all attachments as: .zip

Change History (26)

comment:1 by Remy Blank, 10 years ago

The first issue is fixed in [10342] and [10343], by walking the class MRO to find the class where a property is defined.

by Remy Blank, 10 years ago

Change the cache table primary key to int.

comment:2 by Remy Blank, 10 years ago

9870-int-cache-id-r10344.patch modifies the cache subsystem to use a hash of the property id in the cache table. It includes a DB upgrade to version 27.

comment:3 by Christian Boos, 10 years ago

Well, the trouble with integer ids is maybe a lack of "readability" of the database content, at least for debugging.

Which makes me think we could add some debug commands to trac-admin, like here a trac-admin debug cache-ids that would list in clear text the cache identifiers and their generation info.

comment:4 by Remy Blank, 10 years ago

Sure, but the cache table is not meant to be read (at least once the caching subsystem works correctly). Adding a debug command to trac-admin may be a good idea, but due to the id being a hash, there's no direct mapping back to the identifier. We would have to go through all components and find the descriptors. And this still wouldn't give us the cached attributes for non-Component classes (e.g. CachedRepository).

in reply to:  4 ; comment:5 by Christian Boos, 10 years ago

Replying to rblank:

Sure, but the cache table is not meant to be read (at least once the caching subsystem works correctly)

Well, we thought it did work well for a while ;-) And even if it did, there can always be regressions, or plugin or local modifications which interfere, etc. All things that make having a quick look at the cache table useful for troubleshooting, just like we did in #9866.

there's no direct mapping back to the identifier. We would have to go through all components and find the descriptors. And this still wouldn't give us the cached attributes for non-Component classes (e.g. CachedRepository).

Right, I supposed we could somehow get easily a list of all the keys. But as it's not the case, then I think the trade-off between the proposed optimization and the loss of clarity is not in favor of the change. We already have an index for the id column for the performance part, and for the size, well, we'll never have that many different keys to weight as much as one version of a medium sized wiki page ;-)

in reply to:  5 ; comment:6 by Remy Blank, 10 years ago

Replying to cboos:

We already have an index for the id column for the performance part, and for the size, well, we'll never have that many different keys to weight as much as one version of a medium sized wiki page ;-)

I haven't done any performance measurements yet, but I can't imagine a string comparison being as fast as an integer comparison, neither in Python code, nor in the database.

And the cache is all about speed, so I'd be ready to sacrifice a bit of clarity for speed. We do lack unit tests for the cache subsystem, so this should be improved as well.

Right, I supposed we could somehow get easily a list of all the keys. But as it's not the case, then I think the trade-off between the proposed optimization and the loss of clarity is not in favor of the change.

I might have been a bit to fast on that point. I have an idea how we can get the keys from the hashes. Actually, I have even two ideas how to get them. Let me try that.

Last edited 10 years ago by Remy Blank (previous) (diff)

in reply to:  6 comment:7 by Christian Boos, 10 years ago

Replying to rblank:

I haven't done any performance measurements yet, but I can't imagine a string comparison being as fast as an integer comparison, neither in Python code, nor in the database.

And the cache is all about speed, so I'd be ready to sacrifice a bit of clarity for speed.

Ok, I did some timings, using an int is indeed about twice as fast as using text, at least for SQLite.

$ python cache_timings.py -r 1000 -t text
CREATE TABLE cache ( id text , generation int )
0.00494199991226

$ python cache_timings.py -r 1000 -t int
CREATE TABLE cache ( id int , generation int )
0.00275899982452

$ python cache_timings.py -r 1000 -t int --index
CREATE TABLE cache ( id int primary key, generation int )
0.00266400003433

$ python cache_timings.py -r 1000 -t text --index
CREATE TABLE cache ( id text primary key, generation int )
0.00402200007439

The timings above correspond mainly to the time spent executing the statement, if we also do fetch + print, the difference remains significant:

$ python cache_timings.py -r 1000 -t int --index -v > /dev/null
0.0091930000782
$ python cache_timings.py -r 1000 -t int --index -v > /dev/null
0.00841799998283
$ python cache_timings.py -r 1000 -t text --index -v > /dev/null
0.0170399999619
$ python cache_timings.py -r 1000 -t text --index -v > /dev/null
0.0155539999008

by Christian Boos, 10 years ago

Attachment: cache_timings.py added

simple timing program for various kind of cache table

comment:8 by Christian Boos, 10 years ago

What about using an int index as you suggested (the hash of the property id) and store that property id for readability? (re: comment:5)

Not sure it's still high prio, nor needed for 0.13, I let you decide on that.

in reply to:  8 comment:9 by Remy Blank, 10 years ago

Replying to cboos:

What about using an int index as you suggested (the hash of the property id) and store that property id for readability? (re: comment:5)

I already have a patch that does exactly that, but it requires a somewhat "hackish" implementation. I'll give it one more try and post the results here.

by Remy Blank, 10 years ago

Store cache id strings in the database.

comment:10 by Remy Blank, 10 years ago

9870-int-cache-id-r10551.patch stores the cache id strings in the database as well, by keeping a global dict of hash-to-string (that's the hack I mentioned in comment:9). It also adds a trac-admin $ENV debug cache list command to view the content of the cache table.

Good enough? I'm not sure the debug command is actually needed, as a SELECT * FROM cache; also does the trick. Probably not.

in reply to:  10 ; comment:11 by Christian Boos, 10 years ago

Replying to rblank:

9870-int-cache-id-r10551.patch stores the cache id strings in the database as well, by keeping a global dict of hash-to-string (that's the hack I mentioned in comment:9). It also adds a trac-admin $ENV debug cache list command to view the content of the cache table.

What's wrong with the builtin hash? Ok, it returns a signed 32 bits value but if this is really a problem we could do abs(hash(id)).

Good enough? I'm not sure the debug command is actually needed, as a SELECT * FROM cache; also does the trick. Probably not.

Well, the debug command would have been useful in case the SQL content alone wouldn't be self-explained, so I think we don't need it.

But what we should do is detect hash collisions (thanks to your _ids dict) and crash and burn if there's one ;-)

in reply to:  11 ; comment:12 by Remy Blank, 10 years ago

Replying to cboos:

What's wrong with the builtin hash? Ok, it returns a signed 32 bits value but if this is really a problem we could do abs(hash(id)).

The signed value is not a problem. The reason I re-implemented the hash function is that there's no guarantee that the built-in hash() will return the same value for the same string in different versions of Python. The hash function doesn't have to be very fast, as it is only used during startup.

But what we should do is detect hash collisions (thanks to your _ids dict) and crash and burn if there's one ;-)

You mean, using the set_or_raise_if_already_present() method of dict? :)

in reply to:  12 ; comment:13 by Christian Boos, 10 years ago

Replying to rblank:

Replying to cboos:

What's wrong with the builtin hash? Ok, it returns a signed 32 bits value but if this is really a problem we could do abs(hash(id)).

The signed value is not a problem. The reason I re-implemented the hash function is that there's no guarantee that the built-in hash() will return the same value for the same string in different versions of Python. The hash function doesn't have to be very fast, as it is only used during startup.

I see. But the last significant change in that code was done in … 1996 ;-)

But what we should do is detect hash collisions (thanks to your _ids dict) and crash and burn if there's one ;-)

You mean, using the set_or_raise_if_already_present() method of dict? :)

Either when inserting in the dict, or when we do the select after the update, we could also get the name in addition to the generation and check at that point. But I just realized that a collision just means that the invalidation of a key will also invalidate the other… we're not going to miss notifications if I'm right, so this doesn't deserve more than a warning.

in reply to:  13 comment:14 by Remy Blank, 10 years ago

Replying to cboos:

I see. But the last significant change in that code was done in … 1996 ;-)

Yeah, I know, I'm paranoid :) Also, I just found out that hash() returns a 64-bit value on a 64-bit OS, so it's inherently non-portable.

Either when inserting in the dict, or when we do the select after the update, we could also get the name in addition to the generation and check at that point. But I just realized that a collision just means that the invalidation of a key will also invalidate the other… we're not going to miss notifications if I'm right, so this doesn't deserve more than a warning.

Heh, I really thought you were joking. Yes, your analysis is correct.

comment:15 by Remy Blank, 10 years ago

Should I wait for the 0.13beta release to be close to apply this, or should I just go ahead? It requires a DB upgrade.

comment:16 by Christian Boos, 10 years ago

No, please just go ahead, then we'll see if there's any issue with the upgrade prior to making the beta release, which is always a good thing ;-)

Be sure to warn about this upgrade in all the relevant places (i.e. 0.13, TracDev/ReleaseNotes and 0.13/TracUpgrade.

comment:17 by Remy Blank, 10 years ago

Resolution: fixed
Status: newclosed

Refactored patch applied in [10581]. I separated the CachedSingletonProperty and CachedProperty descriptors, as they are used in different contexts. This doesn't change anything for users of the cache subsystem. Also, is is now simpler to specify the property key for non-singleton classes.

The upgrade is now mentioned in the relevant documentation.

in reply to:  17 comment:18 by Christian Boos, 10 years ago

I started to document the module in r10584 (see apidoc:api/trac_cache).

Replying to rblank:

Refactored patch applied in [10581]. I separated the CachedSingletonProperty and CachedProperty descriptors, as they are used in different contexts.

The change is good, however I think the docstrings can be improved. If I have trouble to understand the code which I helped (a bit) to develop, maybe other people would appreciate some extra hints ;-)

I feel a bit uneasy with the term singleton, as we don't use it elsewhere in Trac. I understand why we don't talk of Components here, but I think making clear that we're talking about a "per Environment singleton" would be better.

Suggested doc changes:

  • cache.py

     
    5151
    5252   
    5353class CachedSingletonProperty(CachedPropertyBase):
    54     """Cached property descriptor for singleton classes"""
     54    """Cached property descriptor for classes behaving as singletons
     55    in the scope of one `~trac.env.Environment` instance.
     56
     57    This means there will be no more than one cache to monitor in the
     58    database for this kind of cache. Therefore, using only "static"
     59    information for the key is enough. For the same reason it is also
     60    safe to store the corresponding id as a property of the descriptor
     61    instance.
     62    """
    5563   
    5664    def __get__(self, instance, owner):
    5765        if instance is None:
     
    7179
    7280
    7381class CachedProperty(CachedPropertyBase):
    74     """Cached property descriptor"""
     82    """Cached property descriptor for classes having potentially
     83    multiple instances associated to a single `~trac.env.Environment`
     84    instance.
     85
     86    As we'll have potentiall many different caches to monitor for this
     87    kind of cache, the key needs to be augmented by a string unique to
     88    each instance of the owner class.  As the resulting id will be
     89    different for each instance of the owner class, we can't store it
     90    as a property of the descriptor class, so we store it back in the
     91    attribute used for augmenting the key (``key_attr``).
     92    """
    7593   
    7694    def __init__(self, retriever, key_attr):
    7795        super(CachedProperty, self).__init__(retriever)
     
    103121    `CacheManager`. Invalidating the cache for this value is done by
    104122    ``del``\ eting the attribute.
    105123   
    106     Note that the cache validity is maintained using a table in the
    107     database.  Cache invalidation is performed within a transaction
    108     block, and can be nested within another transaction block.
     124    Note that the cache validity is maintained using the `cache` table
     125    in the database.  Cache invalidation is performed within a
     126    transaction block, and can be nested within another transaction
     127    block.
    109128   
    110     The key used to identify the attribute in the database is
    111     constructed from the names of the containing module, class and
    112     retriever method. If the decorator is used in non-signleton
    113     (typically non-`Component`) objects, an string specifying the name
    114     of an attribute containing a string unique to the instance must be
    115     passed to the decorator. This value will be appended to the key
    116     constructed from module, class and method name::
     129    When the decorator is used in a class for which instances behave
     130    as singletons within the scope of a given `~trac.env.Environment`
     131    (typically `~trac.core.Component` classes), the key used to
     132    identify the attribute in the database is constructed from the
     133    names of the containing module, class and retriever method::
    117134
     135        class SomeComponent(object):
     136            def __init__(self):
     137                self._all_things = None
     138
     139            @cached
     140            def all_things(self):
     141                self._all_things = {}
     142                ...
     143
     144    Otherwise, when the decorator is used in non-"singleton" objects,
     145    a string specifying the name of an attribute containing a string
     146    unique to the instance must be passed to the decorator. This value
     147    will be appended to the key constructed from module, class and
     148    method name::
     149
    118150        class SomeClass(object):
    119151            def __init__(self, env, name):
    120152                self.env = env
     
    125157            def metadata(self):
    126158                ...
    127159
    128     Note that the key attribute is overwritten with a hash of the key on first
    129     access, so it should not be used for any other purpose.
     160    Note that in this case the key attribute is overwritten with a
     161    hash of the key on first access, so it should not be used for any
     162    other purpose.
    130163   
    131     This decorator requires that the object on which it is used has an `env`
    132     attribute containing the application `Environment`.
     164    In either case, this decorator requires that the object on which
     165    it is used has an ``env`` attribute containing the application
     166    `~trac.env.Environment`.
    133167
    134168    .. versionchanged:: 0.13
    135 
    136169        The data retrieval method used to be called with a single
    137170        argument ``db`` containing a reference to a database
    138171        connection.  This is the same connection that can be retrieved
Last edited 10 years ago by Christian Boos (previous) (diff)

comment:19 by Christian Boos, 10 years ago

Plus some further code simplification for the descriptor classes:

  • trac/cache.py

    diff -r e1ae835d3886 trac/cache.py
    a b class CachedPropertyBase(object):  
    4141        self.retriever = retriever
    4242        self.__doc__ = retriever.__doc__
    4343
     44    def __get__(self, instance, owner):
     45        if instance is None:
     46            return self
     47        return CacheManager(instance.env).get(
     48            self.make_id(instance, owner), self.retriever, instance)
     49
     50    def __delete__(self, instance):
     51        CacheManager(instance.env).invalidate(
     52            self.make_id(instance, instance.__class__))
     53
    4454    def make_key(self, cls):
    4555        attr = self.retriever.__name__
    4656        for base in cls.mro():
    class CachedSingletonProperty(CachedProp  
    6171    instance.
    6272    """
    6373
    64     def __get__(self, instance, owner):
    65         if instance is None:
    66             return self
     74    def make_id(self, instance, owner):
    6775        try:
    6876            id = self.id
    6977        except AttributeError:
    7078            id = self.id = key_to_id(self.make_key(owner))
    71         return CacheManager(instance.env).get(id, self.retriever, instance)
     79        return id
    7280
    73     def __delete__(self, instance):
    74         try:
    75             id = self.id
    76         except AttributeError:
    77             id = self.id = key_to_id(self.make_key(instance.__class__))
    78         CacheManager(instance.env).invalidate(id)
    79 
    80 
    8181class CachedProperty(CachedPropertyBase):
    8282    """Cached property descriptor for classes having potentially
    8383    multiple instances associated to a single `~trac.env.Environment`
    class CachedProperty(CachedPropertyBase)  
    9494    def __init__(self, retriever, key_attr):
    9595        super(CachedProperty, self).__init__(retriever)
    9696        self.key_attr = key_attr
    97 
    98     def __get__(self, instance, owner):
    99         if instance is None:
    100             return self
     97
     98    def make_id(self, instance, owner):
    10199        id = getattr(instance, self.key_attr)
    102100        if isinstance(id, str):
    103101            id = key_to_id(self.make_key(owner) + ':' + id)
    104102            setattr(instance, self.key_attr, id)
    105         return CacheManager(instance.env).get(id, self.retriever, instance)
    106 
    107     def __delete__(self, instance):
    108         id = getattr(instance, self.key_attr)
    109         if isinstance(id, str):
    110             id = key_to_id(self.make_key(instance.__class__) + ':' + id)
    111             setattr(instance, self.key_attr, id)
    112         CacheManager(instance.env).invalidate(id)
     103        return id
    113104
    114105
    115106def cached(fn_or_attr=None):

comment:20 by Remy Blank, 10 years ago

The changes look great, except for the example in the singleton case. I wouldn't show an _all_things attribute, as it's unnecessary. You just need to return a value from the method, and it will be cached in the CacheManager. A more useful example could be:

class WikiSystem(Component):
    @cached
    def pages(self):
        return set(name for name, in
                   self.env.db_query("SELECT DISTINCT name FROM wiki"))

I didn't document CachedSingletonProperty and CachedProperty, as they actually aren't part of the public API. Only the cached() function is needed. But your descriptions are fine, so please go ahead.

in reply to:  19 comment:21 by Remy Blank, 10 years ago

Replying to cboos:

Plus some further code simplification for the descriptor classes:

I prefer the current version, even though there is some code duplication. The way I have written it makes the common case as short as possible. That was actually the whole purpose of separating CachedSingletonProperty from CachedProperty, so re-combining __get__() and __delete__() is a step backwards for me.

Then again, the performance impact is probably negligible. If you prefer the simpler version, I can probably make it even simpler by re-combining them into a single CachedProperty.

comment:22 by Christian Boos, 10 years ago

Doc updated in r10585.

No, separating them is good, it makes it immediately obvious we have two kind of descriptors. I was myself not fully convinced about my change seeing that CachedSingletonProperty.make_id() had to carry the superfluous instance parameter. But the kind of duplication I'm after was already present in the original code as well, so let me have another try:

  • trac/cache.py

    diff -r 99070e68632a trac/cache.py
    a b class CachedSingletonProperty(CachedProp  
    6464    def __get__(self, instance, owner):
    6565        if instance is None:
    6666            return self
     67        return CacheManager(instance.env).get(self.make_id(owner),
     68                                              self.retriever, instance)
     69
     70    def __delete__(self, instance):
     71        CacheManager(instance.env).invalidate(self.make_id(instance.__class__))
     72
     73    def make_id(self, owner):
    6774        try:
    6875            id = self.id
    6976        except AttributeError:
    7077            id = self.id = key_to_id(self.make_key(owner))
    71         return CacheManager(instance.env).get(id, self.retriever, instance)
     78        return id
    7279
    73     def __delete__(self, instance):
    74         try:
    75             id = self.id
    76         except AttributeError:
    77             id = self.id = key_to_id(self.make_key(instance.__class__))
    78         CacheManager(instance.env).invalidate(id)
    79 
    80 
    8180class CachedProperty(CachedPropertyBase):
    8281    """Cached property descriptor for classes having potentially
    8382    multiple instances associated to a single `~trac.env.Environment`
    class CachedProperty(CachedPropertyBase)  
    9493    def __init__(self, retriever, key_attr):
    9594        super(CachedProperty, self).__init__(retriever)
    9695        self.key_attr = key_attr
    97 
     96
    9897    def __get__(self, instance, owner):
    9998        if instance is None:
    10099            return self
     100        return CacheManager(instance.env).get(self.make_id(instance, owner),
     101                                              self.retriever, instance)
     102
     103    def __delete__(self, instance):
     104        CacheManager(instance.env).invalidate(self.make_id(instance,
     105                                                           instance.__class__))
     106
     107    def make_id(self, instance, owner):
    101108        id = getattr(instance, self.key_attr)
    102109        if isinstance(id, str):
    103110            id = key_to_id(self.make_key(owner) + ':' + id)
    104111            setattr(instance, self.key_attr, id)
    105         return CacheManager(instance.env).get(id, self.retriever, instance)
    106 
    107     def __delete__(self, instance):
    108         id = getattr(instance, self.key_attr)
    109         if isinstance(id, str):
    110             id = key_to_id(self.make_key(instance.__class__) + ':' + id)
    111             setattr(instance, self.key_attr, id)
    112         CacheManager(instance.env).invalidate(id)
     112        return id
    113113
    114114
    115115def cached(fn_or_attr=None):

It's a neutral change (+18/-18), but the "complex" parts are not duplicated so I find it easier to understand and maintain.

Last edited 10 years ago by Christian Boos (previous) (diff)

comment:23 by Christian Boos, 8 years ago

API Changes: modified (diff)
Release Notes: modified (diff)

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Remy Blank.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from Remy Blank to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.