#9870 closed enhancement (fixed)
Improve the cache subsystem
Reported by: | Remy Blank | Owned by: | Remy Blank |
---|---|---|---|
Priority: | high | Milestone: | 1.0 |
Component: | general | Version: | 0.12dev |
Severity: | normal | Keywords: | cache |
Cc: | Branch: | ||
Release Notes: |
Improved behavior and documentation for |
||
API Changes: |
It is now safe to inherit from classes having |
||
Internal Changes: |
Description
#9866 got me thinking about the cache subsystem, and there are at least two issues that could be improved:
CachedProperty
currently usesinstance.__class__
to determine the module and class names. This is not correct when a component is subclassed, as seems to be the case e.g. in Agilo. This means that a cached property defined in an abstract base component, and inherited by sub-components, will have a different id in every subclass, and hence invalidating in one subclass will not invalidate the cached value in the other subclasses.
- Using a string identifier as a property id in the database was a pretty silly idea in retrospect. An integer hash of the identifier would be more space and time-efficient. Collisions are not an issue, as the only effect would be to invalidate slightly more than necessary.
The first issue is a bug, so I would like to fix it in 0.12-stable. The solution for the first item is to use the class where the property is defined to determine the module and class names.
The second is an enhancement that requires a DB upgrade, so I will implement it on trunk.
Attachments (3)
Change History (26)
comment:1 by , 14 years ago
by , 14 years ago
Attachment: | 9870-int-cache-id-r10344.patch added |
---|
Change the cache
table primary key to int
.
comment:2 by , 14 years ago
9870-int-cache-id-r10344.patch modifies the cache subsystem to use a hash of the property id in the cache
table. It includes a DB upgrade to version 27.
comment:3 by , 14 years ago
Well, the trouble with integer ids is maybe a lack of "readability" of the database content, at least for debugging.
Which makes me think we could add some debug commands to trac-admin, like here a trac-admin debug cache-ids
that would list in clear text the cache identifiers and their generation info.
follow-up: 5 comment:4 by , 14 years ago
Sure, but the cache
table is not meant to be read (at least once the caching subsystem works correctly). Adding a debug command to trac-admin
may be a good idea, but due to the id being a hash, there's no direct mapping back to the identifier. We would have to go through all components and find the descriptors. And this still wouldn't give us the cached attributes for non-Component
classes (e.g. CachedRepository
).
follow-up: 6 comment:5 by , 14 years ago
Replying to rblank:
Sure, but the
cache
table is not meant to be read (at least once the caching subsystem works correctly)
Well, we thought it did work well for a while ;-) And even if it did, there can always be regressions, or plugin or local modifications which interfere, etc. All things that make having a quick look at the cache table useful for troubleshooting, just like we did in #9866.
there's no direct mapping back to the identifier. We would have to go through all components and find the descriptors. And this still wouldn't give us the cached attributes for non-
Component
classes (e.g.CachedRepository
).
Right, I supposed we could somehow get easily a list of all the keys. But as it's not the case, then I think the trade-off between the proposed optimization and the loss of clarity is not in favor of the change. We already have an index for the id column for the performance part, and for the size, well, we'll never have that many different keys to weight as much as one version of a medium sized wiki page ;-)
follow-up: 7 comment:6 by , 14 years ago
Replying to cboos:
We already have an index for the id column for the performance part, and for the size, well, we'll never have that many different keys to weight as much as one version of a medium sized wiki page ;-)
I haven't done any performance measurements yet, but I can't imagine a string comparison being as fast as an integer comparison, neither in Python code, nor in the database.
And the cache is all about speed, so I'd be ready to sacrifice a bit of clarity for speed. We do lack unit tests for the cache subsystem, so this should be improved as well.
Right, I supposed we could somehow get easily a list of all the keys. But as it's not the case, then I think the trade-off between the proposed optimization and the loss of clarity is not in favor of the change.
I might have been a bit to fast on that point. I have an idea how we can get the keys from the hashes. Actually, I have even two ideas how to get them. Let me try that.
comment:7 by , 14 years ago
Replying to rblank:
I haven't done any performance measurements yet, but I can't imagine a string comparison being as fast as an integer comparison, neither in Python code, nor in the database.
And the cache is all about speed, so I'd be ready to sacrifice a bit of clarity for speed.
Ok, I did some timings, using an int
is indeed about twice as fast as using text
, at least for SQLite.
$ python cache_timings.py -r 1000 -t text CREATE TABLE cache ( id text , generation int ) 0.00494199991226 $ python cache_timings.py -r 1000 -t int CREATE TABLE cache ( id int , generation int ) 0.00275899982452 $ python cache_timings.py -r 1000 -t int --index CREATE TABLE cache ( id int primary key, generation int ) 0.00266400003433 $ python cache_timings.py -r 1000 -t text --index CREATE TABLE cache ( id text primary key, generation int ) 0.00402200007439
The timings above correspond mainly to the time spent executing the statement, if we also do fetch + print, the difference remains significant:
$ python cache_timings.py -r 1000 -t int --index -v > /dev/null 0.0091930000782 $ python cache_timings.py -r 1000 -t int --index -v > /dev/null 0.00841799998283 $ python cache_timings.py -r 1000 -t text --index -v > /dev/null 0.0170399999619 $ python cache_timings.py -r 1000 -t text --index -v > /dev/null 0.0155539999008
by , 14 years ago
Attachment: | cache_timings.py added |
---|
simple timing program for various kind of cache table
follow-up: 9 comment:8 by , 14 years ago
What about using an int
index as you suggested (the hash of the property id) and store that property id for readability? (re: comment:5)
Not sure it's still high prio, nor needed for 0.13, I let you decide on that.
comment:9 by , 14 years ago
Replying to cboos:
What about using an
int
index as you suggested (the hash of the property id) and store that property id for readability? (re: comment:5)
I already have a patch that does exactly that, but it requires a somewhat "hackish" implementation. I'll give it one more try and post the results here.
by , 14 years ago
Attachment: | 9870-int-cache-id-r10551.patch added |
---|
Store cache id strings in the database.
follow-up: 11 comment:10 by , 14 years ago
9870-int-cache-id-r10551.patch stores the cache id strings in the database as well, by keeping a global dict
of hash-to-string (that's the hack I mentioned in comment:9). It also adds a trac-admin $ENV debug cache list
command to view the content of the cache table.
Good enough? I'm not sure the debug command is actually needed, as a SELECT * FROM cache;
also does the trick. Probably not.
follow-up: 12 comment:11 by , 14 years ago
Replying to rblank:
9870-int-cache-id-r10551.patch stores the cache id strings in the database as well, by keeping a global
dict
of hash-to-string (that's the hack I mentioned in comment:9). It also adds atrac-admin $ENV debug cache list
command to view the content of the cache table.
What's wrong with the builtin hash
? Ok, it returns a signed 32 bits value but if this is really a problem we could do abs(hash(id))
.
Good enough? I'm not sure the debug command is actually needed, as a
SELECT * FROM cache;
also does the trick. Probably not.
Well, the debug command would have been useful in case the SQL content alone wouldn't be self-explained, so I think we don't need it.
But what we should do is detect hash collisions (thanks to your _ids
dict) and crash and burn if there's one ;-)
follow-up: 13 comment:12 by , 14 years ago
Replying to cboos:
What's wrong with the builtin
hash
? Ok, it returns a signed 32 bits value but if this is really a problem we could doabs(hash(id))
.
The signed value is not a problem. The reason I re-implemented the hash function is that there's no guarantee that the built-in hash()
will return the same value for the same string in different versions of Python. The hash function doesn't have to be very fast, as it is only used during startup.
But what we should do is detect hash collisions (thanks to your
_ids
dict) and crash and burn if there's one ;-)
You mean, using the set_or_raise_if_already_present()
method of dict
? :)
follow-up: 14 comment:13 by , 14 years ago
Replying to rblank:
Replying to cboos:
What's wrong with the builtin
hash
? Ok, it returns a signed 32 bits value but if this is really a problem we could doabs(hash(id))
.The signed value is not a problem. The reason I re-implemented the hash function is that there's no guarantee that the built-in
hash()
will return the same value for the same string in different versions of Python. The hash function doesn't have to be very fast, as it is only used during startup.
I see. But the last significant change in that code was done in … 1996 ;-)
But what we should do is detect hash collisions (thanks to your
_ids
dict) and crash and burn if there's one ;-)You mean, using the
set_or_raise_if_already_present()
method ofdict
? :)
Either when inserting in the dict, or when we do the select after the update, we could also get the name in addition to the generation and check at that point. But I just realized that a collision just means that the invalidation of a key will also invalidate the other… we're not going to miss notifications if I'm right, so this doesn't deserve more than a warning.
comment:14 by , 14 years ago
Replying to cboos:
I see. But the last significant change in that code was done in … 1996 ;-)
Yeah, I know, I'm paranoid :) Also, I just found out that hash()
returns a 64-bit value on a 64-bit OS, so it's inherently non-portable.
Either when inserting in the dict, or when we do the select after the update, we could also get the name in addition to the generation and check at that point. But I just realized that a collision just means that the invalidation of a key will also invalidate the other… we're not going to miss notifications if I'm right, so this doesn't deserve more than a warning.
Heh, I really thought you were joking. Yes, your analysis is correct.
comment:15 by , 14 years ago
Should I wait for the 0.13beta release to be close to apply this, or should I just go ahead? It requires a DB upgrade.
comment:16 by , 14 years ago
No, please just go ahead, then we'll see if there's any issue with the upgrade prior to making the beta release, which is always a good thing ;-)
Be sure to warn about this upgrade in all the relevant places (i.e. 0.13, TracDev/ReleaseNotes and 0.13/TracUpgrade.
follow-up: 18 comment:17 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Refactored patch applied in [10581]. I separated the CachedSingletonProperty
and CachedProperty
descriptors, as they are used in different contexts. This doesn't change anything for users of the cache subsystem. Also, is is now simpler to specify the property key for non-singleton classes.
The upgrade is now mentioned in the relevant documentation.
comment:18 by , 14 years ago
I started to document the module in r10584 (see apidoc:api/trac_cache).
Replying to rblank:
Refactored patch applied in [10581]. I separated the
CachedSingletonProperty
andCachedProperty
descriptors, as they are used in different contexts.
The change is good, however I think the docstrings can be improved. If I have trouble to understand the code which I helped (a bit) to develop, maybe other people would appreciate some extra hints ;-)
I feel a bit uneasy with the term singleton
, as we don't use it elsewhere in Trac. I understand why we don't talk of Component
s here, but I think making clear that we're talking about a "per Environment singleton" would be better.
Suggested doc changes:
-
cache.py
51 51 52 52 53 53 class CachedSingletonProperty(CachedPropertyBase): 54 """Cached property descriptor for singleton classes""" 54 """Cached property descriptor for classes behaving as singletons 55 in the scope of one `~trac.env.Environment` instance. 56 57 This means there will be no more than one cache to monitor in the 58 database for this kind of cache. Therefore, using only "static" 59 information for the key is enough. For the same reason it is also 60 safe to store the corresponding id as a property of the descriptor 61 instance. 62 """ 55 63 56 64 def __get__(self, instance, owner): 57 65 if instance is None: … … 71 79 72 80 73 81 class CachedProperty(CachedPropertyBase): 74 """Cached property descriptor""" 82 """Cached property descriptor for classes having potentially 83 multiple instances associated to a single `~trac.env.Environment` 84 instance. 85 86 As we'll have potentiall many different caches to monitor for this 87 kind of cache, the key needs to be augmented by a string unique to 88 each instance of the owner class. As the resulting id will be 89 different for each instance of the owner class, we can't store it 90 as a property of the descriptor class, so we store it back in the 91 attribute used for augmenting the key (``key_attr``). 92 """ 75 93 76 94 def __init__(self, retriever, key_attr): 77 95 super(CachedProperty, self).__init__(retriever) … … 103 121 `CacheManager`. Invalidating the cache for this value is done by 104 122 ``del``\ eting the attribute. 105 123 106 Note that the cache validity is maintained using a table in the 107 database. Cache invalidation is performed within a transaction 108 block, and can be nested within another transaction block. 124 Note that the cache validity is maintained using the `cache` table 125 in the database. Cache invalidation is performed within a 126 transaction block, and can be nested within another transaction 127 block. 109 128 110 The key used to identify the attribute in the database is 111 constructed from the names of the containing module, class and 112 retriever method. If the decorator is used in non-signleton 113 (typically non-`Component`) objects, an string specifying the name 114 of an attribute containing a string unique to the instance must be 115 passed to the decorator. This value will be appended to the key 116 constructed from module, class and method name:: 129 When the decorator is used in a class for which instances behave 130 as singletons within the scope of a given `~trac.env.Environment` 131 (typically `~trac.core.Component` classes), the key used to 132 identify the attribute in the database is constructed from the 133 names of the containing module, class and retriever method:: 117 134 135 class SomeComponent(object): 136 def __init__(self): 137 self._all_things = None 138 139 @cached 140 def all_things(self): 141 self._all_things = {} 142 ... 143 144 Otherwise, when the decorator is used in non-"singleton" objects, 145 a string specifying the name of an attribute containing a string 146 unique to the instance must be passed to the decorator. This value 147 will be appended to the key constructed from module, class and 148 method name:: 149 118 150 class SomeClass(object): 119 151 def __init__(self, env, name): 120 152 self.env = env … … 125 157 def metadata(self): 126 158 ... 127 159 128 Note that the key attribute is overwritten with a hash of the key on first 129 access, so it should not be used for any other purpose. 160 Note that in this case the key attribute is overwritten with a 161 hash of the key on first access, so it should not be used for any 162 other purpose. 130 163 131 This decorator requires that the object on which it is used has an `env` 132 attribute containing the application `Environment`. 164 In either case, this decorator requires that the object on which 165 it is used has an ``env`` attribute containing the application 166 `~trac.env.Environment`. 133 167 134 168 .. versionchanged:: 0.13 135 136 169 The data retrieval method used to be called with a single 137 170 argument ``db`` containing a reference to a database 138 171 connection. This is the same connection that can be retrieved
follow-up: 21 comment:19 by , 14 years ago
Plus some further code simplification for the descriptor classes:
-
trac/cache.py
diff -r e1ae835d3886 trac/cache.py
a b class CachedPropertyBase(object): 41 41 self.retriever = retriever 42 42 self.__doc__ = retriever.__doc__ 43 43 44 def __get__(self, instance, owner): 45 if instance is None: 46 return self 47 return CacheManager(instance.env).get( 48 self.make_id(instance, owner), self.retriever, instance) 49 50 def __delete__(self, instance): 51 CacheManager(instance.env).invalidate( 52 self.make_id(instance, instance.__class__)) 53 44 54 def make_key(self, cls): 45 55 attr = self.retriever.__name__ 46 56 for base in cls.mro(): … … class CachedSingletonProperty(CachedProp 61 71 instance. 62 72 """ 63 73 64 def __get__(self, instance, owner): 65 if instance is None: 66 return self 74 def make_id(self, instance, owner): 67 75 try: 68 76 id = self.id 69 77 except AttributeError: 70 78 id = self.id = key_to_id(self.make_key(owner)) 71 return CacheManager(instance.env).get(id, self.retriever, instance)79 return id 72 80 73 def __delete__(self, instance):74 try:75 id = self.id76 except AttributeError:77 id = self.id = key_to_id(self.make_key(instance.__class__))78 CacheManager(instance.env).invalidate(id)79 80 81 81 class CachedProperty(CachedPropertyBase): 82 82 """Cached property descriptor for classes having potentially 83 83 multiple instances associated to a single `~trac.env.Environment` … … class CachedProperty(CachedPropertyBase) 94 94 def __init__(self, retriever, key_attr): 95 95 super(CachedProperty, self).__init__(retriever) 96 96 self.key_attr = key_attr 97 98 def __get__(self, instance, owner): 99 if instance is None: 100 return self 97 98 def make_id(self, instance, owner): 101 99 id = getattr(instance, self.key_attr) 102 100 if isinstance(id, str): 103 101 id = key_to_id(self.make_key(owner) + ':' + id) 104 102 setattr(instance, self.key_attr, id) 105 return CacheManager(instance.env).get(id, self.retriever, instance) 106 107 def __delete__(self, instance): 108 id = getattr(instance, self.key_attr) 109 if isinstance(id, str): 110 id = key_to_id(self.make_key(instance.__class__) + ':' + id) 111 setattr(instance, self.key_attr, id) 112 CacheManager(instance.env).invalidate(id) 103 return id 113 104 114 105 115 106 def cached(fn_or_attr=None):
comment:20 by , 14 years ago
The changes look great, except for the example in the singleton case. I wouldn't show an _all_things
attribute, as it's unnecessary. You just need to return a value from the method, and it will be cached in the CacheManager
. A more useful example could be:
class WikiSystem(Component): @cached def pages(self): return set(name for name, in self.env.db_query("SELECT DISTINCT name FROM wiki"))
I didn't document CachedSingletonProperty
and CachedProperty
, as they actually aren't part of the public API. Only the cached()
function is needed. But your descriptions are fine, so please go ahead.
comment:21 by , 14 years ago
Replying to cboos:
Plus some further code simplification for the descriptor classes:
I prefer the current version, even though there is some code duplication. The way I have written it makes the common case as short as possible. That was actually the whole purpose of separating CachedSingletonProperty
from CachedProperty
, so re-combining __get__()
and __delete__()
is a step backwards for me.
Then again, the performance impact is probably negligible. If you prefer the simpler version, I can probably make it even simpler by re-combining them into a single CachedProperty
.
comment:22 by , 14 years ago
Doc updated in r10585.
No, separating them is good, it makes it immediately obvious we have two kind of descriptors. I was myself not fully convinced about my change seeing that CachedSingletonProperty.make_id()
had to carry the superfluous instance
parameter. But the kind of duplication I'm after was already present in the original code as well, so let me have another try:
-
trac/cache.py
diff -r 99070e68632a trac/cache.py
a b class CachedSingletonProperty(CachedProp 64 64 def __get__(self, instance, owner): 65 65 if instance is None: 66 66 return self 67 return CacheManager(instance.env).get(self.make_id(owner), 68 self.retriever, instance) 69 70 def __delete__(self, instance): 71 CacheManager(instance.env).invalidate(self.make_id(instance.__class__)) 72 73 def make_id(self, owner): 67 74 try: 68 75 id = self.id 69 76 except AttributeError: 70 77 id = self.id = key_to_id(self.make_key(owner)) 71 return CacheManager(instance.env).get(id, self.retriever, instance)78 return id 72 79 73 def __delete__(self, instance):74 try:75 id = self.id76 except AttributeError:77 id = self.id = key_to_id(self.make_key(instance.__class__))78 CacheManager(instance.env).invalidate(id)79 80 81 80 class CachedProperty(CachedPropertyBase): 82 81 """Cached property descriptor for classes having potentially 83 82 multiple instances associated to a single `~trac.env.Environment` … … class CachedProperty(CachedPropertyBase) 94 93 def __init__(self, retriever, key_attr): 95 94 super(CachedProperty, self).__init__(retriever) 96 95 self.key_attr = key_attr 97 96 98 97 def __get__(self, instance, owner): 99 98 if instance is None: 100 99 return self 100 return CacheManager(instance.env).get(self.make_id(instance, owner), 101 self.retriever, instance) 102 103 def __delete__(self, instance): 104 CacheManager(instance.env).invalidate(self.make_id(instance, 105 instance.__class__)) 106 107 def make_id(self, instance, owner): 101 108 id = getattr(instance, self.key_attr) 102 109 if isinstance(id, str): 103 110 id = key_to_id(self.make_key(owner) + ':' + id) 104 111 setattr(instance, self.key_attr, id) 105 return CacheManager(instance.env).get(id, self.retriever, instance) 106 107 def __delete__(self, instance): 108 id = getattr(instance, self.key_attr) 109 if isinstance(id, str): 110 id = key_to_id(self.make_key(instance.__class__) + ':' + id) 111 setattr(instance, self.key_attr, id) 112 CacheManager(instance.env).invalidate(id) 112 return id 113 113 114 114 115 115 def cached(fn_or_attr=None):
It's a neutral change (+18/-18), but the "complex" parts are not duplicated so I find it easier to understand and maintain.
The first issue is fixed in [10342] and [10343], by walking the class MRO to find the class where a property is defined.