Ideas for the GenericTrac data model
We regroup here the various ideas and questions about what could be the new model. It's nowhere near a proposal, more a scratch pad and is more dated than what's in the GenericTrac page itself.
In the following:
- int stands for the integer column type
- string stands for a short text column type (e.g. varchar(65))
- text stands for a long text column type, of unbounded size
EAV-style model
This was the original idea for implementing a generic data model. Starting from the particular example of the ticket
data model and seeing that we had both a main table and a secondary table for the custom properties,
in order to reduce the overall complexity, the idea was to pick only one approach, instead of having to support both.
The main argument in favor of a change is that currently (for the ticket
resources) we already implement both styles. The point is that by using only the second style (but maybe in a more efficient way), we could also have our "fixed" set of properties, while obviously the opposite is not true, the first style can't provide the flexibility provided by the second style.
And it remains to be seen whether the second approach is really less efficient than the first, as by an appropriate use of indexes we might eventually get better performance than what we have today.
The second style (prop/value) can be implemented in several ways:
- a single
properties
table withprop
andvalue
string columns (this is what our currentticket_custom
,system
andrepository
tables do) - a single
properties
table withprop
and several typed columns (int_value
,string_value
,date_value
, etc.), only the appropriate column being used (see in the discussion, the JBPM's approach) - several
properties
tables with eachprop
andvalue
columns, each table dedicated to a given type (int_properties
,string_properties
, etc.) - … more …
So for example the new model could be simply:
ticket_prop id prop value seq
seq is the sequence number, which is used to support multiple values for the same property.
Notes:
- newer "resources", like
repository
in MultiRepos already have this(id,prop,value)
form, we'd only have to addseq
. - this sequence number could be a globally unique sequence number, which would make it possible to use it as a foreign key in specialized tables, in case properties are needed to describe the resource - property relation itself…
Schema
We could also keep the metadata associated to the properties in the database, instead of being hard-coded and present in the TracIni file, permitting an unification of the representation for fixed fields and custom fields.
Note: the existence of a schema describing the fields doesn't mean that modules can't interpret fields as being special. Quite the opposite, as modules are what provides the real "behavior" of resources. Furthermore, properties not defined in the schema could simply be ignored, so this would allow a great deal of flexibility for plugins when they need to store "special" properties or revision properties.
The Generic Scheme
Actually, for achieving requirement 3. & 4., we need specialized tables, one for each different value column type we want to support:
- {resource}_prop for integer values, identifiers and dates (which are already stored using "bigint")
- ({resource}_prop_float for float values, if really needed)
And we could even differentiate between short and long text values (requirement 4):
- {resource}_prop_string for short text values
- {resource}_prop_text for long text values
See #6986, it will make a difference for MySQL, and I suppose it will matter for Oracle support also.
This set of tables is a "generic scheme" that can be easily created for any kind of "resource". There's also a trade-off here, between using one very generic set of "resource_…" table versus special instances like "ticket_…": having one set of tables presents the risk of introducing too much contention for backends using table locking, and leads to writing queries that will contain lots of difficult to joins. We will already have enough issues with this when writing multicriteria search queries…
For satisfying the requirement 3. (speed), we simply need to have enough indexes:
- on
id
, for retrieving all properties of an instance at once - on
(prop, value)
, for retrieving properties by values (searches and joins)
I don't see an use case yet for indexing on seq
, or to use other combinations.
Identifier
Along the same lines there's also the question of what should be the id: a natural or a surrogate key?
- natural keys
- (id would be 123 for ticket #123, id would be 'milestone1' for milestone1, etc.)
- we have to support different type of keys (text for milestone, int for ticket).
- not a problem for separate tables
- would require resource_int_prop style for resources having an int id … cumbersome
- less compact but easier to "understand"
- renaming is more difficult
- we have to support different type of keys (text for milestone, int for ticket).
- surrogate keys
- (id would be a number in a sequence, never shown as such in the interface)
- only one type of keys (int) - faster, simpler,
- the unique resource_prop table approach is possible
- more compact, not that difficult to read either as there would always be a prop/value pair (
'id'
,<the natural key>
) - renaming is easy (relations are preserved)
This all suggests that using surrogate keys would be preferable.
Relations
Storing the relations between resources is also an interesting topic. We face several alternatives. Let's take the example of ticket and project. This is a simple example because we only have a "has_a / part_of" relation between ticket and project.
- in the
ticket_prop_int
table, via the('project', <project_id>)
prop/value pair(s) (support of multiple projects per ticket is possible as any other multivalued field via multipleseq
) - in a dedicated
ticket_project
relation table(<project_id>, <ticket_id>)
- in a global
project_relation
table(<project_id>, <resource_id>)
- as part of the
project_prop_int
table, via('ticket', <ticket_id>)
prop/value pair(s). This is symmetrical to 1., this time the relations are store on the project side.
We also have the more demanding example of generic ticket vs. ticket relations, as discussed in #31. There we don't want to have to add a new table for each new type of relation, hence a solution of the type 1. / 4. would be more appropriate (3. + a "role" type would essentially be the same as 1. / 4.).
In order to decide between 1. or 4., we need to consider the nature of the relation, whether it's an essential or accidental property.
A ticket doesn't really stand on its own, normally it's part of one project. Remember, in the context of MultiProject, even when there's no project defined, there's the implicit "whole" project. A ticket is eventually part of multiple projects but that's less common, if allowed at all. Even in that case, we can still consider that the project properties are essential properties of the ticket. On the opposite, a project can well exist without any tickets attached to it, so it's pretty clear that the tickets are not essential properties of a project. It's also clear that when we load a project instance, we don't necessarily want to load thousands of ticket properties.
For other relations, it can be less clear cut, we have to take sides. For example, concerning the ticket relations:
- parent/child: a child ticket necessarily has a parent ticket, otherwise it's not a child ticket; the parent ticket may have one or more children tickets, but it can also exist on its own
- blocked-by/is-blocking: a ticket can be blocked-by one or more tickets; the converse relation is-blocking is less strong
- duplicate-of/duplicated-by: a ticket may be qualified to be a duplicate-of of another ticket; that other ticket is duplicated-by one or many tickets, but this is less strong
History
We need to differentiate between the changes to the data, and the metadata about the change. The metadata is about who did the change, when, why the change was made, etc. We can adopt the same flexible strategy as the one for resource properties and store arbitrary key/value pairs of "revision properties".
For implementing a simple db backed VCS, we can even adopt the same generic scheme as described above, but for two derived resource:
- one for snapshotting the resource properties at a given point in time "{resource}_version", e.g.
ticket_version
- one representing the metadata corresponding to the change, describing the act of creating a new version, i.e. revisioning: "{resource}_rev", e.g. a
ticket_rev
(orticket_revision
?)
ticket_version_prop id prop value seq
ticket_rev_prop id prop value seq
The main resource and the derived resources would be related together:
- the main resource would point to the last revision (shortcut, this could be deduced by inspecting the date stored in the corresponding
_rev_prop
table - so in a "normalized" schema we wouldn't need it) - a version resource would point to the corresponding revision which was created at the same time (again, matching with the date would be possible? better use an explicit revision property, though)
Typical example:
ticket_prop 1001 id 1 2021010101 1001 revision 101001 2021010102
ticket_rev_prop 101001 ticket 1001 23232323333 101001 auth 1 23232323334 101001 date 123123211422 23232323335 101001 version 202002 23232323336
ticket_rev_prop_int 202002 author cboos 23232323337 202002 comment random change 23232323338
A given revision instance (here 101001) is usually related to a specific change in one resource, but there could be other situations:
- one change affecting lots of resources (typically #4582 and #5658, ticket batch changes #525)
- changes affecting changes (typically #454); now that we have completed #454, it would be interesting to see what constraints this impose on the new model
Here we see that one change can easily cover multiple tickets, useful for storing metadata related to batch changes, via different sequence numbers for the ticket property, and a corresponding sequence of version property.
The property changes themselves are stored in other tables. We only need the properties that have changed here, no need to store the old/new values for each change, as this can be deduced from the past changes. No need to store the properties which haven't changed either (requirement 4.). Deletions of fields could be represented by setting a field to the NULL value.
See also ticket:6466#comment:10 and follow-ups for a discussion about how ticket changes and in particular ticket change edits, could be handled using this approach (not sure if this still applies, though).
Example for the Complete Model
- surrogate keys for all resources
- int, bigint, short and long text fields
Not absolutely necessary to go that far, this could nevertheless help a lot for the MySQL backend (#6986), possibly also for a future Oracle backend. Don't know about PostgreSQL, but for SQLite this should be indifferent.
bigint
is needed since we use this to store microsecond precision timestamps (#6466).
Schema
The generic scheme for a given resource:
{resource}_prop | |||
---|---|---|---|
id | prop | value | seq |
int | string | biging | bigint |
{resource}_prop_string | |||
id | prop | value | seq |
int | string | string | bigint |
{resource}_prop_text | |||
id | prop | value | seq |
int | string | text | bigint |
The generic scheme for the resource change metadata:
{resource}_rev_prop | |||
---|---|---|---|
id | prop | value | seq |
int | string | bigint | bigint |
{resource}_rev_prop_string | |||
id | prop | value | seq |
int | string | string | bigint |
{resource}_rev_prop_text | |||
id | prop | value | seq |
int | string | text | bigint |
The generic scheme for the resource changes themselves:
{resource}_version_prop | |||
---|---|---|---|
id | prop | value | seq |
int | string | bigint | bigint |
{resource}_version_prop_string | |||
id | prop | value | seq |
int | string | string | bigint |
{resource}_version_prop_text | |||
id | prop | value | seq |
int | string | text | bigint |
Example Dataset
FIXME add example of multivalued property
{resource}_prop | |||
---|---|---|---|
id | prop | value | seq |
0 | id | 130 | 100001 |
{resource}_prop_string | |||
id | prop | value | seq |
0 | reporter | joe | 100002 |
{resource}_prop_text | |||
id | prop | value | seq |
0 | summary | Multiple Project Support | 100003 |
0 | description | One day… | 100004 |
{resource}_rev_prop | |||
---|---|---|---|
id | prop | value | seq |
11 | resource | 0 | 10000010 |
11 | date | (5 years ago) | 10000011 |
12 | resource | 0 | 10000012 |
12 | date | (2 years ago) | 10000013 |
13 | resource | 0 | 10000014 |
13 | date | (1 year ago) | 10000015 |
{resource}_revprop_string | |||
id | prop | value | seq |
11 | author | joe | 10000016 |
12 | author | joe | 10000017 |
13 | author | cboos | 10000018 |
{resource}_revprop_text | |||
id | prop | value | seq |
12 | comment | come on… | 10000019 |
13 | comment | sure… | 10000020 |
{resource}_version | |||
---|---|---|---|
id | prop | value | seq |
21 | resource | 0 | 10000080 |
21 | revision | 11 | 10000081 |
22 | resource | 0 | 10000082 |
22 | revision | 12 | 10000083 |
23 | resource | 0 | 10000084 |
23 | revision | 13 | 10000085 |
{resource}_version_string | |||
id | prop | value | seq |
21 | reporter | joe | 10000086 |
{resource}_version_text | |||
id | prop | value | seq |
21 | summary | Multiple Project Support | 10000087 |
21 | description | Should be easy… | 10000088 |
22 | description | Should be easy… Redmine has it! | 10000089 |
23 | description | One day… | 10000090 |
About the "Schema"
{resource}_schema realm prop metaprop value
Here, possible content for prop could be 'label', 'default', 'order', 'type', etc.
Example:
ticket description type wiki ticket priority type enum ticket priority enum priority ticket priority default normal ticket need_review type checkbox ticket need_review default 0
But we could simply reuse the "generic scheme", instantiated for each resource using "{resource}_schema" resource type. Each instance would correspond to a field specification.
Example:
ticket_schema_prop 10305001 name description 50030303 10305001 type wiki 50030304 10305002 name priority 50030305 10305002 type enum 50030306 10305002 enum priority 50030307 10305002 default normal 50030308 10305003 name need_review 50030309 10305003 type checkbox 50030310 10305003 default 0 50030311
Discussion
- (cklein) Why not implement all of the different
resource_prop*
tables into a single table, where each tuple has multiple attributes, see for example the JBPM datamodel for a working and presumably also fast approach. Here, there exists a process_variable or some similar table that stores all the different value types in single table.
The schema would be like so:table resource_prop { id prop type int_val string_val datetime_val ... } table resource_revprop { change_id prop type int_val string_val ... } table resource_change { id change_id prop type int_val string_val ... }
- (cboos) not sure how you see that as an advantage; each row will waste all the fields but one; there need to be one index for each type, each index having to deal with lots of NULL values, each update will have to rebuild all indexes, etc.). But it could be worth benchmarking anyway…
- (cboos) note that when we discussed a bit GenericTrac with Remy, his natural choice was also towards this approach, so it's certainly worth considering :-)
- (cboos) not sure how you see that as an advantage; each row will waste all the fields but one; there need to be one index for each type, each index having to deal with lots of NULL values, each update will have to rebuild all indexes, etc.). But it could be worth benchmarking anyway…
- And please rename the name field to prop so that it matches the one in the resource_schema table.
- (cboos) done - now I use
prop
consistently to talk about resource property names
- (cboos) done - now I use
- Also I would like to have the resource_schema table extended so that it will support different schemas for, say, different ticket types. That way, users can define their personal ticket type schemas. Of course, derivation would also be nice, but that could be implemented at a later point in time, requiring yet another table. That way we could have both inheritance at the schema level and also multiple different models per realm ;)
table resource_schema { realm urn -- the urn of the ticket schema or wiki page schema to which this belongs to, defaults are for example trac::ticket or trac::milestone or trac::wiki-page and so on prop name type }
The inheritance table for the schemas would then betable resource_schema_inheritance { realm urn base -- the urn of the base schema }
- (cboos) Interesting idea, though for simplicity my initial idea was that one could simply define new realms (bug: vs. enhancement:, both nevertheless managed by the TicketModule)
- inheritance would then provide for also multiple inheritance
- (cboos) much harder ;-)
- PostgreSQL supports array and JSON datatypes that could lead to a very generic, simple and performant design. Maybe the DB abstraction layer can be elevated to allow different backends to use different way to implement the schema, so MySQL and SQLite can emulate the design with the respective appropriate design (EAV?).
Experiments
The usual issue with EAV models is the relative complexity of making queries. If we'd like to support something as flexible as our custom query module (and even more, as we currently don't support multi-valued fields, needed in particular for TracDev/Proposals/TicketLinks), we need to find an efficient way, which doesn't require too many joins.
One idea is to combine two queries, the first to get the ids of the matching objects, the second to retrieve the desired columns.
The two queries could be combined in a single SQL statement, or the set of ids could be first used to see if there is already some corresponding data present in memory that could be reused. Only the remaining ids would then be used for looking up the remaining values (as "full" objects).
Example: looking for the ticket number, summary, priority and blocked-by values for all tickets having a priority lower than 3 and "test" in their summary or as a keyword.
With the single table approach:
CREATE TABLE ticket_prop ( id bigint , prop varchar(16) , ival bigint , tval text , seq bigint ); SELECT p.id, p.prop, p.ival, p.tval, p.seq FROM (SELECT id FROM ticket_prop WHERE prop = 'priority' AND ival < 3 INTERSECT SELECT id FROM (SELECT id FROM ticket_prop WHERE prop = 'summary' AND tval LIKE '%test%' UNION SELECT id FROM ticket_prop WHERE prop = 'keyword' AND tval = 'test' ) ) AS ids LEFT JOIN ticket_prop p ON ids.id = p.id WHERE p.prop in ('id', 'summary', 'priority', 'blocked-by') ORDER BY p.id, p.prop, p.seq;
With the multiple table approach:
CREATE TABLE ticket_prop_int ( id bigint , prop varchar(16) , val bigint , seq bigint ); CREATE TABLE ticket_prop_text ( id bigint , prop varchar(16) , val text , seq bigint ); SELECT p.id as id, p.prop as prop, NULL as ival, p.val as tval, p.seq as seq FROM (SELECT id FROM ticket_prop_int WHERE prop = 'priority' AND val < 3 INTERSECT SELECT id FROM (SELECT id FROM ticket_prop_text WHERE prop = 'summary' AND val LIKE '%test%' UNION SELECT id FROM ticket_prop_text WHERE prop = 'keyword' AND val = 'test' ) ) AS ids LEFT JOIN ticket_prop_text p ON ids.id = p.id WHERE p.prop = 'summary' UNION SELECT p.id as id, p.prop as prop, p.val as ival, NULL as tval, p.seq as seq FROM (SELECT id FROM ticket_prop_int WHERE prop = 'priority' AND val < 3 INTERSECT SELECT id FROM (SELECT id FROM ticket_prop_text WHERE prop = 'summary' AND val LIKE '%test%' UNION SELECT id FROM ticket_prop_text WHERE prop = 'keyword' AND val = 'test' ) ) AS ids LEFT JOIN ticket_prop_int p ON ids.id = p.id WHERE p.prop in ('id', 'priority', 'blocked-by') ORDER BY id, prop, seq;
Ok, maybe there's some clever way to reuse the ids
, but I couldn't find one. Hope the SQL backend is smart enough to detect it's the same subquery ;-)