Edgewall Software

Changes between Version 5 and Version 6 of GenericTrac/Brainstorm


Ignore:
Timestamp:
Nov 25, 2014, 10:59:15 PM (9 years ago)
Author:
Christian Boos
Comment:

salvaging parts of [GenericTrac@40], the #EAV approach

Legend:

Unmodified
Added
Removed
Modified
  • GenericTrac/Brainstorm

    v5 v6  
    11= Ideas for the GenericTrac data model =
    22
    3 We list a few ideas about what could be the new model.
     3We regroup here the various ideas and questions about what could be the new model. It's nowhere near a proposal, more a scratch pad and is more dated than what's in the GenericTrac page itself.
     4
    45
    56In the following:
     
    78 - ''string'' stands for a short text column type (e.g. varchar(65))
    89 - ''text''  stands for a long text column type, of unbounded size
     10
     11
     12== EAV-style model #EAV
     13
     14This was the original idea for implementing a generic data model. Starting from the particular example of the `ticket` data model and seeing that we had both a main table and a secondary table for the custom properties,
     15in order to reduce the overall complexity, the idea was to pick only one approach, instead of having to support both.
     16The main argument in favor of a change is that currently (for the `ticket` resources) we //already// implement both styles. The point is that by using //only// the second style (but maybe in a more efficient way), we could also have our "fixed" set of properties, while obviously the opposite is not true, the first style can't provide the flexibility provided by the second style.
     17
     18And it remains to be seen whether the second approach is really less efficient than the first, as by an appropriate use of indexes we might eventually get ''better'' performance than what we have today.
     19
     20The second style (prop/value) can be implemented in several ways:
     21 - a single `properties` table with `prop` and `value` string columns (this is what our current `ticket_custom`, `system` and `repository` tables do)
     22 - a single `properties` table with `prop` and several typed columns (`int_value`, `string_value`, `date_value`, etc.), only the appropriate column being used (see in the discussion, the [[#JBPM-approach|JBPM's approach]])
     23 - several `properties` tables with each `prop` and `value` columns, each table dedicated to a given type (`int_properties`, `string_properties`, etc.)
     24 - ... more ...
     25
     26
     27So for example the new model could be simply:
     28
     29 ||||||||=     '''ticket_prop''    ||
     30 ||= id ||= prop ||= value ||= seq ||
     31
     32//seq// is the sequence number, which is used to support multiple values for the same property.
     33
     34Notes:
     35 - newer "resources", like `repository` in MultiRepos already have this `(id,prop,value)` form, we'd only have to add `seq`.
     36 - this sequence number could be a globally unique sequence number, which would make it possible to use it as a foreign key in specialized tables, in case properties are needed to describe the resource - property relation itself...
     37
     38==== Schema
     39We could also keep the metadata associated to the properties in the database,
     40instead of being hard-coded and present in the TracIni file, permitting an
     41unification of the representation for fixed fields and custom fields.
     42
     43
     44Note: the existence of a schema describing the fields doesn't mean that modules
     45can't interpret fields as being special. Quite the opposite, as modules are what
     46provides the real "behavior" of resources.
     47Furthermore, properties not defined in the schema could simply be ignored, so this would allow a great deal of flexibility for plugins when they need to store "special" properties or revision properties.
     48
     49==== The Generic Scheme #generic-scheme
     50Actually, for achieving requirement 3. & 4., we need specialized tables,
     51one for each different value column type we want to support:
     52 - '''{resource}_prop''' for integer values, identifiers and dates (which are already stored using "bigint")
     53 - ('''{resource}_prop_float''' for float values, if really needed)
     54And we could even differentiate between short and long text values (requirement 4):
     55 - '''{resource}_prop_string''' for short text values
     56 - '''{resource}_prop_text''' for long text values
     57See #6986, it will make a difference for MySQL, and I suppose it will matter for Oracle support also.
     58
     59This set of tables is a "generic scheme" that can be easily created for any kind of "resource". There's also a trade-off here, between using one very generic set of "resource_..." table versus special instances like "ticket_...": having one set of tables presents the risk of introducing too much contention for backends using table locking, and leads to writing queries that will contain lots of difficult to joins. We will already have enough issues with this when writing multicriteria search queries...
     60
     61For satisfying the requirement 3. (speed), we simply need to have enough indexes:
     62 - on `id`, for retrieving all properties of an instance at once
     63 - on `(prop, value)`, for retrieving properties by values (searches and joins)
     64
     65I don't see an use case yet for indexing on `seq`, or to use other combinations.
     66
     67
     68==== Identifier
     69
     70Along the same lines there's also the question of what should be the ''id'':
     71a natural or a surrogate key?
     72 natural keys:: (''id'' would be 123 for ticket !#123, id would be 'milestone1' for milestone1, etc.)
     73   - we have to support different type of keys (text for milestone, int for ticket).
     74     - not a problem for separate tables
     75     - would require ''resource_int_prop'' style for resources having an ''int''
     76       id ... cumbersome
     77   - less compact but easier to "understand"
     78   - renaming is more difficult
     79 surrogate keys:: (''id'' would be a number in a sequence, never shown as such in the interface)
     80   - only one type of keys (int) - faster, simpler,
     81   - the unique ''resource_prop'' table approach is possible
     82   - more compact, not that difficult to read either as there would always be a prop/value pair (`'id'`,`<the natural key>`)
     83   - renaming is easy (relations are preserved)
     84
     85This all suggests that using surrogate keys would be preferable.
     86
     87==== Relations
     88
     89Storing the relations between resources is also an interesting topic.
     90We face several alternatives. Let's take the example of //ticket// and //project//. This is a simple example because we only have a "has_a / part_of" relation between ticket and project.
     91 1. in the `ticket_prop_int` table, via the `('project', <project_id>)` prop/value pair(s) (support of multiple projects per ticket is possible as any other multivalued field via multiple `seq`)
     92 2. in a dedicated `ticket_project` relation table `(<project_id>,  <ticket_id>)`
     93 3. in a global `project_relation` table `(<project_id>,  <resource_id>)`
     94 4. as part of the `project_prop_int` table, via `('ticket', <ticket_id>)` prop/value pair(s). This is symmetrical to 1., this time the relations are store on the project side.
     95
     96We also have the more demanding example of generic //ticket// vs. //ticket// relations, as discussed in #31.
     97There we don't want to have to add a new table for each new type of relation, hence a solution of the type 1. / 4. would be more appropriate (3. + a "role" type would essentially be the same as 1. / 4.).
     98
     99In order to decide between 1. or 4., we need to consider the nature of the relation, whether it's an essential or accidental property.
     100
     101A ticket doesn't really stand on its own, normally it's part of one project. Remember, in the context of MultiProject, even when there's no project defined, there's the implicit "whole" project. A ticket is eventually part of multiple projects but that's less common, if allowed at all. Even in that case, we can still consider that the //project// properties are essential properties of the ticket.
     102On the opposite, a project can well exist without any tickets attached to it, so it's pretty clear that the tickets are not essential properties of a project. It's also clear that when we load a project instance, we don't necessarily want to load thousands of ticket properties.
     103
     104For other relations, it can be less clear cut, we have to take sides. For example, concerning the ticket relations:
     105 - ''**parent**/child'': a child ticket necessarily has a ''parent'' ticket, otherwise it's not a child ticket; the parent ticket may have one or more children tickets, but it can also exist on its own
     106 - ''**blocked-by**/is-blocking'': a ticket can be ''blocked-by'' one or more tickets; the converse relation ''is-blocking'' is less strong
     107 - ''**duplicate-of**/duplicated-by'': a ticket may be qualified to be a ''duplicate-of'' of another ticket; that other ticket is ''duplicated-by'' one or many tickets, but this is less strong
     108
     109==== History
     110
     111We need to differentiate between the changes to the data, and the metadata about the change. The metadata is about who did the change, when, why the change was made, etc.
     112We can adopt the same flexible strategy as the one for resource properties and
     113store arbitrary key/value pairs of "revision properties".
     114
     115For implementing a simple db backed VCS, we can even adopt the same [#generic-scheme generic scheme] as described above, but for two derived resource:
     116 - one for snapshotting the resource properties at a given point in time "{resource}_version", e.g. `ticket_version`
     117 - one representing the metadata corresponding to the change, describing the act of creating a new version, i.e. revisioning:  "{resource}_rev", e.g. a `ticket_rev` (or `ticket_revision`?)
     118
     119 ||||||||=  '''ticket_version_prop'''  ||
     120 ||= id ||= prop ||= value ||= seq ||
     121
     122 ||||||||=  '''ticket_rev_prop'''  ||
     123 ||= id ||= prop ||= value ||= seq ||
     124
     125The main resource and the derived resources would be related together:
     126 - the main resource would point to the last ''revision'' (shortcut, this could be deduced by inspecting the ''date'' stored in the corresponding `_rev_prop` table - so in a "normalized" schema we wouldn't need it)
     127 - a ''version'' resource would point to the corresponding ''revision'' which was created at the same time (again, matching with the ''date'' would be possible? better use an explicit ''revision'' property, though)
     128
     129Typical example:
     130 ||||||||=  '''ticket_prop'''  ||
     131 || 1001 || id || 1 || 2021010101 ||
     132 || 1001 || revision || 101001 ||  2021010102 ||
     133
     134 ||||||||=  '''ticket_rev_prop'''  ||
     135 || 101001 || ticket  || 1001         ||  23232323333 ||
     136 || 101001 || auth    || 1            ||  23232323334 ||
     137 || 101001 || date    || 123123211422 ||  23232323335 ||
     138 || 101001 || version || 202002       ||  23232323336 ||
     139
     140 ||||||||=  '''ticket_rev_prop_int''  ||
     141 || 202002 || author  || cboos        ||  23232323337 ||
     142 || 202002 || comment || random change ||  23232323338 ||
     143
     144A given revision instance (here 101001) is usually related to a specific change in one resource,
     145but there could be other situations:
     146 - one change affecting lots of resources (typically #4582 and #5658, ticket batch changes #525)
     147 - changes affecting changes (typically #454); now that we have completed #454, it would be interesting to see what constraints this impose on the new model
     148
     149Here we see that one change can easily cover multiple tickets, useful for storing metadata related to batch changes, via different sequence numbers for the ''ticket'' property, and a corresponding sequence of ''version'' property.
     150
     151The property changes themselves are stored in other tables.
     152We only need the properties that have changed here, no need to store the old/new values for each change, as this can be deduced from the past changes. No need to store the properties which haven't changed either (requirement 4.).
     153Deletions of fields could be represented by setting a field to the NULL value.
     154
     155
     156See also ticket:6466#comment:10 and follow-ups for a discussion about how ticket changes and in particular ticket change edits, could be handled using this approach //(not sure if this still applies, though)//.
     157
     158
    9159
    10160