= GenericTrac Data Model = 

This page defines a new data model for Trac that should be suitable
for storing most of the data from resources, along with their change history.

The main benefits expected from the new model are:
 - simplification of the internals of Trac, especially for the ticket model,
   in which the storage of changes is quite cumbersome (see #454, #6466)
 - solve a few design problems with the current data model (like #1890, #4582)
 - allow better code reuse and share of the base features 
   among different kinds of resources (like #695, #1113, etc.
   see [#RelatedTickets] for more)

This stems from the following former proposals:
 - TracObjectModelProposal
 - TracDev/Proposals/DataModel
 - TracDev/Proposals/Journaling

See also [googlegroups:trac-dev:8cf3f5fe0e476ce5 this mail].

As this will be a major redesign of the data model, 
it could also be a good opportunity to take the 
''[TracMultipleProjects multiple project]'' considerations into account (#130).

Somehow related to the generic data model, but not strictly depending on it,
Trac should also make it possible to implement new kinds of plugins that would
perform ''generic'' operations on Trac resources.
This could allow the (re-)implementation of the TracCrossReferences idea
as a plugin, for example.
See #6543 and TracDev/Proposals/TracRelations.


== Design Discussion ==

Requirements for the new model:
 1. it has to be ''simple'';
 2. it must be ''flexible'', in order to accommodate different kinds of resources and   
    allow for dynamic evolution;
 3. it should remain ''fast'', if not faster than what we currently have;
 4. it should lead to a more ''compact'' representation of data

=== Resource Content ===

The ticket model is by far richer data model we would have to support, 
so we could take this as a basis to lay out the foundations of the new model.
For ticket, we currently have a fixed set of properties
(as columns in the `ticket` table)
and a flexible set of properties
(as prop/value columns in a `ticket_custom` table).

Both styles have advantages and disadvantages:
 1. properties as columns:
   - (-) only flexibility is to not use some fields (e.g. severity) 
   - (-) no multiple values per field possible  
   - (+) faster (?)
   - (+) straightforward code (`for field1,field2, ... in cursor: ...`) (?)
 2. properties in prop/value columns
   - (+) highest flexibility, add or remove fields at will
   - (+) allow for multiple values per property, provided we don't use a primary key
     as we currently do for the `ticket_custom` table (#918)
     - (+) a slight extension would allow ordered multiple values (sequences), otherwise we have no control over the order in which those multiple values are retrieved, which might introduce confusion as this order will vary when new values are added;
       while primarily useful for `int` fields, this could also be useful to implement `text` fields for Oracle (working around the limits on varchars and not having to use CLOBs)
   - (-) slower, less memory efficient (?)
   - (-) more complex code (?)
(?) means ''yet to be verified''

In order to reduce the overall complexity, the idea would be to pick only one approach, instead of having to support both.
By using the second style, we could also have our "fixed" set of properties, 
while obviously the first style can't support the second.

It remains to be seen whether the second approach is really less efficient than the first, but this doesn't really matter as we anyway have already to pay the 
price for that flexibility.
Note also that by an appropriate use of indexes, we might eventually get 
''better'' performance compared to what we have today.

So the new model could be simply:

'''ticket'''
||= id ||= prop ||= value ||= seq ||

//seq// is the sequence number in case of multiple entries with the same property name.

or even:

'''resource_prop'''
||= realm ||= id ||= prop ||= value ||= seq ||
(if we use one mega table for all resources)



Note by the way that newer "resources", like `repository` in MultiRepos already have this `(id,prop/value)` form.

We could also keep the metadata associated to the properties in the database,
instead of being hard-coded and present in the TracIni file, permitting an
unification of the representation for fixed fields and custom fields.

'''resource_schema'''
||= realm ||= prop ||= metaprop ||= value ||

Here, possible content for ''prop'' could be 'label', 'default', 'order', 'type', etc.

Example:
|| ticket || description || type || wiki ||
|| ticket || priority || type || enum ||
|| ticket || priority || enum || priority ||
|| ticket || priority || default || normal ||
|| ticket || need_review || type || checkbox ||
|| ticket || need_review || default || 0 ||

Note: the existence of a schema describing the fields doesn't mean that modules
can't interpret fields as being special. Quite the opposite, as modules are what
provides the real "behavior" of resources.
Furthermore, properties not defined in the schema could simply be ignored, so this would allow a great deal of flexibility for plugins when they need to store "special" properties or revision properties.

As a possible refining, it could be possible to have specialized tables, 
one for each different value column type we want to support:
 - '''resource_prop''' for text values
 - '''resource_prop_int''' for integer values
 - ('''resource_prop_float''' for float values, if really needed)
And we could even differentiate between short and long text values (requirement 4):
 - '''resource_prop''' for short text values 
 - '''resource_prop_text''' for long text values
(see #6986).


Along the same lines there's also the question of what should be the ''id'':
a natural or a surrogate key?
 natural keys:: (''id'' would be 123 for ticket !#123, id would be 'milestone1' for milestone1, etc.)
   - we have to support different type of keys (text for milestone, int for ticket).
     - not a problem for separate tables
     - would require ''resource_int_prop'' style for resources having an ''int'' 
       id ... cumbersome
   - less compact but easier to "understand"
   - renaming is more difficult
 surrogate keys:: (''id'' would be a number in a sequence, never shown as such in the interface)
   - only one type of keys (int) - faster, simpler, 
     the unique ''resource_prop'' table approach is possible
   - more compact, not that difficult to read either (there would always be a 
     ''prop=id'', ''value=the natural key'' entry
   - renaming is easy (relations preserved)

This suggests that using surrogate keys would be preferable.


Now if this is the case, the '''resource_prop''' table could as well become:
||= id ||= prop ||= value ||= //seq//
and the ''realm'' information could simply be stored as another prop/value entry.


=== Resource History ===

We need to differentiate between the changes to the data, and the metadata about the change. The metadata is about who did the change, when, why the change was made, etc.
We can adopt the same flexible strategy as the one for resource properties and
store arbitrary key/value pairs of "revision properties".

'''resource_revprop'''
|| ''changeid'' || ''revprop'' || ''value'' ||

Typical example:
|| 101001 || author  || cboos         ||
|| 101001 || auth    || 1             ||
|| 101001 || date    || 1231232114.12 ||
|| 101001 || comment || random change ||

A given ''changeid'' is usually related to a specific change in one resource,
but there could be other situations:
 - one change affecting lots of resources (typically #4582 and #5658, ticket batch changes #525)
 - changes affecting changes (typically #454); now that we have completed #454, it would be interesting to see what constraints this impose on the new model


The property changes themselves are stored in other tables.
We only need the changed properties here, no need to store the old/new values
for each change, as this can be deduced from the past changes.
Deletions of fields should be represented by setting a field to the NULL value.

Several possibilities here:

'''ticket_change'''
||= id ||= changeid ||= prop ||= value ||= seq ||

'''milestone_change'''
||= id ||= changeid ||= prop ||= value ||= seq ||

or:

'''resource_change'''
||= id ||= changeid ||= prop ||= value ||= seq ||
(surrogate key approach)

The latter has the advantage that it would make easy to relate a given ``changeid``
to the resource(s) that were affected by the change, without having to go through
each resource table.

We could also keep all property changes as text values
or have extra `..._int` (`..._float`) tables for more compact 
representation.

See also ticket:6466#comment:10 and follow-ups for a discussion about how ticket changes and in particular ticket change edits, could be handled using this approach.

== The Model ==

To summarize the above discussion, here's what could be the new model.

In the following:
 - ''int'' stands for the integer column type
 - ''string'' stands for a short text column type (e.g. varchar(65))
 - ''text''  stands for a long text column type, of unbounded size

=== Minimal Model ===
 - surrogate keys for all resources
 - only text fields


{{{
#!div style="width: 30%; float: left"

==== Schema ====

'''resource_schema'''
||= realm ||= prop ||= metaprop ||= value ||
|| string || string || string || text ||

'''resource_prop'''
||= id ||= prop ||= value ||= seq ||
|| int || string || text || int ||

'''resource_revprop'''
||= changeid ||= revprop ||= value ||= seq ||
|| int || string || text || int ||

'''resource_change'''
||= id ||= changeid ||= prop ||= value ||= seq ||
|| int || int || string || text || int ||
}}}
{{{
#!div style="width: 60%; border: 2px dotted #ddd; padding: 0 0 1em 3em ; float: left;"

==== Example Dataset ====

'''resource_schema'''
||= realm ||= prop ||= metaprop ||= value ||
|| ticket    || summary     || type     || text ||
|| ticket    || description || type     || wiki ||
|| ticket    || reporter    || type     || text ||

'''resource_prop'''
||= id ||= prop ||= value ||= seq ||
|| 0||   id      ||   130     || 0||
|| 0|| summary   ||Multiple Project Support|| 0||
|| 0||description||One day... || 0||
|| 0|| reporter  ||   joe     || 0||

'''resource_revprop'''
||= changeid ||= revprop ||= value ||= seq ||
|| 1|| author  ||    joe    || 0||
|| 1|| date    ||5 years ago|| 0||
|| 2|| author  ||    joe    || 0||
|| 2|| date    ||2 years ago|| 0||
|| 2|| comment || come on...|| 0||
|| 3|| author  ||   cboos   || 0||
|| 3|| date    ||1 year ago || 0||
|| 3|| comment || sure...   || 0||

'''resource_change'''
||= id ||= changeid ||= prop ||= value ||= seq ||
|| 0|| 1|| id        ||130|| 0||
|| 0|| 1|| summary   ||Multiple Project Support|| 0||
|| 0|| 1||description||Should be easy...|| 0||
|| 0|| 1|| reporter  ||joe|| 0||
|| 0|| 2||description||Should be easy... Redmine has it! || 0||
|| 0|| 3||description||One day...|| 0||
}}}

[[html(<br style="clear: both" />)]]


=== Intermediate Model ===
 - surrogate keys for all resources
 - text and int fields (same as [#CompleteModel Complete Model], without the `*_string` tables)

The [#MinimalModel minimal model] above is handy for showing the essence of the new model,
but it's too simple in practice.

I think we need at the very least to support 'integer' type columns, useful for storing dates efficiently, boolean values, and relations to other resources (as the surrogate ''id''s will be integers).


=== Complete Model ===
 - surrogate keys for all resources
 - int, bigint, short and long text fields

Not absolutely necessary to go that far, this could nevertheless help a lot for the MySQL backend (#6986), possibly also for a future Oracle backend. Don't know about PostgreSQL, but for SQLite this should be indifferent.

`bigint` is needed since we use this to store microsecond precision timestamps (#6466).

{{{
#!div style="width: 30%; float: left"

==== Schema ====

'''resource_schema'''
||= realm ||= prop ||= metaprop ||= value ||
|| string || string || string || text ||


'''resource_prop'''
||= id ||= prop ||= value ||= seq ||
|| int || string || text || int ||

'''resource_revprop'''
||= changeid ||= revprop ||= value ||= seq ||
|| int || string || text || int ||

'''resource_change'''
||= id ||= changeid ||= prop ||= value ||= seq ||
|| int || int || string || text || int ||


'''resource_prop_string'''
||= id ||= prop ||= value ||= seq ||
|| int || string || string || int ||

'''resource_revprop_string'''
||= changeid ||= revprop ||= value ||= seq ||
|| int || string || string || int ||

'''resource_change_string'''
||= id ||= changeid ||= prop ||= value ||= seq ||
|| int || int || string || string || int ||


'''resource_prop_int'''
||= id ||= prop ||= value ||= seq ||
|| int || string || int || int ||

'''resource_revprop_int'''
||= changeid ||= revprop ||= value ||= seq ||
|| int || string || int || int ||

'''resource_change_int'''
||= id ||= changeid ||= prop ||= value ||= seq ||
|| int || int || string || int || int ||

}}}
{{{
#!div style="width: 60%; border: 2px dotted #ddd; padding: 0 0 1em 3em ; float: left;"

==== Example Dataset ====

**FIXME** add example of multivalued property

'''resource_schema'''
||= realm ||= prop ||= metaprop ||= value ||
|| ticket || summary || type || text ||
|| ticket || description || type || wiki ||
|| ticket || reporter || type || string ||

----

'''resource_prop'''
||= id ||= prop ||= value ||= seq ||
|| 0|| summary   ||Multiple Project Support|| 0||
|| 0||description||One day... || 0||

'''resource_revprop'''
||= changeid ||= revprop ||= value ||= seq ||
|| 2|| comment  || come on...|| 0||
|| 3|| comment  || sure...   || 0||

'''resource_change'''
||= changeid ||= revprop ||= value ||= seq ||
|| 0 || 1||  summary  ||Multiple Project Support|| 0||
|| 0 || 1||description||Should be easy...|| 0||
|| 0 || 2||description||Should be easy... Redmine has it!|| 0||
|| 0 || 3||description||One day...|| 0||

----

'''resource_prop_string'''
||= id ||= prop ||= value ||= seq ||
|| 0|| reporter || joe || 0||

'''resource_revprop_string'''
||= changeid ||= revprop ||= value ||= seq ||
|| 1|| author || joe    || 0||
|| 2|| author || joe    || 0||
|| 3|| author || cboos  || 0||

'''resource_change_string'''
||= id ||= changeid ||= prop ||= value ||= seq ||
|| 0|| 1|| reporter ||joe|| 0||

----

'''resource_prop_int'''
||= id ||= prop ||= value ||= seq ||
|| 0 || id || 130 || 0 ||

'''resource_revprop_int'''
||= changeid ||= revprop ||= value ||= seq ||
|| 1||   date   ||5 years ago|| 0||
|| 2||   date   ||2 years ago|| 0||
|| 3||   date   ||1 year ago || 0||

'''resource_change_int'''
||= id ||= changeid ||= prop ||= value ||= seq ||
|| 0|| 1|| id ||130||0||

}}}

[[html(<br style="clear: both" />)]]


'''Comments''' (cklein)
 - Why not implement all of the different `resource_prop*` tables into a single table, where each tupel has multiple attributes, see for example the JBPM datamodel for a working and presumably also fast approach. Here, there exists a process_variable or some similar table that stores all the different value types in single table. 
   - (cboos) not sure how you see that as an advantage; each row will waste all the fields but one; there need to be one index for each type, each index having to deal with lots of NULL values, each update will have to rebuild all indexes, etc.). But it could be worth benchmarking anyway...
  The schema would be like so:
{{{
table resource_prop
{
id
prop
type
int_val
string_val
datetime_val
...
}

table resource_revprop
{
change_id
prop
type
int_val
string_val
...
}

table resource_change
{
id
change_id
prop
type
int_val
string_val
...
}
}}}
 - And please rename the ''name'' field to ''prop'' so that it matches the one in the resource_schema table.
   - (cboos) done - now I use `prop` consistently to talk about resource property keys, `revprop` to talk about change property keys and `metaprop` in the schema (as those are properties of properties)
 - Also I would like to have the resource_schema table extended so that it will support different schemas for, say, different ticket types. That way, users can define their personal ticket type schemas. Of course, derivation would also be nice, but that could be implemented at a later point in time, requiring yet another table. That way we could have both inheritance at the schema level and also multiple different models per realm ;)
  {{{
table resource_schema
{
  realm
  urn    -- the urn of the ticket schema or wiki page schema to which this belongs to, defaults are for example trac::ticket or trac::milestone or trac::wiki-page and so on
  prop
  name
  type
}
  }}}
  The inheritance table for the schemas would then be
  {{{
table resource_schema_inheritance
{
  realm
  urn
  base -- the urn of the base schema
}
  }}}
   - (cboos) Interesting idea, though for simplicity my initial idea was that one could simply define new realms (bug: vs. enhancement:, both nevertheless managed by the TicketModule)
 - inheritance would then provide for also multiple inheritance
   - (cboos) much harder ;-)

== Possible Implementation Plan ==

=== Milestone First ===
 - modify the Milestone module so that it uses the new proposed datamodel.
 - experiment new tabbed view for the milestone (''View'', ''Discussion'', ''History''). See TracProject/UiGuidelines.
 - milestone should be able to have attachments, too (#3068)
 - adapt the Roadmap module to the new model
 - adapt the Milestone admin component to the new model

Once this is complete, validate the genericity by promoting the components to be first class resources as well (#1233).

=== Ticket First ===

As the ticket module is by far the most complex, it might be worth to
try out the new model there first:
 - we could verify that we meet the expectations in terms of code simplification,
   solving open issues, etc.
 - we could detect early if there are no regressions or risk of losing current features
 - by redeploying the ticket infrastructure to the other components, 
   we could spread the most benefits of tickets (comments, 
   custom fields, queries, etc.) to other resources (milestone, wiki, component, ...)

=== Wiki First ===

Milestone and components are closer in spirit to a wiki page than to a ticket (we have the long standing #3776 ticket).

We could as well start simply on the wiki model and wiki module, to rework the basics there, and once this is done port the changes to the milestone module, then work on the components.

If we do this, we can also at the same time think about the storage model, in particular how past versions could be stored in a VCS (#1132). We could then think about a serialization model that would play nicely with the genericity of the Trac resources on one hand side, and with the external editors on the other hand.

== Related Tickets ==

 - Data model issues: [[TicketQuery(status=!closed&keywords=~model)]]
 - Resource related: [[TicketQuery(status=!closed&keywords=~tracobject)]]