Edgewall Software

Version 36 (modified by Christian Boos, 14 years ago) ( diff )

#Base: some implementation notes

GenericTrac Data Model

This page defines a new data model for Trac that should be suitable for storing most of the data from resources, along with their change history.

In the longer term, it would be more appropriate to keep the change history in a dedicated system (a version control backend, even a distributed one). In that picture, we could even provide a simple "dbvcs" backend, which would use the database for storing the change history much like we do now.

The main benefits expected from the new model are:

  • simplification of the internals of Trac, especially for the ticket model, in which the storage of changes is quite cumbersome (see #454, #6466)
  • solve a few design problems with the current data model (like #1890, #4582)
  • allow better code reuse and share of the base features among different kinds of resources (like #695, #1113, etc. see #RelatedTickets for more)

Related Proposals

The idea is not new, it has its root in the TracObjectModelProposal, and was refined several times (TracDev/Proposals/DataModel, TracDev/Proposals/Journaling, and this mail).

As this will be a major redesign of the data model, it could also be a good opportunity to take the multiple project considerations into account (#130).

Somehow related to the generic data model, but not strictly depending on it, Trac should also make it possible to implement new kinds of plugins that would perform generic operations on Trac resources. This could allow the (re-)implementation of the TracCrossReferences idea as a plugin, for example. See #31, #886, #6543 and TracDev/Proposals/TracRelations.

Design Discussion

Requirements for the new model:

  1. it has to be simple;
  2. it must be flexible, in order to accommodate different kinds of resources and allow for dynamic evolution, like addition or removal of fields by plugin or via the web admin;
  3. it should remain fast, if not become faster than what we currently have;
  4. it should lead to a more compact representation of data
  5. all the existing constraints about maintaining resource history and associated metadata should be taken into account

Note that the persistence constraints imposed by the Trac data model are not necessarily only (or even best) approached using the RelationalModel, one could imagine that a future version could use more document-oriented persistence layers (e.g. MongoDB), or object-oriented databases (e.g. ZODB). Also, as said above, the versioning of resources should be delegated to a version control backend, with a default, simple, in-database VCS backend.

Resource Content

The ticket model is by far richer data model we would have to support, so we could use it as a basis to lay out the foundations of the new model.

For ticket, we currently have a fixed set of properties (as columns in the ticket table) and a flexible set of properties (as prop/value columns in a ticket_custom table).

Note that we might soon get relations (or links) as well, see #31.

Properties

Let's examine the advantages and disadvantages of each different kind of storage, for the properties:

  1. properties stored in columns of a main table:
    • (-) only flexibility is to not use some fields (e.g. 'severity', 'url')
    • (-) no multiple values per field possible
    • (+) faster especially for multicriteria queries (?)
    • (+) straightforward code (for field1,field2, ... in cursor: ...) (?)
  2. properties stored in prop/value columns of a generic table
    • (+) highest flexibility, add or remove fields at will
    • (+) allow for multiple values per property, provided we don't use a primary key as we currently do for the ticket_custom table (#918)
      • (+) a slight extension would allow ordered multiple values (sequences), otherwise we have no control over the order in which those multiple values are retrieved, which might introduce confusion as this order will vary when new values are added; while primarily useful for int fields, this could also be useful to implement text fields for Oracle (working around the limits on varchars and not having to use CLOBs)
    • (-) slower, less memory efficient, especially for multicriteria queries(?)
    • (-) more complex code (?)

(?) means yet to be verified

In order to reduce the overall complexity, the idea is to pick only one approach, instead of having to support both. By using the second style, we could also have our "fixed" set of properties, while obviously the first style can't support the second.

It remains to be seen whether the second approach is really less efficient than the first, but this doesn't really matter as we anyway have already to pay the price for that flexibility. Note also that by an appropriate use of indexes, we might eventually get better performance compared to what we have today.

The second style (prop/value) can be implemented in several ways:

  • a single properties table with prop and value string columns (this is what our current ticket_custom, system and repository tables do)
  • a single properties table with prop and several typed columns (int_value, string_value, date_value, etc.), only the appropriate column being used (see in the discussion, the JBPM's approach)
  • several properties tables with each prop and value columns, each table dedicated to a given type (int_properties, string_properties, etc.)
  • … more …

So for example the new model could be simply:

ticket_prop
id prop value seq

seq is the sequence number, which is used to support multiple values for the same property.

Notes:

  • newer "resources", like repository in MultiRepos already have this (id,prop,value) form, we'd only have to add seq.
  • this sequence number could be a globally unique sequence number, which would make it possible to use it as a foreign key in specialized tables, in case properties are needed to describe the resource - property relation itself…

Schema

We could also keep the metadata associated to the properties in the database, instead of being hard-coded and present in the TracIni file, permitting an unification of the representation for fixed fields and custom fields.

Note: the existence of a schema describing the fields doesn't mean that modules can't interpret fields as being special. Quite the opposite, as modules are what provides the real "behavior" of resources. Furthermore, properties not defined in the schema could simply be ignored, so this would allow a great deal of flexibility for plugins when they need to store "special" properties or revision properties.

The Generic Scheme

Actually, for achieving requirement 3. & 4., we need specialized tables, one for each different value column type we want to support:

  • {resource}_prop for integer values, identifiers and dates (which are already stored using "bigint")
  • ({resource}_prop_float for float values, if really needed)

And we could even differentiate between short and long text values (requirement 4):

  • {resource}_prop_string for short text values
  • {resource}_prop_text for long text values

See #6986, it will make a difference for MySQL, and I suppose it will matter for Oracle support also.

This set of tables is a "generic scheme" that can be easily created for any kind of "resource". There's also a trade-off here, between using one very generic set of "resource_…" table versus special instances like "ticket_…": having one set of tables presents the risk of introducing too much contention for backends using table locking, and leads to writing queries that will contain lots of difficult to joins. We will already have enough issues with this when writing multicriteria search queries…

For satisfying the requirement 3. (speed), we simply need to have enough indexes:

  • on id, for retrieving all properties of an instance at once
  • on (prop, value), for retrieving properties by values (searches and joins)

I don't see an use case yet for indexing on seq, or to use other combinations.

Identifier

Along the same lines there's also the question of what should be the id: a natural or a surrogate key?

natural keys
(id would be 123 for ticket #123, id would be 'milestone1' for milestone1, etc.)
  • we have to support different type of keys (text for milestone, int for ticket).
    • not a problem for separate tables
    • would require resource_int_prop style for resources having an int id … cumbersome
  • less compact but easier to "understand"
  • renaming is more difficult
surrogate keys
(id would be a number in a sequence, never shown as such in the interface)
  • only one type of keys (int) - faster, simpler,
  • the unique resource_prop table approach is possible
  • more compact, not that difficult to read either as there would always be a prop/value pair ('id',<the natural key>)
  • renaming is easy (relations are preserved)

This all suggests that using surrogate keys would be preferable.

Relations

Storing the relations between resources is also an interesting topic. We face several alternatives. Let's take the example of ticket and project. This is a simple example because we only have a "has_a / part_of" relation between ticket and project.

  1. in the ticket_prop_int table, via the ('project', <project_id>) prop/value pair(s) (support of multiple projects per ticket is possible as any other multivalued field via multiple seq)
  2. in a dedicated ticket_project relation table (<project_id>, <ticket_id>)
  3. in a global project_relation table (<project_id>, <resource_id>)
  4. as part of the project_prop_int table, via ('ticket', <ticket_id>) prop/value pair(s). This is symmetrical to 1., this time the relations are store on the project side.

We also have the more demanding example of generic ticket vs. ticket relations, as discussed in #31. There we don't want to have to add a new table for each new type of relation, hence a solution of the type 1. / 4. would be more appropriate (3. + a "role" type would essentially be the same as 1. / 4.).

In order to decide between 1. or 4., we need to consider the nature of the relation, whether it's an essential or accidental property.

A ticket doesn't really stand on its own, normally it's part of one project. Remember, in the context of MultiProject, even when there's no project defined, there's the implicit "whole" project. A ticket is eventually part of multiple projects but that's less common, if allowed at all. Even in that case, we can still consider that the project properties are essential properties of the ticket. On the opposite, a project can well exist without any tickets attached to it, so it's pretty clear that the tickets are not essential properties of a project. It's also clear that when we load a project instance, we don't necessarily want to load thousands of ticket properties.

For other relations, it can be less clear cut, we have to take sides. For example, concerning the ticket relations:

  • parent/child: a child ticket necessarily has a parent ticket, otherwise it's not a child ticket ; the parent ticket may have one or more children tickets, but it can also exist on its own
  • blocked-by/is-blocking: a ticket can be blocked-by one or more tickets ; the converse relation is-blocking is less strong
  • duplicate-of/duplicated-by: a ticket may be qualified to be a duplicate-of of another ticket ; that other ticket is duplicated-by one or many tickets, but this is less strong

History

We need to differentiate between the changes to the data, and the metadata about the change. The metadata is about who did the change, when, why the change was made, etc. We can adopt the same flexible strategy as the one for resource properties and store arbitrary key/value pairs of "revision properties".

For implementing a simple db backed VCS, we can even adopt the same generic scheme as described above, but for two derived resource:

  • one for snapshotting the resource properties at a given point in time "{resource}_version", e.g. ticket_version
  • one representing the metadata corresponding to the change, describing the act of creating a new version, i.e. revisioning: "{resource}_rev", e.g. a ticket_rev (or ticket_revision?)
ticket_version_prop
id prop value seq
ticket_rev_prop
id prop value seq

The main resource and the derived resources would be related together:

  • the main resource would point to the last revision (shortcut, this could be deduced by inspecting the date stored in the corresponding _rev_prop table - so in a "normalized" schema we wouldn't need it)
  • a version resource would point to the corresponding revision which was created at the same time (again, matching with the date would be possible? better use an explicit revision property, though)

Typical example:

ticket_prop
1001 id 1 2021010101
1001 lastchange 101001
ticket_rev_prop
101001 ticket 1001 23232323333
101001 auth 1 23232323334
101001 date 123123211422 23232323335
101001 version 202002 23232323336
ticket_rev_prop_int
202002 author cboos 23232323337
202002 comment random change 23232323338

A given revision instance (here 101001) is usually related to a specific change in one resource, but there could be other situations:

  • one change affecting lots of resources (typically #4582 and #5658, ticket batch changes #525)
  • changes affecting changes (typically #454); now that we have completed #454, it would be interesting to see what constraints this impose on the new model

Here we see that one change can easily cover multiple tickets, useful for storing metadata related to batch changes, via different sequence numbers for the ticket property, and a corresponding sequence of version property.

The property changes themselves are stored in other tables. We only need the properties that have changed here, no need to store the old/new values for each change, as this can be deduced from the past changes. No need to store the properties which haven't changed either (requirement 4.). Deletions of fields could be represented by setting a field to the NULL value.

See also ticket:6466#comment:10 and follow-ups for a discussion about how ticket changes and in particular ticket change edits, could be handled using this approach (not sure if this still applies, though).

Possible Implementation Plan

Base

The model is relatively simple to implement.

There will be a few generic utility functions for creating a set tables for the "generic scheme", and one for versioned resource:

  def create_resource_tables(env, resource):
     """Create resource_prop, _prop_string and _prop_text tables."""
     ...

  def create_versioned_resource_tables(env, resource):
     """Create a generic scheme for resource plus related ancillary
     versioning resource (_rev and _version)"""
         with env.db_transaction():
             for r in (resource, 
                       resource + "_rev", 
                       resource + "_version"):
                 create_generic_scheme(r)

On the model level, we could create a few utility classes:

  • a "view" (read-only dictionary), which will contain the current values for an instance (i.e. all the records within {resource}_prop... tables with the same id)
  • an "abstractmodel", object wrapping a view (for keeping around the "old" values), but mutable
  • "model", concrete subclass of the above, which implements loading by fetching the data from a set of tables following the generic scheme

Note that views could eventually be cached, somehow (need to think carefully about how to do cross-process invalidation, though; there's certainly a way on top of the CacheManager).

Applied on Milestone module

  • modify the Milestone module so that it uses the new proposed datamodel.
  • experiment new tabbed view for the milestone (View, Discussion, History). See TracProject/UiGuidelines.
  • milestone should be able to have attachments, too (#3068)
  • adapt the Roadmap module to the new model
  • adapt the Milestone admin component to the new model

Once this is complete, validate the genericity by promoting the components to be first class resources as well (#1233).

Applied on Ticket module

As the ticket module is by far the most complex, it might be worth to try out the new model there first:

  • we could verify that we meet the expectations in terms of code simplification, solving open issues, etc.
  • we could detect early if there are no regressions or risk of losing current features
  • by redeploying the ticket infrastructure to the other components, we could spread the most benefits of tickets (comments, custom fields, queries, etc.) to other resources (milestone, wiki, component, …)

Applied on Wiki module

Milestone and components are closer in spirit to a wiki page than to a ticket (we have the long standing #3776 ticket).

We could as well start simply on the wiki model and wiki module, to rework the basics there, and once this is done port the changes to the milestone module, then work on the components.

If we do this, we can also at the same time think about the storage model, in particular how past versions could be stored in a VCS (#1132). We could then think about a serialization model that would play nicely with the genericity of the Trac resources on one hand side, and with the external editors on the other hand.

Applied on Component module

An other possibility would be to start on fresh ground, implementing not yet existing resource typse in a generic way. That would be the less disrupting approach, as everything else should just continue to work unchanged (or mostly unchanged, except for things like #5211 and places where we deal with the old component table).

Related Tickets

  • Data model issues:
    #4130
    Ticket Component Interface Fails to Properly Check for Existing Component
    #11301
    intermittent failure with notification test
  • Resource related:
    #150
    User-centric storage. Last-viewed information and 'intelligent' followup
    #221
    Creating TR for multiple components
    #695
    Keywords for wiki pages
    #787
    Change attachment description
    #918
    [patch] Custom Ticket Fields should support the multiple selection type
    #1113
    Show milestone changes in timeline
    #1386
    Adding properties to Wiki Pages
    #1395
    Text box for duplicate when a bug is a duplicate
    #1678
    Show component view in timeline view for checkins
    #1835
    Add a discussion tab for Trac resources
    #2035
    Changeset commenting
    #2344
    sub milestone
    #2464
    Conditional fields in tickets
    #2465
    Add "custom query" functionality for changesets
    #2467
    Link user name in reports to custom query showing that user's open tickets
    #2662
    assign tickets to multiple users
    #2961
    custom comment fields
    #3003
    milestone could be a ticket
    #3080
    Custom field sorts only as text
    #3718
    Trac should use HTTP 301 Moved when milestones get renamed
    #3911
    implement an object system (xwiki, roundup)
    #4588
    User Page: trac links to these pages
    #5211
    Ticket - Wiki Integration enhancement/suggestion
    #7871
    Add start date to a milestone
    #8335
    Create People page with user profiles and status of who's doing what.
    #9263
    if SVN is used, milestone should manage/allowtoassign links for branch and tag

See also: Brainstorm for examples and "test" of the ideas, and discussion.

Note: See TracWiki for help on using the wiki.