Edgewall Software

Version 45 (modified by Christian Boos, 9 years ago) ( diff )

expand on the role of metadata (custom fields, sub-types)

GenericTrac Data Model

This page defines a new data model for Trac that should be suitable for manipulating and storing most of the data from resources, along with their change history.

There are effectively two parts to this proposal:

  • the API level, how resource data is presented to the program
  • the storage level, how the resource data are stored on persistent memory

The high level API is intended to unify and take the best aspects of the different model used so far (WikiPage, Ticket, Milestone, etc.). Actually, only the Ticket is really flexible and nice, the others are just simpler and much more limited models, so it's rather an "embrace and extend" move from the Ticket model to the others…

While the high level API tends to unify things, the low level API will be there to introduce some diversity in the storage, at best transparent to the "user" of the higher level API. Those different storage models could be different table layouts in the database, but also in the longer term, different or complementary storage backend. It would be more appropriate to keep the change history in a dedicated system, like a version control backend, even a distributed one).

The main benefits expected from the new API are:

  • simplification of the internals of Trac, especially for the ticket model, in which the storage of changes is quite cumbersome (see #454, #6466)
  • solve a few design problems with the current data model (like #1890, #4582)
  • allow better code reuse and sharing of the base features among different kinds of resources (like #695, #1113, and a lot more, see #RelatedTickets below)

Related Proposals

The idea is not new, it has its root in the TracObjectModelProposal, and was refined several times (TracDev/Proposals/DataModel, TracDev/Proposals/Journaling, and this mail).

As this will be a major redesign of the data model, it could also be a good opportunity to take the MultipleProjectSupport considerations into account (#11025).

Somehow related to the generic data model, but not strictly depending on it, Trac should also make it possible to implement new kinds of plugins that would perform generic operations on Trac resources. This could allow the (re-)implementation of the TracCrossReferences idea as a plugin, for example. See #31, #886, #6543 and TracDev/Proposals/TracRelations.

Design Discussion

Requirements for the new model:

  1. it has to be simple (easy to understand, easy to code with, when looking at the raw data in the database one should be able to intuitively understand what it means)
  2. it must be flexible ((accommodate different kinds of resources, allow for dynamic evolution like addition or removal of fields by plugin or via the web admin)
  3. it should remain fast, if not become faster than what we currently have;
  4. it should lead to a more compact representation of data
  5. all the existing constraints about maintaining resource history and associated metadata should be taken into account

Note that the persistence constraints imposed by the Trac data model are not necessarily only (or even best) approached using the RelationalModel, one could imagine that a future version could use more document-oriented persistence layers (e.g. MongoDB), or object-oriented databases (e.g. ZODB). Also, as said above, the versioning of resources should be delegated to a version control backend, with a default, simple, in-database VCS backend.

The ticket data model

The ticket model is by far richer data model we would have to support, so we could use it as a basis to lay out the foundations of the new model.

Trac ≤ 1.1.3 model

For ticket, we currently have a fixed set of properties (as columns in the ticket table) and a flexible set of properties (as prop/value columns in a ticket_custom table).

Note that we'd like to get relations (or links) as well, see #31.

Let's examine the advantages and disadvantages of each different styles of properties storage:

  1. properties stored in columns of a main table:
    • (-) only flexibility is to not use some fields (e.g. 'severity', 'url')
    • (-) no multiple values per field possible
    • (+) faster especially for multicriteria queries (?)
    • (+) straightforward code (for field1,field2, ... in cursor: ...) (?)
  2. properties stored in prop/value columns of a generic table
    • (+) highest flexibility, add or remove fields at will
    • (+) allow for multiple values per property, provided we don't use a primary key as we currently do for the ticket_custom table (#918)
      • (+) a slight extension would allow ordered multiple values (sequences), otherwise we have no control over the order in which those multiple values are retrieved, which might introduce confusion as this order will vary when new values are added; while primarily useful for int fields, this could also be useful to implement text fields for Oracle (working around the limits on varchars and not having to use CLOBs)
    • (-) slower, less memory efficient, especially for multicriteria queries(?)
    • (-) more complex code (?)
    • (-) cannot use DB features to assure data integrity (not NULL constraints, foreign key constraints, etc.)
    • (-) very difficult to integrate against the database (e.g. import/export, analysis)

(?) means yet to be verified

The current ticket tables look like this (.schema ticket and .schema ticket_custom in SQLite):

CREATE TABLE ticket (
        id              integer PRIMARY KEY,
        type            text,           -- the nature of the ticket
        time            integer,        -- the time it was created
        changetime      integer,
        component       text,
        severity        text,
        priority        text,
        owner           text,           -- who is this ticket assigned to
        reporter        text,
        cc              text,           -- email addresses to notify
        version         text,           --
        milestone       text,           --
        status          text,
        resolution      text,
        summary         text,           -- one-line summary
        description     text,           -- problem description (long)
        keywords        text
);
CREATE INDEX ticket_status_idx ON ticket (status);
CREATE INDEX ticket_time_idx ON ticket (time);

CREATE TABLE ticket_custom (
       ticket               integer,
       name             text,
       value            text,
       UNIQUE(ticket,name)
);

Such a flat table makes it hard to get 1-n associations, or this has to be faked somehow (cf. the cc and keywords).

And for the "versioning" of ticket changes, we have (.schema ticket_change):

CREATE TABLE ticket_change (
    ticket integer,
    time integer,
    author text,
    field text,
    oldvalue text,
    newvalue text,
    UNIQUE (ticket,time,field)
);
CREATE INDEX ticket_change_ticket_idx ON ticket_change (ticket);
CREATE INDEX ticket_change_time_idx ON ticket_change (time);

Squeezed in the above are the ticket comments, complete with "reply-to" information and their own versioning scheme ;-)

Refactored model

The new model is heavily relational, as opposed to the earlier EAV-like approach (see Brainstorm).

We keep a primary table for the resource, for all 1-to-1 properties and relations (table named Ticket here).

We now make systematic use of surrogate keys, to allow for efficient relations between resources. That key will however remain hidden, an implementation detail. The visible ID for a ticket either stays the same as before or can even become more flexible (#9624, sub-tickets with dotted notation).

All the properties that may contain multiple values are located in secondary tables (<Resource>_<[Ff]ield>) with three columns: <resource>, <field> and <seq> for keeping the multiple values ordered. Note that <[Ff]ield> might be either a property (<field>), the secondary table <field> column then contains directly a value, or a relation (<Field>), in which case the secondary table <field> column contains the surrogate key of the related resource.

There's only one multi-valued property left, for the keywords, and all the others 1-m fields become relations:

  • to components,
  • to versions (#221),
  • to milestones,
  • to users for the CC (#454 done right):

This of course will impose to have actual resource tables for all these related resource realms.

Most 1-to-1 fields will also be indexed, in order to better support queries.

A note about the cardinality: in case resources have a n-m relationship, the side expected to have the less relations is the one that will come first in the relation table. For example, a project will have many (n) tickets, but a ticket will usually be associated with 1 or perhaps a few (m) projects, and we have n >> m. In this case, the relation table will be Ticket_Project. The idea is that when we retrieve information about tickets, we'll want to get the information about their projects, but not necessarily the converse. Another factor to take into account is the order. We put an emphasis on keeping track of the order in which the relations are laid out. Sometimes it's not that that important (CC:), sometimes we could take advantage of this information (keywords), special importance for first listed component or version… But it's true that there are situations where the distinction is less clear cut, like for Ticket_Milestone, based on the current interface we could say that we seldom consider a milestone without its associated tickets, and we might want to order tickets in a milestone (Scrum-like). Anyway, we'll use the hint given by the "m" cardinality for a start.

CREATE TABLE Ticket (
        uid             integer PRIMARY KEY,
        id              text UNIQUE,
        type            text,           -- the nature of the ticket
        created         integer,        -- the time it was created
        modified        integer,
        severity        text,
        priority        text,
        _owner          integer REFERENCES User (uid),
        _reporter       integer REFERENCES User (uid),         
        status          text,
        resolution      text,
        summary         text,           -- one-line summary
        description     text            -- problem description (long)
);
CREATE INDEX Ticket_created_idx ON ticket (created);
CREATE INDEX Ticket_modified_idx ON ticket (modified);
CREATE INDEX Ticket_severity_idx ON ticket (severity);
CREATE INDEX Ticket_priority_idx ON ticket (priority);
CREATE INDEX Ticket_owner_idx ON ticket (_owner);
CREATE INDEX Ticket_reporter_idx ON ticket (_reporter);
CREATE INDEX Ticket_status_idx ON ticket (status);
CREATE INDEX Ticket_resolution_idx ON ticket (resolution);
-- summary and description should be indexed for FTS


CREATE TABLE Ticket_Component (
       ticket           integer REFERENCES Ticket (uid),
       component        integer REFERENCES Component (uid),
       seq              integer,
       UNIQUE(ticket,component)
);
CREATE INDEX Ticket_Component_ticket_idx ON Ticket_Component (ticket);
CREATE INDEX Ticket_Component_component_idx ON Ticket_Component (component);


CREATE TABLE Ticket_Version (
       ticket           integer REFERENCES Ticket (uid),
       version          integer REFERENCES Version (uid),
       seq              integer,
       UNIQUE(ticket,version)
);
CREATE INDEX Ticket_Version_ticket_idx ON Ticket_Version (ticket);
CREATE INDEX Ticket_Version_version_idx ON Ticket_Version (version);


CREATE TABLE Ticket_Milestone (
       ticket           integer REFERENCES Ticket (uid),
       milestone        integer REFERENCES Milestone (uid),
       seq              integer,
       UNIQUE(ticket,milestone)
);
CREATE INDEX Ticket_Milestone_ticket_idx ON Ticket_Milestone (ticket);
CREATE INDEX Ticket_Milestone_milestone_idx ON Ticket_Milestone (milestone);


CREATE TABLE Ticket_Cc (
       ticket           integer REFERENCES Ticket (uid),
       cc               integer REFERENCES User (uid),
       seq              integer,
       UNIQUE(ticket,cc)
);
CREATE INDEX Ticket_Cc_ticket_idx ON Ticket_Cc (ticket);
CREATE INDEX Ticket_Cc_cc_idx ON Ticket_Cc (cc);


CREATE TABLE Ticket_keyword (
       ticket           integer REFERENCES ticket(uid),
       keyword          text,
       seq              integer,
       UNIQUE(ticket,keyword)
);
CREATE INDEX Ticket_keyword_ticket_idx ON Ticket_keyword (ticket);
CREATE INDEX Ticket_keyword_keyword_idx ON Ticket_keyword (keyword);

Adding the Project to mix becomes a simple matter of yet another relation table:

CREATE TABLE Ticket_Project (
       ticket           integer REFERENCES Ticket (uid),
       project          integer REFERENCES Project (uid),
       seq              integer,
       UNIQUE(ticket,project,seq)
);
CREATE INDEX Ticket_Project_ticket_idx ON Ticket_Project (ticket);
CREATE INDEX Ticket_Project_project_idx ON Ticket_Project (project);

It starts to become interesting when we start to consider the versioning aspect of the above. We're going to add ancillary tables for storing the information related to the change itself (the event) and the recording of the changes.

CREATE TABLE TicketEvent (
       uid              integer PRIMARY KEY,
       _ticket          integer REFERENCES Ticket (uid),
       _author          integer REFERENCES User (uid),
       instant          integer
);
CREATE INDEX TicketEvent_ticket_idx ON TicketEvent (_ticket);
CREATE INDEX TicketEvent_author_idx ON TicketEvent (_author);
CREATE INDEX TicketEvent_instant_idx ON TicketEvent (instant);


CREATE TABLE TicketChange (
       uid              integer PRIMARY KEY,
       _tevent          integer REFERENCES TicketEvent (uid),
       field            text,
       oldval           text,
       newval           text,
       UNIQUE(_tevent, field)
); 
CREATE INDEX TicketChange_tevent_idx ON TicketChange (_tevent);

The <Resource>Change table is clearly inspired of the existing ticket_change: we don't want to replicate all the complexity of the ticket relational model in dedicated versioning tables, because it would be clearly overkill. We're not interested in querying these values anyway, only in storing them for displaying them whenever we're going to show the history of the resource.

The comments are a different story, we're not going to "hide" them in the above, they'll rather behave as secondary (ancillary) resources related to the events themselves.

CREATE TABLE TicketEventComment (
       uid              integer PRIMARY KEY,
       _tevent          integer REFERENCES TicketEvent (uid),
       inreplyto        integer REFERENCES TicketEventComment (uid),
       comment          text
);
CREATE INDEX TicketEventComment_tevent_idx ON TicketChange (_tevent);
-- comment should be FTS indexed

It's easy to see that if we'd like to extend the ticket event comments by addinng "custom" fields, we would just have to add them in the above table if they're of the 1-to-1 type, or using property/relation secondary tables if they're of the 1-to-m type (#2961).

And of course, the comment events are themselves versioned!

CREATE TABLE TicketEventCommentEvent (
       uid              integer PRIMARY KEY,
       _TEComment       integer REFERENCES TicketEventComment (uid),
       _Author          integer REFERENCES User (uid),
       instant          integer
);
CREATE INDEX TicketEventCommentEvent_ticket_idx ON TicketEvent (_ticket);
CREATE INDEX TicketEventCommentEvent_author_idx ON TicketEvent (_author);
CREATE INDEX TicketEventCommentEvent_instant_idx ON TicketEvent (instant);


CREATE TABLE TicketEventCommentChange (
       uid              integer PRIMARY KEY,
       _TECEvent        integer REFERENCES TicketEventCommentEvent (uid),
       field            text,
       oldval           text,
       newval           text,
       UNIQUE(_TECEvent, field)
); 
CREATE INDEX TicketEventCommentChange_event_idx ON TicketEventCommentChange (_TECEvent);

At first sight, this seems far more complex than the old model was, but in fact it's above all more explicit, with no "hidden" structure (except for the oldval/newval). And the regularities in the model means we're going to be able to provide a convenient high-level API for dealing with it.

In addition, even though the relational information given by the foreign keys can be enlightening for understanding the data model, we'll need an additional descriptive layer, metadata describing the precise relationships between the tables (e.g. list of fields for a given resource/type?).

This metadata is a natural continuation of the configuration of fields one can find in [ticket-custom] section of TracIni. Of course this time it is generalized to the "standard" resources and fields, and it is stored in the DB. Speaking of custom fields, one can see how custom fields fit in the above approach: they will simply correspond to additional secondary tables. The presence of a field in the primary table or in one secondary table becomes an implementation detail abstracted away by the metadata description of the resource. This will be also used to ease the transition to the new model by first describing the "current" state (all 1-to-1 fields in the primary table) and allowing for progressive upgrade steps.

Finally, for the code that would handle the resources in a generic way, instead of dealing with the same tables for every resources (the EAV approach), they would deal with the same API and use the metadata to know how to handle each particular kind of resources.

By the way, we could also easily graft the idea of sub-types (cf. TicketTypes) that would use the same primary table as their main type but perhaps a subset (or a slightly different set) of secondary tables (e.g. no 'Version' for tasks).

Implementation Plan

The idea is to introduce the "right" model at the occasion of the implementation of the MultipleProjectSupport. So the new project resource will use this generic model, but as we're primarily concerned with tickets in association with the projects, we'll have to address the ticket module early as well.

Applied on Ticket module

As the ticket module is by far the most complex, it might be worth to try out the new model there first:

  • we could verify that we meet the expectations in terms of code simplification, solving open issues, etc.
  • we could detect early if there are no regressions or risk of losing current features
  • by redeploying the ticket infrastructure to the other components, we could spread the most benefits of tickets (comments, custom fields, queries, etc.) to other resources (milestone, wiki, component, …)

Applied on new Component and Version resources

The not yet existing resource types like 'component' and 'version' will be implemented using this generic approach (#1233). Initially the corresponding fields would remain in the primary Ticket table, they will be replaced from being properties (component, version) to relations (Component, Version). In a second step we could move these fields from the primary Ticket table to new secondary relation tables (Ticket_Version) and (Ticket_Component), still keeping the 1-to-1 relationship. The last step would be to allow them to be used as multi-valued relations (1 ticket to m versions and components).

Applied on Milestone module

  • modify the Milestone module so that it uses the new proposed datamodel.
  • adapt the Roadmap module to the new model
  • adapt the Milestone admin component to the new model
  • for a first approach, the generic scheme would be enough, no need for versioning as we currently don't support that

Later:

  • experiment new tabbed view for the milestone (View, Discussion, History). See TracProject/UiGuidelines.
  • migrate to generic versioned scheme

Applied on Wiki module

Milestone and components are closer in spirit to a wiki page than to a ticket (we have the long standing #3776 ticket).

We could as well start simply on the wiki model and wiki module, to rework the basics there, and once this is done port the changes to the milestone module, then work on the components.

If we do this, we can also at the same time think about the storage model, in particular how past versions could be stored in a VCS (#1132). We could then think about a serialization model that would play nicely with the genericity of the Trac resources on one hand side, and with the external editors on the other hand.

See how a VCS backend could be written for the versioned generic scheme…

Related Tickets

  • Data model issues:
    #4130
    Ticket Component Interface Fails to Properly Check for Existing Component
    #11301
    intermittent failure with notification test
  • Resource related:
    #150
    User-centric storage. Last-viewed information and 'intelligent' followup
    #221
    Creating TR for multiple components
    #695
    Keywords for wiki pages
    #787
    Change attachment description
    #918
    [patch] Custom Ticket Fields should support the multiple selection type
    #1113
    Show milestone changes in timeline
    #1386
    Adding properties to Wiki Pages
    #1395
    Text box for duplicate when a bug is a duplicate
    #1678
    Show component view in timeline view for checkins
    #1835
    Add a discussion tab for Trac resources
    #2035
    Changeset commenting
    #2344
    sub milestone
    #2464
    Conditional fields in tickets
    #2465
    Add "custom query" functionality for changesets
    #2467
    Link user name in reports to custom query showing that user's open tickets
    #2662
    assign tickets to multiple users
    #2961
    custom comment fields
    #3003
    milestone could be a ticket
    #3080
    Custom field sorts only as text
    #3718
    Trac should use HTTP 301 Moved when milestones get renamed
    #3911
    implement an object system (xwiki, roundup)
    #4588
    User Page: trac links to these pages
    #5211
    Ticket - Wiki Integration enhancement/suggestion
    #7871
    Add start date to a milestone
    #8335
    Create People page with user profiles and status of who's doing what.
    #9263
    if SVN is used, milestone should manage/allowtoassign links for branch and tag

See also: Brainstorm for older iterations of the idea and discussion.

Note: See TracWiki for help on using the wiki.