Edgewall Software

Version 26 (modified by anonymous, 6 years ago) ( diff )

Mention BM25 / TF-IDF scoring functions

Advanced Search

One of the expected feature for 1.0 is a much improved search system.

But what exactly should be improved? This is the place to discuss it and make proposals.

Note that there's currently no development branch dedicated to this topic, but when there will be one, this page can be used to discuss the corresponding implementation details.

As usual with Trac, the challenge is that we're not only searching Wiki pages, but other kind of Trac resources as well: tickets, changesets, etc. Therefore, the result shown should also be adapted to the kind of object retrieved (see e.g. #2859).

A related question is how the TracSearch and the TracQuery should interact, see Trac-Dev:333, #1329, #2644.

Weighting

Right now, the results are returned in reverse chronological order (i.e. most recent first). All matches are considered equal. It was suggested that we could use some simple weighting techniques to return the results in a more useful order. For example, if a term is found in a ticket summary, this could "weight" more than if found in a ticket comment. Likewise, the number of times the term is found for a given result could be taken into account, etc.

It should be possible to do a first version of this improvement independently of the rest, by modifying the return type of ISearch.get_search_results to return a list of SearchResult object (much like the ITimelineEventProvider change).

In Information retrieval systems, it is standard to use TF-IDF (term frequency–inverse document frequency) scoring functions. It normalizes weights based on document length and how common the different search terms are in general. One commonly used such function is Okapi BM25 or its variant BM25F. The latter supports weighting multiple fields differently.

Indexing

It would probably be a good idea if objects were indexed as they are created/updated. This would obviously improve search performance greatly, and no longer effectively require a full retrieval of the entire DB. This could be optional I guess.

A generic search system would provide components with a means to index content, query the content in a standard way (ie. a decent query language) and refer to this content at a later date (eg. ticket hits would display the ticket in a useful way, with links to specific comments, etc.)

Alec's Stream of Consciousness

If indexing on creation/update we would need hooks for each resource in Trac (ala IWikiChangeListener) to update the indexer. The potential downside for this is that indexing on the fly could slow down Trac's responsiveness at the cost of faster search. This could be mitigated by running the indexer in a thread. I like this solution.

For indexing itself, there seems to be two solutions: use a generalised indexing engine (Hyperestraier, Lucene, etc.) or in-database indexers. A generalised indexing engine has advantages in that one interface could be used for all resources (wiki, ticket, source, attachment, …). I am personally a fan of this option, and in particular pyndexter (bias!), which provides an abstraction layer for a number of indexers. It also includes a generic query language (similar to Google's) which is transformed into the query language particular to each backend.

So, here is a completely unthoughtout proposal:

# trac.wiki.search
from trac.search import SearchSystem

class WikiIndexer(Component):
    implements(IWikiChangeListener)

    def _update(self, page):
        SearchSystem(self.env).add('wiki:%s' % page.id, content=page.content)

    wiki_page_added = _update
    wiki_page_changed = _update
    wiki_page_version_deleted = _update

    def wiki_page_deleted(self, page):
        SearchSystem(self.env).remove('wiki:%s' % page.id)

This kind of system could be implemented entirely as a plugin, assuming appropriate ChangeListener style interfaces existed for all resources (currently only the versioncontrol module is missing this functionality).

Search Engines

Several search engines could be good candidate for handling the search requests, but probably this should be done in a pluggable way, so that different search engines could be supported, including a fallback engine (i.e. the current SQL-based search in the database), which would require no extra package.

Among the possible candidates:

Not an engine, but might be a source of inspiration nonetheless: Haystack, modular search for Django, supports Solr, Whoosh and Xapian.

Optional Search Targets

On the search page, there are checkboxes to search tickets, changesets, and the wiki. It would helpful to expand these. For example, specific checkboxes for ticket summary, ticket description, ticket comments, wiki page title, wiki page content, etc. I'm sure the Trac development team could come up with a great, user-friendly way to make a user-friendly advanced search with many options.

On a similar note - it would be nice to add a checkbox for searching (or to explicitly exclude searching) in the Trac documentation. Many times I am searching for something I am either looking something up in the Trac help, or in my project data - rarely do I not know which it is in :-).

Searching Down Links and Attachments to a Specified Depth

It would be exceptionally useful to be able to extend the search to look inside attachments and links and to control the depth of the search. By this I mean that the search would allow for

  • a link (depth 1);
  • a link within a link (depth 2);
  • a link within a link within a link (depth 3);

etc

to be indexed within the search.

For example I make a link to a page outside my trac setup and on this page there is a pdf file linked, like Federal Reserve which contains links including a pdf file http://www.federalreserve.gov/pubs/bulletin/2010/pdf/legalq409.pdf. For example, I could search for "FDIC" and it would turn up in this paper.

If the external search depth were 2 then the search function would search external links down two levels and include the text FDIC within the pdf document as shown. The unix program lynx is able to recursively locate links to a specified depth. The hack would have to do something similar and then index the pages and files (of allowed types) to be included within the search. It might be most efficient to recognise if the link has changed (either by date or by some hash based upon the data) and only try to index it if it is new or changed. These could be indexed periodically, and/or after the page with the links has been changed.

The beauty of this is that it extends the search from just the trac website to the local nodes of the network and would allow information on adjacent sites specified by links to be searched.

The same search could be extended to attached files too, that already exist within the trac framework. The file types to search could be specified (pdf txt doc etc).

I put this into requestahack on the trac-hacks site as (#6918) but it might be better as part of the core trac project.

markm: I think searching sites outside of track is beyond the scope of Trac. It would involve a web crawler to follow those links. I think it perfectly acceptable that Trac does not index links outside of trac itself. I thikn that Indexing attachments may be quite interesting. But I think it may belong in an optional plugin - as it probably needs to be quite platform dependent, and may involve technologies completely oblique to Trac's core functionality.

Attachments (2)

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.