Edgewall Software

Changes between Version 26 and Version 27 of AdvancedSearch


Ignore:
Timestamp:
Jul 31, 2018, 9:39:45 AM (6 years ago)
Author:
figaro
Comment:

Further cosmetic changes, update/removal dead links

Legend:

Unmodified
Added
Removed
Modified
  • AdvancedSearch

    v26 v27  
    1 = Advanced Search =
     1= Advanced Search
    22
    3 One of the expected feature for [milestone:1.0] is a much improved search system.
     3One of the features for [milestone:1.0] is a much improved search system. This is the place to discuss further improvements and make proposals.
    44
    5 But what exactly should be improved?
    6 This is the place to discuss it and make proposals.
     5Note that there's currently no development branch dedicated to this topic, but when there will be one, this page can be used to discuss the corresponding implementation details.
    76
    8 Note that there's currently no development branch dedicated to
    9 this topic, but when there will be one, this page can be used to
    10 discuss the corresponding implementation details.
     7As usual with Trac, the challenge is that we're not only searching Wiki pages, but other Trac resources as well, such as tickets, changesets, etc. Therefore, the result shown should also be adapted to the kind of object retrieved (see #2859).
    118
    12 As usual with Trac, the challenge is that we're not only searching
    13 Wiki pages, but other kind of Trac resources as well: tickets, changesets, etc.
    14 Therefore, the result shown should also be adapted to the kind of object
    15 retrieved (see e.g. #2859).
     9A related question is how the TracSearch and the TracQuery should interact, see Trac-Dev:333, #1329, #2644.
    1610
    17 A related question is how the TracSearch and the TracQuery should interact,
    18 see Trac-Dev:333, #1329, #2644.
     11== Weighting
    1912
    20 == Weighting ==
     13Right now the results are returned in reverse chronological order, ie most recent first. All matches have equal weighting. It was suggested that we could use some simple weighting techniques to return the results in a more useful order. For example, if a term is found in a ticket summary, this could weigh more than if found in a ticket comment. Likewise, the number of times the term is found for a given result could be taken into account also.
    2114
    22 Right now, the results are returned in reverse chronological order (i.e. most recent first). All matches are considered equal. It was suggested that we could use some simple weighting techniques to return the results in a more useful order.
    23 For example, if a term is found in a ticket summary, this could "weight" more than if found in a ticket comment. Likewise, the number of times the term is found for a given result could be taken into account, etc.
     15It should be possible to do a first version of this improvement independently of the rest, by modifying the return type of `ISearch.get_search_results` to return a list of `SearchResult` object, much like the [wiki:"TracDev/ApiChanges/0.11#ITimelineEventProvider" ITimelineEventProvider] change.
    2416
    25 It should be possible to do a first version of this improvement independently of the rest, by modifying the return type of `ISearch.get_search_results` to return a list of `SearchResult` object (much like the [wiki:"TracDev/ApiChanges/0.11#ITimelineEventProvider" ITimelineEventProvider] change).
     17In [wikipedia:"Information retrieval"] systems it is standard to use [wikipedia:"Tf–idf" TF-IDF] (term frequency–inverse document frequency) scoring functions. It normalizes weights based on document length and how common the different search terms are in general. One commonly used such function is [wikipedia:"Okapi BM25"] or its variant BM25F. The latter supports weighting multiple fields differently.
    2618
    27 In [wikipedia:"Information retrieval"] systems, it is standard to use [wikipedia:"Tf–idf" TF-IDF] (term frequency–inverse document frequency) scoring functions. It normalizes weights based on document length and how common the different search terms are in general. One commonly used such function is [wikipedia:"Okapi BM25"] or its variant BM25F. The latter supports weighting multiple fields differently.
     19== Indexing
    2820
    29 == Indexing ==
    30 
    31 It would probably be a good idea if objects were indexed as they are created/updated. This would obviously improve search performance greatly, and no longer effectively require a full retrieval of the entire DB. This could be optional I guess.
     21It would probably be a good idea if objects were indexed as they are created/updated. This would obviously improve search performance greatly, and no longer effectively require a full retrieval of the entire database. This could be optional I guess.
    3222
    3323A generic search system would provide components with a means to index content, query the content in a standard way (ie. a decent query language) and refer to this content at a later date (eg. ticket hits would display the ticket in a useful way, with links to specific comments, etc.)
    3424
    35 == Alec's Stream of Consciousness ==
     25== Alec's Stream of Consciousness
    3626
    37 If indexing on creation/update we would need hooks for each resource in Trac (ala `IWikiChangeListener`) to update the indexer. The potential downside for this is that indexing on the fly could slow down Trac's responsiveness at the cost of faster search. This could be mitigated by running the indexer in a thread. I like this solution.
     27If indexing on creation/update we would need hooks for each resource in Trac (such as `IWikiChangeListener`) to update the indexer. The potential downside for this is that indexing on the fly could slow down Trac's responsiveness at the cost of faster search. This could be mitigated by running the indexer in a thread. I like this solution.
    3828
    39 For indexing itself, there seems to be two solutions: use a generalised indexing engine (Hyperestraier, Lucene, etc.) or in-database indexers. A generalised indexing engine has advantages in that one interface could be used for all resources (wiki, ticket, source, attachment, ...). I am personally a fan of this option, and in particular [http://swapoff.org/pyndexter pyndexter] (bias!), which provides an abstraction layer for a number of indexers. It also includes a generic query language (similar to Google's) which is transformed into the query language particular to each backend.
     29For indexing itself there seems to be two solutions: use a generalised indexing engine (Hyperestraier, Lucene, etc.) or in-database indexers. A generalised indexing engine has advantages in that one interface could be used for all resources (wiki, ticket, source, attachment). I am personally a fan of this option, and in particular [pypi:pyndexter] (bias!), which provides an abstraction layer for a number of indexers. It also includes a generic query language similar to Google's which is transformed into the query language particular to each database backend.
    4030
    41 So, here is a completely unthoughtout proposal:
     31Proposal:
    4232
    4333{{{
     
    6050}}}
    6151
    62 This kind of system could be implemented entirely as a plugin, assuming appropriate ''!ChangeListener'' style interfaces existed for all resources (currently only the versioncontrol module is missing this functionality).
     52This kind of system could be implemented entirely as a plugin, assuming appropriate ''!ChangeListener'' style interfaces existed for all resources. Currently only the versioncontrol module is missing this functionality.
    6353
    64 == Search Engines ==
     54== Search Engines
    6555
    66 Several search engines could be good candidate for handling
    67 the search requests, but probably this should be done in a
    68 pluggable way, so that different search engines could be supported, including a ''fallback'' engine (i.e. the current SQL-based search in the database), which would require no extra package.
     56Several search engines could be good candidate for handling the search requests, but probably this should be done in a pluggable way, so that different search engines could be supported, including a ''fallback'' engine (i.e. the current SQL-based search in the database), which would require no extra package.
    6957
    7058Among the possible candidates:
    71  * [http://www.xapian.org Xapian]
    72    and [http://divmod.org/trac/wiki/DivmodXapwrap DivmodXapwrap].
    73    See also the discussion about using Xapin in MoinMoin:
    74    MoinMoin:FeatureRequests/AdvancedXapianSearch
    75  * [http://pylucene.osafoundation.org/ PyLucene]
     59 * [http://www.xapian.org Xapian]. See also the discussion about using Xapian in MoinMoin: MoinMoin:FeatureRequests/AdvancedXapianSearch
     60 * [http://lucene.apache.org/pylucene/ PyLucene]
    7661 * [http://lucene.apache.org/solr/ Solr]
    77  * [http://hyperestraier.sourceforge.net/ Hyper Estraier]
    78    and [http://hype.python-hosting.com/ hype].
    79  * [http://whoosh.ca/ Whoosh] (read Chris Mulligan on trac-dev [googlegroups:trac-dev:d4025a6f2feef8d7 Whoosh for search?])
    80  * ... ?
    81  * There's been some efforts to provide a neutral API for some of the above search engines:
    82    - [http://swapoff.org/wiki/pyndexter pyndexter] [[BR]]
    83      The Hyperestraier adapter works well, Xapian is coming along nicely and the pure python indexer is based on that used by th:wiki:RepoSearchPlugin (ie. works, but has issues). I have yet to write the !PyLucene adapter, but it doesn't look too difficult.
    84    - [http://blog.case.edu/bmb12/2006/08/merquery_summer_of_code_results merquery] [[BR]]
    85      This is now a Django specific SoC project.
    86    - [http://opensearch.a9.com/ OpenSearch]
     62 * [http://fallabs.com/hyperestraier/intro-en.html Hyper Estraier]
     63 * [pypi:Whoosh], read Chris Mulligan on trac-dev [googlegroups:trac-dev:d4025a6f2feef8d7 Whoosh for search?]
     64 * There has been some efforts to provide a neutral API for some of the above search engines:
     65   - [pypi:pyndexter] [[BR]]
     66     The Hyperestraier adapter works well, Xapian is coming along nicely and the pure Python indexer is based on that used by the [th:RepoSearchPlugin] (ie. works, but has issues). I have yet to write the !PyLucene adapter, but it doesn't look too difficult.
     67   - [http://blog.case.edu/bmb12/2006/08/merquery_summer_of_code_results merquery], which was also a Django-specific Google SoC project in 2006.
     68   - [http://www.opensearchserver.com/ OpenSearch] of which the source code is hosted on [https://github.com/jaeksoft/opensearchserver github]
    8769 * DatabaseBackend may also have their own way to implement full text search:
    88    - Recent SQLite (3.3.8) comes with experimental full-text search module, sqlite:FtsTwo
    89      (depends on pysqlite:#180). [[BR]]
    90      This seems to be a long way from useful in its current state,
    91      unfortunately :(. fts1 and fts2 are both deprecated, and there are no
    92      decent build instructions for fts3. More critically, FTS support is also
    93      not shipped with any major distribution I'm aware of.
    94    - postgres fulltext search. Postgresql 8.3 has integrated full text search as a core functionality.
    95      - [http://pgfoundry.org/projects/pgestraier postgres calling hyper estraier], soure is[http://svn.rot13.org/index.cgi/pgestraier/browse/trunk/ here]
    96      - [http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch2 Tsearch2]
    97      - [http://www.devx.com/opensource/Article/21674 article about tsearch2]
    98      - [http://www.postgresql.org/docs/8.3/static/textsearch.html Postgres 8.3 integrated Full Text Search]
    99    - [http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html MySql fulltext indexing] [[BR]] not an option, because it is only available for MyISAM type tables that don't have transactions support
     70   - SQLite 3.3.8+ comes with a full-text search module [https://www.sqlite.org/fts3.html FTS3 and FTS4]. [[BR]] This seems to be a long way from useful in its current state, unfortunately. There are no decent build instructions for fts3. More critically, FTS support is also not shipped with any major distribution I'm aware of.
     71   - PostgreSQL v8.3+ has [http://www.postgresql.org/docs/8.3/static/textsearch.html integrated full text search] as a core functionality.
     72     - [http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch2 Tsearch2] and an [http://www.devx.com/opensource/Article/21674 article about tsearch2]
     73   - [http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html MySQL fulltext indexing] is not an option, because it is only available for MyISAM type tables, which don't have transaction support
    10074
    10175Not an engine, but might be a source of inspiration nonetheless: [http://haystacksearch.org/ Haystack], modular search for Django, supports Solr, Whoosh and Xapian.
    10276
    103 == Optional Search Targets ==
     77== Optional Search Targets
    10478
    105 On the search page, there are checkboxes to search tickets, changesets, and the wiki.  It would helpful to expand these.  For example, specific checkboxes for ticket summary, ticket description, ticket comments, wiki page title, wiki page content, etc.  I'm sure the Trac development team could come up with a great, user-friendly way to make a user-friendly advanced search with many options.
     79On the search page, there are checkboxes to search tickets, changesets, and the wiki. It would be helpful to expand these. For example, specific checkboxes for ticket summary, ticket description, ticket comments, wiki page title, wiki page content, etc. I'm sure the Trac development team could come up with a way to make a user-friendly advanced search with many options.
    10680
    107 On a similar note - it would be nice to add a checkbox for searching (or to explicitly exclude searching) in the Trac documentation. Many times I am searching for something I am either looking something up in the Trac help, or in my project data - rarely do I not know which it is in :-).
     81On a similar note, it would be nice to add a checkbox for searching (or to explicitly exclude searching) in the Trac documentation. Many times I am searching for something I am either looking something up in the Trac help, or in my project data - rarely do I not know which it is in.
    10882
    109 == Searching Down Links and Attachments to a Specified Depth ==
     83== Searching Down Links and Attachments to a Specified Depth
    11084
    11185It would be exceptionally useful to be able to extend the search to look inside attachments and links and to control the depth of the search. By this I mean that the search would allow for
     
    11791to be indexed within the search.
    11892
    119 For example I make a link to a page outside my trac setup and on this page there is a pdf file linked, like [http://www.federalreserve.gov/pubs/bulletin/2010/default.htm#legal Federal Reserve] which contains links including a pdf file [http://www.federalreserve.gov/pubs/bulletin/2010/pdf/legalq409.pdf]. For example, I could search for "FDIC" and it would turn up in this paper.
     93For example I make a link to a page outside my Trac setup and on this page there is a pdf file linked, like [http://www.federalreserve.gov/pubs/bulletin/2010/default.htm#legal Federal Reserve] which contains links including a pdf file [http://www.federalreserve.gov/pubs/bulletin/2010/pdf/legalq409.pdf]. For example, I could search for "FDIC" and it would turn up in this paper.
    12094
    121 If the external search depth were 2 then the search function would search external links down two levels and include the text FDIC within the pdf document as shown. The unix program lynx is able to recursively locate links to a specified depth. The hack would have to do something similar and then index the pages and files (of allowed types) to be included within the search. It might be most efficient to recognise if the link has changed (either by date or by some hash based upon the data) and only try to index it if it is new or changed. These could be indexed periodically, and/or after the page with the links has been changed.
     95If the external search depth were 2, then the search function would search external links down two levels and include the text FDIC within the pdf document as shown. The UNIX program lynx is able to recursively locate links to a specified depth. The hack would have to do something similar and then index the pages and files (of allowed types) to be included within the search. It might be most efficient to recognise if the link has changed (either by date or by some hash based upon the data) and only try to index it if it is new or changed. These could be indexed periodically, and/or after the page with the links has been changed.
    12296
    12397The beauty of this is that it extends the search from just the trac website to the local nodes of the network and would allow information on adjacent sites specified by links to be searched.
    12498
    125 The same search could be extended to attached files too, that already exist within the trac framework. The file types to search could be specified (pdf txt doc etc).
     99The same search could be extended to attached files too, that already exist within the Trac framework. The file types to search could be specified, such as pdf, txt, doc etc.
    126100
    127 I put this into requestahack on the trac-hacks site as [http://trac-hacks.org/ticket/6918 (#6918)] but it might be better as part of the core trac project.
     101This depth-linking is captured in [th:#6918], but it might be better as part of the core Trac project.
    128102
    129103   '''markm:''' I think searching sites outside of track is beyond the scope of Trac. It would involve a web crawler to follow those links.
    130    I think it perfectly acceptable that Trac does not index links outside of trac itself. I thikn that Indexing attachments may be quite interesting.
     104   I think it perfectly acceptable that Trac does not index links outside of trac itself. I think that Indexing attachments may be quite interesting.
    131105   But I think it may belong in an optional plugin - as it probably needs to be quite platform dependent, and may involve technologies completely oblique to Trac's core functionality.