Edgewall Software
Modify

Opened 9 years ago

Last modified 5 years ago

#11937 new enhancement

Improve suggestion of similarly named pages

Reported by: Ryan J Ollos Owned by:
Priority: normal Milestone: next-major-releases
Component: wiki system Version:
Severity: normal Keywords: bitesized
Cc: Branch:
Release Notes:
API Changes:
Internal Changes:

Description

Trac suggests similarly named pages when navigating to a non-existent page with a CamelCase name, but the suggestion is naive. Trac will suggest CamelCase if CamelCas is entered, but not if CameCase is entered. The code that generates suggestions is here. The aim of this ticket will be to improve the suggestions.

The levenshtein_distance function used by AdminCommandManager may be useful in implementing this ticket.

Attachments (3)

wiki_suggestion_patch.diff (3.4 KB ) - added by walty 9 years ago.
for wiki suggestion logic
wiki_suggestion_patch.v2.diff (3.0 KB ) - added by walty 9 years ago.
uses levenshtein_distance for scoring
wiki_suggestion_patch.v3.diff (5.3 KB ) - added by walty <walty8@…> 9 years ago.
same logic as before, just added unit tests

Download all attachments as: .zip

Change History (20)

comment:1 by Jun Omae, 9 years ago

Sounds good.

Also, Trac currently suggests the pages for only users with WIKI_CREATE permission. I think we should suggest with or without the permission.

comment:2 by Ryan J Ollos, 9 years ago

Milestone: next-major-releasesnext-dev-1.1.x

comment:3 by Ryan J Ollos, 9 years ago

Keywords: bitesized added
Type: defectenhancement

by walty, 9 years ago

Attachment: wiki_suggestion_patch.diff added

for wiki suggestion logic

comment:4 by walty <walty8@…>, 9 years ago

I checked the suggestion logic used by command-not-found handler in linux, and found the snippet here, so I decided to use the similar logic, because:

  1. it's much more efficient, only need to calculate the possible mutations once.
  2. less arbitrary, no need to set up threshold of score.

I have attached the patch to the ticket. The functional unit test is passed, and I have added one more unit test.

comment:5 by anonymous, 9 years ago

Please make sure similar pages elsewhere in the hierarchy are still show. For example CookBook/UnitTest should still show TracDev/WritingUnitTests.

in reply to:  5 comment:6 by walty <walty8@…>, 9 years ago

Replying to anonymous:

Please make sure similar pages elsewhere in the hierarchy are still show. For example CookBook/UnitTest should still show TracDev/WritingUnitTests.

OK, but actually they are two logics, initially the suggestion is based on logic of 'in string' (i.e. suggested wiki name contains the user input name), and now the suggestion is based on editing distance. May be I just make the union of both for the list of suggestions?

in reply to:  5 comment:7 by Ryan J Ollos, 9 years ago

Replying to anonymous:

Please make sure similar pages elsewhere in the hierarchy are still show. For example CookBook/UnitTest should still show TracDev/WritingUnitTests.

It seems that anonymous always has concerns about us removing functionality, but never bothers to inspect the proposed changes, test the proposed changes or review the regression tests we have in place.

Forgive me if I've overlooked something and you've actually taken an action that could be helpful to us, but I fail to see it so far. We could certainly use help in reviewing and testing changes, but just shouting out things without taking the time to consider the proposed changes doesn't qualify.

in reply to:  4 ; comment:8 by Ryan J Ollos, 9 years ago

Replying to walty <walty8@…>:

Thank you for the patch. Two suggestions:

  • Use a list comprehension rather than an if conditional nested in a for loop. List comprehensions are generally preferred, especially when creating lists.

I checked the suggestion logic used by command-not-found handler in linux, and found the snippet here, so I decided to use the similar logic, because:

  1. it's much more efficient, only need to calculate the possible mutations once.

The way you've written the patch doesn't avoid recomputing the possible mutations on every request that is processed by the handler. Anyway I don't think that is likely required, but we could do some performance testing to be sure.

  1. less arbitrary, no need to set up threshold of score.

Unit test coverage is needed for get_similar_words . We should consider that a function, get_similar_pages, that is similar to get_similar_commands may be preferred because it would reuse an existing algorithm that we've observed to show good behavior in suggesting similar commands through TracAdmin. In general we should try to avoid adding more algorithms if an existing algorithm is sufficient.

in reply to:  8 comment:9 by walty <walty8@…>, 9 years ago

Replying to rjollos:

Replying to walty <walty8@…>:

Thank you for the patch. Two suggestions:

  • Use a list comprehension rather than an if conditional nested in a for loop. List comprehensions are generally preferred, especially when creating lists.

I checked the suggestion logic used by command-not-found handler in linux, and found the snippet here, so I decided to use the similar logic, because:

  1. it's much more efficient, only need to calculate the possible mutations once.

The way you've written the patch doesn't avoid recomputing the possible mutations on every request that is processed by the handler. Anyway I don't think that is likely required, but we could do some performance testing to be sure.

Sorry for the confusion, and you are right that it could not avoid computation on every request.

However, if use levenshtein_distance, that implies for each comparison of searched keyword and wiki page name, one round of (small) mutation is required (because of the implementation inside levenshtein_distance).

If use the approach in the patch file wiki_suggestion_patch.diff, only one (big) round of mutation is needed. The algorithm would try to get all possible mutations of the searched keyword first (e.g. camecase1, camecas1e, cameca1se, … etc), and use all the mutated words to match against all wiki page names (simple string match).

I did not make an accurate calculation of order of complexity, just my gut feeling is that the 2nd way should be faster.

  1. less arbitrary, no need to set up threshold of score.

Unit test coverage is needed for get_similar_words . We should consider that a function, get_similar_pages, that is similar to get_similar_commands may be preferred because it would reuse an existing algorithm that we've observed to show good behavior in suggesting similar commands through TracAdmin. In general we should try to avoid adding more algorithms if an existing algorithm is sufficient.

I think consistency could be a good point.

So if I am going to implement the function of get_similar_pages, should I just copy and paste the code and change a little bit (which seems not a good habit), or I should try to refactor out the common logic between get_similar_pages and get_similar_commands , and then call the new common function from both functions?

comment:10 by Jun Omae, 9 years ago

The get_similar_pages works only with alphabet and digit characters in ascii. However, wiki page names allow to use unicode characters, e.g. latin, CJK and emoji characters. I think we should suggest with all of unicode characters.

Also, I consider this issue is not bitesized. I've modified to use levenshtein_distance in [eb98b3f19/jomae.git], but it doesn't work well.

by walty, 9 years ago

uses levenshtein_distance for scoring

comment:11 by walty, 9 years ago

I have tried the code of jomae, and found that the performance is actually OK. But it got some problem with the matching correctness. For example, for wiki entry of tracbowser, it would give a long list of suggestions, includingTracBrowser and TracImport. The latter seems odd.

And when I further checked the code, as well as the code in AdminCommandManager, I think the problem may be because of the denominator in the score calculation of levenshtein_distance, the old code uses len(search) + len(name) as denominator. But stackoverflow suggests that we should actually use max(len(search), len(name)).

I referenced the code from both jomae and AdminCommandManager, and set up a new patch, which contains a method of setup_similar_pages now (But I am not sure where the method should be located inside the file). I have updated the functional test as well.

And my wild guess is, the code in AdminCommandManager might also need to be updated as well.

comment:12 by Ryan J Ollos, 9 years ago

Our general strategy is to use functional tests to test code in the rendered templates, but to opt for unit tests for all other Python code. The functional test you've written is good to confirm that the suggested pages are rendered on the page. Functional tests are slow though, so we wouldn't want to add more functional tests to test additional behavior. Instead, it would be good to add unit tests to provide more rigorous evaluation of the suggested pages. Then, if someone says that the algorithm is not providing good results, a unit test case can be added to evaluate the scenario and the algorithm can be improved.

When jomae suggested that the algorithm may not be good, he may have been referring to suggestions that include non-ascii page names. Either way, having unit tests will allow us to add tests for unicode page names, and any other scenarios that arise.

in reply to:  12 ; comment:13 by walty, 9 years ago

Replying to rjollos:

Our general strategy is to use functional tests to test code in the rendered templates, but to opt for unit tests for all other Python code. The functional test you've written is good to confirm that the suggested pages are rendered on the page. Functional tests are slow though, so we wouldn't want to add more functional tests to test additional behavior. Instead, it would be good to add unit tests to provide more rigorous evaluation of the suggested pages. Then, if someone says that the algorithm is not providing good results, a unit test case can be added to evaluate the scenario and the algorithm can be improved.

When jomae suggested that the algorithm may not be good, he may have been referring to suggestions that include non-ascii page names. Either way, having unit tests will allow us to add tests for unicode page names, and any other scenarios that arise.

No problem, I would add some unit tests by today. I have briefly tried the new logic, and it should support unicode as well now.

So, should I remove the functional test then?

It just appears to me that even if the function get_similar_page is correct, I cannot guarantee the final displayed result is fine without the functional test (though it's very likely to be correct).

in reply to:  13 comment:14 by Ryan J Ollos, 9 years ago

Replying to walty:

So, should I remove the functional test then?

It just appears to me that even if the function get_similar_page is correct, I cannot guarantee the final displayed result is fine without the functional test (though it's very likely to be correct).

The functional test probably has value for the very reason you mention. I think we'll want to keep it. I'll try to test out the patch this weekend.

by walty <walty8@…>, 9 years ago

same logic as before, just added unit tests

comment:15 by Ryan J Ollos, 9 years ago

Milestone: next-dev-1.1.xnext-dev-1.3.x

Narrowing focus for milestone:1.2. Please move ticket to milestone:1.2 if you intend to fix it.

comment:16 by Ryan J Ollos, 5 years ago

Milestone: next-dev-1.3.xnext-dev-1.5.x

Milestone renamed

comment:17 by Ryan J Ollos, 5 years ago

Milestone: next-dev-1.5.xnext-major-releases

Modify Ticket

Change Properties
Set your email in Preferences
Action
as new The ticket will remain with no owner.
The ticket will be disowned.
as The resolution will be set. Next status will be 'closed'.
The owner will be changed from (none) to anonymous. Next status will be 'assigned'.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.