Opened 10 years ago
Last modified 5 years ago
#11937 new enhancement
Improve suggestion of similarly named pages
Reported by: | Ryan J Ollos | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | next-major-releases |
Component: | wiki system | Version: | |
Severity: | normal | Keywords: | bitesized |
Cc: | Branch: | ||
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description
Trac suggests similarly named pages when navigating to a non-existent page with a CamelCase name, but the suggestion is naive. Trac will suggest CamelCase if CamelCas is entered, but not if CameCase is entered. The code that generates suggestions is here. The aim of this ticket will be to improve the suggestions.
The levenshtein_distance function used by AdminCommandManager may be useful in implementing this ticket.
Attachments (3)
Change History (20)
comment:1 by , 10 years ago
comment:2 by , 10 years ago
Milestone: | next-major-releases → next-dev-1.1.x |
---|
comment:3 by , 9 years ago
Keywords: | bitesized added |
---|---|
Type: | defect → enhancement |
follow-up: 8 comment:4 by , 9 years ago
I checked the suggestion logic used by command-not-found handler in linux, and found the snippet here, so I decided to use the similar logic, because:
- it's much more efficient, only need to calculate the possible mutations once.
- less arbitrary, no need to set up threshold of score.
I have attached the patch to the ticket. The functional unit test is passed, and I have added one more unit test.
follow-ups: 6 7 comment:5 by , 9 years ago
Please make sure similar pages elsewhere in the hierarchy are still show. For example CookBook/UnitTest should still show TracDev/WritingUnitTests.
comment:6 by , 9 years ago
Replying to anonymous:
Please make sure similar pages elsewhere in the hierarchy are still show. For example CookBook/UnitTest should still show TracDev/WritingUnitTests.
OK, but actually they are two logics, initially the suggestion is based on logic of 'in string' (i.e. suggested wiki name contains the user input name), and now the suggestion is based on editing distance. May be I just make the union of both for the list of suggestions?
comment:7 by , 9 years ago
Replying to anonymous:
Please make sure similar pages elsewhere in the hierarchy are still show. For example CookBook/UnitTest should still show TracDev/WritingUnitTests.
It seems that anonymous always has concerns about us removing functionality, but never bothers to inspect the proposed changes, test the proposed changes or review the regression tests we have in place.
Forgive me if I've overlooked something and you've actually taken an action that could be helpful to us, but I fail to see it so far. We could certainly use help in reviewing and testing changes, but just shouting out things without taking the time to consider the proposed changes doesn't qualify.
follow-up: 9 comment:8 by , 9 years ago
Replying to walty <walty8@…>:
Thank you for the patch. Two suggestions:
- Use a list comprehension rather than an
if
conditional nested in afor
loop. List comprehensions are generally preferred, especially when creating lists.
I checked the suggestion logic used by command-not-found handler in linux, and found the snippet here, so I decided to use the similar logic, because:
- it's much more efficient, only need to calculate the possible mutations once.
The way you've written the patch doesn't avoid recomputing the possible mutations on every request that is processed by the handler. Anyway I don't think that is likely required, but we could do some performance testing to be sure.
- less arbitrary, no need to set up threshold of score.
Unit test coverage is needed for get_similar_words
. We should consider that a function, get_similar_pages
, that is similar to get_similar_commands may be preferred because it would reuse an existing algorithm that we've observed to show good behavior in suggesting similar commands through TracAdmin. In general we should try to avoid adding more algorithms if an existing algorithm is sufficient.
comment:9 by , 9 years ago
Replying to rjollos:
Replying to walty <walty8@…>:
Thank you for the patch. Two suggestions:
- Use a list comprehension rather than an
if
conditional nested in afor
loop. List comprehensions are generally preferred, especially when creating lists.I checked the suggestion logic used by command-not-found handler in linux, and found the snippet here, so I decided to use the similar logic, because:
- it's much more efficient, only need to calculate the possible mutations once.
The way you've written the patch doesn't avoid recomputing the possible mutations on every request that is processed by the handler. Anyway I don't think that is likely required, but we could do some performance testing to be sure.
Sorry for the confusion, and you are right that it could not avoid computation on every request.
However, if use levenshtein_distance
, that implies for each comparison of searched keyword and wiki page name, one round of (small) mutation is required (because of the implementation inside levenshtein_distance
).
If use the approach in the patch file wiki_suggestion_patch.diff
, only one (big) round of mutation is needed. The algorithm would try to get all possible mutations of the searched keyword first (e.g. camecase1, camecas1e, cameca1se, … etc), and use all the mutated words to match against all wiki page names (simple string match).
I did not make an accurate calculation of order of complexity, just my gut feeling is that the 2nd way should be faster.
- less arbitrary, no need to set up threshold of score.
Unit test coverage is needed for
get_similar_words
. We should consider that a function,get_similar_pages
, that is similar to get_similar_commands may be preferred because it would reuse an existing algorithm that we've observed to show good behavior in suggesting similar commands through TracAdmin. In general we should try to avoid adding more algorithms if an existing algorithm is sufficient.
I think consistency could be a good point.
So if I am going to implement the function of get_similar_pages
, should I just copy and paste the code and change a little bit (which seems not a good habit), or I should try to refactor out the common logic between get_similar_pages
and get_similar_commands
, and then call the new common function from both functions?
comment:10 by , 9 years ago
The get_similar_pages
works only with alphabet and digit characters in ascii. However, wiki page names allow to use unicode characters, e.g. latin, CJK and emoji characters. I think we should suggest with all of unicode characters.
Also, I consider this issue is not bitesized. I've modified to use levenshtein_distance
in [eb98b3f19/jomae.git], but it doesn't work well.
by , 9 years ago
Attachment: | wiki_suggestion_patch.v2.diff added |
---|
uses levenshtein_distance for scoring
comment:11 by , 9 years ago
I have tried the code of jomae, and found that the performance is actually OK. But it got some problem with the matching correctness. For example, for wiki entry of tracbowser
, it would give a long list of suggestions, includingTracBrowser
and TracImport
. The latter seems odd.
And when I further checked the code, as well as the code in AdminCommandManager, I think the problem may be because of the denominator in the score calculation of levenshtein_distance
, the old code uses len(search) + len(name)
as denominator. But stackoverflow suggests that we should actually use max(len(search), len(name))
.
I referenced the code from both jomae and AdminCommandManager
, and set up a new patch, which contains a method of setup_similar_pages
now (But I am not sure where the method should be located inside the file). I have updated the functional test as well.
And my wild guess is, the code in AdminCommandManager
might also need to be updated as well.
follow-up: 13 comment:12 by , 9 years ago
Our general strategy is to use functional tests to test code in the rendered templates, but to opt for unit tests for all other Python code. The functional test you've written is good to confirm that the suggested pages are rendered on the page. Functional tests are slow though, so we wouldn't want to add more functional tests to test additional behavior. Instead, it would be good to add unit tests to provide more rigorous evaluation of the suggested pages. Then, if someone says that the algorithm is not providing good results, a unit test case can be added to evaluate the scenario and the algorithm can be improved.
When jomae suggested that the algorithm may not be good, he may have been referring to suggestions that include non-ascii page names. Either way, having unit tests will allow us to add tests for unicode page names, and any other scenarios that arise.
follow-up: 14 comment:13 by , 9 years ago
Replying to rjollos:
Our general strategy is to use functional tests to test code in the rendered templates, but to opt for unit tests for all other Python code. The functional test you've written is good to confirm that the suggested pages are rendered on the page. Functional tests are slow though, so we wouldn't want to add more functional tests to test additional behavior. Instead, it would be good to add unit tests to provide more rigorous evaluation of the suggested pages. Then, if someone says that the algorithm is not providing good results, a unit test case can be added to evaluate the scenario and the algorithm can be improved.
When jomae suggested that the algorithm may not be good, he may have been referring to suggestions that include non-ascii page names. Either way, having unit tests will allow us to add tests for unicode page names, and any other scenarios that arise.
No problem, I would add some unit tests by today. I have briefly tried the new logic, and it should support unicode as well now.
So, should I remove the functional test then?
It just appears to me that even if the function get_similar_page
is correct, I cannot guarantee the final displayed result is fine without the functional test (though it's very likely to be correct).
comment:14 by , 9 years ago
Replying to walty:
So, should I remove the functional test then?
It just appears to me that even if the function
get_similar_page
is correct, I cannot guarantee the final displayed result is fine without the functional test (though it's very likely to be correct).
The functional test probably has value for the very reason you mention. I think we'll want to keep it. I'll try to test out the patch this weekend.
by , 9 years ago
Attachment: | wiki_suggestion_patch.v3.diff added |
---|
same logic as before, just added unit tests
comment:15 by , 9 years ago
Milestone: | next-dev-1.1.x → next-dev-1.3.x |
---|
Narrowing focus for milestone:1.2. Please move ticket to milestone:1.2 if you intend to fix it.
comment:17 by , 5 years ago
Milestone: | next-dev-1.5.x → next-major-releases |
---|
Sounds good.
Also, Trac currently suggests the pages for only users with
WIKI_CREATE
permission. I think we should suggest with or without the permission.