Opened 15 years ago
Last modified 4 years ago
#8519 new defect
internal SQL string matches are always case insensitive
Reported by: | Christian Boos | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | next-major-releases |
Component: | database backend | Version: | none |
Severity: | normal | Keywords: | sql like ilike |
Cc: | franz.mayer@…, Jun Omae | Branch: | |
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description
Our like()
helper actually translate into doing an ILIKE
or similar. There are several places where this is not desired, we would actually need a case sensitive search.
The idea is to rename the current like()
to ilike()
and add a real like()
helper.
See #8424 for the original ticket and implementation hints for SQLite which doesn't have a case-sensitive LIKE.
Attachments (3)
Change History (39)
by , 15 years ago
Attachment: | sqlite_like.py added |
---|
comment:1 by , 15 years ago
attachment:sqlite_like.py can be installed as a custom function (see ticket:8424#comment:3).
by , 15 years ago
Attachment: | 8519-ilike-r8958.patch added |
---|
Add ilike()
to the DB connection, and implement a correct like()
for PostgreSQL and MySQL. Misuse the LIKE
operator on SQLite.
follow-up: 4 comment:3 by , 15 years ago
The patch above adds an ilike()
method to the database connections, and changes like()
to be case-sensitive on PostgreSQL and MySQL. For SQLite, the proposed function is installed for the 2-argument LIKE
operator, and the simplistic test cases pass.
However, the SQLite solution is a hack, and it works only in the most simple cases:
- It only allows
%
wildcards at the beginning or end of the pattern. - It doesn't support the
_
wildcard. - It doesn't support escaping of wildcards.
Implementing this correctly would make the function painfully slow. My suggestion would be to keep like()
case-insensitive on SQLite, and wait for it to grow a proper ILIKE
operator. I'm even considering implementing it myself and submitting a patch to SQLite. Then we could make LIKE
case-sensitive via the PRAGMA
.
Thoughts?
follow-up: 5 comment:4 by , 15 years ago
Replying to rblank:
However, the SQLite solution is a hack, and it works only in the most simple cases:
I think that's all we need in Trac proper. Then, of course, plugins might have higher expectations.
- It only allows
%
wildcards at the beginning or end of the pattern.- It doesn't support the
_
wildcard.- It doesn't support escaping of wildcards.
Implementing this correctly would make the function painfully slow.
Maybe, but then string operations are fast in Python, things like find
. We could have a try on improving the above and still have something reasonably efficient.
My suggestion would be to keep
like()
case-insensitive on SQLite, and wait for it to grow a properILIKE
operator.
But don't we use mainly ilike()
now? For which case do we still use like()
and need "native" performance?
I'm even considering implementing it myself and submitting a patch to SQLite. Then we could make
LIKE
case-sensitive via thePRAGMA
.
Maybe the best long term solution …
comment:5 by , 15 years ago
Replying to cboos:
But don't we use mainly
ilike()
now? For which case do we still uselike()
and need "native" performance?
The SVN cache, where we use it to match paths (see cache.py). But maybe there's a way to reformulate the query that doesn't require LIKE. I read a trick yesterday where somebody replaced LIKE '...%'
with BETWEEN, like so:
path LIKE 'trunk/trac/versioncontrol/%'
would be changed to:
path BETWEEN 'trunk/trac/versioncontrol/' AND 'trunk/trac/versioncontrol0'
(where 0 is the character immediately following / in the current encoding).
comment:6 by , 15 years ago
Very interesting. So I did some experiments:
import sqlite3 import random c = sqlite3.connect('testpaths') c.execute('create table test ( path text primary key)') def fragment(letters, r): return ''.join(random.choice(letters) for i in range(0, random.choice(r))) def path(): return '/'.join(['trunk', fragment('abcdef', range(4, 8)), fragment('defghijkl', range(4, 10)), fragment('pqrstuvwx', range(4, 6))]) c.executemany('insert into test values (?)', ((path(),) for x in range(0, 100000))) def like3(): return c.execute("select count(*) from test " "where path like 'trunk//beef//%' escape '/'").fetchall() def like2(): return c.execute("select count(*) from test " "where path like 'trunk/beef/%'").fetchall() def between(): return c.execute("select count(*) from test where " "path between 'trunk/beef/' and 'trunk/beef0'").fetchall()
timeit:
>>> import timeit >>> timeit.timeit('like3()', 'from __main__ import like3', number=100) 4.868307215023151 >>> timeit.timeit('like2()', 'from __main__ import like2', number=100) 4.4357460235867974
With python sqlite_like
from attachment:sqlite_like.py:
>>> c.create_function('like', 2, sqlite_like) >>> timeit.timeit('like2()', 'from __main__ import like2', number=100) 20.509961112376018
And finally:
>>> timeit.timeit('between()', 'from __main__ import between', number=100000) 1.814714859011417
No, there's no typo in the number ;-) Nice one ;-)
follow-up: 18 comment:7 by , 15 years ago
Interesting! But I made a small mistake. You can't use BETWEEN, because both ends of the range are included. You have to use:
path >= 'trunk/trac/versioncontrol/' AND path < 'trunk/trac/versioncontrol0'
follow-up: 9 comment:8 by , 15 years ago
Well, I expect the speed should be in the same league with the explicit comparison than with the "between", as it will also be able to make good use of the index. What the above shows IMO is that the LIKE 'xxx%...
doesn't benefit from the index, with SQLite.
follow-up: 12 comment:9 by , 15 years ago
Replying to cboos:
… So my next suggestion would be to drop the idea of separate
like()
andilike()
altogether, and keeplike()
case-insensitive.
Ha! Lazyness attack! But I have to agree, let's defer making the like
/ilike
distinction to when it becomes really needed. If not already done, we should also make sure the docstring for like()
correctly highlights its case-insensitiveness.
comment:11 by , 15 years ago
Replying to cboos:
He, interesting case of ticket edit collision :-)
Wait, was that simply me doing a Modify instead of a Reply?
Right, that's what happened, which also explains the Replying to … (kept from the original comment:9).
[OT] suggesting a visual clue when we're in editing mode, e.g. a light yellow background.
comment:12 by , 15 years ago
Replying to cboos:
If not already done, we should also make sure the docstring for
like()
correctly highlights its case-insensitiveness.
Done already in my sandbox. Complete patch coming shortly.
by , 15 years ago
Attachment: | 8519-replace-like-r8958.patch added |
---|
Replace case-sensitive LIKE
clauses with range clauses.
follow-up: 14 comment:13 by , 15 years ago
The patch above replaces all LIKE
clauses that were supposed to be case-sensitive with range tests as suggested in comment:7 (version control cache and ticket comment editing). There will be no speed boost at this time for the cache, as we still use a CAST(rev AS int)
on the revision column (until #6654 is fixed).
If the patch is ok, I'll apply it and call this a "wontfix" :)
follow-up: 15 comment:14 by , 15 years ago
Component: | general → database backend |
---|---|
Keywords: | ilike added |
Replying to rblank:
If the patch is ok, I'll apply it and call this a "wontfix" :)
The patch is fine except for the comment:%
selection:
sqlite> create table test ( comment text ); sqlite> insert into test values ('comment:1'); sqlite> insert into test values ('comment:2'); sqlite> select * from test where comment >= 'comment0'; comment:1 comment:2 sqlite> select * from test where comment < 'comment:'; sqlite> select * from test where comment >= 'comment0' and comment < 'comment:'; sqlite> select * from test where comment >= 'comment:' and comment < 'comment;'; comment:1 comment:2 sqlite>
If we want to use the same trick as for "/0" we need to use ":;".
Also, fixed here seems perfectly valid to me.
follow-up: 16 comment:15 by , 15 years ago
Replying to cboos:
The patch is fine except for the
comment:%
selection:
Well, no, I want to match the comment edits, i.e. "_comment%d"
. So the condition is:
field >= '_comment0' AND field < '_comment:'
because the next character after 9
is :
.
(OT: I also just clicked "Edit" instead of "Reply"… Maybe we should swap both buttons? It would fit nicely with the delete plugin: "Reply", "Edit", "Delete")
comment:16 by , 15 years ago
Replying to rblank:
Replying to cboos:
The patch is fine except for the
comment:%
selection:Well, no, I want to match the comment edits, i.e.
"_comment%d"
.
Oh, my bad. Perfect then.
(OT: I also just clicked "Edit" instead of "Reply"… Maybe we should swap both buttons? It would fit nicely with the delete plugin: "Reply", "Edit", "Delete")
Ok, I'll do that together with the .css change.
comment:17 by , 15 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Patch applied in [8973], and db.like()
is left case-insensitive.
follow-up: 19 comment:18 by , 15 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Replying to rblank:
path >= 'trunk/trac/versioncontrol/' AND path < 'trunk/trac/versioncontrol0'
We figured out this doesn't work as expected on some PostgreSQL database.
After some research, it seems this is indeed due to the UTF8 collating rules, which are different from the ASCII ones…
Expected:
www-data@lynx:~$ cat <<EOF | LC_COLLATE=C sort > a/ > a/a > a0 > EOF a/ a/a a0
But with UTF8:
www-data@lynx:~$ cat <<EOF | LC_COLLATE=en_US.utf8 sort > a/ > a/a > a0 > EOF a/ a0 a/a
See e.g. http://archives.postgresql.org/pgsql-bugs/2006-03/msg00067.php
Unfortunately, this setting can't be changed:
The nature of some locale categories is that their value has to be fixed for the lifetime of a database cluster. That is, once initdb has run, you cannot change them anymore. LC_COLLATE and LC_CTYPE are those categories. They affect the sort order of indexes, so they must be kept fixed, or indexes on text columns will become corrupt. PostgreSQL enforces this by recording the values of LC_COLLATE and LC_CTYPE that are seen by initdb. The server automatically adopts those two values when it is started.
in http://www.postgresql.org/docs/8.3/interactive/locale.html
So we need to see if we can do something similar in UTF8 and adapt the query to the collation type.
comment:19 by , 15 years ago
Replying to cboos:
After some research, it seems this is indeed due to the UTF8 collating rules, which are different from the ASCII ones…
What I don't understand is that we both have "en_US.UTF-8" as LC_COLLATE
, but t.e.o has the issue and my server hasn't. I suppose "show LC_COLLATE" must be wrong on my server, as it gives "en_US.UTF-8", but "locale" returns "C".
Why one would want the second sort order in comment:18 (in any locale) is beyond me…
comment:20 by , 15 years ago
I find that really… weird:
$ echo -e "a/\na/a\na01234zzz/a\n" | LC_COLLATE="en_US.UTF-8" sort a/ a01234zzz/a a/a
Oh well, time to revert a few changesets, then.
follow-up: 23 comment:22 by , 15 years ago
#8914 suggests adding an interface to issue REGEXP queries. This could be used instead of LIKE for case-sensitive matching.
comment:23 by , 15 years ago
Replying to rblank:
#8914 suggests adding an interface to issue REGEXP queries. This could be used instead of LIKE for case-sensitive matching.
I'm not sure it's a good idea. The point here would be to find a way to make an efficient use of indexes, as this can make a tremendous difference, at least for SQLite as comment:6 shows.
For PostgreSQL, it would be interesting to see if LIKE '<prefix>/%'
is able to make use of the index or not. Otherwise we should really write some install notes advising to do an initdb --locale=C
when creating a Trac database.
To summarize, I think we should refactor this into a match_prefix(p)
connection helper method that will generate the sql and the arguments best adapted to each backend (i.e. the (... = ... OR (... >= ... AND ... < ...))
for SQLite and MySQL and (... = ... OR ... LIKE ...)
for PostgreSQL in case the locale was not the expected one (a simple one-time dynamic check of the result of SELECT 'a/a' < 'a0'
could be used to detect the mode when immediately after the connection creation).
comment:24 by , 15 years ago
A match_prefix()
sounds like a good idea. But I'm not sure how we can get the automatic detection right. In comment:19, I was wondering why I got a different result from yours, with the same LC_COLLATE=en_US.UTF-8
setting. It seems that collation is not all too clearly defined across platforms.
On Linux (Gentoo):
$ echo -e "a/\na/a\na0" | LC_COLLATE="en_US.UTF-8" sort a/ a0 a/a
On OS X 10.5.6:
$ echo -e "a/\na/a\na0" | LC_COLLATE="en_US.UTF-8" sort a/ a/a a0
So it seems that the Mac gets it right. Cygwin on XP agrees with the Mac, but this could just as well be due to the absence of locale support. I'm not sure how to test that on native Windows. Maybe directly in PostgreSQL?
follow-up: 26 comment:25 by , 15 years ago
I suggest we close this now (as the initial problem is fixed) and reopen a new one later for addressing the match_prefix optimization.
comment:26 by , 15 years ago
Milestone: | 0.12 → next-minor-0.12.x |
---|
Replying to cboos:
I suggest we close this now (as the initial problem is fixed) and reopen a new one later for addressing the match_prefix optimization.
The issue is not really fixed: we still match path prefixes as case-insensitive. But I agree that we can re-schedule this ticket post-0.12.
comment:27 by , 15 years ago
This seems to be a bug, see also http://wiki.freebsd.org/KonradJankowski/Collation.
And, if the specified locale does not exist, the fallback will be the standard 'C' locale (cf. 8519#comment:24)
comment:28 by , 14 years ago
Milestone: | next-minor-0.12.x → next-major-0.1X |
---|
Still no better idea, moving out.
comment:29 by , 14 years ago
I think ilike() should still be there as an alias for like(), so we can decide which one to use according to our needs - so if a real LIKE and ILIKE is implemented in the future, code won't break
(I have some code that needs case-insensitive matching, and I'm using like() which feels kinda weird - if this gets fixed in the future, my code will break…)
follow-up: 31 comment:30 by , 13 years ago
Cc: | added |
---|
Are there any news? Either in SQLite or Postgre?
I recognized that SQLite is out in version 3.7.10 and Trac "only" requires 2.0.7. Maybe requiring a higher version of SQLite or Postgre would enable this feature?
A few users already requested this feature for our Trac instance.
comment:31 by , 13 years ago
Replying to framay <franz.mayer@…>:
[…] Trac "only" requires 2.0.7
I just saw, that Trac requires SQLite version 3.3.3 … but still maybe it could be increased a bit in favor of having nice feature(s)?
follow-up: 34 comment:32 by , 13 years ago
Cc: | added |
---|
SQLite 2.8+ has GLOB
operator which is case-sensitive. We can use it? However, GLOB
uses wildcards like shell (e.g. abc*xyz
). See http://www.sqlite.org/lang_expr.html#glob.
-
trac/db/sqlite_backend.py
26 26 from trac.util.translation import _ 27 27 28 28 _like_escape_re = re.compile(r'([/_%])') 29 _match_prefix_re = re.compile(r'[*?[]') 29 30 30 31 try: 31 32 import pysqlite2.dbapi2 as sqlite … … 310 311 else: 311 312 return text 312 313 314 def match_prefix(self, prefix): 315 """Return a prefix matching condition.""" 316 return "GLOB '%s*'" % \ 317 _match_prefix_re.sub(r'[\g<0>]', prefix).replace("'", "''") 318 313 319 def quote(self, identifier): 314 320 """Return the quoted identifier.""" 315 321 return "`%s`" % identifier.replace('`', '``')
comment:33 by , 13 years ago
I just thought "how have other bug-tracker solved the case-sensitive problem"? But I haven't seen any case-sensitive checkbox on redmine, bugzilla, or Jira.
On Jira Knowledge Base it is written explicitly: Note: All query terms in JIRA are case insensitive.
follow-up: 35 comment:34 by , 13 years ago
Replying to jomae:
SQLite 2.8+ has
GLOB
operator which is case-sensitive. We can use it? However,GLOB
uses wildcards like shell (e.g.abc*xyz
). See http://www.sqlite.org/lang_expr.html#glob.
Well, the idea of a match_prefix()
was to give us the possibility to do a fast range search when it's "well behaved" according to our needs (comment:23).
But I'm not sure anymore we really need such a match_prefix()
, as it seems that recent SQLite versions are able to use the index while doing queries like LIKE '<somepath>/%'
(or like GLOB '<somepath>/*'
). This is according to EXPLAIN QUERY PLAN
at least, I will do some real experiments while working on #8813.
But the proposed GLOB
looks like a perfect match for implementing a real case-sensitive like()
for SQLite. Why didn't we think about that before, is GLOB
a recent addition?
comment:35 by , 13 years ago
Priority: | high → normal |
---|
Replying to cboos:
But I'm not sure anymore we really need such a
match_prefix()
, as it seems that recent SQLite versions are able to use the index while doing queries likeLIKE '<somepath>/%'
(or likeGLOB '<somepath>/*'
). This is according toEXPLAIN QUERY PLAN
at least, I will do some real experiments while working on #8813.
I was not so successful, and the results from comment:6 remain current. And worse, it seems that SQLite has even trouble using an index on path with range queries like the one in discussed in comment:5.
So back to the specifics of this ticket, like()
vs. ilike()
. We could rework 8519-ilike-r8958.patch to use GLOB for SQLite, but then we'd need to also wrap the argument passed to executed, to do the necessary adaptations (s/%/*/
, s/_/?/
) to the pattern. Not so convenient.
So we could indeed go for the simpler match_prefix
fix proposed by Jun, as it's less invasive and does improve the correctness of the prev/next rev query, the one place where it really matters.
comment:36 by , 10 years ago
Owner: | removed |
---|---|
Status: | reopened → new |
function to be used as
s LIKE p
operator in SQLite statements.