Edgewall Software
Modify

Opened 15 years ago

Last modified 3 years ago

#8519 new defect

internal SQL string matches are always case insensitive

Reported by: Christian Boos Owned by:
Priority: normal Milestone: next-major-releases
Component: database backend Version: none
Severity: normal Keywords: sql like ilike
Cc: franz.mayer@…, Jun Omae Branch:
Release Notes:
API Changes:
Internal Changes:

Description

Our like() helper actually translate into doing an ILIKE or similar. There are several places where this is not desired, we would actually need a case sensitive search.

The idea is to rename the current like() to ilike() and add a real like() helper.

See #8424 for the original ticket and implementation hints for SQLite which doesn't have a case-sensitive LIKE.

Attachments (3)

sqlite_like.py (997 bytes ) - added by Christian Boos 14 years ago.
function to be used as s LIKE p operator in SQLite statements.
8519-ilike-r8958.patch (11.4 KB ) - added by Remy Blank 14 years ago.
Add ilike() to the DB connection, and implement a correct like() for PostgreSQL and MySQL. Misuse the LIKE operator on SQLite.
8519-replace-like-r8958.patch (9.3 KB ) - added by Remy Blank 14 years ago.
Replace case-sensitive LIKE clauses with range clauses.

Download all attachments as: .zip

Change History (39)

by Christian Boos, 14 years ago

Attachment: sqlite_like.py added

function to be used as s LIKE p operator in SQLite statements.

comment:1 by Christian Boos, 14 years ago

attachment:sqlite_like.py can installed as a custom function (see ticket:8424#comment:3).

Version 0, edited 14 years ago by Christian Boos (next)

comment:2 by Remy Blank, 14 years ago

Owner: changed from Christian Boos to Remy Blank

I'll implement the DB API change.

by Remy Blank, 14 years ago

Attachment: 8519-ilike-r8958.patch added

Add ilike() to the DB connection, and implement a correct like() for PostgreSQL and MySQL. Misuse the LIKE operator on SQLite.

comment:3 by Remy Blank, 14 years ago

The patch above adds an ilike() method to the database connections, and changes like() to be case-sensitive on PostgreSQL and MySQL. For SQLite, the proposed function is installed for the 2-argument LIKE operator, and the simplistic test cases pass.

However, the SQLite solution is a hack, and it works only in the most simple cases:

  • It only allows % wildcards at the beginning or end of the pattern.
  • It doesn't support the _ wildcard.
  • It doesn't support escaping of wildcards.

Implementing this correctly would make the function painfully slow. My suggestion would be to keep like() case-insensitive on SQLite, and wait for it to grow a proper ILIKE operator. I'm even considering implementing it myself and submitting a patch to SQLite. Then we could make LIKE case-sensitive via the PRAGMA.

Thoughts?

in reply to:  3 ; comment:4 by Christian Boos, 14 years ago

Replying to rblank:

However, the SQLite solution is a hack, and it works only in the most simple cases:

I think that's all we need in Trac proper. Then, of course, plugins might have higher expectations.

  • It only allows % wildcards at the beginning or end of the pattern.
  • It doesn't support the _ wildcard.
  • It doesn't support escaping of wildcards.

Implementing this correctly would make the function painfully slow.

Maybe, but then string operations are fast in Python, things like find. We could have a try on improving the above and still have something reasonably efficient.

My suggestion would be to keep like() case-insensitive on SQLite, and wait for it to grow a proper ILIKE operator.

But don't we use mainly ilike() now? For which case do we still use like() and need "native" performance?

I'm even considering implementing it myself and submitting a patch to SQLite. Then we could make LIKE case-sensitive via the PRAGMA.

Maybe the best long term solution …

in reply to:  4 comment:5 by Remy Blank, 14 years ago

Replying to cboos:

But don't we use mainly ilike() now? For which case do we still use like() and need "native" performance?

The SVN cache, where we use it to match paths (see cache.py). But maybe there's a way to reformulate the query that doesn't require LIKE. I read a trick yesterday where somebody replaced LIKE '...%' with BETWEEN, like so:

path LIKE 'trunk/trac/versioncontrol/%'

would be changed to:

path BETWEEN 'trunk/trac/versioncontrol/' AND 'trunk/trac/versioncontrol0'

(where 0 is the character immediately following / in the current encoding).

comment:6 by Christian Boos, 14 years ago

Very interesting. So I did some experiments:

import sqlite3
import random

c = sqlite3.connect('testpaths')
c.execute('create table test ( path text primary key)')

def fragment(letters, r):
    return ''.join(random.choice(letters) for i in range(0, random.choice(r)))

def path():
    return '/'.join(['trunk', 
                     fragment('abcdef', range(4, 8)),
                     fragment('defghijkl', range(4, 10)),
                     fragment('pqrstuvwx', range(4, 6))])

c.executemany('insert into test values (?)', ((path(),) for x in range(0, 100000)))

def like3():
  return c.execute("select count(*) from test "
                   "where path like 'trunk//beef//%' escape '/'").fetchall()

def like2():
  return c.execute("select count(*) from test "
                   "where path like 'trunk/beef/%'").fetchall()

def between():
    return c.execute("select count(*) from test where "
                     "path between 'trunk/beef/' and 'trunk/beef0'").fetchall()

timeit:

>>> import timeit
>>> timeit.timeit('like3()', 'from __main__ import like3', number=100)
4.868307215023151
>>> timeit.timeit('like2()', 'from __main__ import like2', number=100)
4.4357460235867974

With python sqlite_like from attachment:sqlite_like.py:

>>> c.create_function('like', 2, sqlite_like)
>>> timeit.timeit('like2()', 'from __main__ import like2', number=100)
20.509961112376018

And finally:

>>> timeit.timeit('between()', 'from __main__ import between', number=100000)
1.814714859011417

No, there's no typo in the number ;-) Nice one ;-)

Last edited 14 years ago by Christian Boos (previous) (diff)

comment:7 by Remy Blank, 14 years ago

Interesting! But I made a small mistake. You can't use BETWEEN, because both ends of the range are included. You have to use:

path >= 'trunk/trac/versioncontrol/' AND path < 'trunk/trac/versioncontrol0'

comment:8 by Christian Boos, 14 years ago

Well, I expect the speed should be in the same league with the explicit comparison than with the "between", as it will also be able to make good use of the index. What the above shows IMO is that the LIKE 'xxx%... doesn't benefit from the index, with SQLite.

in reply to:  8 ; comment:9 by Remy Blank, 14 years ago

Replying to cboos:

… So my next suggestion would be to drop the idea of separate like() and ilike() altogether, and keep like() case-insensitive.

Ha! Lazyness attack! But I have to agree, let's defer making the like/ilike distinction to when it becomes really needed. If not already done, we should also make sure the docstring for like() correctly highlights its case-insensitiveness.

Last edited 14 years ago by Christian Boos (previous) (diff)

comment:10 by Christian Boos, 14 years ago

He, interesting case of ticket edit collision :-)

in reply to:  10 comment:11 by Christian Boos, 14 years ago

Replying to cboos:

He, interesting case of ticket edit collision :-)

Wait, was that simply me doing a Modify instead of a Reply?

Right, that's what happened, which also explains the Replying to … (kept from the original comment:9).

[OT] suggesting a visual clue when we're in editing mode, e.g. a light yellow background.

Last edited 14 years ago by Christian Boos (previous) (diff)

in reply to:  9 comment:12 by Remy Blank, 14 years ago

Replying to cboos:

If not already done, we should also make sure the docstring for like() correctly highlights its case-insensitiveness.

Done already in my sandbox. Complete patch coming shortly.

by Remy Blank, 14 years ago

Replace case-sensitive LIKE clauses with range clauses.

comment:13 by Remy Blank, 14 years ago

The patch above replaces all LIKE clauses that were supposed to be case-sensitive with range tests as suggested in comment:7 (version control cache and ticket comment editing). There will be no speed boost at this time for the cache, as we still use a CAST(rev AS int) on the revision column (until #6654 is fixed).

If the patch is ok, I'll apply it and call this a "wontfix" :)

in reply to:  13 ; comment:14 by Christian Boos, 14 years ago

Component: generaldatabase backend
Keywords: ilike added

Replying to rblank:

If the patch is ok, I'll apply it and call this a "wontfix" :)

The patch is fine except for the comment:% selection:

sqlite> create table test ( comment text );
sqlite> insert into test values ('comment:1');
sqlite> insert into test values ('comment:2');
sqlite> select * from test where comment >= 'comment0';
comment:1
comment:2
sqlite> select * from test where comment < 'comment:';
sqlite> select * from test where comment >= 'comment0' and comment < 'comment:';
sqlite> select * from test where comment >= 'comment:' and comment < 'comment;';
comment:1
comment:2
sqlite>

If we want to use the same trick as for "/0" we need to use ":;".

Also, fixed here seems perfectly valid to me.

in reply to:  14 ; comment:15 by Remy Blank, 14 years ago

Replying to cboos:

The patch is fine except for the comment:% selection:

Well, no, I want to match the comment edits, i.e. "_comment%d". So the condition is:

field >= '_comment0' AND field < '_comment:'

because the next character after 9 is :.

(OT: I also just clicked "Edit" instead of "Reply"… Maybe we should swap both buttons? It would fit nicely with the delete plugin: "Reply", "Edit", "Delete")

in reply to:  15 comment:16 by Christian Boos, 14 years ago

Replying to rblank:

Replying to cboos:

The patch is fine except for the comment:% selection:

Well, no, I want to match the comment edits, i.e. "_comment%d".

Oh, my bad. Perfect then.

(OT: I also just clicked "Edit" instead of "Reply"… Maybe we should swap both buttons? It would fit nicely with the delete plugin: "Reply", "Edit", "Delete")

Ok, I'll do that together with the .css change.

comment:17 by Remy Blank, 14 years ago

Resolution: fixed
Status: newclosed

Patch applied in [8973], and db.like() is left case-insensitive.

in reply to:  7 ; comment:18 by Christian Boos, 14 years ago

Resolution: fixed
Status: closedreopened

Replying to rblank:

path >= 'trunk/trac/versioncontrol/' AND path < 'trunk/trac/versioncontrol0'

We figured out this doesn't work as expected on some PostgreSQL database.

After some research, it seems this is indeed due to the UTF8 collating rules, which are different from the ASCII ones…

Expected:

www-data@lynx:~$ cat <<EOF | LC_COLLATE=C sort
> a/
> a/a
> a0
> EOF
a/
a/a
a0

But with UTF8:

www-data@lynx:~$ cat <<EOF | LC_COLLATE=en_US.utf8 sort
> a/
> a/a
> a0
> EOF
a/
a0
a/a

See e.g. http://archives.postgresql.org/pgsql-bugs/2006-03/msg00067.php

Unfortunately, this setting can't be changed:

The nature of some locale categories is that their value has to be fixed for the lifetime of a database cluster. That is, once initdb has run, you cannot change them anymore. LC_COLLATE and LC_CTYPE are those categories. They affect the sort order of indexes, so they must be kept fixed, or indexes on text columns will become corrupt. PostgreSQL enforces this by recording the values of LC_COLLATE and LC_CTYPE that are seen by initdb. The server automatically adopts those two values when it is started.
in http://www.postgresql.org/docs/8.3/interactive/locale.html

So we need to see if we can do something similar in UTF8 and adapt the query to the collation type.

in reply to:  18 comment:19 by Remy Blank, 14 years ago

Replying to cboos:

After some research, it seems this is indeed due to the UTF8 collating rules, which are different from the ASCII ones…

What I don't understand is that we both have "en_US.UTF-8" as LC_COLLATE, but t.e.o has the issue and my server hasn't. I suppose "show LC_COLLATE" must be wrong on my server, as it gives "en_US.UTF-8", but "locale" returns "C".

Why one would want the second sort order in comment:18 (in any locale) is beyond me…

comment:20 by Remy Blank, 14 years ago

I find that really… weird:

$ echo -e "a/\na/a\na01234zzz/a\n" | LC_COLLATE="en_US.UTF-8" sort

a/
a01234zzz/a
a/a

Oh well, time to revert a few changesets, then.

comment:21 by Remy Blank, 14 years ago

Change reverted in [8987]. Back to square one…

comment:22 by Remy Blank, 14 years ago

#8914 suggests adding an interface to issue REGEXP queries. This could be used instead of LIKE for case-sensitive matching.

in reply to:  22 comment:23 by Christian Boos, 14 years ago

Replying to rblank:

#8914 suggests adding an interface to issue REGEXP queries. This could be used instead of LIKE for case-sensitive matching.

I'm not sure it's a good idea. The point here would be to find a way to make an efficient use of indexes, as this can make a tremendous difference, at least for SQLite as comment:6 shows.

For PostgreSQL, it would be interesting to see if LIKE '<prefix>/%' is able to make use of the index or not. Otherwise we should really write some install notes advising to do an initdb --locale=C when creating a Trac database.

To summarize, I think we should refactor this into a match_prefix(p) connection helper method that will generate the sql and the arguments best adapted to each backend (i.e. the (... = ... OR (... >= ... AND ... < ...)) for SQLite and MySQL and (... = ... OR ... LIKE ...) for PostgreSQL in case the locale was not the expected one (a simple one-time dynamic check of the result of SELECT 'a/a' < 'a0' could be used to detect the mode when immediately after the connection creation).

comment:24 by Remy Blank, 14 years ago

A match_prefix() sounds like a good idea. But I'm not sure how we can get the automatic detection right. In comment:19, I was wondering why I got a different result from yours, with the same LC_COLLATE=en_US.UTF-8 setting. It seems that collation is not all too clearly defined across platforms.

On Linux (Gentoo):

$ echo -e "a/\na/a\na0" | LC_COLLATE="en_US.UTF-8" sort
a/
a0
a/a

On OS X 10.5.6:

$ echo -e "a/\na/a\na0" | LC_COLLATE="en_US.UTF-8" sort
a/
a/a
a0

So it seems that the Mac gets it right. Cygwin on XP agrees with the Mac, but this could just as well be due to the absence of locale support. I'm not sure how to test that on native Windows. Maybe directly in PostgreSQL?

comment:25 by Christian Boos, 14 years ago

I suggest we close this now (as the initial problem is fixed) and reopen a new one later for addressing the match_prefix optimization.

in reply to:  25 comment:26 by Remy Blank, 14 years ago

Milestone: 0.12next-minor-0.12.x

Replying to cboos:

I suggest we close this now (as the initial problem is fixed) and reopen a new one later for addressing the match_prefix optimization.

The issue is not really fixed: we still match path prefixes as case-insensitive. But I agree that we can re-schedule this ticket post-0.12.

comment:27 by Carsten Klein <carsten.klein@…>, 14 years ago

This seems to be a bug, see also http://wiki.freebsd.org/KonradJankowski/Collation.

And, if the specified locale does not exist, the fallback will be the standard 'C' locale (cf. 8519#comment:24)

comment:28 by Remy Blank, 14 years ago

Milestone: next-minor-0.12.xnext-major-0.1X

Still no better idea, moving out.

comment:29 by shesek, 14 years ago

I think ilike() should still be there as an alias for like(), so we can decide which one to use according to our needs - so if a real LIKE and ILIKE is implemented in the future, code won't break

(I have some code that needs case-insensitive matching, and I'm using like() which feels kinda weird - if this gets fixed in the future, my code will break…)

comment:30 by framay <franz.mayer@…>, 12 years ago

Cc: franz.mayer@… added

Are there any news? Either in SQLite or Postgre?

I recognized that SQLite is out in version 3.7.10 and Trac "only" requires 2.0.7. Maybe requiring a higher version of SQLite or Postgre would enable this feature?

A few users already requested this feature for our Trac instance.

in reply to:  30 comment:31 by framay <franz.mayer@…>, 12 years ago

Replying to framay <franz.mayer@…>:

[…] Trac "only" requires 2.0.7

I just saw, that Trac requires SQLite version 3.3.3 … but still maybe it could be increased a bit in favor of having nice feature(s)?

comment:32 by Jun Omae, 12 years ago

Cc: Jun Omae added

SQLite 2.8+ has GLOB operator which is case-sensitive. We can use it? However, GLOB uses wildcards like shell (e.g. abc*xyz). See http://www.sqlite.org/lang_expr.html#glob.

  • trac/db/sqlite_backend.py

     
    2626from trac.util.translation import _
    2727
    2828_like_escape_re = re.compile(r'([/_%])')
     29_match_prefix_re = re.compile(r'[*?[]')
    2930
    3031try:
    3132    import pysqlite2.dbapi2 as sqlite
     
    310311        else:
    311312            return text
    312313
     314    def match_prefix(self, prefix):
     315        """Return a prefix matching condition."""
     316        return "GLOB '%s*'" % \
     317               _match_prefix_re.sub(r'[\g<0>]', prefix).replace("'", "''")
     318
    313319    def quote(self, identifier):
    314320        """Return the quoted identifier."""
    315321        return "`%s`" % identifier.replace('`', '``')

comment:33 by framay <franz.mayer@…>, 12 years ago

I just thought "how have other bug-tracker solved the case-sensitive problem"? But I haven't seen any case-sensitive checkbox on redmine, bugzilla, or Jira.

On Jira Knowledge Base it is written explicitly: Note: All query terms in JIRA are case insensitive.

in reply to:  32 ; comment:34 by Christian Boos, 12 years ago

Replying to jomae:

SQLite 2.8+ has GLOB operator which is case-sensitive. We can use it? However, GLOB uses wildcards like shell (e.g. abc*xyz). See http://www.sqlite.org/lang_expr.html#glob.

Well, the idea of a match_prefix() was to give us the possibility to do a fast range search when it's "well behaved" according to our needs (comment:23).

But I'm not sure anymore we really need such a match_prefix(), as it seems that recent SQLite versions are able to use the index while doing queries like LIKE '<somepath>/%' (or like GLOB '<somepath>/*'). This is according to EXPLAIN QUERY PLAN at least, I will do some real experiments while working on #8813.

But the proposed GLOB looks like a perfect match for implementing a real case-sensitive like() for SQLite. Why didn't we think about that before, is GLOB a recent addition?

in reply to:  34 comment:35 by Christian Boos, 12 years ago

Priority: highnormal

Replying to cboos:

But I'm not sure anymore we really need such a match_prefix(), as it seems that recent SQLite versions are able to use the index while doing queries like LIKE '<somepath>/%' (or like GLOB '<somepath>/*'). This is according to EXPLAIN QUERY PLAN at least, I will do some real experiments while working on #8813.

I was not so successful, and the results from comment:6 remain current. And worse, it seems that SQLite has even trouble using an index on path with range queries like the one in discussed in comment:5.

So back to the specifics of this ticket, like() vs. ilike(). We could rework 8519-ilike-r8958.patch to use GLOB for SQLite, but then we'd need to also wrap the argument passed to executed, to do the necessary adaptations (s/%/*/, s/_/?/) to the pattern. Not so convenient.

So we could indeed go for the simpler match_prefix fix proposed by Jun, as it's less invasive and does improve the correctness of the prev/next rev query, the one place where it really matters.

comment:36 by Ryan J Ollos, 9 years ago

Owner: Remy Blank removed
Status: reopenednew

Modify Ticket

Change Properties
Set your email in Preferences
Action
as new The ticket will remain with no owner.
The ticket will be disowned.
as The resolution will be set. Next status will be 'closed'.
The owner will be changed from (none) to anonymous. Next status will be 'assigned'.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.