Edgewall Software
Modify

Ticket #8519 (reopened defect)

Opened 3 years ago

Last modified 12 days ago

internal SQL string matches are always case insensitive

Reported by: cboos Owned by: rblank
Priority: normal Milestone: next-major-0.1X
Component: database backend Version: none
Severity: normal Keywords: sql like ilike
Cc: franz.mayer@…, jomae
Release Notes:
API Changes:

Description

Our like() helper actually translate into doing an ILIKE or similar. There are several places where this is not desired, we would actually need a case sensitive search.

The idea is to rename the current like() to ilike() and add a real like() helper.

See #8424 for the original ticket and implementation hints for SQLite which doesn't have a case-sensitive LIKE.

Attachments

sqlite_like.py (997 bytes) - added by cboos 2 years ago.
function to be used as s LIKE p operator in SQLite statements.
8519-ilike-r8958.patch (11.4 KB) - added by rblank 2 years ago.
Add ilike() to the DB connection, and implement a correct like() for PostgreSQL and MySQL. Misuse the LIKE operator on SQLite.
8519-replace-like-r8958.patch (9.3 KB) - added by rblank 2 years ago.
Replace case-sensitive LIKE clauses with range clauses.
t8519-case-sensitive-like-r10949.diff (9.2 KB) - added by jomae 11 days ago.
Supports case-sensitive db.like() and adds db.like_pattern() helper.

Download all attachments as: .zip

Change History

Changed 2 years ago by cboos

function to be used as s LIKE p operator in SQLite statements.

comment:1 Changed 2 years ago by cboos

attachment:sqlite_like.py can be installed as a custom function (see ticket:8424#comment:3).

Last edited 2 years ago by cboos (previous) (diff)

comment:2 Changed 2 years ago by rblank

  • Owner changed from cboos to rblank

I'll implement the DB API change.

Changed 2 years ago by rblank

Add ilike() to the DB connection, and implement a correct like() for PostgreSQL and MySQL. Misuse the LIKE operator on SQLite.

comment:3 follow-up: Changed 2 years ago by rblank

The patch above adds an ilike() method to the database connections, and changes like() to be case-sensitive on PostgreSQL and MySQL. For SQLite, the proposed function is installed for the 2-argument LIKE operator, and the simplistic test cases pass.

However, the SQLite solution is a hack, and it works only in the most simple cases:

  • It only allows % wildcards at the beginning or end of the pattern.
  • It doesn't support the _ wildcard.
  • It doesn't support escaping of wildcards.

Implementing this correctly would make the function painfully slow. My suggestion would be to keep like() case-insensitive on SQLite, and wait for it to grow a proper ILIKE operator. I'm even considering implementing it myself and submitting a patch to SQLite. Then we could make LIKE case-sensitive via the PRAGMA.

Thoughts?

comment:4 in reply to: ↑ 3 ; follow-up: Changed 2 years ago by cboos

Replying to rblank:

However, the SQLite solution is a hack, and it works only in the most simple cases:

I think that's all we need in Trac proper. Then, of course, plugins might have higher expectations.

  • It only allows % wildcards at the beginning or end of the pattern.
  • It doesn't support the _ wildcard.
  • It doesn't support escaping of wildcards.

Implementing this correctly would make the function painfully slow.

Maybe, but then string operations are fast in Python, things like find. We could have a try on improving the above and still have something reasonably efficient.

My suggestion would be to keep like() case-insensitive on SQLite, and wait for it to grow a proper ILIKE operator.

But don't we use mainly ilike() now? For which case do we still use like() and need "native" performance?

I'm even considering implementing it myself and submitting a patch to SQLite. Then we could make LIKE case-sensitive via the PRAGMA.

Maybe the best long term solution ...

comment:5 in reply to: ↑ 4 Changed 2 years ago by rblank

Replying to cboos:

But don't we use mainly ilike() now? For which case do we still use like() and need "native" performance?

The SVN cache, where we use it to match paths (see cache.py). But maybe there's a way to reformulate the query that doesn't require LIKE. I read a trick yesterday where somebody replaced LIKE '...%' with BETWEEN, like so:

path LIKE 'trunk/trac/versioncontrol/%'

would be changed to:

path BETWEEN 'trunk/trac/versioncontrol/' AND 'trunk/trac/versioncontrol0'

(where 0 is the character immediately following / in the current encoding).

comment:6 Changed 2 years ago by cboos

Very interesting. So I did some experiments:

import sqlite3
import random

c = sqlite3.connect('testpaths')
c.execute('create table test ( path text primary key)')

def fragment(letters, r):
    return ''.join(random.choice(letters) for i in range(0, random.choice(r)))

def path():
    return '/'.join(['trunk', 
                     fragment('abcdef', range(4, 8)),
                     fragment('defghijkl', range(4, 10)),
                     fragment('pqrstuvwx', range(4, 6))])

c.executemany('insert into test values (?)', ((path(),) for x in range(0, 100000)))

def like3():
  return c.execute("select count(*) from test "
                   "where path like 'trunk//beef//%' escape '/'").fetchall()

def like2():
  return c.execute("select count(*) from test "
                   "where path like 'trunk/beef/%'").fetchall()

def between():
    return c.execute("select count(*) from test where "
                     "path between 'trunk/beef/' and 'trunk/beef0'").fetchall()

timeit:

>>> import timeit
>>> timeit.timeit('like3()', 'from __main__ import like3', number=100)
4.868307215023151
>>> timeit.timeit('like2()', 'from __main__ import like2', number=100)
4.4357460235867974

With python sqlite_like from attachment:sqlite_like.py:

>>> c.create_function('like', 2, sqlite_like)
>>> timeit.timeit('like2()', 'from __main__ import like2', number=100)
20.509961112376018

And finally:

>>> timeit.timeit('between()', 'from __main__ import between', number=100000)
1.814714859011417

No, there's no typo in the number ;-)
Nice one ;-)

Last edited 2 years ago by cboos (previous) (diff)

comment:7 follow-up: Changed 2 years ago by rblank

Interesting! But I made a small mistake. You can't use BETWEEN, because both ends of the range are included. You have to use:

path >= 'trunk/trac/versioncontrol/' AND path < 'trunk/trac/versioncontrol0'

comment:8 follow-up: Changed 2 years ago by cboos

Well, I expect the speed should be in the same league with the explicit comparison than with the "between", as it will also be able to make good use of the index. What the above shows IMO is that the LIKE 'xxx%... doesn't benefit from the index, with SQLite.

comment:9 in reply to: ↑ 8 ; follow-up: Changed 2 years ago by rblank

Replying to cboos:

... So my next suggestion would be to drop the idea of separate like() and ilike() altogether, and keep like() case-insensitive.

Ha! Lazyness attack! But I have to agree, let's defer making the like/ilike distinction to when it becomes really needed. If not already done, we should also make sure the docstring for like() correctly highlights its case-insensitiveness.

Last edited 2 years ago by cboos (previous) (diff)

comment:10 follow-up: Changed 2 years ago by cboos

He, interesting case of ticket edit collision :-)

comment:11 in reply to: ↑ 10 Changed 2 years ago by cboos

Replying to cboos:

He, interesting case of ticket edit collision :-)

Wait, was that simply me doing a Modify instead of a Reply?

Right, that's what happened, which also explains the Replying to ... (kept from the original comment:9).

[OT] suggesting a visual clue when we're in editing mode, e.g. a light yellow background.

Last edited 2 years ago by cboos (previous) (diff)

comment:12 in reply to: ↑ 9 Changed 2 years ago by rblank

Replying to cboos:

If not already done, we should also make sure the docstring for like() correctly highlights its case-insensitiveness.

Done already in my sandbox. Complete patch coming shortly.

Changed 2 years ago by rblank

Replace case-sensitive LIKE clauses with range clauses.

comment:13 follow-up: Changed 2 years ago by rblank

The patch above replaces all LIKE clauses that were supposed to be case-sensitive with range tests as suggested in comment:7 (version control cache and ticket comment editing). There will be no speed boost at this time for the cache, as we still use a CAST(rev AS int) on the revision column (until #6654 is fixed).

If the patch is ok, I'll apply it and call this a "wontfix" :)

comment:14 in reply to: ↑ 13 ; follow-up: Changed 2 years ago by cboos

  • Component changed from general to database backend
  • Keywords ilike added

Replying to rblank:

If the patch is ok, I'll apply it and call this a "wontfix" :)

The patch is fine except for the comment:% selection:

sqlite> create table test ( comment text );
sqlite> insert into test values ('comment:1');
sqlite> insert into test values ('comment:2');
sqlite> select * from test where comment >= 'comment0';
comment:1
comment:2
sqlite> select * from test where comment < 'comment:';
sqlite> select * from test where comment >= 'comment0' and comment < 'comment:';
sqlite> select * from test where comment >= 'comment:' and comment < 'comment;';
comment:1
comment:2
sqlite>

If we want to use the same trick as for "/0" we need to use ":;".

Also, fixed here seems perfectly valid to me.

comment:15 in reply to: ↑ 14 ; follow-up: Changed 2 years ago by rblank

Replying to cboos:

The patch is fine except for the comment:% selection:

Well, no, I want to match the comment edits, i.e. "_comment%d". So the condition is:

field >= '_comment0' AND field < '_comment:'

because the next character after 9 is :.

(OT: I also just clicked "Edit" instead of "Reply"... Maybe we should swap both buttons? It would fit nicely with the delete plugin: "Reply", "Edit", "Delete")

comment:16 in reply to: ↑ 15 Changed 2 years ago by cboos

Replying to rblank:

Replying to cboos:

The patch is fine except for the comment:% selection:

Well, no, I want to match the comment edits, i.e. "_comment%d".

Oh, my bad. Perfect then.

(OT: I also just clicked "Edit" instead of "Reply"... Maybe we should swap both buttons? It would fit nicely with the delete plugin: "Reply", "Edit", "Delete")

Ok, I'll do that together with the .css change.

comment:17 Changed 2 years ago by rblank

  • Resolution set to fixed
  • Status changed from new to closed

Patch applied in [8973], and db.like() is left case-insensitive.

comment:18 in reply to: ↑ 7 ; follow-up: Changed 2 years ago by cboos

  • Resolution fixed deleted
  • Status changed from closed to reopened

Replying to rblank:

path >= 'trunk/trac/versioncontrol/' AND path < 'trunk/trac/versioncontrol0'

We figured out this doesn't work as expected on some PostgreSQL database.

After some research, it seems this is indeed due to the UTF8 collating rules, which are different from the ASCII ones...

Expected:

www-data@lynx:~$ cat <<EOF | LC_COLLATE=C sort
> a/
> a/a
> a0
> EOF
a/
a/a
a0

But with UTF8:

www-data@lynx:~$ cat <<EOF | LC_COLLATE=en_US.utf8 sort
> a/
> a/a
> a0
> EOF
a/
a0
a/a

See e.g. http://archives.postgresql.org/pgsql-bugs/2006-03/msg00067.php

Unfortunately, this setting can't be changed:

The nature of some locale categories is that their value has to be fixed for the lifetime of a database cluster. That is, once initdb has run, you cannot change them anymore. LC_COLLATE and LC_CTYPE are those categories. They affect the sort order of indexes, so they must be kept fixed, or indexes on text columns will become corrupt. PostgreSQL enforces this by recording the values of LC_COLLATE and LC_CTYPE that are seen by initdb. The server automatically adopts those two values when it is started.
in http://www.postgresql.org/docs/8.3/interactive/locale.html

So we need to see if we can do something similar in UTF8 and adapt the query to the collation type.

comment:19 in reply to: ↑ 18 Changed 2 years ago by rblank

Replying to cboos:

After some research, it seems this is indeed due to the UTF8 collating rules, which are different from the ASCII ones...

What I don't understand is that we both have "en_US.UTF-8" as LC_COLLATE, but t.e.o has the issue and my server hasn't. I suppose "show LC_COLLATE" must be wrong on my server, as it gives "en_US.UTF-8", but "locale" returns "C".

Why one would want the second sort order in comment:18 (in any locale) is beyond me...

comment:20 Changed 2 years ago by rblank

I find that really... weird:

$ echo -e "a/\na/a\na01234zzz/a\n" | LC_COLLATE="en_US.UTF-8" sort

a/
a01234zzz/a
a/a

Oh well, time to revert a few changesets, then.

comment:21 Changed 2 years ago by rblank

Change reverted in [8987]. Back to square one...

comment:22 follow-up: Changed 2 years ago by rblank

#8914 suggests adding an interface to issue REGEXP queries. This could be used instead of LIKE for case-sensitive matching.

comment:23 in reply to: ↑ 22 Changed 2 years ago by cboos

Replying to rblank:

#8914 suggests adding an interface to issue REGEXP queries. This could be used instead of LIKE for case-sensitive matching.

I'm not sure it's a good idea. The point here would be to find a way to make an efficient use of indexes, as this can make a tremendous difference, at least for SQLite as comment:6 shows.

For PostgreSQL, it would be interesting to see if LIKE '<prefix>/%' is able to make use of the index or not. Otherwise we should really write some install notes advising to do an initdb --locale=C when creating a Trac database.

To summarize, I think we should refactor this into a match_prefix(p) connection helper method that will generate the sql and the arguments best adapted to each backend (i.e. the (... = ... OR (... >= ... AND ... < ...)) for SQLite and MySQL and (... = ... OR ... LIKE ...) for PostgreSQL in case the locale was not the expected one (a simple one-time dynamic check of the result of SELECT 'a/a' < 'a0' could be used to detect the mode when immediately after the connection creation).

comment:24 Changed 2 years ago by rblank

A match_prefix() sounds like a good idea. But I'm not sure how we can get the automatic detection right. In comment:19, I was wondering why I got a different result from yours, with the same LC_COLLATE=en_US.UTF-8 setting. It seems that collation is not all too clearly defined across platforms.

On Linux (Gentoo):

$ echo -e "a/\na/a\na0" | LC_COLLATE="en_US.UTF-8" sort
a/
a0
a/a

On OS X 10.5.6:

$ echo -e "a/\na/a\na0" | LC_COLLATE="en_US.UTF-8" sort
a/
a/a
a0

So it seems that the Mac gets it right. Cygwin on XP agrees with the Mac, but this could just as well be due to the absence of locale support. I'm not sure how to test that on native Windows. Maybe directly in PostgreSQL?

comment:25 follow-up: Changed 2 years ago by cboos

I suggest we close this now (as the initial problem is fixed) and reopen a new one later for addressing the match_prefix optimization.

comment:26 in reply to: ↑ 25 Changed 2 years ago by rblank

  • Milestone changed from 0.12 to next-minor-0.12.x

Replying to cboos:

I suggest we close this now (as the initial problem is fixed) and reopen a new one later for addressing the match_prefix optimization.

The issue is not really fixed: we still match path prefixes as case-insensitive. But I agree that we can re-schedule this ticket post-0.12.

comment:27 Changed 2 years ago by Carsten Klein <carsten.klein@…>

This seems to be a bug, see also http://wiki.freebsd.org/KonradJankowski/Collation.

And, if the specified locale does not exist, the fallback will be the standard 'C' locale (cf. 8519#comment:24)

comment:28 Changed 17 months ago by rblank

  • Milestone changed from next-minor-0.12.x to next-major-0.1X

Still no better idea, moving out.

comment:29 Changed 16 months ago by shesek

I think ilike() should still be there as an alias for like(), so we can decide which one to use according to our needs - so if a real LIKE and ILIKE is implemented in the future, code won't break

(I have some code that needs case-insensitive matching, and I'm using like() which feels kinda weird - if this gets fixed in the future, my code will break...)

comment:30 follow-up: Changed 3 weeks ago by framay <franz.mayer@…>

  • Cc franz.mayer@… added

Are there any news? Either in SQLite or Postgre?

I recognized that SQLite is out in version 3.7.10 and Trac "only" requires 2.0.7. Maybe requiring a higher version of SQLite or Postgre would enable this feature?

A few users already requested this feature for our Trac instance.

comment:31 in reply to: ↑ 30 Changed 3 weeks ago by framay <franz.mayer@…>

Replying to framay <franz.mayer@…>:

[...] Trac "only" requires 2.0.7

I just saw, that Trac requires SQLite version 3.3.3 ... but still maybe it could be increased a bit in favor of having nice feature(s)?

comment:32 follow-up: Changed 3 weeks ago by jomae

  • Cc jomae added

SQLite 2.8+ has GLOB operator which is case-sensitive. We can use it? However, GLOB uses wildcards like shell (e.g. abc*xyz). See http://www.sqlite.org/lang_expr.html#glob.

  • trac/db/sqlite_backend.py

     
    2626from trac.util.translation import _ 
    2727 
    2828_like_escape_re = re.compile(r'([/_%])') 
     29_match_prefix_re = re.compile(r'[*?[]') 
    2930 
    3031try: 
    3132    import pysqlite2.dbapi2 as sqlite 
     
    310311        else: 
    311312            return text 
    312313 
     314    def match_prefix(self, prefix): 
     315        """Return a prefix matching condition.""" 
     316        return "GLOB '%s*'" % \ 
     317               _match_prefix_re.sub(r'[\g<0>]', prefix).replace("'", "''") 
     318 
    313319    def quote(self, identifier): 
    314320        """Return the quoted identifier.""" 
    315321        return "`%s`" % identifier.replace('`', '``') 

comment:33 Changed 3 weeks ago by framay <franz.mayer@…>

I just thought "how have other bug-tracker solved the case-sensitive problem"? But I haven't seen any case-sensitive checkbox on redmine, bugzilla, or Jira.

On Jira Knowledge Base it is written explicitly:
Note: All query terms in JIRA are case insensitive.

comment:34 in reply to: ↑ 32 ; follow-up: Changed 2 weeks ago by cboos

Replying to jomae:

SQLite 2.8+ has GLOB operator which is case-sensitive. We can use it? However, GLOB uses wildcards like shell (e.g. abc*xyz). See http://www.sqlite.org/lang_expr.html#glob.

Well, the idea of a match_prefix() was to give us the possibility to do a fast range search when it's "well behaved" according to our needs (comment:23).

But I'm not sure anymore we really need such a match_prefix(), as it seems that recent SQLite versions are able to use the index while doing queries like LIKE '<somepath>/%' (or like GLOB '<somepath>/*'). This is according to EXPLAIN QUERY PLAN at least, I will do some real experiments while working on #8813.

But the proposed GLOB looks like a perfect match for implementing a real case-sensitive like() for SQLite. Why didn't we think about that before, is GLOB a recent addition?

comment:35 in reply to: ↑ 34 Changed 12 days ago by cboos

  • Priority changed from high to normal

Replying to cboos:

But I'm not sure anymore we really need such a match_prefix(), as it seems that recent SQLite versions are able to use the index while doing queries like LIKE '<somepath>/%' (or like GLOB '<somepath>/*'). This is according to EXPLAIN QUERY PLAN at least, I will do some real experiments while working on #8813.

I was not so successful, and the results from comment:6 remain current. And worse, it seems that SQLite has even trouble using an index on path with range queries like the one in discussed in comment:5.

So back to the specifics of this ticket, like() vs. ilike(). We could rework 8519-ilike-r8958.patch to use GLOB for SQLite, but then we'd need to also wrap the argument passed to executed, to do the necessary adaptations (s/%/*/, s/_/?/) to the pattern. Not so convenient.

So we could indeed go for the simpler match_prefix fix proposed by Jun, as it's less invasive and does improve the correctness of the prev/next rev query, the one place where it really matters.

Changed 11 days ago by jomae

Supports case-sensitive db.like() and adds db.like_pattern() helper.

View

Add a comment

Modify Ticket

Change Properties
<Author field>
Action
as reopened
as The resolution will be set. Next status will be 'closed'
to The owner will be changed from rblank. Next status will be 'new'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.