Opened 18 years ago
Closed 16 years ago
#4990 closed enhancement (duplicate)
release 0.10.4 with MySQL utf8 enforcement
Reported by: | Owned by: | Jonas Borgström | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | general | Version: | |
Severity: | normal | Keywords: | mysql helpwanted consider |
Cc: | Branch: | ||
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description
Major problems Trac has with MySQL are arise because Trac uses utf8 as internal encoding and MySQL tables are created with different collations. To narrow the scope of outstanding MySQL bugs we should force it to use only utf8_generic collation for all Trac tables and columns.
So, we should release 0.10.4 with extra checks what are made using.
USE trac_database; SHOW VARIABLES LIKE "character_set_database"; SHOW VARIABLES LIKE "collation_database";
Then for each table parse result of the following command:
SHOW FULL COLUMNS FROM trac_tbl_name;
as described here http://dev.mysql.com/doc/refman/5.0/en/charset-show.html
Attachments (0)
Change History (12)
comment:1 by , 18 years ago
Milestone: | → 0.10.4 |
---|
comment:2 by , 18 years ago
comment:4 by , 18 years ago
Keywords: | mysql helpwanted added |
---|
Patch appreciated, as I'm not sure what could/should be done here.
comment:5 by , 18 years ago
I am a Traditional Chinese user who use big5 most but prefer utf8 in db more than big5.
I test trac-trunk on Debian with python2.4-2.4.4-3 / python-mysqldb 1.2.1-p2-4 / mysql-server-5.0 5.0.32-7etch1 with both big5 / utf8 in db and both work.
The way I test different encoding is: set the character set as create database, and use trac to create db tables automatically.
I understand the reason techtonik suggest that limit db to use utf8 might narrow down some bug of trac mysql-backend support just as I prefer to use utf8 in db. But in some situation use big5 as internal encoding in db might be the choice in certain cases. So if trac can both work find in any back-end encoding, I think not to force utf8 might be the better choice.
As I mention, I try mysql-backend in mysql 5.x not ⇐ 4.x, so more testing result might be helpful.
comment:6 by , 18 years ago
It should not matter which encoding is used by backend as long as frontend is able to convert it to desired. User doesn't need to know anything about DB design. Can you be more specific about reasons why using big5 at backend is more preferrable than utf-8?
comment:7 by , 18 years ago
Most Traditional Chinese is 2 bytes in big5 but 3 bytes in utf-8. I do agree that maybe only few situation big5 (or other local charset) rather than utf-8 will be choose.
For me, though I might never use big5 as database encoding, but I still think keep this as a warning as table is created or as a note in trac document might be enough.
comment:8 by , 18 years ago
It doesn't matter how may bytes each symbol takes, because internally it is still the same symbol (and string length). Users of Trac don't need to know which encoding Trac uses if it can handle it gracefully. Using UTF-8 guarantees that there won't be any extra weak places, processing and encoding glitches on the way from Trac 2 MySQL Database.
comment:9 by , 18 years ago
Unicode String in memory is the same length, but different in disk size in bytes.
One way to do the check is using another issue of trac mysql backend which mysql limit index length in 1000 bytes.
I create index on text field in different text field, the max length of the index in the test result is:
char of field | max length of index | the length of index in bytes |
big5_chinese_ci | 500 | 1000 |
utf8_unicode_ci | 333 | 999 |
Why is that ? Check here: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html
In that page, you can see that mysql 5.0: "Currently, MySQL support for UTF-8 does not include four-byte sequences.", and Korean, Chinese, and Japanese ideographs use three-byte sequences. (As using UTF-8) And it also mention that the most updated RFC has 4-byte utf-8 which Mysql doesn't support for now. (Older RFC has even 6-byte s utf-8 character)
So, in memory, it's the same, but in disk, no. (In fact, python could be configured to use 2-byte or 4-byte in memory to represent a unicode character. See http://www.python.org/doc/2.2.3/whatsnew/node8.html for reference.)
Well, these are details that might interest you or not.
comment:10 by , 18 years ago
Well, thanks for explanation even though I knew these details. =) In any case extra 200 bytes will not help to work around index limitation in MySQL until indexes are built upon hashes.
As for unsupported 4-bytes utf-8 you say that none of the natural languages at the moment uses them, so this is not be an issue too.
comment:11 by , 17 years ago
Keywords: | consider added |
---|---|
Milestone: | 0.10.5 → 0.12 |
Eventually for 0.12, but ideally should be contributed as a patch
comment:12 by , 16 years ago
Milestone: | 0.13 |
---|---|
Resolution: | → duplicate |
Status: | new → closed |
Superseded by #8089.
Other variables which may be useful for debugging purposes are: character_set_client and character_set_connection
From http://dev.mysql.com/doc/refman/5.1/en/show-variables.html