Edgewall Software
Modify

Opened 17 years ago

Closed 15 years ago

#4990 closed enhancement (duplicate)

release 0.10.4 with MySQL utf8 enforcement

Reported by: techtonik <techtonik@…> Owned by: Jonas Borgström
Priority: normal Milestone:
Component: general Version:
Severity: normal Keywords: mysql helpwanted consider
Cc: Branch:
Release Notes:
API Changes:
Internal Changes:

Description

Major problems Trac has with MySQL are arise because Trac uses utf8 as internal encoding and MySQL tables are created with different collations. To narrow the scope of outstanding MySQL bugs we should force it to use only utf8_generic collation for all Trac tables and columns.

So, we should release 0.10.4 with extra checks what are made using.

USE trac_database;
SHOW VARIABLES LIKE "character_set_database";
SHOW VARIABLES LIKE "collation_database";

Then for each table parse result of the following command:

SHOW FULL COLUMNS FROM trac_tbl_name;

as described here http://dev.mysql.com/doc/refman/5.0/en/charset-show.html

Attachments (0)

Change History (12)

comment:1 by anonymous, 17 years ago

Milestone: 0.10.4

comment:2 by techtonik <techtonik@…>, 17 years ago

Other variables which may be useful for debugging purposes are: character_set_client and character_set_connection

From http://dev.mysql.com/doc/refman/5.1/en/show-variables.html

comment:3 by Christian Boos, 17 years ago

Milestone: 0.10.40.10.5

Not for 0.10.4.

comment:4 by Christian Boos, 17 years ago

Keywords: mysql helpwanted added

Patch appreciated, as I'm not sure what could/should be done here.

comment:5 by mail@…, 17 years ago

I am a Traditional Chinese user who use big5 most but prefer utf8 in db more than big5.

I test trac-trunk on Debian with python2.4-2.4.4-3 / python-mysqldb 1.2.1-p2-4 / mysql-server-5.0 5.0.32-7etch1 with both big5 / utf8 in db and both work.

The way I test different encoding is: set the character set as create database, and use trac to create db tables automatically.

I understand the reason techtonik suggest that limit db to use utf8 might narrow down some bug of trac mysql-backend support just as I prefer to use utf8 in db. But in some situation use big5 as internal encoding in db might be the choice in certain cases. So if trac can both work find in any back-end encoding, I think not to force utf8 might be the better choice.

As I mention, I try mysql-backend in mysql 5.x not ⇐ 4.x, so more testing result might be helpful.

comment:6 by techtonik <techtonik@…>, 17 years ago

It should not matter which encoding is used by backend as long as frontend is able to convert it to desired. User doesn't need to know anything about DB design. Can you be more specific about reasons why using big5 at backend is more preferrable than utf-8?

comment:7 by mail@…, 17 years ago

Most Traditional Chinese is 2 bytes in big5 but 3 bytes in utf-8. I do agree that maybe only few situation big5 (or other local charset) rather than utf-8 will be choose.

For me, though I might never use big5 as database encoding, but I still think keep this as a warning as table is created or as a note in trac document might be enough.

comment:8 by techtonik <techtonik@…>, 17 years ago

It doesn't matter how may bytes each symbol takes, because internally it is still the same symbol (and string length). Users of Trac don't need to know which encoding Trac uses if it can handle it gracefully. Using UTF-8 guarantees that there won't be any extra weak places, processing and encoding glitches on the way from Trac 2 MySQL Database.

comment:9 by anonymous, 17 years ago

Unicode String in memory is the same length, but different in disk size in bytes.

One way to do the check is using another issue of trac mysql backend which mysql limit index length in 1000 bytes.

I create index on text field in different text field, the max length of the index in the test result is:

char of field max length of index the length of index in bytes
big5_chinese_ci 500 1000
utf8_unicode_ci 333 999

Why is that ? Check here: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html

In that page, you can see that mysql 5.0: "Currently, MySQL support for UTF-8 does not include four-byte sequences.", and Korean, Chinese, and Japanese ideographs use three-byte sequences. (As using UTF-8) And it also mention that the most updated RFC has 4-byte utf-8 which Mysql doesn't support for now. (Older RFC has even 6-byte s utf-8 character)

So, in memory, it's the same, but in disk, no. (In fact, python could be configured to use 2-byte or 4-byte in memory to represent a unicode character. See http://www.python.org/doc/2.2.3/whatsnew/node8.html for reference.)

Well, these are details that might interest you or not.

comment:10 by techtonik <techtonik@…>, 17 years ago

Well, thanks for explanation even though I knew these details. =) In any case extra 200 bytes will not help to work around index limitation in MySQL until indexes are built upon hashes.

As for unsupported 4-bytes utf-8 you say that none of the natural languages at the moment uses them, so this is not be an issue too.

comment:11 by Christian Boos, 16 years ago

Keywords: consider added
Milestone: 0.10.50.12

Eventually for 0.12, but ideally should be contributed as a patch

comment:12 by Christian Boos, 15 years ago

Milestone: 0.13
Resolution: duplicate
Status: newclosed

Superseded by #8089.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Jonas Borgström.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from Jonas Borgström to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.