The Bug Genie team blog

What's cooking behind the scenes of The Bug Genie

More database encoding (or: how not to have fun with UTF-8)

with one comment

zegenie posted a short while ago regarding encoding trouble with The Bug Genie, especially those of you who use Greek or Russian alphabets. The good news is we have made some progress on this topic!

The root of the problem is to do with how The Bug Genie connects to the database. Each component has different encoding:

  • Database: whatever you created it as (probably latin1)
  • Tables and columns: UTF-8 (we create them as that)
  • The connection itself: could be anything

The last of these is the crux of the problem. On my system (as I am from Britain), the connection is latin1. This is fine for me, as I don’t need any unicode characters. However, for those of you who do, this means that unicode data is inserted into a unicode table, but is converted to something else (such as latin1) in the process.

The net result is that it will look right on the screen, but in the database it is garbage. This is fine up to the point where you start trying to do stuff with it, as there is a potential for problems to occur with JSON and other unicode-specific things, as what is in the database is not unicode (it is infact latin1 or whatever)!

An example of this is to copy some Russian text into both the Title and Issue Description fields. It will appear correctly in the Issue Description but not in the Title (it will claim it is empty).

This is easily fixed by setting the connection to use UTF-8, by placing this call on line 539 of core/B2DB/classes/B2DB.class.php:

self::getDBLink()->query('SET NAMES UTF8');

Now, all unicode data will be stored correctly in the database. However, this now results in another problem!

Any mangled Unicode data stored in the database as latin1 will be outputted as stored in the database (as no conversion is going on). This means that the mangled text in the database will be rendered to the screen. Luckily this is easily resolved by outputting a dump of the database in a latin1 connection, and then reimporting in a utf-8 connection (the nice Unicode output is obtained on the latin1 export, not the mangle that you get if you exported as UTF-8). The following code will do this, but don’t do this unless you put in the above code change:

mysqldump -h localhost --user=root -p --default-character-set=latin1 -c --insert-ignore --skip-set-charset -r dump.sql thebuggenie
mysql --user=root -p --execute="DROP DATABASE thebuggenie; CREATE DATABASE thebuggenie CHARACTER SET utf8 COLLATE utf8_general_ci;"
mysql --user=root --max_allowed_packet=16M -p --default-character-set=utf8 thebuggenie < dump.sql

We will be continuing to work on these encoding issues for The Bug Genie 3.2.

Written by lsproc

July 10, 2011 at 13:31

One Response

Subscribe to comments with RSS.

  1. […] discussed earlier on this blog, changes in The Bug Genie 3.2 to improve our support of Unicode may result in some […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: