Mediawiki: mysteries of charsets

In the past whenever I had to install a new wiki I always wondered which charset-option to choose:

I thought it would be best to let the db know about the charset of the stored data so that a charset-conversion is always possible. Hence I chose “MySQL 4.1/5.0 UTF-8” – and for new (empty) wikis there was never a problem.

The problems arose when I had to import db-dumps from older wikis. Mainly the pagetitles where often erroneous when they had “specialchars” like ü,ö,ä,ß in it. At first it seemed simple: I checked the table-defs from the old wiki, which where latin1 (or ISO-8859-1 to be ISO-conform). Then exported using “–default-character-set=latin1” and imported with “–default-character-set=utf-8”. But in vain: pagetitles scrambled.
Usually I ex-imported back and forth with various charset-settings until it worked – and usually WHEN it worked I almost never had a clue why.

I connect from a Windows-client through Putty to a MySQL-DB on a linux-server. And every element along the line has it’s own charset-def – mysql even a few (too much):

Server characterset:    latin1
Db     characterset:    latin1
Client characterset:    latin1
Conn.  characterset:    latin1

And every mysql-table could even have a different charset for every of it’s columns. What a mess! I virtually had no chance to tell the REAL charset by looking in the dump-file – cause every thing between my eyes and the dump could conceal it by it’s own settings.

Accidentally I discovered, that the linux-command “less” shows any special-, non-ASCII-chars in a dumpfile as hexcodes. This was the point when the fog cleared. Now I was able to tell in which charset the data was encoded by looking up the hexcodes of known characters in code-tables for latin1 and utf-8. And guess what! It turned out that the supposed latin1-tables actually contained utf-8-encoded data!

There are two ways to get this data uncorrupted into the new db:

1.) The latin1-way:

Export the data with “–default-character-set=latin1” (utf-8 is default for mysqldump). The db believes the data is already latin1 and won’t change it – actually doing an utf-8-export. To be save it should be best to export with “–default-character-set=binary” as this dumps always with no conversion.

 mysqldump -u USERNAME -pPWD --default-character-set=binary dbname > dbname.dump

Knowing the data is utf-8 we import using “–default-character-set=utf8”:

 mysql -u USERNAME -pPWD --default-character-set=utf8 dbname < dbname.dump

As “latin1” is the given charset in the CREATE-statements of the tables in the dumpfile the tables will be created so. This forces the db to do an automatic conversion from utf-8 to latin1 when doing the import. The result is latin1-data in latin1-tables.

2.) The utf-8-way:

The export (and import as well) is now a two step thing.

First we will export only the metadata (the CREATE-statements) of the db using “–no-data”:

 mysqldump -u USERNAME -pPWD --no-data --skip-set-charset dbname > dbname_meta.dump 

In order to change the charset in each of the db’s CREATE-statements we replace every occurrence of “latin1” in the dumpfile with “utf8”:

 sed -i 's/latin1/utf8/g' dbname_meta.dump 

But beware that this is “a blind shot”: The sed-command replaces EVERY occurrence – everywhere in the dump. We do this under the assumption, that it will hit the right places. And as we have no real data in the dump, which would be modified as well, it should be alright.

Second we export just the data – using “–default-character-set=binary” to prevent a “double-utf-8-conversion” and “–no-create-info” to just export the row-data:

 mysqldump -u USERNAME -pPWD --default-character-set=binary --no-create-info --skip-set-charset dbname > dbname_data.dump 

We now have a metadata-file with utf8-enabled table-CREATEs and a datadump in utf-8-encoding. So the import should be obvious:

We create the tables by importing the metadump in an empty db:

 mysql -u USERNAME -pPWD dbname < dbname_meta.dump 

…and then import the data:

 mysql -u USERNAME -pPWD --default-character-set=utf8 dbname < dbname_data.dump 

And finally we have utf8-tables with utf-8-data in it.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: