Bad filenames after “ZIPing” files from Linux to Windows

I need to copy the whole images-directory of our Mediawiki from a Linux-Box to a Windows-Machine. But no matter if I tar or zip the files, the filenames with non-ASCII-characters (ü,ö,ä,…) are messed up after unziped on Windows. All those non-ASCII-chars are shown as squares or other obscure characters under Windows. Not just that they look ugly this way, the Mediawiki won’t find these files anymore as their names changed referred to the entry in the wiki’s database.

Again it smells like an encoding-problem. How nice it would be, if the whole IT-world would just use unicode.

Our Linux is a RedHat 5 where the command “locale” shows this:

LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=

Don’t know what charset this “C” is meant to be. Edit 2012-10-18: “C” seems to mean charset “ANSI-C”. I thought all Linuxes use UTF-8 per default but at least for ours this seems not to be true. We found two solutions:

1.) Use WinSCP to copy all files over from Linux to Windows. This way the filenames get converted to Window’s own charset WIN1252.

2.) Change the charset of the Linux-console explicitly to UTF-8 prior to ZIP the files. Under Bash I do this:

export LANG=de_DE.UTF-8

Afterwards I zip the files using 7zip (in 7z-Format!). When I unzip them under Windows with 7zip too, all is fine. Using the normal ZIP-command to compress under Linux still messed up my filenames.

Mediawiki: mysteries of charsets

In the past whenever I had to install a new wiki I always wondered which charset-option to choose:

I thought it would be best to let the db know about the charset of the stored data so that a charset-conversion is always possible. Hence I chose “MySQL 4.1/5.0 UTF-8” – and for new (empty) wikis there was never a problem.

The problems arose when I had to import db-dumps from older wikis. Mainly the pagetitles where often erroneous when they had “specialchars” like ü,ö,ä,ß in it. At first it seemed simple: I checked the table-defs from the old wiki, which where latin1 (or ISO-8859-1 to be ISO-conform). Then exported using “–default-character-set=latin1” and imported with “–default-character-set=utf-8”. But in vain: pagetitles scrambled.
Usually I ex-imported back and forth with various charset-settings until it worked – and usually WHEN it worked I almost never had a clue why.

I connect from a Windows-client through Putty to a MySQL-DB on a linux-server. And every element along the line has it’s own charset-def – mysql even a few (too much):

Server characterset:    latin1
Db     characterset:    latin1
Client characterset:    latin1
Conn.  characterset:    latin1

And every mysql-table could even have a different charset for every of it’s columns. What a mess! I virtually had no chance to tell the REAL charset by looking in the dump-file – cause every thing between my eyes and the dump could conceal it by it’s own settings.

Accidentally I discovered, that the linux-command “less” shows any special-, non-ASCII-chars in a dumpfile as hexcodes. This was the point when the fog cleared. Now I was able to tell in which charset the data was encoded by looking up the hexcodes of known characters in code-tables for latin1 and utf-8. And guess what! It turned out that the supposed latin1-tables actually contained utf-8-encoded data!

There are two ways to get this data uncorrupted into the new db:

1.) The latin1-way:

Export the data with “–default-character-set=latin1” (utf-8 is default for mysqldump). The db believes the data is already latin1 and won’t change it – actually doing an utf-8-export. To be save it should be best to export with “–default-character-set=binary” as this dumps always with no conversion.

 mysqldump -u USERNAME -pPWD --default-character-set=binary dbname > dbname.dump

Knowing the data is utf-8 we import using “–default-character-set=utf8”:

 mysql -u USERNAME -pPWD --default-character-set=utf8 dbname < dbname.dump

As “latin1” is the given charset in the CREATE-statements of the tables in the dumpfile the tables will be created so. This forces the db to do an automatic conversion from utf-8 to latin1 when doing the import. The result is latin1-data in latin1-tables.

2.) The utf-8-way:

The export (and import as well) is now a two step thing.

First we will export only the metadata (the CREATE-statements) of the db using “–no-data”:

 mysqldump -u USERNAME -pPWD --no-data --skip-set-charset dbname > dbname_meta.dump 

In order to change the charset in each of the db’s CREATE-statements we replace every occurrence of “latin1” in the dumpfile with “utf8”:

 sed -i 's/latin1/utf8/g' dbname_meta.dump 

But beware that this is “a blind shot”: The sed-command replaces EVERY occurrence – everywhere in the dump. We do this under the assumption, that it will hit the right places. And as we have no real data in the dump, which would be modified as well, it should be alright.

Second we export just the data – using “–default-character-set=binary” to prevent a “double-utf-8-conversion” and “–no-create-info” to just export the row-data:

 mysqldump -u USERNAME -pPWD --default-character-set=binary --no-create-info --skip-set-charset dbname > dbname_data.dump 

We now have a metadata-file with utf8-enabled table-CREATEs and a datadump in utf-8-encoding. So the import should be obvious:

We create the tables by importing the metadump in an empty db:

 mysql -u USERNAME -pPWD dbname < dbname_meta.dump 

…and then import the data:

 mysql -u USERNAME -pPWD --default-character-set=utf8 dbname < dbname_data.dump 

And finally we have utf8-tables with utf-8-data in it.

AJAX-SearchSuggest in Mediawiki

In a v1.15.1-mediawiki I had the following problem:

The searchsuggest-extension didn’t function properly. The suggestions only showed up for pagetitles with only capital letters at the beginning of the title. In another wiki whose MW-installation was an exact copy of this wiki the extension worked like a charme. quite curious.

To narrow the problem we made the wiki to show us the exact sql-query it uses to get the searchsuggestions. To do so we edited the file /mediawiki/includes/db/Database.php. After line 1025 in the function “select” we temporarily inserted an “echo $sql“:

function select( $table, $vars, $conds='', $fname = 'Database::select', $options = array(), $join_conds = array() )
	{
		$sql = $this->selectSQLText( $table, $vars, $conds, $fname, $options, $join_conds );
		echo $sql;
		return $this->query( $sql, $fname );
	}

After this the query showed up in the suggestbox and we got this:

SELECT  page_title  FROM `page`
WHERE page_namespace = '0' AND (UPPER(page_title) LIKE '%CPU%')
LIMIT 8

After some playing around with this query directly on the database it turned out, that the “UPPER(page_title)” didn’t work. The query only gave results when the pagetitle was already in capital letters – at least the part which matched the searchstring.

Comparing this db to the db of the properly working wiki, we found that the working db had been set up with “DEFAULT CHARSET=utf8” while the malfunctioning one had “DEFAULT CHARSET=binary“. So the pagetitle in the latter one was stored just as plain binary-code with no information about the encoding charset. Without this info the db is not able to find a matching uppercase letter to a given lower one.

Knowing this I wondered why Mediawiki uses the “binary”-setting as the default since some time now – because this makes all string-functions in the db useless. In earlier versions MW didn’t seem to make a fuss about charset-encodings and just used the mysql-default which was, and still is, “latin1”. So on older versions of MW, and newer ones with an explicitly set charset, the “searchsuggest” should work without any problems.

But what about the ones with “binary-encoding” enabled?

On the extension’s webpage you read this:

This extension is obsolete!
It has been replaced by core functionality in the MediaWiki software
(which was added in version 1.13.0).
See Manual:$wgEnableMWSuggest for the new core feature.

So you don’t need the extension anymore. Just put the named parameter “$wgEnableMWSuggest = TRUE;” in your LocalSettings.php and off you go.

I think this new suggestbox is even more ergonomical to use than the older one and also doesn’t blow up the layout like old SearchSuggest. But it has one discomfort: The search is casesensitive. To get around this you have to install the TitleKey-extension, which adds another table to your db. This additional table holds all the page_titles in uppercase letters to circumvent the problem with non-working UPPER/LOWER-db-functions when using “binary”-charset. TitleKey turns the searchstring into uppercase via PHP and compares it against this table – and this way makes the search case-insensitive.

Doing a “contains”-search

The normal behaviour of the searchsuggest is to match the searchstring against the beginning of the pagetitles – getting all pages that start with “searchstring”.
To get all pages that contain the searchstring somewhere in the title one could easily modify the TitleKey-Extension. To do that open the file “/extensions/TitleKey/TitleKey_body.php” and in the “static function prefixSearch” (line 188 or so) change the line ~202 like this:

old: 'tk_key LIKE \'' . $dbr->escapeLike( $key ) . '%\'',

new: 'tk_key LIKE \'%' . $dbr->escapeLike( $key ) . '%\'',