Mediawiki: mysteries of charsets

In the past whenever I had to install a new wiki I always wondered which charset-option to choose:

I thought it would be best to let the db know about the charset of the stored data so that a charset-conversion is always possible. Hence I chose “MySQL 4.1/5.0 UTF-8” – and for new (empty) wikis there was never a problem.

The problems arose when I had to import db-dumps from older wikis. Mainly the pagetitles where often erroneous when they had “specialchars” like ü,ö,ä,ß in it. At first it seemed simple: I checked the table-defs from the old wiki, which where latin1 (or ISO-8859-1 to be ISO-conform). Then exported using “–default-character-set=latin1” and imported with “–default-character-set=utf-8”. But in vain: pagetitles scrambled.
Usually I ex-imported back and forth with various charset-settings until it worked – and usually WHEN it worked I almost never had a clue why.

I connect from a Windows-client through Putty to a MySQL-DB on a linux-server. And every element along the line has it’s own charset-def – mysql even a few (too much):

Server characterset:    latin1
Db     characterset:    latin1
Client characterset:    latin1
Conn.  characterset:    latin1

And every mysql-table could even have a different charset for every of it’s columns. What a mess! I virtually had no chance to tell the REAL charset by looking in the dump-file – cause every thing between my eyes and the dump could conceal it by it’s own settings.

Accidentally I discovered, that the linux-command “less” shows any special-, non-ASCII-chars in a dumpfile as hexcodes. This was the point when the fog cleared. Now I was able to tell in which charset the data was encoded by looking up the hexcodes of known characters in code-tables for latin1 and utf-8. And guess what! It turned out that the supposed latin1-tables actually contained utf-8-encoded data!

There are two ways to get this data uncorrupted into the new db:

1.) The latin1-way:

Export the data with “–default-character-set=latin1” (utf-8 is default for mysqldump). The db believes the data is already latin1 and won’t change it – actually doing an utf-8-export. To be save it should be best to export with “–default-character-set=binary” as this dumps always with no conversion.

 mysqldump -u USERNAME -pPWD --default-character-set=binary dbname > dbname.dump

Knowing the data is utf-8 we import using “–default-character-set=utf8”:

 mysql -u USERNAME -pPWD --default-character-set=utf8 dbname < dbname.dump

As “latin1” is the given charset in the CREATE-statements of the tables in the dumpfile the tables will be created so. This forces the db to do an automatic conversion from utf-8 to latin1 when doing the import. The result is latin1-data in latin1-tables.

2.) The utf-8-way:

The export (and import as well) is now a two step thing.

First we will export only the metadata (the CREATE-statements) of the db using “–no-data”:

 mysqldump -u USERNAME -pPWD --no-data --skip-set-charset dbname > dbname_meta.dump 

In order to change the charset in each of the db’s CREATE-statements we replace every occurrence of “latin1” in the dumpfile with “utf8”:

 sed -i 's/latin1/utf8/g' dbname_meta.dump 

But beware that this is “a blind shot”: The sed-command replaces EVERY occurrence – everywhere in the dump. We do this under the assumption, that it will hit the right places. And as we have no real data in the dump, which would be modified as well, it should be alright.

Second we export just the data – using “–default-character-set=binary” to prevent a “double-utf-8-conversion” and “–no-create-info” to just export the row-data:

 mysqldump -u USERNAME -pPWD --default-character-set=binary --no-create-info --skip-set-charset dbname > dbname_data.dump 

We now have a metadata-file with utf8-enabled table-CREATEs and a datadump in utf-8-encoding. So the import should be obvious:

We create the tables by importing the metadump in an empty db:

 mysql -u USERNAME -pPWD dbname < dbname_meta.dump 

…and then import the data:

 mysql -u USERNAME -pPWD --default-character-set=utf8 dbname < dbname_data.dump 

And finally we have utf8-tables with utf-8-data in it.


MySQL: how to export query-results to flatfile

To save the result of a query in mysql to a file you can do this:

  1. save the query-string in an inputfile (e.g.: “SELECT * FROM t1;“)
  2. execute the query via mysql-console:
    mysql -u username -pPWD dbname < inputfilename > outputfilename

You now have the output of the query tab-delimited with columnnames in the given outputfile.

You could also pass the query as a parameter to mysql:

mysql -u username -pPWD dbname -B -e “SELECT * FROM t1;” > outputfilename

The -B tells mysql to execute the query in batchmode and terminate afterwards. As obviouse the -e stands for “execute the following query”.

This way is easier than the first one but now you have to take care of masking the query-string if needed.

Alternatively you can use the “SELECT … INTO OUTFILE …”-directive in mysql. This also gives you the possibility to define another field- or line-terminator as shown in this example copied from the mysql-manual:

SELECT a,b,a+b INTO OUTFILE '/tmp/result.txt'
  FROM test_table;

To use this your mysql-user-account needs the FILE-privilege. Also the file can only be created on the mysql-server-host and there only in a directory where your mysql-process has write-access on.

To get the same result without the given restrictions you can do some sed-operations on the output as shown here.

AJAX-SearchSuggest in Mediawiki

In a v1.15.1-mediawiki I had the following problem:

The searchsuggest-extension didn’t function properly. The suggestions only showed up for pagetitles with only capital letters at the beginning of the title. In another wiki whose MW-installation was an exact copy of this wiki the extension worked like a charme. quite curious.

To narrow the problem we made the wiki to show us the exact sql-query it uses to get the searchsuggestions. To do so we edited the file /mediawiki/includes/db/Database.php. After line 1025 in the function “select” we temporarily inserted an “echo $sql“:

function select( $table, $vars, $conds='', $fname = 'Database::select', $options = array(), $join_conds = array() )
		$sql = $this->selectSQLText( $table, $vars, $conds, $fname, $options, $join_conds );
		echo $sql;
		return $this->query( $sql, $fname );

After this the query showed up in the suggestbox and we got this:

SELECT  page_title  FROM `page`
WHERE page_namespace = '0' AND (UPPER(page_title) LIKE '%CPU%')

After some playing around with this query directly on the database it turned out, that the “UPPER(page_title)” didn’t work. The query only gave results when the pagetitle was already in capital letters – at least the part which matched the searchstring.

Comparing this db to the db of the properly working wiki, we found that the working db had been set up with “DEFAULT CHARSET=utf8” while the malfunctioning one had “DEFAULT CHARSET=binary“. So the pagetitle in the latter one was stored just as plain binary-code with no information about the encoding charset. Without this info the db is not able to find a matching uppercase letter to a given lower one.

Knowing this I wondered why Mediawiki uses the “binary”-setting as the default since some time now – because this makes all string-functions in the db useless. In earlier versions MW didn’t seem to make a fuss about charset-encodings and just used the mysql-default which was, and still is, “latin1”. So on older versions of MW, and newer ones with an explicitly set charset, the “searchsuggest” should work without any problems.

But what about the ones with “binary-encoding” enabled?

On the extension’s webpage you read this:

This extension is obsolete!
It has been replaced by core functionality in the MediaWiki software
(which was added in version 1.13.0).
See Manual:$wgEnableMWSuggest for the new core feature.

So you don’t need the extension anymore. Just put the named parameter “$wgEnableMWSuggest = TRUE;” in your LocalSettings.php and off you go.

I think this new suggestbox is even more ergonomical to use than the older one and also doesn’t blow up the layout like old SearchSuggest. But it has one discomfort: The search is casesensitive. To get around this you have to install the TitleKey-extension, which adds another table to your db. This additional table holds all the page_titles in uppercase letters to circumvent the problem with non-working UPPER/LOWER-db-functions when using “binary”-charset. TitleKey turns the searchstring into uppercase via PHP and compares it against this table – and this way makes the search case-insensitive.

Doing a “contains”-search

The normal behaviour of the searchsuggest is to match the searchstring against the beginning of the pagetitles – getting all pages that start with “searchstring”.
To get all pages that contain the searchstring somewhere in the title one could easily modify the TitleKey-Extension. To do that open the file “/extensions/TitleKey/TitleKey_body.php” and in the “static function prefixSearch” (line 188 or so) change the line ~202 like this:

old: 'tk_key LIKE \'' . $dbr->escapeLike( $key ) . '%\'',

new: 'tk_key LIKE \'%' . $dbr->escapeLike( $key ) . '%\'',

Oracle: Create PFILE from SPFILE

I just ran into an error while trying to bounce a long running Oracle-DB:

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORA-16032: parameter LOG_ARCHIVE_DEST destination string cannot be translated
ORA-07286: sksagdi: cannot obtain device information.
SVR4 Error: 2: No such file or directory

So there must be something wrong with that LOG_ARCHIVE_DEST. But where the heck is it pointing to?!

To get a parameterfile (PFILE), which is human readable as it’s in plain ASCII, one can use “CREATE PFILE FROM SPFILE” in sqlplus. That command will result in a pfile in $ORACLE_HOME/dbs where also the spfile resides. Until now I thought that I had to bring the instance into NOMOUNT-state with “STARTUP NOMOUNT” for beeing able to execute the given command, because the SPFILE is read just then. But…

SQL> startup nomount
ORA-16032: parameter LOG_ARCHIVE_DEST destination string cannot be translated
ORA-07286: sksagdi: cannot obtain device information.
SVR4 Error: 2: No such file or directory

So, no way to get the value of the incriminated parameter?

When the environmentvariable  “ORACLE_SID=” is set to the desired DB the instance seems to read the spfile from the default-location ($ORACLE_HOME/dbs) in the default-filenameformat (spfileSID.ora) – no matter if the DB is up or not. So one can connect to an idle instance and create a PFILE:

sqlplus '/ as sysdba';
SQL*Plus: Release - Production on Mo Apr 12 17:05:39 2010
Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.
Connected to an idle instance.
SQL> create pfile from spfile;
File created.

As the process of creating a PFILE is just a translation of SPFILE to PFILE one even can connect to no instance at all (using only SQL-Plus) and create a PFILE from any given SPFILE while stating the absolute paths of each file:

sqlplus /nolog
SQL*Plus: Release - Production on Mo Apr 12 17:41:29 2010
Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.
SQL> create pfile='/usr/people/oracle/initORGDB.ora' from spfile='/opt/oracle/orahome1/dbs/spfileORGDB.ora';
File created.

After getting the PFILE one can examine it, fix the error and startup the instance with the corrected PFILE – or create a SPFILE from that corrected PFILE right away and startup with the SPFILE.

Edit 2014-08-04:
At least with version it is no longer possible to generate a pfile from spfile with “sqlplus /nolog”. To get the pfile we must get on the instance.

[oracle@testsrv ~]$ sqlplus /nolog

SQL*Plus: Release Production on Mo Aug 4 14:12:42 2014
Copyright (c) 1982, 2011, Oracle.  All rights reserved.

SQL> create pfile='/app/oracle/temp/inittestdb.ora' from spfile='/app/oracle/product/11.2.0/dbhome_1/dbs/spfiletestdb.ora';
SP2-0640: Not connected
SQL> exit

[oracle@testsrv ~]$ export ORACLE_SID=testdb
[oracle@testsrv ~]$ sqlplus / as sysdba

SQL*Plus: Release Production on Mon Aug 4 14:22:48 2014
Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> create pfile='/app/oracle/temp/testdb.ora' from spfile;
File created.

Another way of getting the parameters is using the “strings”-command at the shell on the spfile (credits go to Frits Hoogland):

[oracle@testsrv ~]$ strings /app/oracle/product/11.2.0/dbhome_1/dbs/spfiletestdb.ora
testdb.__oracle_base='/app/oracle'#ORACLE_BASE set from environment
[oracle@testsrv ~]$

This command extracts all strings from the SPFILE. The result is almost what would be in a PFILE. We just have to care for occasional newlines that would break certain parameters (here: control_files, undo_tablespace) in two lines. When we fix this line-breaks we would end up with the exact PFILE-Content.

Installing MathSupport in Mediawiki on RHEL5

Yesterday I wanted to get that Math-Thing to work in a Wiki to give users the possibility to create nice looking formulas.

Red Hat Enterprise Linux 5 – 64bit
Mediawiki 1.15.1

In my default MW-install there was already a MATH subfolder in the wiki-root – so no need to install anything here. I just had to compile the stuff in the Math-folder to get the texvc-executable.
The Math-README states the need to have OCaml installed before compiling. OCaml is no default on RHEL5. So I grabbed the next best RHEL5-compatible RPM (ocaml-3.10.1-1.el5.i386.rpm) from here.
The “make” in the Math-Folder faild with something like this:

ocamlopt -c
/tmp/camlasm747801.s: Assembler messages:
/tmp/camlasm747801.s:62: Error: suffix or operands invalid for `call'
/tmp/camlasm747801.s:80: Error: suffix or operands invalid for `call'
/tmp/camlasm747801.s:128: Error: suffix or operands invalid for `call'
/tmp/camlasm747801.s:164: Error: suffix or operands invalid for `push'
/tmp/camlasm747801.s:181: Error: suffix or operands invalid for `pop'
/tmp/camlasm747801.s:199: Error: suffix or operands invalid for `push'
/tmp/camlasm747801.s:206: Error: suffix or operands invalid for `push'
/tmp/camlasm747801.s:212: Error: suffix or operands invalid for `push'
/tmp/camlasm747801.s:219: Error: suffix or operands invalid for `push'
Assembler error, input left in file /tmp/camlasm747801.s
make: *** [util.cmx] Error 2

After some unsuccessfull googling it came to my mind, that I had a similar problem when unintentionally trying to install a 32bit-version of Oracle-Database on a 64bit-RHEL.
The OCaml I installed was 32bit. So I replaced it with this 64bit-version and lo and behold the Math-“make” succeeded. I set the $wgUseTeX in the LocalSettings to “true”, created a page with some formulas and all was well.

In another migrated wiki which came from another server I did the same but just got  red errormessages instead of formulas when I opend a preexisting page with formulas. The errors said something like “no read/write access to math-directory”. In the LocalSettings I found this:

$wgMathPath         =  “{$wgUploadPath}/math”
$wgMathDirectory    =  “{$wgUploadDirectory}/math”
$wgTmpDirectory     =  “{$wgUploadDirectory}/tmp”

These variables are identical to the ones in the DefaultSettings.php. So there’s no need to use them here, even if I don’t see any problem if so. I deleted them but still got “red alerts”. I checked the directory-permissions at the filesystem but all was okay. Finally I opened the wiki-page with the errors in edit-mode and saved it unchanged to the wiki to get a suspected old page-version out of the wiki-cache. Success!