Mediawiki: How to export a subset of pages including images

Assuming that we want to export a subset of pages (e.g. a whole category) from a remote source-wiki into our target-wiki without shell-access to the source’s images-directory.

The wiki-articles consist mostly of three things: the article-text itself, some templates and some images used in the article. We will have to make sure, that all the stuff is imported into our target-wiki.

To export the articles and templates we go to the special-page “Special:Export”. There we can enter all the pagetitles or just the name of our to be exported category, check the checkbox “Include templates” and hit the “Export”-button. The result should be a XML-file that we store locally on our machine.

To import the XML-file, we go to the page “Special:Import” in our target-wiki, select the XML-file and import it. Easy.

Now comes the tricky part: The images.

When we go to page “Special:WantedFiles”, we’ll see a list of all files that are referenced somewhere in the articles but are not uploaded yet to the wiki. These are the files we want to get from our source-wiki.

Uploaded files in a Mediawiki are usually stored in the directory “./wikiname/images/”. The files in there are spread over a subdirectory-structure which is calculated by the MD5-hashvalues of the uploaded files. When a file is uploaded into the wiki, the wiki calculates the md5sum of the filename, takes the first character of the md5sum as the first-level- and the first two characters as the second-level-directory where finally the file will be stored.

So we have to get a list of our missing files and calculate the corresponding links to be able to download them. Here we have to take care of the used charset. Mediawiki usually uses UTF-8. So we have to make sure, that our file-list is UTF-8 too. Otherwise the md5sums, and hence the file-links, would differ. Also we must not include the LF (LineFeed) at the end of each line in our file-list-file into the md5-calculation, as this of course would alter the md5sum too.

To get our missing-files-list we send a query to our target-database after importing the XML-file and store the result in a file.

SELECT distinct il_to FROM imagelinks;

The table “imagelinks” holds all references to images/files that are used in articles. In our case I assume a new, empty wiki that just contains our imported XML-dump, so the SQL above is suffice.

In a case where we already have uploaded files in our wiki, we should check the imagelinks-table against the image-table. The image-table holds all the uploads. So we have to get a list of all image-links that have no corresponding upload in the image-table:

SELECT distinct il_to FROM imagelinks where not exists (select 1 from image where img_name=il_to);

Next we’ll calculate the links using the Linux-tool “md5sum” and download the files using wget.

All put together looks like this:

#!/bin/bash
# Mediawiki-image-exporter
# 2012-02-17 Marc Tempel (https://logbuffer.wordpress.com/)

if [ $# -ne 2 ]
then
	echo "usage $0 <TARGETDBNAME> <DBROOTPWD>"
	exit
fi

FILE=imagelist.txt
BASEURL="http://sourceserver/sourcewiki/images"
TARGETDBNAME=$1
DBROOTPWD=$2

echo "SET NAMES utf8; SELECT distinct il_to FROM imagelinks;" | mysql -u root -p"$DBROOTPWD" "$TARGETDBNAME" > ./"$FILE"

sed -i '/^il_to$/d' ./"$FILE"

for a in `cat ./"$FILE"`
do
	MD5=`echo -n $a | md5sum`
	FIRST=`echo ${MD5:0:1}`
	SECOND=`echo ${MD5:0:2}`
	TARGET="$BASEURL"/"$FIRST"/"$SECOND"/"$a"
	echo $TARGET
	wget -a ./wget.log --restrict-file-names=nocontrol -P ./images/ $TARGET
#	read -p "hit ENTER to continue..."
done
#EOF

The “BASEURL” is the link to the source-images-directory.

As DB-user I use “root” here (“-u root “), but you could use any other suitable user that has access to the wiki-db. The SQL to get the filenames is piped into the mysql-executable and the result is stored in the file “./imagelist.txt”. The “SET NAMES utf8;” is necessary to get the query-result UTF-8-encoded.

The parameter “-n” at the “echo”-command omits the LF to calculate a correct md5sum.

The “–restrict-file-names=nocontrol” at the wget keeps wget from escaping “control-characters” in UTF-8-filenames. Otherwise wget would eventually change the filenames in the links and so would make it impossible to get that files. For more info on that topic see here.

The images are stored in the directory “./images” (parameter “-P” at wget) and a wget-logfile is stored as “./wget.log”.

If the script has successfully downloaded all the files (check the wget.log for errors) we can import all files in a batch using the appropriate maintenance-script:

php /path/to/target_wiki_folder/maintenance/importImages.php ./images/

The command inserts all files in “./images” into the target-wiki. To take a dry-run first we could use parameter “–dry” on the upload-script.

If we now again check the “Special:WantedFiles” we should see a blank list if all was fine. I we still see missing files listed, we should check the LocalSettings.php for parameter “$wgEnableUploads = TRUE” and especially “$wgFileExtensions“. We have to make sure that all designated upload-file-extensions are listed in “$wgFileExtensions” – otherwise they are blocked for upload. If thereafter we want to upload some previously not allowed extensions, we can use parameter “–extensions=” on importImages.php (see “importImages.php –help” for more).

When downloading from Wikipedia it could be that the images are hosted on different servers. In that case we must check the servers on the image’s pages in Wikipedia (look at the deep-link there) and execute our script against each server.

If the articles in the target-wiki still look messy, there are probably some extensions or CSS-settings missing which exist on the source-wiki.

Get files from SVN via wget

If one has to manually download all the stuff from a given SVN-branch, this could be a very tedious task right-clicking every thing and saving it to local disk. Luckily this could be automated via wget, which comes with most Linux distributions (there are Windows-ports out there too).

If for example I want to download all the stuff from http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/SemanticForms/ I “wget” it this way:

wget -e robots=off --wait 1 -r -I /svnroot/mediawiki/trunk/extensions/SemanticForms/ http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/SemanticForms/

“-e robots=off” ignores the robots.txt
“–wait 1” waits 1 second between the downloads
“-r” scans the server recursively
“-I /svnroot/mediawiki/trunk/extensions/SemanticForms/” includes the given path and below (excluding all the rest)

The latter one (“-I …”) is important! If omitted, wget would scan the whole server up and down – not limited to the “SemanticForms”-subdir.

Edit 2012-05-22:
There is an easier way of limiting the wget-operation to the given branch: the “-np” option. This switch keeps wget from ascending to the parent directory. So you could omit the -I mentioned above and use -np instead to get the same result with less typing. (Thank you Joe for the advice!)

Mediawiki: Problem with Extension MultiUpload / TagCloud

We had an error while using the MultiUpload-Extension along with WikiCategoryTagCloud. After every successful upload we got this for every uploaded file:

Warning: Illegal offset type in isset or empty in (...)/includes/Title.php on line 117
Warning: trim() expects parameter 1 to be string, array given in (...)/includes/Title.php on line 2286

We found out, that there is a minor bug in the code of function “invalidateCache” of WikiCategoryTagCloud. In the line

$titles[0] = explode( "\n", wfMsg( 'tagcloudpages' ) );

the whole “exploded” tagcloudpages go as a second array into the first slot of $titles[0]. So that later on in the code there is no single string to be handed, but a whole array (“expects parameter 1 to be string, array given”). To correct the error we changed the above line to this:

$titles = explode( "\n", wfMsg( 'tagcloudpages' ) );

This way all lines (page_titles) from the article Mediawiki:Tagcloudpages implicitly build up an array in the variable $titles and can be referenced correctly later on in the function’s code.

System: MediaWiki 1.16.5, PHP 5.3.3, MySQL 5.0.77

This post is a copy of my two cents I put in at mediawiki.org.

Bad filenames after “ZIPing” files from Linux to Windows

I need to copy the whole images-directory of our Mediawiki from a Linux-Box to a Windows-Machine. But no matter if I tar or zip the files, the filenames with non-ASCII-characters (ü,ö,ä,…) are messed up after unziped on Windows. All those non-ASCII-chars are shown as squares or other obscure characters under Windows. Not just that they look ugly this way, the Mediawiki won’t find these files anymore as their names changed referred to the entry in the wiki’s database.

Again it smells like an encoding-problem. How nice it would be, if the whole IT-world would just use unicode.

Our Linux is a RedHat 5 where the command “locale” shows this:

LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=

Don’t know what charset this “C” is meant to be. Edit 2012-10-18: “C” seems to mean charset “ANSI-C”. I thought all Linuxes use UTF-8 per default but at least for ours this seems not to be true. We found two solutions:

1.) Use WinSCP to copy all files over from Linux to Windows. This way the filenames get converted to Window’s own charset WIN1252.

2.) Change the charset of the Linux-console explicitly to UTF-8 prior to ZIP the files. Under Bash I do this:

export LANG=de_DE.UTF-8

Afterwards I zip the files using 7zip (in 7z-Format!). When I unzip them under Windows with 7zip too, all is fine. Using the normal ZIP-command to compress under Linux still messed up my filenames.

Mediawiki 1.16: Bad performance with IE 6.0

Checking out the latest Mediawiki version 1.16, we encountered a really bad performance while accessing the wiki with “Internet Explorer 6.0”. We had an average response-time of 3 seconds. On some older machines even more then 20 seconds. With Firefox 3.6 all was fine: instant response.

It turned out, that the cause of this performance problem was in the new default-skin “Vector”. In the file “/skins/vector.php” is a “public function initPage” with this snippet:

$out->addScript(
	'<!--[if lt IE 7]><style type="text/css">body{behavior:url("' .
		$wgStylePath .
		'/vector/csshover.htc")}</style><![endif]-->'
);

This adds a version-check: “[if lt IE 7]”. So for clients with IE lower than version 7 this function loads the given file “/vector/csshover.htc” which supplies some functionality that IE 6.0 and lower lacks. After we commented out this part of code the performance was well in IE 6.0 too. Until now we saw no cutbacks done by this, so we left it disabled.

Edit 2011-07-06:
Ok, it should have been obvious (“csshover.htc”): There is a lack of functionality after disabling it. In IE6 the hover-effect, e.g. on the arrow-tab right of the “version-history”-tab, doesn’t work. So the corresponding menu with “delete”, “move”, etc. will not show up.

Here is a hack to resolve the problem:

In file skins/Vector.php (~line 39) replace this

$out->addScript(
	'<!--[if lt IE 7]><style type="text/css">body{behavior:url("' .
		$wgStylePath .
		'/vector/csshover.htc")}</style><![endif]-->'
);

with this

$out->addScript(
	'<!--[if lt IE 7]><script type="text/javascript">
		jQuery(function(){
			jQuery("#p-cactions").mouseenter(function(){
				jQuery("#p-cactions div.menu").css("display","block");
			});
			jQuery("#p-cactions").mouseleave(function(){
				jQuery("#p-cactions div.menu").css("display","none");
			});
		});
	</script><![endif]-->'
);

Template-hassle with MediaWiki

One of our departments makes extensive use of including templates in wiki-articles. They have one article with 56 KB in size that hosts some larger tables. And every row of that tables includes at least one template. There is a total of 24 different templates in use in the article.

At the top of that not flawless loading page we now see this:

"Kategorie:Seiten, in denen die maximale Größe eingebundener Vorlagen überschritten ist"
 
or in english: "Category:Pages where template include size is exceeded"

Examining the HTML-source of the page shows now and then this warning:

<!-- WARNING: template omitted, post-expand include size too large -->

Near the end of the code we find this block of information:

<!-- 
NewPP limit report
Preprocessor node count: 25885/1000000
Post-expand include size: 2097152/2097152 bytes
Template argument size: 263610/2097152 bytes
Expensive parser function count: 3/500
ExtLoops count: 148/200
-->

As you can see, the “post-expand include size” hit the upper limit of 2048 KB. Remember: The core-text of the article was only 56 KB in size. But as stated here, every parsing of (almost) every template adds the size of that template to the total-size of the article-page while preprocessing the page.

A proper way would be to split the article into smaller subpages and link them together in a main-page. But department was under time pressure and urged me to rise the limit.

Unfortunately I found no parameter to be set in the LocalSettings.php to increase the “post-expand include size”. But running through the MediaWiki-parameters-page and scanning every one with a “Max” in it’s name, I came across the “$wgMaxArticleSize” whose default-value coincidentally equaled the 2048 KB of the “Post-expand include size”-limit. And hence all summed up template- and article-code actually makes up the “ArticleSize”, I gave it a try and set “$wgMaxArticleSize = 4096” in the LocalSettings.

The page now loads without any problem and no limit is hit:

<!-- 
NewPP limit report
Preprocessor node count: 25938/1000000
Post-expand include size: 2672362/4194304 bytes
Template argument size: 277317/4194304 bytes
Expensive parser function count: 3/500
ExtLoops count: 152/200
-->

According to “Google” there really doesn’t seem to be a special-parameter for “Post-expand include size” but as my new limits now again equal the $wgMaxArticleSize, it seems as this is the way to do it. At least it works…

Locked rows in mediawiki-db

Recently a few times I had the problem, that I got a timeout from the wiki-db when requesting a certain page:
 
“Lock wait timeout exceeded; try restarting the transaction”

Querying the mysql-db for running processes with “SHOW PROCESSLIST” showed one or more sessions waiting for an exclusive lock on a datarow while trying to run an “UPDATE” or “SELECT…FOR UPDATE” on the “page”-table.
The lock is required because of an update of the page_counter-column every time a page is requested.
Normally the update should be very quick but here it seemed as if another session still held a lock on the datarow of my requested page and won’t release it.

Blogger Venu Anuganti wrote about an odd locking-problem with MySQL’s InnoDB-Engine which I verified on a test-db:

When a transaction, which holds a lock, is stopped by an error (e.g. “ERROR 1062 (23000): Duplicate entry ’10’ for key ‘PRIMARY‘”), the lock is not instantly released but held indefinitely until a commit or rollback is set. This causes other sessions, that requests the same lock, to wait until the lock is released or a timeout is hit.

According to the docs this is intended behavior:

“A duplicate-key error rolls back the SQL statement, if you have not specified the IGNORE option in your statement.”

Note: This states that not the entire transaction is rolled back, but just the failed statement. The handling of such an error is left to the application.

Also interesting in this context (same link as quote above):

“A lock wait timeout causes InnoDB to roll back only the single statement that was waiting for the lock and encountered the timeout. (Until MySQL 5.0.13 InnoDB rolled back the entire transaction if a lock wait timeout happened. You can restore this behavior by starting the server with the –innodb_rollback_on_timeout option, available as of MySQL 5.0.32.)”

As I’m using InnoDB for the wiki-tables I think it’s likely that this is the cause of my errors.

Until now I don’t know which piece of code is responsible for the error nor how to automatically detect and fix such errors. All I could do to fix it was to gradually kill the oldest inactive db-sessions (“SHOW PROCESSLIST“, “KILL session_id“) of the wiki-db-user until my test-statement (“SELECT * FROM PAGE WHERE PAGE_ID=108 FOR UPDATE“) run through.