Mediawiki: How to export a subset of pages including images

Assuming that we want to export a subset of pages (e.g. a whole category) from a remote source-wiki into our target-wiki without shell-access to the source’s images-directory.

The wiki-articles consist mostly of three things: the article-text itself, some templates and some images used in the article. We will have to make sure, that all the stuff is imported into our target-wiki.

To export the articles and templates we go to the special-page “Special:Export”. There we can enter all the pagetitles or just the name of our to be exported category, check the checkbox “Include templates” and hit the “Export”-button. The result should be a XML-file that we store locally on our machine.

To import the XML-file, we go to the page “Special:Import” in our target-wiki, select the XML-file and import it. Easy.

Now comes the tricky part: The images.

When we go to page “Special:WantedFiles”, we’ll see a list of all files that are referenced somewhere in the articles but are not uploaded yet to the wiki. These are the files we want to get from our source-wiki.

Uploaded files in a Mediawiki are usually stored in the directory “./wikiname/images/”. The files in there are spread over a subdirectory-structure which is calculated by the MD5-hashvalues of the uploaded files. When a file is uploaded into the wiki, the wiki calculates the md5sum of the filename, takes the first character of the md5sum as the first-level- and the first two characters as the second-level-directory where finally the file will be stored.

So we have to get a list of our missing files and calculate the corresponding links to be able to download them. Here we have to take care of the used charset. Mediawiki usually uses UTF-8. So we have to make sure, that our file-list is UTF-8 too. Otherwise the md5sums, and hence the file-links, would differ. Also we must not include the LF (LineFeed) at the end of each line in our file-list-file into the md5-calculation, as this of course would alter the md5sum too.

To get our missing-files-list we send a query to our target-database after importing the XML-file and store the result in a file.

SELECT distinct il_to FROM imagelinks;

The table “imagelinks” holds all references to images/files that are used in articles. In our case I assume a new, empty wiki that just contains our imported XML-dump, so the SQL above is suffice.

In a case where we already have uploaded files in our wiki, we should check the imagelinks-table against the image-table. The image-table holds all the uploads. So we have to get a list of all image-links that have no corresponding upload in the image-table:

SELECT distinct il_to FROM imagelinks where not exists (select 1 from image where img_name=il_to);

Next we’ll calculate the links using the Linux-tool “md5sum” and download the files using wget.

All put together looks like this:

#!/bin/bash
# Mediawiki-image-exporter
# 2012-02-17 Marc Tempel (https://logbuffer.wordpress.com/)

if [ $# -ne 2 ]
then
	echo "usage $0 <TARGETDBNAME> <DBROOTPWD>"
	exit
fi

FILE=imagelist.txt
BASEURL="http://sourceserver/sourcewiki/images"
TARGETDBNAME=$1
DBROOTPWD=$2

echo "SET NAMES utf8; SELECT distinct il_to FROM imagelinks;" | mysql -u root -p"$DBROOTPWD" "$TARGETDBNAME" > ./"$FILE"

sed -i '/^il_to$/d' ./"$FILE"

for a in `cat ./"$FILE"`
do
	MD5=`echo -n $a | md5sum`
	FIRST=`echo ${MD5:0:1}`
	SECOND=`echo ${MD5:0:2}`
	TARGET="$BASEURL"/"$FIRST"/"$SECOND"/"$a"
	echo $TARGET
	wget -a ./wget.log --restrict-file-names=nocontrol -P ./images/ $TARGET
#	read -p "hit ENTER to continue..."
done
#EOF

The “BASEURL” is the link to the source-images-directory.

As DB-user I use “root” here (“-u root “), but you could use any other suitable user that has access to the wiki-db. The SQL to get the filenames is piped into the mysql-executable and the result is stored in the file “./imagelist.txt”. The “SET NAMES utf8;” is necessary to get the query-result UTF-8-encoded.

The parameter “-n” at the “echo”-command omits the LF to calculate a correct md5sum.

The “–restrict-file-names=nocontrol” at the wget keeps wget from escaping “control-characters” in UTF-8-filenames. Otherwise wget would eventually change the filenames in the links and so would make it impossible to get that files. For more info on that topic see here.

The images are stored in the directory “./images” (parameter “-P” at wget) and a wget-logfile is stored as “./wget.log”.

If the script has successfully downloaded all the files (check the wget.log for errors) we can import all files in a batch using the appropriate maintenance-script:

php /path/to/target_wiki_folder/maintenance/importImages.php ./images/

The command inserts all files in “./images” into the target-wiki. To take a dry-run first we could use parameter “–dry” on the upload-script.

If we now again check the “Special:WantedFiles” we should see a blank list if all was fine. I we still see missing files listed, we should check the LocalSettings.php for parameter “$wgEnableUploads = TRUE” and especially “$wgFileExtensions“. We have to make sure that all designated upload-file-extensions are listed in “$wgFileExtensions” – otherwise they are blocked for upload. If thereafter we want to upload some previously not allowed extensions, we can use parameter “–extensions=” on importImages.php (see “importImages.php –help” for more).

When downloading from Wikipedia it could be that the images are hosted on different servers. In that case we must check the servers on the image’s pages in Wikipedia (look at the deep-link there) and execute our script against each server.

If the articles in the target-wiki still look messy, there are probably some extensions or CSS-settings missing which exist on the source-wiki.

Advertisements