Locked rows in mediawiki-db

Recently a few times I had the problem, that I got a timeout from the wiki-db when requesting a certain page:
“Lock wait timeout exceeded; try restarting the transaction”

Querying the mysql-db for running processes with “SHOW PROCESSLIST” showed one or more sessions waiting for an exclusive lock on a datarow while trying to run an “UPDATE” or “SELECT…FOR UPDATE” on the “page”-table.
The lock is required because of an update of the page_counter-column every time a page is requested.
Normally the update should be very quick but here it seemed as if another session still held a lock on the datarow of my requested page and won’t release it.

Blogger Venu Anuganti wrote about an odd locking-problem with MySQL’s InnoDB-Engine which I verified on a test-db:

When a transaction, which holds a lock, is stopped by an error (e.g. “ERROR 1062 (23000): Duplicate entry ’10’ for key ‘PRIMARY‘”), the lock is not instantly released but held indefinitely until a commit or rollback is set. This causes other sessions, that requests the same lock, to wait until the lock is released or a timeout is hit.

According to the docs this is intended behavior:

“A duplicate-key error rolls back the SQL statement, if you have not specified the IGNORE option in your statement.”

Note: This states that not the entire transaction is rolled back, but just the failed statement. The handling of such an error is left to the application.

Also interesting in this context (same link as quote above):

“A lock wait timeout causes InnoDB to roll back only the single statement that was waiting for the lock and encountered the timeout. (Until MySQL 5.0.13 InnoDB rolled back the entire transaction if a lock wait timeout happened. You can restore this behavior by starting the server with the –innodb_rollback_on_timeout option, available as of MySQL 5.0.32.)”

As I’m using InnoDB for the wiki-tables I think it’s likely that this is the cause of my errors.

Until now I don’t know which piece of code is responsible for the error nor how to automatically detect and fix such errors. All I could do to fix it was to gradually kill the oldest inactive db-sessions (“SHOW PROCESSLIST“, “KILL session_id“) of the wiki-db-user until my test-statement (“SELECT * FROM PAGE WHERE PAGE_ID=108 FOR UPDATE“) run through.