Crash-testing the innodb transaction log!

So, back when we released our first beta in September, one of the many responses was this

The comments about the reliability / durability of the log definitely struck me as testing we needed.

It’s taken a while (we had this GA thing we were working on…), but we finally have crash and recover testing of the innodb transaction log and the slave plugin.

Here is what happens for the innodb-based log:

  • Set up the test servers and start the randgen with the trx_log grammars.  I’ll point you to my superhuman colleague, Andrew Hutchings for a summary of what they do.
  • Some time into the test (after several rounds of queries have run), `kill -9 $pid` is issued against the master server
  • The master server is restarted
  • The transaction_reader utility is called to generate SQL from the contents of the log
  • A validator server is populated with the log’s SQL
  • Drizzledump is called against the master and validation servers
  • A diff is taken of the dump files – if all is well, they should match

For the slave plugin, everything is basically the same except that we wait and make sure the slave and master are synched, then dumpfiles are compared.

With this testing we can say that:
* The innodb-based rpl log will provide an accurate representation of the database’s state even after a crash.
* The slave plugin will provide an accurate representation of the master server even after a crash and restart.

Many iterations of these tests have been run so far, using the standard randgen data and queries as well as making use of –seed=time.  When we do this, we randomize the data and queries generated so it can cover more ground than simply running the same 1000 transactions over and over.  As it is a well designed tool, any runs can easily be repeated as the same seed *always* produces the same data and queries…repeatability is one of a qa engineer’s favorite words : )

So without further ado, here is some output from the tests.  They are located in the innodb_trx_log and slave_plugin suites for the randgen, executable via dbqp:
To run them:
./dbqp –mode=randgen –randgen-path=/path/to/randgen –suite=innodb_trx_log,slave_plugin multi_thread1_crash_recover

NOTE that the output doesn’t normally include the ps output, just putting it in here to show off the magic ; )


<snip>
# 2011-03-24T17:00:03 Query:  SELECT * FROM `C` AS X WHERE X . `col_bigint_key` BETWEEN 211 AND 2673999872 LIMIT 5 FOR UPDATE /*Generated by THREAD_ID 1*/  failed: 1213 Deadlock found when trying to get lock; try restarting transaction
# 2011-03-24T17:00:03 Query:  ROLLBACK TO SAVEPOINT A /*Generated by THREAD_ID 1*/  failed: 1305 SAVEPOINT %s does not exist
# 2011-03-24T17:00:03 Query:  ROLLBACK TO SAVEPOINT A /*Generated by THREAD_ID 1*/  failed: 1305 SAVEPOINT %s does not exist
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
0 S  1000 13090   491  0  80   0 -  9317 -      pts/6    00:00:00 python
0 S  1000 13229     1 99  80   0 - 164210 -     pts/6    00:01:31 lt-drizzled
0 S  1000 13260     1  0  80   0 - 104483 -     pts/6    00:00:00 lt-drizzled
0 S  1000 13290 13090  0  80   0 -  1038 -      pts/6    00:00:00 sh
0 S  1000 13291 13290  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13298 13291  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13299 13291  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13300 13291  5  80   0 - 20378 -      pts/6    00:00:02 gentest.pl
1 S  1000 13302 13291  5  80   0 - 20355 -      pts/6    00:00:02 gentest.pl
1 S  1000 13304 13291  5  80   0 - 20324 -      pts/6    00:00:02 gentest.pl
1 S  1000 13306 13291  5  80   0 - 20406 -      pts/6    00:00:02 gentest.pl
0 R  1000 13343 13299  0  80   0 -  1651 -      pts/6    00:00:00 ps
# 2011-03-24T17:00:03 0
# 2011-03-24T17:00:03 Sending kill -9 to server pid 13229 in order to force a recovery.
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
0 S  1000 13090   491  0  80   0 -  9317 -      pts/6    00:00:00 python
0 Z  1000 13229     1 99  80   0 -     0 ?      pts/6    00:01:31 lt-drizzled <defunct>
0 S  1000 13260     1  0  80   0 - 104483 -     pts/6    00:00:00 lt-drizzled
0 S  1000 13290 13090  0  80   0 -  1038 -      pts/6    00:00:00 sh
0 S  1000 13291 13290  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13298 13291  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13299 13291  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13300 13291  5  80   0 - 20378 -      pts/6    00:00:02 gentest.pl
1 S  1000 13302 13291  5  80   0 - 20355 -      pts/6    00:00:02 gentest.pl
1 S  1000 13304 13291  5  80   0 - 20324 -      pts/6    00:00:02 gentest.pl
1 S  1000 13306 13291  5  80   0 - 20406 -      pts/6    00:00:02 gentest.pl
0 R  1000 13344 13299  0  80   0 -  1651 -      pts/6    00:00:00 ps
# 2011-03-24T17:00:03 0
# 2011-03-24T17:00:03 Killing child process with pid 13300...
# 2011-03-24T17:00:03 Killing child process with pid 13306...
# 2011-03-24T17:00:03 Killing child process with pid 13302...
# 2011-03-24T17:00:03 Killing child process with pid 13304...
# 2011-03-24T17:00:03 Kill GenTest::ErrorFilter(13298)
# 2011-03-24T17:00:03 Attempting database recovery using the server ...
# 2011-03-24T17:00:03 Executing drizzle/drizzled/drizzled --no-defaults --core-file --datadir="drizzle/tests/workdir/bot0/s0/var/master-data" --basedir="drizzle" --plugin-add=shutdown_function --mysql-protocol.port=9306 2>&1 .
# 2011-03-24T17:00:03 13345
# 2011-03-24T17:00:08 transaction_log output file:  /tmp//translog_13291_.sql
# 2011-03-24T17:00:08 drizzle/plugin/transaction_log/utilities/transaction_reader -uroot --use-innodb-replication-log -p 9306 --ignore-events > /tmp//translog_13291_.sql
# 2011-03-24T17:00:09 Replicating from transaction_log output...
# 2011-03-24T17:00:09 drizzle/client/drizzle --host=127.0.0.1 --port=9311 --user=root test <  /tmp//translog_13291_.sql
# 2011-03-24T17:00:16 Validating replication via dumpfile compare...
# 2011-03-24T17:00:16 /tmp//translog_rpl_dump_13291_9306.sql
# 2011-03-24T17:00:16 drizzle/client/drizzledump --compact --skip-extended-insert --host=127.0.0.1 --port=9306 --user=root test >/tmp//translog_rpl_dump_13291_9306.sql
# 2011-03-24T17:00:17 /tmp//translog_rpl_dump_13291_9311.sql
# 2011-03-24T17:00:17 drizzle/client/drizzledump --compact --skip-extended-insert --host=127.0.0.1 --port=9311 --user=root test >/tmp//translog_rpl_dump_13291_9311.sql
# 2011-03-24T17:00:17 Executing diff --unified /tmp//translog_rpl_dump_13291_9306.sql /tmp//translog_rpl_dump_13291_9311.sql
# 2011-03-24T17:00:17 Cleaning up validation server...
# 2011-03-24T17:00:17 Resetting validation server...
# 2011-03-24T17:00:18 0
# 2011-03-24T17:00:18 Test completed successfully.

The randgen code is in lp:randgen and the updated dbqp tests should be merged to trunk very soon.  Docs on running the randgen are here, dbqp are here.

At this point, I’d also like to shine the spotlight on David Shrewsbury for all of his hard work on the replication system.  He’s shepherded this code from a file-based log with limited testing all the way to a functional (and highly flexible) replication solution.  It was Dave who helped me with the often frustrating task of rooting out early bugs so that subsequent code could have a good foundation.  Big props should also go to Jay Pipes (master of the fu!) for his design work on the initial transaction log code.  Good design, good coding, and lots of love = some pretty cool stuff.  Of course, plenty of other people have helped…I just wanted to personally thank Dave for not trying to kill me when I was bombarding him with painful bugs early on ; )

As always, we hope to hear from you guys via IRC, emails and launchpad.  Also Drizzle Developer Day !  Sign up for it ; )  Hope to see you guys at the 2011 MySQL User’s Conference

2 thoughts on “Crash-testing the innodb transaction log!

  1. Pingback: Patrick Crews: Crash-testing the innodb transaction log! | Weez.com

  2. Pingback: Patrick Crews: Testing Drizzle’s multi-master replication, part I | Weez.com

Comments are closed.