Testing Drizzle’s multi-master replication, part I

So, the amazing Mr. Shrewsbury, master of all things Drizzle-replication, has created a beta tree with multi-master replication!

As his blog explains, it is still very early days, but as it currently stands, it can provide some long-hoped-for functionality for people with particular replication needs.  This feature will definitely be made more robust in the future, but it is still quite useful in its present form.

As I think this would be cool to see in trunk, I have started testing the tree.  No progress has been made on the multi-master side of things…yet, but I am happy to report that all of our existing replication tests (including crash and restore) are passing just fine.

What this means is that the code could make it into trunk without breaking anything for anyone and also providing access to a very cool new feature.  As time permits, we will be expanding the test cases for this feature, but in the interim, I hope that you will enjoy testing Dave’s awesome work ; )

Crash-testing the innodb transaction log!

So, back when we released our first beta in September, one of the many responses was this

The comments about the reliability / durability of the log definitely struck me as testing we needed.

It’s taken a while (we had this GA thing we were working on…), but we finally have crash and recover testing of the innodb transaction log and the slave plugin.

Here is what happens for the innodb-based log:

  • Set up the test servers and start the randgen with the trx_log grammars.  I’ll point you to my superhuman colleague, Andrew Hutchings for a summary of what they do.
  • Some time into the test (after several rounds of queries have run), `kill -9 $pid` is issued against the master server
  • The master server is restarted
  • The transaction_reader utility is called to generate SQL from the contents of the log
  • A validator server is populated with the log’s SQL
  • Drizzledump is called against the master and validation servers
  • A diff is taken of the dump files – if all is well, they should match

For the slave plugin, everything is basically the same except that we wait and make sure the slave and master are synched, then dumpfiles are compared.

With this testing we can say that:
* The innodb-based rpl log will provide an accurate representation of the database’s state even after a crash.
* The slave plugin will provide an accurate representation of the master server even after a crash and restart.

Many iterations of these tests have been run so far, using the standard randgen data and queries as well as making use of –seed=time.  When we do this, we randomize the data and queries generated so it can cover more ground than simply running the same 1000 transactions over and over.  As it is a well designed tool, any runs can easily be repeated as the same seed *always* produces the same data and queries…repeatability is one of a qa engineer’s favorite words : )

So without further ado, here is some output from the tests.  They are located in the innodb_trx_log and slave_plugin suites for the randgen, executable via dbqp:
To run them:
./dbqp –mode=randgen –randgen-path=/path/to/randgen –suite=innodb_trx_log,slave_plugin multi_thread1_crash_recover

NOTE that the output doesn’t normally include the ps output, just putting it in here to show off the magic ; )


<snip>
# 2011-03-24T17:00:03 Query:  SELECT * FROM `C` AS X WHERE X . `col_bigint_key` BETWEEN 211 AND 2673999872 LIMIT 5 FOR UPDATE /*Generated by THREAD_ID 1*/  failed: 1213 Deadlock found when trying to get lock; try restarting transaction
# 2011-03-24T17:00:03 Query:  ROLLBACK TO SAVEPOINT A /*Generated by THREAD_ID 1*/  failed: 1305 SAVEPOINT %s does not exist
# 2011-03-24T17:00:03 Query:  ROLLBACK TO SAVEPOINT A /*Generated by THREAD_ID 1*/  failed: 1305 SAVEPOINT %s does not exist
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
0 S  1000 13090   491  0  80   0 -  9317 -      pts/6    00:00:00 python
0 S  1000 13229     1 99  80   0 - 164210 -     pts/6    00:01:31 lt-drizzled
0 S  1000 13260     1  0  80   0 - 104483 -     pts/6    00:00:00 lt-drizzled
0 S  1000 13290 13090  0  80   0 -  1038 -      pts/6    00:00:00 sh
0 S  1000 13291 13290  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13298 13291  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13299 13291  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13300 13291  5  80   0 - 20378 -      pts/6    00:00:02 gentest.pl
1 S  1000 13302 13291  5  80   0 - 20355 -      pts/6    00:00:02 gentest.pl
1 S  1000 13304 13291  5  80   0 - 20324 -      pts/6    00:00:02 gentest.pl
1 S  1000 13306 13291  5  80   0 - 20406 -      pts/6    00:00:02 gentest.pl
0 R  1000 13343 13299  0  80   0 -  1651 -      pts/6    00:00:00 ps
# 2011-03-24T17:00:03 0
# 2011-03-24T17:00:03 Sending kill -9 to server pid 13229 in order to force a recovery.
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
0 S  1000 13090   491  0  80   0 -  9317 -      pts/6    00:00:00 python
0 Z  1000 13229     1 99  80   0 -     0 ?      pts/6    00:01:31 lt-drizzled <defunct>
0 S  1000 13260     1  0  80   0 - 104483 -     pts/6    00:00:00 lt-drizzled
0 S  1000 13290 13090  0  80   0 -  1038 -      pts/6    00:00:00 sh
0 S  1000 13291 13290  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13298 13291  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13299 13291  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13300 13291  5  80   0 - 20378 -      pts/6    00:00:02 gentest.pl
1 S  1000 13302 13291  5  80   0 - 20355 -      pts/6    00:00:02 gentest.pl
1 S  1000 13304 13291  5  80   0 - 20324 -      pts/6    00:00:02 gentest.pl
1 S  1000 13306 13291  5  80   0 - 20406 -      pts/6    00:00:02 gentest.pl
0 R  1000 13344 13299  0  80   0 -  1651 -      pts/6    00:00:00 ps
# 2011-03-24T17:00:03 0
# 2011-03-24T17:00:03 Killing child process with pid 13300...
# 2011-03-24T17:00:03 Killing child process with pid 13306...
# 2011-03-24T17:00:03 Killing child process with pid 13302...
# 2011-03-24T17:00:03 Killing child process with pid 13304...
# 2011-03-24T17:00:03 Kill GenTest::ErrorFilter(13298)
# 2011-03-24T17:00:03 Attempting database recovery using the server ...
# 2011-03-24T17:00:03 Executing drizzle/drizzled/drizzled --no-defaults --core-file --datadir="drizzle/tests/workdir/bot0/s0/var/master-data" --basedir="drizzle" --plugin-add=shutdown_function --mysql-protocol.port=9306 2>&1 .
# 2011-03-24T17:00:03 13345
# 2011-03-24T17:00:08 transaction_log output file:  /tmp//translog_13291_.sql
# 2011-03-24T17:00:08 drizzle/plugin/transaction_log/utilities/transaction_reader -uroot --use-innodb-replication-log -p 9306 --ignore-events > /tmp//translog_13291_.sql
# 2011-03-24T17:00:09 Replicating from transaction_log output...
# 2011-03-24T17:00:09 drizzle/client/drizzle --host=127.0.0.1 --port=9311 --user=root test <  /tmp//translog_13291_.sql
# 2011-03-24T17:00:16 Validating replication via dumpfile compare...
# 2011-03-24T17:00:16 /tmp//translog_rpl_dump_13291_9306.sql
# 2011-03-24T17:00:16 drizzle/client/drizzledump --compact --skip-extended-insert --host=127.0.0.1 --port=9306 --user=root test >/tmp//translog_rpl_dump_13291_9306.sql
# 2011-03-24T17:00:17 /tmp//translog_rpl_dump_13291_9311.sql
# 2011-03-24T17:00:17 drizzle/client/drizzledump --compact --skip-extended-insert --host=127.0.0.1 --port=9311 --user=root test >/tmp//translog_rpl_dump_13291_9311.sql
# 2011-03-24T17:00:17 Executing diff --unified /tmp//translog_rpl_dump_13291_9306.sql /tmp//translog_rpl_dump_13291_9311.sql
# 2011-03-24T17:00:17 Cleaning up validation server...
# 2011-03-24T17:00:17 Resetting validation server...
# 2011-03-24T17:00:18 0
# 2011-03-24T17:00:18 Test completed successfully.

The randgen code is in lp:randgen and the updated dbqp tests should be merged to trunk very soon.  Docs on running the randgen are here, dbqp are here.

At this point, I’d also like to shine the spotlight on David Shrewsbury for all of his hard work on the replication system.  He’s shepherded this code from a file-based log with limited testing all the way to a functional (and highly flexible) replication solution.  It was Dave who helped me with the often frustrating task of rooting out early bugs so that subsequent code could have a good foundation.  Big props should also go to Jay Pipes (master of the fu!) for his design work on the initial transaction log code.  Good design, good coding, and lots of love = some pretty cool stuff.  Of course, plenty of other people have helped…I just wanted to personally thank Dave for not trying to kill me when I was bombarding him with painful bugs early on ; )

As always, we hope to hear from you guys via IRC, emails and launchpad.  Also Drizzle Developer Day !  Sign up for it ; )  Hope to see you guys at the 2011 MySQL User’s Conference

Drizzle7 GA – testing summary

With the HUGE news of Drizzle’s first GA, I thought it appropriate that I spend some time discussing the testing that has gone into this release.

I tend to agree with Stewart’s assessment of our quality – it is solid and I think that you will find yourself pleasantly surprised and not at all angry when using it, but it is always in the eye of the user ; )  With that said, I did want to highlight some areas of interest.

To begin with, as we are a fork of MySQL, the bulk of our test suite comes directly from there as well.  Most of the standard mysql-test-run tests that are used to validate MySQL are also used for Drizzle.  All of the basics like creating tables, inserting data, etc – all tested just the same : )  Of course, we have also added significant numbers of new tests as development has progressed.

On top of this, we utilize a variety of other testing and benchmarking tools.  It seems appropriate to describe the life of a Drizzle patch really quickly:

Once a patch is approved for trunk, we merge it into our build branch.  This kicks off a set of Jenkins jobs on a variety of platforms that make sure the code will build and pass the main test suite.  If it fails, we usually ask the person who submitted the patch to sort things out.  We also provide for param-builds so that people can test their branches directly before proposing a merge – it is an invaluable tool.

Once the patch passes build, we move it into staging.  Staging consists of deeper tests and benchmarks.  We take lcov numbers, we run sysbench and dbt2, we utilize a variety of randgen tests to try to crash the server and to validate functionality like drizzledump and the transaction log.  In short, we really put our code through the wringer before we ever allow it into trunk.

For this release, my efforts were focused on a few particular areas:

Additional testing was provided by the community (many thanks to everyone who helped us out with feedback!) and by the individual developers.  We generally create test cases for our main suite.

We have a lot more interesting things planned for the future.  There are several intriguing new features on the horizon and we are also doing some cool things with our testing tools as well.

dbqp / randgen docs are up

As I mentioned earlier, dbqp now has a randgen mode.  This allows us to run test cases for the randgen in much the same way as we execute drizzletest cases.  At present, all of our randgen tests are defined as dbqp cases and live in drizzle/tests/randgen_tests.  They are organized into suites (much like drizzletest cases).  You can also use it as a quick means of creating an ad hoc replication setup!

 

Anyway, we now have proper documentation up.  Instructions and prereqs for the randgen are included in the docs.  If anyone runs into any issues, please write a bug and I’ll get on it.  Thanks and have fun testing! : )

 

We have lots more cool stuff planned for dbqp and drizzle.  Ultimately, the goal is to make it fast, easy and efficient for anyone to make sure the server is behaving as they expect it to – we want happy users!  With that said –   GA release tomorrow!  It is a huge deal for us and lots of blood, sweat and tears have gone into the software.  We are really excited and hope that everyone else will wonder how they lived without Drizzle for so long (I know I have) : )