Drizzle multi-master testing!

So, it has been a while since I’ve blogged.  As some of you may have read, I have a new job and Stewart and I have been busy planning all kinds of testing goodness for Percona >: ) (I’ve also been recovering from trying to keep up with Stewart!)

Rest assured, gentle readers, that I have not forgotten everyone’s favorite modular, community-driven database ; )  Not by a long-shot.  I have some major improvements to dbqp getting ready for a merge (think randgen in-tree / additional testing modes / multiple basedirs of multiple types).  Additionally, I’ve been cooking up some code to test the mighty Mr. Shrews’ multi-master code (mwa ha ha!)

What I’ve done is allow for a new option to be used with a test’s .cnf file (this is a dbqp thing, won’t work with standard drizzle-test-run).  If the runner sees this request, it will generate a multi-master config file from the specified servers’ individual slave.cnf files. 

Here is a sample config:

[test_servers]
servers = [[--innodb.replication-log],[--innodb.replication-log],[--plugin-add=slave --slave.config-file=$MASTER_SERVER_SLAVE_CONFIG]]

[s2]
# we tell the system that we want
# to generate a multi-master cnf file
# for the 3rd server to use, that
# has the first two servers as masters
# the final file is written to the first
# server's general slave.cnf file
gen_multi_master_cnf= 0,1

A good rundown of the file’s contents can be found on Shrews’ blog here, but the end result looks like this:

ignore-errors

[master1]
master-host=127.0.0.1
master-port=9306
master-user=root
master-pass=''

[master2]
master-host=127.0.0.1
master-port=9312
master-user=root
master-pass=''

I tried cooking up a basic test case where we spin up 3 servers – 2 masters and one slave.  One master 1, we create table t1:


CREATE TABLE t1 (a int not null auto_increment, primary key(a));

On master 2, table t2:


CREATE TABLE t2 (a int not null auto_increment, primary key(a));

We insert some records into both tables, then check that our slave has everything! Sounds simple, right?

Sigh. If only. It seems that we are running into some issues when we try to record the test – you can read the bug here

We see some interesting output in the slave’s logs before it crashes:

$ cat workdir/bot0/s2/var/log/s2.err
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: 127 rollback segment(s) active.
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
(SQLSTATE 00000) Duplicate entry '772-1' for key 'PRIMARY'
Failure while executing:
INSERT INTO `sys_replication`.`queue` (`master_id`, `trx_id`, `seg_id`, `commit_order`, `originating_server_uuid`, `originating_commit_id`, `msg`) VALUES (2, 772, 1, 1, 'ac9c8ac0-8f10-474b-9bbd-b61d2cdb2b93' , 1, 'transaction_context {
server_id: 1
transaction_id: 772
start_timestamp: 1317760732106016
end_timestamp: 1317760732106017
}
event {
type: STARTUP
}
segment_id: 1
end_segment: true
')

Replication slave: Unable to insert into queue.
Replication slave: drizzle_state_read:lost connection to server (EOF)
Lost connection to master. Reconnecting.
Replication slave: drizzle_state_connect:could not connect
111004 16:39:05 InnoDB: Starting shutdown...

Additionally, you can just try the setup with –start-and-exit:

$ ./dbqp --suite=slave --start-and-exit multi_master_basic
20111004-170033 INFO Using Drizzle source tree:

20111004-170033 INFO Taking clean db snapshot...
20111004-170033 INFO Taking clean db snapshot...
20111004-170033 INFO Taking clean db snapshot...
20111004-170035 INFO bot0 server:
20111004-170035 INFO NAME: s0
20111004-170035 INFO MASTER_PORT: 9306
20111004-170035 INFO DRIZZLE_TCP_PORT: 9307
20111004-170035 INFO MC_PORT: 9308
20111004-170035 INFO PBMS_PORT: 9309
20111004-170035 INFO RABBITMQ_NODE_PORT: 9310
20111004-170035 INFO VARDIR: /drizzle_mm_test/tests/workdir/bot0/s0/var
20111004-170035 INFO STATUS: 1
20111004-170035 INFO bot0 server:
20111004-170035 INFO NAME: s1
20111004-170035 INFO MASTER_PORT: 9312
20111004-170035 INFO DRIZZLE_TCP_PORT: 9313
20111004-170035 INFO MC_PORT: 9314
20111004-170035 INFO PBMS_PORT: 9315
20111004-170035 INFO RABBITMQ_NODE_PORT: 9316
20111004-170035 INFO VARDIR: /drizzle_mm_test/tests/workdir/bot0/s1/var
20111004-170035 INFO STATUS: 1
20111004-170035 INFO bot0 server:
20111004-170035 INFO NAME: s2
20111004-170035 INFO MASTER_PORT: 9318
20111004-170035 INFO DRIZZLE_TCP_PORT: 9319
20111004-170035 INFO MC_PORT: 9320
20111004-170035 INFO PBMS_PORT: 9321
20111004-170035 INFO RABBITMQ_NODE_PORT: 9322
20111004-170035 INFO VARDIR: /drizzle_mm_test/tests/workdir/bot0/s2/var
20111004-170035 INFO STATUS: 1
20111004-170035 INFO User specified --start-and-exit. dbqp.py exiting and leaving servers running...
pcrews@mister:/drizzle_mm_test/tests$ ps -al
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
0 S 1000 18652 1 2 80 0 - 112094 poll_s pts/2 00:00:00 lt-drizzled
0 S 1000 18688 1 3 80 0 - 112096 poll_s pts/2 00:00:00 lt-drizzled
0 S 1000 18721 1 3 80 0 - 156326 poll_s pts/2 00:00:00 lt-drizzled
0 R 1000 18780 15985 0 80 0 - 3375 - pts/2 00:00:00 ps
0 S 1000 32463 30047 0 80 0 - 11272 poll_s pts/1 00:00:01 ssh

From here, we can connect to the slave and check out sys_replication.applier_state:

$ drizzle -uroot -p9318 test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the Drizzle client.. Commands end with ; or \g.
Your Drizzle connection id is 216
Connection protocol: mysql
Server version: 2011.09.26.2427 Source distribution (drizzle_mm_test)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

drizzle> use sys_replication;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Schema changed
drizzle> show tables;
+---------------------------+
| Tables_in_sys_replication |
+---------------------------+
| applier_state |
| io_state |
| queue |
+---------------------------+
3 rows in set (0.001641 sec)

drizzle> select * from applier_state;
+-----------+------------------------+--------------------------------------+-----------------------+---------+-----------+
| master_id | last_applied_commit_id | originating_server_uuid | originating_commit_id | status | error_msg |
+-----------+------------------------+--------------------------------------+-----------------------+---------+-----------+
| 1 | 0 | f716781f-8c00-4b81-82c6-62039136d616 | 0 | RUNNING | |
| 2 | 3 | df7f2f6e-dba4-43ea-a674-fa4a3709865b | 3 | RUNNING | |
+-----------+------------------------+--------------------------------------+-----------------------+---------+-----------+
2 rows in set (0.000928 sec)

drizzle> select * from io_state;
+-----------+---------+-----------+
| master_id | status | error_msg |
+-----------+---------+-----------+
| 1 | STOPPED | |
| 2 | RUNNING | |
+-----------+---------+-----------+
2 rows in set (0.000839 sec)

drizzle>

So, it looks like the slave knows about both masters, but for some reason, the applier from master 1 is stopped : (
At any rate, there is a bug open on this and it could be something in my config(?) It’s been a while since I’ve played with replication and I know there has been some tinkering under the hood since then : )

The branch with the test code can be found here:
lp:~patrick-crews/drizzle/dbqp_multi_master_test

At the very least, we can now create tests that use this feature, which will help ensure that it stays on the path of solid code in the future! How about anyone out there? Has anyone been using multi-master? If so, can you share any setups / tests? Extra information would be most appreciated : )

Crash-testing the innodb transaction log!

So, back when we released our first beta in September, one of the many responses was this

The comments about the reliability / durability of the log definitely struck me as testing we needed.

It’s taken a while (we had this GA thing we were working on…), but we finally have crash and recover testing of the innodb transaction log and the slave plugin.

Here is what happens for the innodb-based log:

  • Set up the test servers and start the randgen with the trx_log grammars.  I’ll point you to my superhuman colleague, Andrew Hutchings for a summary of what they do.
  • Some time into the test (after several rounds of queries have run), `kill -9 $pid` is issued against the master server
  • The master server is restarted
  • The transaction_reader utility is called to generate SQL from the contents of the log
  • A validator server is populated with the log’s SQL
  • Drizzledump is called against the master and validation servers
  • A diff is taken of the dump files – if all is well, they should match

For the slave plugin, everything is basically the same except that we wait and make sure the slave and master are synched, then dumpfiles are compared.

With this testing we can say that:
* The innodb-based rpl log will provide an accurate representation of the database’s state even after a crash.
* The slave plugin will provide an accurate representation of the master server even after a crash and restart.

Many iterations of these tests have been run so far, using the standard randgen data and queries as well as making use of –seed=time.  When we do this, we randomize the data and queries generated so it can cover more ground than simply running the same 1000 transactions over and over.  As it is a well designed tool, any runs can easily be repeated as the same seed *always* produces the same data and queries…repeatability is one of a qa engineer’s favorite words : )

So without further ado, here is some output from the tests.  They are located in the innodb_trx_log and slave_plugin suites for the randgen, executable via dbqp:
To run them:
./dbqp –mode=randgen –randgen-path=/path/to/randgen –suite=innodb_trx_log,slave_plugin multi_thread1_crash_recover

NOTE that the output doesn’t normally include the ps output, just putting it in here to show off the magic ; )


<snip>
# 2011-03-24T17:00:03 Query:  SELECT * FROM `C` AS X WHERE X . `col_bigint_key` BETWEEN 211 AND 2673999872 LIMIT 5 FOR UPDATE /*Generated by THREAD_ID 1*/  failed: 1213 Deadlock found when trying to get lock; try restarting transaction
# 2011-03-24T17:00:03 Query:  ROLLBACK TO SAVEPOINT A /*Generated by THREAD_ID 1*/  failed: 1305 SAVEPOINT %s does not exist
# 2011-03-24T17:00:03 Query:  ROLLBACK TO SAVEPOINT A /*Generated by THREAD_ID 1*/  failed: 1305 SAVEPOINT %s does not exist
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
0 S  1000 13090   491  0  80   0 -  9317 -      pts/6    00:00:00 python
0 S  1000 13229     1 99  80   0 - 164210 -     pts/6    00:01:31 lt-drizzled
0 S  1000 13260     1  0  80   0 - 104483 -     pts/6    00:00:00 lt-drizzled
0 S  1000 13290 13090  0  80   0 -  1038 -      pts/6    00:00:00 sh
0 S  1000 13291 13290  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13298 13291  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13299 13291  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13300 13291  5  80   0 - 20378 -      pts/6    00:00:02 gentest.pl
1 S  1000 13302 13291  5  80   0 - 20355 -      pts/6    00:00:02 gentest.pl
1 S  1000 13304 13291  5  80   0 - 20324 -      pts/6    00:00:02 gentest.pl
1 S  1000 13306 13291  5  80   0 - 20406 -      pts/6    00:00:02 gentest.pl
0 R  1000 13343 13299  0  80   0 -  1651 -      pts/6    00:00:00 ps
# 2011-03-24T17:00:03 0
# 2011-03-24T17:00:03 Sending kill -9 to server pid 13229 in order to force a recovery.
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
0 S  1000 13090   491  0  80   0 -  9317 -      pts/6    00:00:00 python
0 Z  1000 13229     1 99  80   0 -     0 ?      pts/6    00:01:31 lt-drizzled <defunct>
0 S  1000 13260     1  0  80   0 - 104483 -     pts/6    00:00:00 lt-drizzled
0 S  1000 13290 13090  0  80   0 -  1038 -      pts/6    00:00:00 sh
0 S  1000 13291 13290  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13298 13291  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13299 13291  0  80   0 - 18502 -      pts/6    00:00:00 gentest.pl
1 S  1000 13300 13291  5  80   0 - 20378 -      pts/6    00:00:02 gentest.pl
1 S  1000 13302 13291  5  80   0 - 20355 -      pts/6    00:00:02 gentest.pl
1 S  1000 13304 13291  5  80   0 - 20324 -      pts/6    00:00:02 gentest.pl
1 S  1000 13306 13291  5  80   0 - 20406 -      pts/6    00:00:02 gentest.pl
0 R  1000 13344 13299  0  80   0 -  1651 -      pts/6    00:00:00 ps
# 2011-03-24T17:00:03 0
# 2011-03-24T17:00:03 Killing child process with pid 13300...
# 2011-03-24T17:00:03 Killing child process with pid 13306...
# 2011-03-24T17:00:03 Killing child process with pid 13302...
# 2011-03-24T17:00:03 Killing child process with pid 13304...
# 2011-03-24T17:00:03 Kill GenTest::ErrorFilter(13298)
# 2011-03-24T17:00:03 Attempting database recovery using the server ...
# 2011-03-24T17:00:03 Executing drizzle/drizzled/drizzled --no-defaults --core-file --datadir="drizzle/tests/workdir/bot0/s0/var/master-data" --basedir="drizzle" --plugin-add=shutdown_function --mysql-protocol.port=9306 2>&1 .
# 2011-03-24T17:00:03 13345
# 2011-03-24T17:00:08 transaction_log output file:  /tmp//translog_13291_.sql
# 2011-03-24T17:00:08 drizzle/plugin/transaction_log/utilities/transaction_reader -uroot --use-innodb-replication-log -p 9306 --ignore-events > /tmp//translog_13291_.sql
# 2011-03-24T17:00:09 Replicating from transaction_log output...
# 2011-03-24T17:00:09 drizzle/client/drizzle --host=127.0.0.1 --port=9311 --user=root test <  /tmp//translog_13291_.sql
# 2011-03-24T17:00:16 Validating replication via dumpfile compare...
# 2011-03-24T17:00:16 /tmp//translog_rpl_dump_13291_9306.sql
# 2011-03-24T17:00:16 drizzle/client/drizzledump --compact --skip-extended-insert --host=127.0.0.1 --port=9306 --user=root test >/tmp//translog_rpl_dump_13291_9306.sql
# 2011-03-24T17:00:17 /tmp//translog_rpl_dump_13291_9311.sql
# 2011-03-24T17:00:17 drizzle/client/drizzledump --compact --skip-extended-insert --host=127.0.0.1 --port=9311 --user=root test >/tmp//translog_rpl_dump_13291_9311.sql
# 2011-03-24T17:00:17 Executing diff --unified /tmp//translog_rpl_dump_13291_9306.sql /tmp//translog_rpl_dump_13291_9311.sql
# 2011-03-24T17:00:17 Cleaning up validation server...
# 2011-03-24T17:00:17 Resetting validation server...
# 2011-03-24T17:00:18 0
# 2011-03-24T17:00:18 Test completed successfully.

The randgen code is in lp:randgen and the updated dbqp tests should be merged to trunk very soon.  Docs on running the randgen are here, dbqp are here.

At this point, I’d also like to shine the spotlight on David Shrewsbury for all of his hard work on the replication system.  He’s shepherded this code from a file-based log with limited testing all the way to a functional (and highly flexible) replication solution.  It was Dave who helped me with the often frustrating task of rooting out early bugs so that subsequent code could have a good foundation.  Big props should also go to Jay Pipes (master of the fu!) for his design work on the initial transaction log code.  Good design, good coding, and lots of love = some pretty cool stuff.  Of course, plenty of other people have helped…I just wanted to personally thank Dave for not trying to kill me when I was bombarding him with painful bugs early on ; )

As always, we hope to hear from you guys via IRC, emails and launchpad.  Also Drizzle Developer Day !  Sign up for it ; )  Hope to see you guys at the 2011 MySQL User’s Conference

Drizzle’s slave plugin is working!

These are exciting times for the Drizzle team.  We just released our first RC and things are finally coming together into some awesome new features.  I’m excited to bring you latest news from the replication front:

Where to begin?  Well, many moons ago, Brian sent David Shrewsbury and myself out on the task of making the transaction_log plugin rock solid.  This plugin provides a file-based log that captures the server’s state via protobuf messages.  After much blood, sweat, and tears (and *many* bugs), we accomplished our task with *plenty* of help from everyone on the team.  With this task accomplished, we could say that any replication solution using the log would have an accurate representation of the server’s state.  However, this was a long way from actually replicating server-server.

During this time, we were also working on storing the transaction log in an innodb table rather than a file.  Initial sysbench tests show a significant performance gain using this code versus the file based log.  Thanks to the initial work on the file-based log, getting this code up and running wasn’t too painful and it also passes all of our tests with flying colors.  Special mention should be given to Joe Daly, Stewart Smith, and Andrew Hutchings for hacking on this.

Having the transaction log in a database table provides other advantages beyond speed (such as easy, standardized access by other servers), but I’ll leave that to the hackers to discuss (I’d really recommend catching Dave’s UC talk if you are interested!).  The gist is that it has allowed the amazing Mr. Shrewsbury to cook up the slave plugin!

This is a plugin that allows a server to replicate from another server that is using the innodb-trx log.  It is amazingly simple:
master options:  –innodb.replication-log=true
slave options: –plugin-add=slave –slave.config-file=XYZ

The config file may contain the following options in option=value format:
master-host – hostname/ip of the master host
master-port – port used by the master server
master-user – username
master-pass – password
max-reconnects – max # of reconnect attempts if the master disappears
seconds-between-reconnects – how long to wait between reconnect attempts

The code hasn’t yet been merged to trunk, but can be checked out from lp:~dshrews/drizzle/slave
Currently, the plugin is able to replicate *anything* we throw at the master, which is HUGE!

Additionally, our experimental test-runner, dbqp, is sporting randgen integration!  I’ll write more about this in an upcoming post, but I mention it here as you can use the new randgen-mode + –start-and-exit to have yourself a handy-dandy replication setup for ad-hoc testing.  Observe:

./dbqp –mode=randgen –start-and-exit –suite=slave_plugin
Setting –no-secure-file-priv=True for randgen mode…
<snip>
24 Feb 2011 12:25:11 INFO: Using testing mode: randgen
24 Feb 2011 12:25:11 INFO: Processing test suites…
24 Feb 2011 12:25:11 INFO: Found 1 test(s) for execution
24 Feb 2011 12:25:11 INFO: Creating 1 testbot(s)
24 Feb 2011 12:25:11 INFO: Taking clean db snapshot…
24 Feb 2011 12:25:12 INFO: Taking clean db snapshot…
24 Feb 2011 12:25:13 INFO: testbot0 server:
24 Feb 2011 12:25:13 INFO: NAME: server0
24 Feb 2011 12:25:13 INFO: MASTER_PORT: 9316
24 Feb 2011 12:25:13 INFO: DRIZZLE_TCP_PORT: 9317
24 Feb 2011 12:25:13 INFO: MC_PORT: 9318
24 Feb 2011 12:25:13 INFO: PBMS_PORT: 9319
24 Feb 2011 12:25:13 INFO: RABBITMQ_NODE_PORT: 9320
24 Feb 2011 12:25:13 INFO: VARDIR: drizzle/tests/workdir/testbot0/server0/var
24 Feb 2011 12:25:13 INFO: STATUS: 1
24 Feb 2011 12:25:13 INFO: testbot0 server:
24 Feb 2011 12:25:13 INFO: NAME: server1
24 Feb 2011 12:25:13 INFO: MASTER_PORT: 9321
24 Feb 2011 12:25:13 INFO: DRIZZLE_TCP_PORT: 9322
24 Feb 2011 12:25:13 INFO: MC_PORT: 9323
24 Feb 2011 12:25:13 INFO: PBMS_PORT: 9324
24 Feb 2011 12:25:13 INFO: RABBITMQ_NODE_PORT: 9325
24 Feb 2011 12:25:13 INFO: VARDIR: drizzle/tests/workdir/testbot0/server1/var
24 Feb 2011 12:25:13 INFO: STATUS: 1
24 Feb 2011 12:25:13 INFO: User specified –start-and-exit.  dbqp.py exiting and leaving servers running…

We now have two servers running, a master on port 9316 and a slave on port 9321

user@mahmachine:~drizzle/tests$ ../client/drizzle -uroot -p9316 test

Reading table information for completion of table and column names

You can turn off this feature to get a quicker startup with -A

Welcome to the Drizzle client..  Commands end with ; or \g.
Your Drizzle connection id is 4
Connection protocol: mysql
Server version: 2011.02.2197 Source distribution (drizzle)

Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the buffer.

drizzle> create table t1 (a int auto_increment not null, primary key(a));
Query OK, 0 rows affected (0.001652 sec)

drizzle> insert into t1 values (),(),();
Query OK, 3 rows affected (0.001182 sec)
Records: 3  Duplicates: 0  Warnings: 0

drizzle> select * from t1;
+—+
| a |
+—+
| 1 |
| 2 |
| 3 |
+—+
3 rows in set (0.000538 sec)

drizzle> exit
Bye
user@mahmachine:~drizzle/tests$ ../client/drizzle -uroot -p9321 test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the Drizzle client..  Commands end with ; or \g.
Your Drizzle connection id is 82
Connection protocol: mysql
Server version: 2011.02.2197 Source distribution (drizzle)

Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the buffer.

drizzle> show tables;
+—————-+
| Tables_in_test |
+—————-+
| t1             |
+—————-+
1 row in set (0.000983 sec)

drizzle> select * from t1;
+—+
| a |
+—+
| 1 |
| 2 |
| 3 |
+—+
3 rows in set (0.00058 sec)

You will likely need to merge with trunk to use this feature in the slave branch, but the actual plugin should be merged soon.

When you are done playing, you can take advantage of a new quality of life feature – cleanup mode.
A pet peeve of mine is needing to clean up / shutdown any servers I may have started during testing.  With dbqp –mode=cleanup, the tool will now kill any server pids it detects in its workdir.  Nothing fancy, but useful, quick, and easy:

./dbqp –mode=cleanup
Setting –start-dirty=True for cleanup mode…
24 Feb 2011 12:32:55 INFO: Using Drizzle source tree:
24 Feb 2011 12:32:55 INFO: basedir: drizzle
24 Feb 2011 12:32:55 INFO: clientbindir: drizzle/client
24 Feb 2011 12:32:55 INFO: testdir: drizzle/tests
24 Feb 2011 12:32:55 INFO: server_version: 2011.02.2197
24 Feb 2011 12:32:55 INFO: server_compile_os: unknown-linux-gnu
24 Feb 2011 12:32:55 INFO: server_platform: x86_64
24 Feb 2011 12:32:55 INFO: server_comment: (Source distribution (drizzle))
24 Feb 2011 12:32:55 INFO: Using –start-dirty, not attempting to touch directories
24 Feb 2011 12:32:55 INFO: Using default-storage-engine: innodb
24 Feb 2011 12:32:55 INFO: Using testing mode: cleanup
24 Feb 2011 12:32:55 INFO: Killing pid 24416 from drizzle/tests/workdir/testbot0/server1/var/run/server1.pid
24 Feb 2011 12:32:55 INFO: Killing pid 24385 from drizzle/tests/workdir/testbot0/server0/var/run/server0.pid
24 Feb 2011 12:32:55 INFO: Stopping all running servers…

Please give it a whirl and help us make the code better by filing any bugs you detect!

Drizzle’s transaction log is passing all tests!

In case you missed it here, we are very proud to announce that Drizzle’s transaction log is passing all of our tests.  For quite some time, David Shrewsbury, Stewart Smith, and Joe Daly have been putting a lot of love into the log code.  Please don’t be fooled by Dave’s praise of QA now that the storm has passed…you should have heard the names he called me and the things he plotted when we were rooting these bugs out ; )  However, now that there is a permanent record of his words, I’ll be reminding him about this post the next time my testing becomes a pain in his posterior and I feel him giving me the stink-eye in IRC (heheh)

With that said, we really have been putting tons of effort into making the log rock-solid.  This code will serve as the foundation for Drizzle replication and we can now be assured that any replication solutions will have a reliable log that will reflect the state of the server.  We have been beating up the code with the random query generator.  We have concocted several grammars that throw a variety of queries, in transactions and standalone, at the server.  After we have made the master do some work – we use a variety of connections and per-connection query counts, we produce SQL from the log, populate a validation server with the SQL, then compare Drizzledump files to ensure they match.  You should really check out the transaction_reader utility in drizzle/plugin/transaction_log/utilities – it allows a user to view the raw trx log contents, produce SQL from the log contents, and a few other neat tricks.  I’ll be blogging a bit more about how we used this for testing and troubleshooting very soon.

The transaction log worked well in most cases, the majority of our problems were in rooting out strange behavior around deadlocks and rolled back transactions.  I’d like to once again thank Stewart Smith, of the spork most flaming, for his valued assistance in finding these annoying bugs : )  If anyone wants to take a look at the bugs we’ve killed – you can check them out here.

We are far from being done – some of our next tasks include testing RabbitMQ, tweaking randgen tests to make certain we are totally crash-safe, and a few other things.  In the meantime, our randgen trx log tests run against every branch we intend to merge into trunk, so we’re keeping a close eye on making sure it stays solid.  We’ll keep you posted as our replication testing moves along.  Please keep trying Drizzle and helping us to improve it with your bug reports and feedback.