Running sysbench tests against a Galera cluster
So, vacation is over and I was in luck: Already during first week I had ample time to finally put Galera replication to the test. It was a great experience: I learned a lot, and eventually got the great results I was hoping to see.
Again I've started by just running the standard Sysbench oltp read-write test. Since this is a commonly used benchmark, it produces numbers that are comparable with others running the same benchmarks. Including, as it happens, Galera developers themselves.
These tests were run on an 8 core server with 32 GB of RAM and the disk on some EMC device with a 2,5GB write cache.
I was using the Galera demo distribution provided by Codership guys. It's a recent MySQL 5.1, where I also enabl the InnoDB plugin. I then changed the following configuration parameters:
# Caveat: A couple of these occur again later in this file, must comment out for these to have effect.
As you can see, Galera allows me to use multiple parallel threads on the slave side, to combat slave lag. And it's a very general solution too: Any work on applying transactions can be somewhat parallelized, even if the commit order is still preserved. You can also choose to allow slave to commit out of order, but in this test that would not have helped anything. I took that option into use as I was troubleshooting some poor performance, but in the end the problem was not on Galera side at all. So in a normal situation most of those threads just show up as idle in SHOW PROCESSLIST. However, it's a nice feeling to know that replication is not a bottleneck anymore.
The sysbench command used was
# Run this manually to prepare some tables:
# sysbench --test=sysbench/tests/db/oltp.lua --mysql-host=host1 \
# --mysql-user=root --mysql-password=rootpass --mysql-table-engine=InnoDB \
# --mysql-engine-trx=yes --oltp-table-size=2000000 --max-time=300 \
# --max-requests=0 --num-threads=1 --oltp-tables-count=10 \
# --oltp-auto-inc=off prepare \
# The above creates a dataset roughly
# 189,8*2000000*10/1024/1024 = 3620M = 3,5G
NUM_THREADS="1 2 4 8 16 32 64 128 256 512 1023"
for N in $NUM_THREADS
#while test 1
# --oltp-auto-inc=off is undocumented, but important when running galera.
# sysbench assumes that autoinc columns will be a continuos sequence 1,2,3...
# but with galera they're not (auto_increment_increment)
sysbench/sysbench --test=sysbench/tests/db/oltp.lua \
--mysql-user=root --mysql-password=rootpass \
--mysql-table-engine=InnoDB --mysql-engine-trx=yes \
--oltp-table-size=2000000 --max-time=60 \
--max-requests=0 --num-threads=$N --oltp-tables-count=10 \
# sleep 5
I'm using the development version of sysbench 0.5, which allows me to spread the test over more than one table. In the end there's no difference from just doing the test agains one large table instead, since I'm still using a single tablespace. With 10x2M rows, the dataset fits well in memory on this server.
In the graphs below I'm using transactions/second. One transaction consists of 14 selects, 2 updates, 1 delete and 1 insert. If you wanted to measure queries per second, you would multiply these numbers by 18.
Benchmarking single node behavior
Ok, here we go. To get a baseline which to compare against, I first started a single mysql node and ran the benchmark against that. I first set sync_binlog and innodb_trx_flush_at_trx_commit and innodb_doublewrite all to 1 (blue). This is the setting you'd run on a normal mysql server to guarantee data durability. (It's also the default, in case you wonder.)
As is famously known, InnoDB supports group commit, which is an important performance optimization, but enabling the binary log and setting sync_binlog=1 breaks this functionality so that InnoDB and the binary log will both fsync to disk after each single transaction. For this reason, many people run in prodution with sync_binlog=0, to make group commit work and instead sacrifice some reliability on the replication side. Since this is a commonly used setting, I also benchmarked that (red).1
This is a surprising result. The red and blue lines are essentially the same, even if the red one should show better performance.
iostat 1 reports my disk is doing roughly 2000 io per second, and since I know MySQL has to do 3 fsyncs per transaction, then 600 transactions/second sounds about right. This is expected for the blue line, but I was expecting more from the red.
After a while of thinking I realize this MySQL binary now also has another replication engine, galera, included. Even if it's not replicating anywhere yet, maybe it has similar effects in breaking InnoDB group commit. So I try the last test again with galera disabled (brown):
(In fact, in the demo package this is set by the custom startup script on the command line, and you have to comment out that row in the shell script. The my.cnf setting does nothing.)
That improves performance to something like I was expecting it to be. With group commit working and using a write back cache on the disk controller, this should be close to the max performance I expect to get.
Finally I also run a test with both sync_binlog and innodb_... variables set to 0 (yellow with galera module loaded, green with galera unloaded). This is not something you'd normally do on your MySQL server. But it is a safe optimization when using synchronous replication like Galera. The idea is that transactions are made durable by being copied to the other node(s) rather than written to disk (which they now are not immediately after commit) and they will thus survive the crash of one node. They wouldn't survive simultaneous crash of all nodes, just like commits to disk won't survive a fire that melts down your disks - both of these tend to be quite rare events MySQL Cluster follows the same principle. The tag line for this kind of thinking is: "Writing to remote RAM is faster than local disk."
Both of those tests also show good results. It's no surprise that the setting with galera unloaded and no syncs anywhere is slightly better than anything else. On the other hand, the brown line with InnoDB group commit is essentially just as good. This is thanks to the write cache. If I had just a normal disk, the difference would be bigger.
First try with Galera cluster
Having set a baseline we can compare against, it's now time to fire up a 3 node Galera cluster. Galera automatically provisions the new nodes with the data from the node that is already running. This is very convenient, I don't have to copy backups around manually, or re-run the sysbench prepare stage. Currently this is done by mysqldump, but Vadim already has a bzr repository where xtrabackup is included as a donor method.
In the following graph, I kept the graphs with sync_binlog=1 (blue) and =0 (red) from the previous graph.
I then do tests where I
- write against 1 master node only, and 2 nodes remain as purely passive slaves. (yellow)
- write against 2 masters, with 1 node remaining as a slave (green)
- write against all 3 masters concurrently (brown)
Note that the concept of master and slave are here purely an application side decision. There's no change in the configuration of the Galera cluster, it's just a matter of deciding which node to connect to.
In fact, sysbench makes this very easy as it allows me to give a list of nodes separated by commas, such as
--mysql-host=host1,host2,host3. This is in the same spirit I outlined in my recent post The Ultimate MySQL High-Availability Solution.
The results of my first run were very poor, the Galera cluster basically didn't scale beyond 2 client threads and writing to several nodes didn't help the situation.
I got a lot of help from Alex and Seppo on the codership mailing list in troubleshooting this situation, and learned how to read Galera related status values in the process.
The problem is seen in a SHOW PROCESSLIST taken from the slave side:
host2 >show processlist;
| Id | User | Host | db | Command | Time | State | Info |
| 1 | system user | | NULL | Sleep | 2 | post RBR applying (5610209) | NULL |
| 2 | system user | | NULL | Sleep | 2 | post RBR applying (5610197) | NULL |
| 3 | system user | | NULL | Sleep | 2 | post RBR applying (5610224) | NULL |
| 4 | system user | | NULL | Sleep | 6058 | wsrep aborter idle | NULL |
| 5 | system user | | NULL | Sleep | 2 | post RBR applying (5610203) | NULL |
| 6 | system user | | NULL | Sleep | 2 | post RBR applying (5610200) | NULL |
| 7 | system user | | NULL | Sleep | 2 | post RBR applying (5610226) | NULL |
| 8 | system user | | NULL | Sleep | 2 | post RBR applying (5610206) | NULL |
| 9 | system user | | NULL | Sleep | 2 | post RBR applying (5610213) | NULL |
| 10 | system user | | NULL | Sleep | 2 | post RBR applying (5610215) | NULL |
| 11 | system user | | NULL | Sleep | 2 | Delete_rows_log_event::ha_delete_row(5610196) | NULL |
| 12 | system user | | NULL | Sleep | 2 | post RBR applying (5610204) | NULL |
| 13 | system user | | NULL | Sleep | 2 | post RBR applying (5610220) | NULL |
| 14 | system user | | NULL | Sleep | 2 | post RBR applying (5610217) | NULL |
| 15 | system user | | NULL | Sleep | 2 | post RBR applying (5610210) | NULL |
| 16 | system user | | NULL | Sleep | 2 | post RBR applying (5610198) | NULL |
| 17 | system user | | NULL | Sleep | 2 | post RBR applying (5610207) | NULL |
| 18 | system user | | NULL | Sleep | 2 | post RBR applying (5610227) | NULL |
| 19 | system user | | NULL | Sleep | 2 | post RBR applying (5610222) | NULL |
| 20 | system user | | NULL | Sleep | 2 | post RBR applying (5610211) | NULL |
| 21 | system user | | NULL | Sleep | 2 | post RBR applying (5610223) | NULL |
| 22 | system user | | NULL | Sleep | 2 | post RBR applying (5610208) | NULL |
| 23 | system user | | NULL | Sleep | 2 | post RBR applying (5610212) | NULL |
| 24 | system user | | NULL | Sleep | 2 | post RBR applying (5610202) | NULL |
| 25 | system user | | NULL | Sleep | 2 | post RBR applying (5610201) | NULL |
| 26 | system user | | NULL | Sleep | 2 | post RBR applying (5610221) | NULL |
| 27 | system user | | NULL | Sleep | 2 | post RBR applying (5610216) | NULL |
| 28 | system user | | NULL | Sleep | 2 | post RBR applying (5610205) | NULL |
| 29 | system user | | NULL | Sleep | 2 | post RBR applying (5610199) | NULL |
| 30 | system user | | NULL | Sleep | 2 | post RBR applying (5610219) | NULL |
| 31 | system user | | NULL | Sleep | 2 | post RBR applying (5610214) | NULL |
| 33 | system user | | NULL | Sleep | 2 | post RBR applying (5610225) | NULL |
| 34 | system user | | NULL | Sleep | 2 | post RBR applying (5610218) | NULL |
| 1187 | root | localhost:59314 | NULL | Query | 0 | sleeping | show processlist |
34 rows in set (0.00 sec)
The following wsrep status values are interesting to monitor:
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 0.262727 |
| wsrep_flow_control_paused | 0.609370 |
What is seen happening above is that 1 of the wsrep_slave threads is trying to delete a row in InnoDB (it could also be insert or update) and all the other ones are waiting for their turn. The numbers you see are the sequence part of the global transaction id (I think) and you can see that the event that is stuck has the lowest number, and all others must wait for it to complete before they can continue to commit.
The wsrep_flow_control_paused variable shows the fraction of time replication was paused due to threads waiting for their turn - as in the above processlist. It is usually zero or very small. Now it was more than 0.5, even 0.9.
From this Alex was able to conclude that the problem really is on InnoDB side, it's InnoDB everyone is waiting for here. Since I didn't have any problems on single node workload, I first didn't believe him, but as it happens, increasing innodb_log_file_size from 100M to 2x1.5G made the problem go away! I know keeping it at 100M is kind of small, but like I said, it didn't cause any problems in the single node run. Why InnoDB behaves differently when exactly the same transactions are applied from a slave thread instead of a sysbench client connection is currently an unknown, but it does. Perhaps it gets confused by the fact that a slave workload is write-only, whereas on the master you also have selects.
Time to re-run the tests. Again the red and blue curves are copied from the previous graph to give a baseline to compare against. I will also include latencies now.
What we see here is that first of all, even just writing to 1 Galera master you get similar performance as a standalone server with innodb_flush_log_at_trx_commit=1. So there is no overhead at all from the replication, but you now have a true highly available synchronously replicated cluster. This is even confirmed with the latency graph, there simply isn't any penalty from the synchronous replication here.
Of course, if I didn't have battery backed up write cache on my servers, then Galera would be a huge win here! This is again similar to MySQL Cluster architecture: better software architecture provides great performance with lesser hardware.
These results are already great in the sense that I get equal or better performance by deploying a Galera cluster compared to standalone MySQL node. But I can already see this is not the maximum I could get out of this cluster. For one thing I don't see almost any benefit from connecting to all 3 nodes compared to 2 nodes. Also when looking at top during the benchmark, in the single node test I can see MySQL consuming more than 700% CPU (where 800% is theorethical maximum for a 8 core server), but in the tests with a Galera cluster each node only reaches about 600%. In other words, something is keeping it down.
A nice property of Galera's behavior here is also that even if it maxes out at some point, the graphs don't go back down as they often do in single node MySQL benchmarks, but performance stays flat even when you increase the amount of client threads. Again, I don't know exactly why this is, but it's very nice.
I did try to peek under the hood to get an idea of what is limiting me. For instance I suspected I could already be limited by network interrupts on the client side. To test this I ran 512+512 and 1023+1023 sysbench clients from two different machines, however I didn't get better results.
From looking at show processlist, show status like "%wsrep%" and top, I see similar behavior as I did originally in the tests with poor results. I therefore suspect I have again reached a limit where purging the InnoDB log file is blocking everything else. Also iostat shows a pattern of io activity going up and down in waves.
Since I didn't want this to turn into an excercise in InnoDB tuning, I didn't go further, but I am convinced that I can get even better numbers if I needed to.
Next up, same test with larger tables to see how a disk-bound workload behaves.
- 1. FYI, the bug where group commit breaks has been fixed in MariaDB 5.3.0 beta, will be fixed in Percona Server (or is already? need to check...), and Oracle is previewing a different fix in MySQL 5.6 preview releases.