More Galera lessons: parallel slave, out of order commits and deadlocks
2 concepts I've been an active advocate of during the past few years are both supported by Galera: Multi-threaded (aka parallel) slave, and allowing out-of-order commits on such a parallel slave. In trying to optimize Galera settings for the disk bound workload I just reported on, I also came to test these alternatives.
Single threaded vs Multi threaded slave
All of my previously reported tests have been run with
wsrep_slave_threads=32. For the memory-bound workload there was no difference using one thread or more, but I left it at 32 "just in case". For the disk bound workload there is a clear benefit in having a multi-threaded slave:
The green, brown and cyan lines were run with a single threaded slave and show a clear decrease in performance. They are also all equal, because they are all limited by the same bottleneck: single-threaded slave. By using more threads on the slave side (blue, red, yellow), we get 2x more performance. This is when writing to only one master, which is the best case for this workload, but I've heard of even 4x improvements due to using multi-threaded slave on other workloads which allow you to also benefit from scale-out to more than one node.
It seems to me Galera team could change the default here to something higher than 1. (Probably 4 or 8.)
Out of order commits
Galera's implementation of multi-threaded slaves is imho the preferred since it allows you to use it on any workload. What happens is that the slave threads can do some work in parallel (most notably, waiting for the disk seeks), but the actual commits are still serialized and happen in the same order as the transactions originated.
So a natural question to ask is, whether it would be beneficial to allow the commits to happen out-of-order, ie just allow each thread to commit independently and as fast as possible?
Before we get to the results, let's ask ourselves, when is this even safe to do? In the general case it is not safe, because you can't have an older transaction writing data in a record that already contains data from a newer transaction. You'd then have old data in that record, and you would also have a slave out of sync with your master. But allowing out of order commits can be safe in some situations:
- If you could also use a conflict resolution function, then a workload that does just single row updates is a good example of a safe workload. In this case the row could contain a column with a timestamp value, and the replication slave is told to only apply transactions where the timestamp is greater than what is already in the database. MySQL Cluster provides this kind of conflict resolution, but Galera does not.
- Even without conflict resolution, some workloads are inherently safe. For instance if you only do reads and inserts, but no updates or deletes.
- Tungsten and MySQL 5.6 allow you to parallelize the replication stream on a per-schema basis, preserving commit order within each schema but not between them. The assumption here is that transactions to different schemas will always be safe, which is in practice true in many cases. AFAICS Galera does not provide this option.
In the sysbench test it is not really safe to use OOOC, but since it is just garbage data anyway, it's worth a try just to see the results:
To read these results, you need to compare them one-to-one, ie blue vs green, red vs brown, yellow vs cyan. It makes sense to focus on the best case, which is writing to single master, ie blue and green lines. The others are burdened with other bottlenecks anyway. We see that using OOOC we can indeed increase performance a little. The highest increase is with a low amount of sysbench threads, ie before we saturate the InnoDB/disk layer. However, after we've saturated the disk, out of order commits don't really help. As I explained in the previous post, once you have enough uncomitted events in the replication stream, Galera will make incoming transactions wait anyway, so using OOOC doesn't really help that much, since all of the 32 slaves are just stuck anyway.
Even if there was a small benefit here, it seems there are more important things to focus on than using OOOC, which is only safe in some specific cases. I was surprised to see that just the multi-threaded slave while preserving commit order is so efficient. As it is also a very general solution, it seems like that's the recommended configuration.
Deadlocks happen when two transactions need to write to two or more of the same rows, and both write to one row first and then both wait for the other transaction to release the lock it's holding on the other row. The only solution in such a case is that one of the transactions is aborted and rolled back, so at least one transaction can continue. This is done by the database and is generally seen as an error (such as Java exception) in the application layer. However, it is a "normal" error, a correctly written application should catch the error and retry the transaction - chances are good it will just succeed the second time.
InnoDB's implementation of MVCC and row-level locks is very nice and user friendly in this regard, so most MySQL users might not have seen a deadlock in their entire life. Other databases may behave differently - for instance I've seen a customer migrating from InnoDB to MySQL Cluster that got into deadlock situation (they were using Hibernate which liked to lock a lot of rows for a long time...) and completely freaked out and blamed MySQL Cluster for it.
Sysbench also reports the amount of transactions that end up deadlocked. Below I've converted these to a percentage of total transactions:
This table is for the smaller, in-memory dataset. With the larger data set transactions are spread out over so many rows that there's at most one odd deadlock per each 2.5 minute run.
So, like I said, for a single MySQL node deadlocks simply don't happen. Why do they happen with Galera then?
The way Galera replication works is that a transaction is first executed against the single MySQL node, and committed. When it is committed all write operations are replicated - synchronously, while the commit is blocked - to other nodes in the cluster. At this point Galera now checks if the transaction can also be committed without there being any conflicting locks on those nodes. Usually it can, but it is possible that there is another transaction on that node just running, which is also modifying the same rows. In this case one of the transactions needs to be rolled back.
The difference to a single MySQL node is that there MySQL takes these locks as the transaction proceeds (pessimistic locking), but in Galera this is only the case for the local node, and on remote nodes all locks are taken at commit time (optimistic locking). Indeed, you can see that in the test where we run a Galera cluster but all clients connect to the same master, there are also no deadlocks. They only happen when you write concurrently to several nodes.
So what can you do to overcome this problem? Really, deadlocks are not a problem. Here we see that even under severe stress only 1 transaction in 500 will result in a deadlock, and under normal high concurrency situation it is only 1 in ten thousand - from a performance point of view it is completely negligible. So the only thing you need to do is to diligently catch those Java exceptions and retry deadlocked transactions, as you should. (But I know you don't, so now is a good time to start doing it!)
There could also be some patological workloads where this really becomes an issues. For instance if you had an application that always increments a counter:
UPDATE c=c+1 FROM ... WHERE id=1; On a single MySQL node this would result in bad performance, but probably not any deadlocks. However in a Galera cluster I'm sure such an application could cause interesting behavior.
In the worst case, should you find a workload that has problems due to deadlocks, it seems you can always solve it by just writing to a single master.