Understanding the Galera commit sequence and innodb_doublewrite_buffer back to 1
People often assume the reason I talk at conferences is because I like to teach them about Galera, MySQL and other cool, open source things. That's of course true to some extent. But for me personally, another reason is even more important: It happens almost always that the audience teaches me something. Thus the process of public speaking has increasingly become my way to learn new insights. Low level details that aren't easy to learn by just RTFM.
At my Galera talk in Froscon, I was graced by the presence of Monty Widenius himself, as well as Oli and Erkan, two Galera gurus of Europe. There were 2 things in the talk that provoked discussion. These are the conclusions, confirmed by Alex and Seppo from Codership:
I have advocated to set innodb_doublewrite=0 when using Galera. (slide 16) This was based on the fact that if the mysqld process crashes, galera will discard the database anyway and take a full snapshot from a healthy node. So who cares if data is corrupted on a crash.
As of Galera 2.1 this isn't true anymore. Galera will recover the state also after a crash, and may very well do only an incremental state transfer (ie keep the data and catch up the missing transactions). For this reason it is now important to keep innodb_doublewrite=1, which is the MySQL default.
Note that we still advice to set innodb_flush_log_at_trx_commit and sync_binlog to zero. It doesn't matter if some transactions are lost on a crash, because galera can get those back from the other nodes. What matters is just that whatever is written to disk, is written in a consistent state.
Thanks to Monty for spotting this and convincing us about it!
Galera commit sequence
On slide 23 you will find a diagram about the commit sequence in galera. It is wrong! (I copied this pretty much from Vadim, so Percona guys should pay attention too!)
A correct diagram is found in the Galera documentation:
The only communication between nodes is to send the transaction (the "write set") to the "group".(Which means it is sent to all nodes, but the correct way to think about this is to think of "the group" as a single entity. Hence, group communication.) "The group" will then return with giving the transaction its unique Global Transaction ID. After this point, there is no more communication between nodes. Therefore, the main function of the group communication - in addition to the actual replication of writesets from a node to another - is to impose a global (cluster-wide) ordering on all transactions.
The next step is called certification. This means to check whether the transaction can be committed or not. This is a purely deterministic operation, hence nodes do not "agree" or otherwise communicate with each other anymore.
If certification passes, the transaction is now committed. This is the point of logical commit in galera.
Finally, the transaction is then committed to InnoDB and also to the binary log, if that is enabled. On the slave, it is first applied to InnoDB, then committed.
When the master InnoDB commit returns, control is returned to client.
An important consequence of how all of this works is that even if the writing to InnoDB and binary log are not synchronous between nodes, they are enforced to happen in the same order. This means that if you combine two galera clusters with an asynchronous MySQL replication link, it is possible to do so called "channel failover": if a node in the asynchronous replication fails, you can safely continue replicating via another node as long as you figure out the right binlog position to start from. Also it guarantees that point-in-time-recovery from the binlog will work.
Thanks to Oli of Fromdual for relentlessy pushing us to understand this behavior.