The different ways of doing HA in MySQL
A week ago Baron wrote a blog post which can only be described as the final nail in the coffin for MMM. At MySQL AB we never used or recommended MMM as a High Availability solution. I never really asked about details about that, but surely one reason was that it is based on using the MySQL replication. At MySQL/Sun we recommended against asynchronous replication as a HA solution so that was the end of it as far as MMM was concerned. Instead we recommended DRBD, shared disk or MySQL Cluster based solutions. Of course, to replicate across continents (geographical redundancy) you will mostly just use asynchronous replication, also MySQL Cluster used the standard MySQL replication for that purpose.
So with the advent of semi-sync replication I had actually thought of MMM again. If the replication itself is now ok for HA purposes, would MMM be ok too, or was it the case that it is also cr*p in itself? Seems I will now take Baron's advice and forget about MMM. It's not just that I really respect Baron's expertise here, but one of the first comments to that post is from the creator of MMM agreeing! And if that doesn't convince you, others like Schlomi Noach chime in with claims of 4 MMM installations recently failing for him and causing down time. There is a wise saying that you shouldn't take anecdotes for statistical evidence. But in some cases I think it is safe to rule out some solution based on hearing too many anecdotes. If something that claims to be a high availability solution is constantly heard of causing down time instead of preventing it, then yes, that is enough reason to look elsewhere.
So Heartbeat is commonly used with DRBD to monitor the state of your Primary/Secondary MySQL server pair. Nowadays Pacemaker is recommended over the "legacy" Heartbeat. I commonly hear of MySQL users that have turned off the automatic failover functionality of Heartbeat. They may or may not still use the monitoring part and send alerts, but they use manual failovers. I often hear about these setups when people come to me asking for advice, and kind of apologizing for what they've had to do, because they had problems with Heartbeat failing over on false positives. I then tell them that I've heard others do exactly the same thing and have run happily for years with manual failover procedures.
The rationale for manual failovers is the following: Heartbeat, Pacemaker and other similar tools can cause false positives: Your MySQL instances are really fine, but due to some network overload or whatever, a few heartbeat signals are missed, and Heartbeat decides to start the failover procedure. The problem is that especially with DRBD or shared disk setups MySQL failover is expensive, the recovery procedures that need to be done on the standby node that is being started up can take some minutes. This leads to the question, how often do you really need automatic failover? How often do you really experience a disk crash, server crash, operating system crash or MySQL crash? Hardware failures happen maybe once every 3 to 5 years. If you have 24/7 staff in your data center anyway, it will not take many minutes until someone gets an alarm, and can manually fail over to the standby server. (Where failover itself will then take an additional one or more minutes.) On the other hand, if your Heartbeat setup itself is causing failovers even just once a year, it may actually be causing more down time than preventing! (...since each failover in itself takes a minute or more.) Add to this the added grief to sysadmins who have to spend time analyzing and guessing why the failover happened, and restoring the old primary server as a standby.
I've recently mentioned the combination of NoSQL Voldemort with a MySQL backend we use at Nokia. Also Twitter has a similar combo called Gizzard. Voldemort and Gizzard are systems built on the Amazon Dynamo, featuring eventual consistency / CAP theorem type of redundancy. It's a best of both worlds solution: MySQL for storing the data and good single node performance, monitoring and backups. A NoSQL layer on top replaces MySQL for replication (including both scale-out and high availability) and client connection.
So the other day I was talking about this general approach, and realized these systems also don't use Heartbeat. But not because they've had problems with it, rather the whole reason to use Heartbeat never arises! So let's look at this approach a bit more closely, while remembering how much Baron hates MMM and how many MySQL users turn off Heartbeat:
The systems based on Amazon Dynamo use a combined approach to redundancy: Clients write and read to more than one replica of the database, but not to all replicas, and hence there is also some asynchronous replication happening between database nodes. The asynchronous replication guarantees the eventual consistency of the complete system - if it sees different versions of the same record, it needs to be able to figure out what is the most recent state it should choose for all nodes - leading to consistent end state.
The formula often cited for these systems is:
R + W > N
...where R = replicas read, W = replicas written, N = total number of replicas.
In a typical implementation you would have:
- 3 replicas (copies of the same data). Having less than 3 is possible within the theory, but doesn't make sense in practice.
- Each time the client writes data (insert or update), the transaction has to succeed on 2 nodes. (majority)
- Each time the client reads data, he must get an identical reply from 2 nodes. (majority) So if the first two reads match, that's it and you can ignore the third. If the there's a mismatch, you have to also read the third replica to know which is the correct data and which is outdated data.
- Ie: R = 2, W = 2, N = 3.
Eventual consistency is often talked about as a radical approach, where applications have to deal with the fact that your database cluster is not in a consistent state and this is done to achieve the buzzword "web scale". In fact the opposite is true! The above approach actually guarantees that your application will always see a consistent view of the data, despite the fact that the cluster as a whole may internally be in an inconsistent state! In fact, many old school web applications using the classic MySQL master-slave replication will have to live with the fact slaves experience lag. With the eventual consistency approach above, applications don't have to design around it, you are guaranteed to get consistent data.
Originally I thought the above approach was used to minimize latency: You get a kind of synchronous replication, without having to wait for the replication between the nodes to happen (such as happens with semi-sync replication or MySQL Cluster, both of which add latency). But as it turns out, the high availability properties of this system are quite interesting too! So let's now see what happens when one node fails:
That was short, but this is an important insight, so in case you weren't listening, I'll repeat:
What happens when one node in this cluster fails?
In the above system, if 1 node fails, 2 writes will still succeed, and that's all we need. When you read data, you'll be able to read back your data from the 2 working nodes - again, that's all we need.
The cluster is symmetric in the sense that there are no master or slave nodes, or primary or secondary nodes. When one node fails, you can continue working with the ones that are still ok, regardless of which ones are working and which one is dead. There is no failover to do. There is no heartbeat interval to wait for. The application is working at full speed even if one node in the database cluster is down.
Of course, you still want to be alerted of the fact that one node is missing, so that you can fix the problem and restore the normal state of having 3 replicas online. After all, when you are down to 2 nodes, that's it. If you lose another one, you are down. Actually, you could then relax your rules and continue running with a single node, even if not supported by the CAP theorem :-) But in any case, the sysadmin will of course tend to the dead node and fix whatever problem made it go down, but this can be done as a non-critical background task, that doesn't affect the online status of your application
With the many anecdotes from field experiences of using both MMM and Heartbeat and other similar systems, I'm beginning to think we should rethink the whole approach those systems are based on. Possibly the solution is that we should copy into MySQL (pick your favorite fork...) the functionality of the eventual consistency / Amazon Dynamo based systems I've described here. To do that, we need 1) global transaction IDs, 2) a new kind of eventually consistent replication (that would utilize the global transaction ID, and use similar R+W=N majority rules to know what to replicate and where) and 3) client libraries that can do the above kind of R+W>N writes and reads transparently to the application.
I now expect Stewart to implement all of this by Thursday in a prototype Drizzle branch.