Failover is evil
In the Matrix movie there is a scene where the heroes visit a spiritual councelor, and amongst the people in her waiting room they see a little boy, dressed like a buddhist monk, who can bend a spoon just by looking at it. When they ask him what he does to bend the spoon, the boy's answer is: "There is no spoon". And if you watch the movie to the end, you will see that he is right. (In that spirit, if this post is too long to read for you, just skip to the last paragraph for the answer.)
The title for this blog post is of course inspired by Baron's "Is automated failover the root of all evil?", which is a commentary on GitHub's detailed explanation of their recent Pacemaker-induced downtime. Baron makes a good question, but the answer is deeper than suggested by the question. The problem is not the automation, the problem is the failover.
Many on Planet MySQL have joined the discussion. I have all of them open in browser tabs. Baron's and Robert's posts still come to the conclusion that automated failovers are perhaps hard and weird, but not evil - Robert of course is the developer of one such solution. Yoshinori is also known for an automated failover solution, but at the end of his post he makes a surprise admission: "actually I have used MHA many more times for manual failover than automated failover."
After this Peter, Jeremy and Daniel make a strong case for manual failovers. Robert comes back agreeing with Peter. Ronald plugs Tungsten for planned failovers (ie maintenance), I don't know if that counts in any column.
So manual failover wins by what? At least 5 to 1? It is the post by Peter that really makes the case for why this is: Statistics. MySQL systems really fail less than once a year. (Even then it is not really MySQL, but disk crashes and such.) We also know that our beloved clustering suites tend to have false positives, or more severe problems (both of which happened at GitHub) well too often. When that happens they either create a big mess (when you use MySQL replication), or in the best case just induce unnecessary downtime due to the failover taking too much time (if you use DRBD). In other words, the very system you put in place to prevent downtime, is the main cause for downtime! When I joined MySQL in 2008 I learned this from Kristian Köhntopp, immediately understood the logic, and have also advocated manual failovers since then.
Yet, I do sympatise with all of those who keep trying. The whole point of software engineering is to automate manual tasks. Saying that manual failover is the best you can get is simply not satisfying.
I learned this at Nokia a year ago. I had evaluated Pacemaker for a week and gave a very strong recommendation against it. (I wish I was born as diplomatic as Robert is, but quite frankly, nothing of what happened at Github was a surprise to me. And I can give you a list of more things that can go wrong with Pacemaker that didn't yet happen at GitHub, but it would need another blog post...) But sure enough, there was a project that insisted on using Pacemaker despite my recommendations. Why? Because their project requirements included that they must use a clustering solution.
Faced with such invincible arguments, I changed tactics and offered to help them setting up Pacemaker. My help consisted of replacing the Pacemaker agents designed to start and stop mysqld processes, and replacing them with dummy scripts. It sounds silly but I worked long nights to do this favor for them :-) Now, when Pacemaker wanted to start a MySQL node, the agent would do nothing and simply return success or failure depending on some checks. The agent to stop a MySQL node would do absolutely nothing, leave mysqld running, yet always return success. All code touching MySQL replication was removed. The project took my "enhanced" agents and run in production with them today. Project management was satisfied that another requirement was delivered.
Out of everyone who blogged on this topic this week, Peter was really the only one to strongly point to the real answer. Which is: There is no spoon.
I came to realize this after my encounter with Pacemaker, and this is also the takeaway from Github's experience. You see, we used to think that all the problems with MMM were only because it was written in Perl, by a kid, who even himself regrets it now. Those who used Heartbeat also knew all too well the problems with false positives (such as Kristian Köhntopp's customers). But fear not: Pacemaker is designed to be the replacement for Heartbeat. It was created by Ericsson engineers from the telecom world and (at least once) endorsed by Red Hat. It has to be good! But it gets better: If you find that Pacemaker is hard to twist around MySQL replication (which it is, it wasn't designed for MySQL) then there is this Pacemaker based solution for you: Percona Replication Manager. It's design was conceived during a long dinner in London by Florian Haas, the nr 1 Pacemaker guru in the world (at least in the MySQL world) and Yves Trudeau, one of the top MySQL high-availability experts in the world.
My point is this: If there is an automated failover system that you can expect to work, then it is Percona Replication Manager. It just doesn't get better than this.1 Yet this is the system that failed - spectacularly - at Github. It didn't just fail in handling a failure. It was the cause of failure when everything else was fine. And it failed twice on the same day. And it compromised data integrity.
This should be a wake up call to those who still want to design a better automated failover system. If you try to bend the spoon, it won't work.
Luckily, there is an answer to the problem: NoSQL.
Just kidding! But for real, Amazon Dynamo is a very exciting design. It is this design that gave NoSQL systems the reputation of having better availability than relational databases could ever have. It really has nothing to do with the CAP theorem, it's just that the way that this system does writes, reads and replication that is really clever. The end result if you are a user of such a system is that one thing you don't have to worry about is failover. You just write and read from some nodes that are there, and if they are not there you need to write and read to some other nodes. A single node failing is a complete non-issue, it's kind of expected to happen all the time. Hence: There is no failover!
I was aware of the Dynamo paper since some years ago. After giving up on Pacemaker I thought about it again. I realized that all these problems are really about the pain of managing the failover process. What if there was no failover?
But as such the quorum consistency of the Dynamo design is only possible to implement for key-value stores and not for a general purpose RDBMS. (...at least I'm pretty sure, if you want proof, ask a mathematician). Still, I came to realize we are lucky in the MySQL world to have not one, but two systems that can do the same: MySQL NDB Cluster and Galera Cluster. The common thing with these two is that they are synchronous active-active clusters.
Note that MySQL semi-synchronous replication doesn't count, because it does not give you an active-active cluster. Oracle RAC is the only other Active-Active system I know about, and it too will guarantee your data integrity in failures (no split-brain, since there is only one copy of the database) but RAC failovers tend to take like a minute, so from a HA point of view it is really no better than MySQL with DRBD. (I know Sybase is good at replication, but I don't know enough to say if they could match Galera or NDB.)
And if you think about it, then synchronous replication is equivalent to an Amazon Dynamo system where you write to all nodes and read from one node. So in a way NDB and Galera are a special case of Dynamo, but with-SQL.
When you use NDB or Galera, the recommended setup is to use them together with a load balancer, preferably with the load balancer that comes with your JDBC or PHP-ND drivers. (...because they can catch errors immediately as they happen in the MySQL client protocol, a proxy load balancer with polling is just not the same ...it's not Dynamo.) With this kind of architecture, you can stop worrying, just write to any node that is available, and now... say it with me: There is no failover!
PS: As it happens, I will be speaking on a very relevant topic at MySQL Connect, San Francisco a week from now: Evaluating MySQL High-Availability Alternatives. It won't be as deep as I've been today, but previous audiences have appreciated my detailed explaining of trade-offs you make with each alternative. Unfortunately I won't be able to attend Percona Live New York, because on Monday I have this day job where they need me, but I am proud to be talking about XtraBackup Manager at Percona Live London later this year.
- 1. Ok, to be perfectly honest, if I was forced to use one of these, I would trust MHA and Tungsten more than anything related to Pacemaker. I have never evaluated these products, but just based on knowing the people who wrote them.