Failover is evil

In the Matrix movie there is a scene where the heroes visit a spiritual councelor, and amongst the people in her waiting room they see a little boy, dressed like a buddhist monk, who can bend a spoon just by looking at it. When they ask him what he does to bend the spoon, the boy's answer is: "There is no spoon". And if you watch the movie to the end, you will see that he is right. (In that spirit, if this post is too long to read for you, just skip to the last paragraph for the answer.)

The title for this blog post is of course inspired by Baron's "Is automated failover the root of all evil?", which is a commentary on GitHub's detailed explanation of their recent Pacemaker-induced downtime. Baron makes a good question, but the answer is deeper than suggested by the question. The problem is not the automation, the problem is the failover.

Many on Planet MySQL have joined the discussion. I have all of them open in browser tabs. Baron's and Robert's posts still come to the conclusion that automated failovers are perhaps hard and weird, but not evil - Robert of course is the developer of one such solution. Yoshinori is also known for an automated failover solution, but at the end of his post he makes a surprise admission: "actually I have used MHA many more times for manual failover than automated failover."

After this Peter, Jeremy and Daniel make a strong case for manual failovers. Robert comes back agreeing with Peter. Ronald plugs Tungsten for planned failovers (ie maintenance), I don't know if that counts in any column.

So manual failover wins by what? At least 5 to 1? It is the post by Peter that really makes the case for why this is: Statistics. MySQL systems really fail less than once a year. (Even then it is not really MySQL, but disk crashes and such.) We also know that our beloved clustering suites tend to have false positives, or more severe problems (both of which happened at GitHub) well too often. When that happens they either create a big mess (when you use MySQL replication), or in the best case just induce unnecessary downtime due to the failover taking too much time (if you use DRBD). In other words, the very system you put in place to prevent downtime, is the main cause for downtime! When I joined MySQL in 2008 I learned this from Kristian Köhntopp, immediately understood the logic, and have also advocated manual failovers since then.

Yet, I do sympatise with all of those who keep trying. The whole point of software engineering is to automate manual tasks. Saying that manual failover is the best you can get is simply not satisfying.

I learned this at Nokia a year ago. I had evaluated Pacemaker for a week and gave a very strong recommendation against it. (I wish I was born as diplomatic as Robert is, but quite frankly, nothing of what happened at Github was a surprise to me. And I can give you a list of more things that can go wrong with Pacemaker that didn't yet happen at GitHub, but it would need another blog post...) But sure enough, there was a project that insisted on using Pacemaker despite my recommendations. Why? Because their project requirements included that they must use a clustering solution.

Faced with such invincible arguments, I changed tactics and offered to help them setting up Pacemaker. My help consisted of replacing the Pacemaker agents designed to start and stop mysqld processes, and replacing them with dummy scripts. It sounds silly but I worked long nights to do this favor for them :-) Now, when Pacemaker wanted to start a MySQL node, the agent would do nothing and simply return success or failure depending on some checks. The agent to stop a MySQL node would do absolutely nothing, leave mysqld running, yet always return success. All code touching MySQL replication was removed. The project took my "enhanced" agents and run in production with them today. Project management was satisfied that another requirement was delivered.

Out of everyone who blogged on this topic this week, Peter was really the only one to strongly point to the real answer. Which is: There is no spoon.

I came to realize this after my encounter with Pacemaker, and this is also the takeaway from Github's experience. You see, we used to think that all the problems with MMM were only because it was written in Perl, by a kid, who even himself regrets it now. Those who used Heartbeat also knew all too well the problems with false positives (such as Kristian Köhntopp's customers). But fear not: Pacemaker is designed to be the replacement for Heartbeat. It was created by Ericsson engineers from the telecom world and (at least once) endorsed by Red Hat. It has to be good! But it gets better: If you find that Pacemaker is hard to twist around MySQL replication (which it is, it wasn't designed for MySQL) then there is this Pacemaker based solution for you: Percona Replication Manager. It's design was conceived during a long dinner in London by Florian Haas, the nr 1 Pacemaker guru in the world (at least in the MySQL world) and Yves Trudeau, one of the top MySQL high-availability experts in the world.

My point is this: If there is an automated failover system that you can expect to work, then it is Percona Replication Manager. It just doesn't get better than this.1 Yet this is the system that failed - spectacularly - at Github. It didn't just fail in handling a failure. It was the cause of failure when everything else was fine. And it failed twice on the same day. And it compromised data integrity.

This should be a wake up call to those who still want to design a better automated failover system. If you try to bend the spoon, it won't work.

Luckily, there is an answer to the problem: NoSQL.

Just kidding! But for real, Amazon Dynamo is a very exciting design. It is this design that gave NoSQL systems the reputation of having better availability than relational databases could ever have. It really has nothing to do with the CAP theorem, it's just that the way that this system does writes, reads and replication that is really clever. The end result if you are a user of such a system is that one thing you don't have to worry about is failover. You just write and read from some nodes that are there, and if they are not there you need to write and read to some other nodes. A single node failing is a complete non-issue, it's kind of expected to happen all the time. Hence: There is no failover!

I was aware of the Dynamo paper since some years ago. After giving up on Pacemaker I thought about it again. I realized that all these problems are really about the pain of managing the failover process. What if there was no failover?

But as such the quorum consistency of the Dynamo design is only possible to implement for key-value stores and not for a general purpose RDBMS. (...at least I'm pretty sure, if you want proof, ask a mathematician). Still, I came to realize we are lucky in the MySQL world to have not one, but two systems that can do the same: MySQL NDB Cluster and Galera Cluster. The common thing with these two is that they are synchronous active-active clusters.

Note that MySQL semi-synchronous replication doesn't count, because it does not give you an active-active cluster. Oracle RAC is the only other Active-Active system I know about, and it too will guarantee your data integrity in failures (no split-brain, since there is only one copy of the database) but RAC failovers tend to take like a minute, so from a HA point of view it is really no better than MySQL with DRBD. (I know Sybase is good at replication, but I don't know enough to say if they could match Galera or NDB.)

And if you think about it, then synchronous replication is equivalent to an Amazon Dynamo system where you write to all nodes and read from one node. So in a way NDB and Galera are a special case of Dynamo, but with-SQL.

When you use NDB or Galera, the recommended setup is to use them together with a load balancer, preferably with the load balancer that comes with your JDBC or PHP-ND drivers. (...because they can catch errors immediately as they happen in the MySQL client protocol, a proxy load balancer with polling is just not the same ...it's not Dynamo.) With this kind of architecture, you can stop worrying, just write to any node that is available, and now... say it with me: There is no failover!

PS: As it happens, I will be speaking on a very relevant topic at MySQL Connect, San Francisco a week from now: Evaluating MySQL High-Availability Alternatives. It won't be as deep as I've been today, but previous audiences have appreciated my detailed explaining of trade-offs you make with each alternative. Unfortunately I won't be able to attend Percona Live New York, because on Monday I have this day job where they need me, but I am proud to be talking about XtraBackup Manager at Percona Live London later this year.

  • 1Ok, to be perfectly honest, if I was forced to use one of these, I would trust MHA and Tungsten more than anything related to Pacemaker. I have never evaluated these products, but just based on knowing the people who wrote them.

I have a conundrum for you, Henrik. If automated failover does not work why do you and Peter Z both feel that Galera is a good solution? Galera and Dynamo both use automated failover, though it is usually called maintaining quorum. Instead of promoting a new master it kicks one or more out to ensure the remaining group is consistent. The effect is the same as for master/slave--to preserve overall availability and consistency you make a decision to reconfigure the cluster and shut down one or more hosts.

Failure handling is a (the?) classic problem of distributed distributed. The basic mechanisms of groups, leader election, quorums, witness hosts, fencing, as well as algorithms like Paxos, that solve distributed consensus in industrial systems have been around for decades. They are hard to implement correctly but function well provided you do not ask them to do to much. Lest anyone be in doubt, multi-master systems of all types (including Dynamo and its descendants) run into terrible problems when they mess this up.

In summary I don't get the argument that multi-master clusters can automate failure handling while master/slave clusters can't. They are both solving the same underlying problem. A lot of the discussion on this topic is confusing bad implementations (of which there are unfortunately many in the MySQL world) with the actual state of the art.

To some degree yes but in general I don't agree. In the above I of course talk about external clustering systems, which try to automate the failovers between a MySQL master and slave - or any other similar master-slave system. In this approach I don't think it is about the implementations being poor, I agree with Peter: For most users the statistics are against you. Such systems will create more trouble than they prevent. The most useful use of these is to automate the manual steps that are executed, but the execution is triggered manually.

First, any HA system of course is designed to survive failures. But I don't think it is true that every method of surviving a failure can be called failover. Both in Dynamo and Galera all nodes are equal, the situation is completely symmetric. So you can't say that something has failed*over* from somewhere to somewhere else. In the Dynamo case a node failing is literally a non-event. There is absolutely nothing that needs to be done when a node is gone. (...other than try to add it back or replace it with a new node.) So clearly it is not appropriate to talk about failover in the case of Dynamo. The protocol is designed such that a single node failure can be survived without having to take any action at all.

In the case of Galera it is almost the same. Every node is equal. There is nothing to failover from somewhere to somewhere else. Unlike in Dynamo, there actually is a small timeout before surviving nodes agree to continue without the failed node. But like I said, this is the same that would have to happen if you used Dynamo with W=N and R=1. Even if there is a small action that kind of happens here, there isn't really a risk for split brain or diverging nodes. Everything that happens is well defined and completely deterministic. Most importantly, from the outside you don't have any risk of writing to a node that should have been read-only, or whatever. The system doesn't allow you to shoot yourself in the foot like that.

Now, in the case of NDB you kind of have a point. NDB datanodes actually are a master-slave system. Each parition has one primary replica and the others are completely passive replicas. When a node fails, a passive partition replica must become the primary partition. But also here the steps that are taken to do this internal failover are part of the NDB wire protocol. The sequence of steps that are taken is well defined, and all nodes are aware of what is going on. There is not a single commit that could slip by somewhere, on the wrong node, because the failover and the commits are part of the same protocol. And most importantly, just like in Galera, for the external user it is an active-active system. Clients don't need to be informed about the fact that they must now connect somewhere else and absolutely cannot connect to some other place. To the outside world, the failover that has happened internally in the NDB data nodes is again a complete non-event.

When you talk about the state of the art - this is it, and it doesn't involve failover. There is no way you can achieve the same by taking standalone, non-clustered database nodes and try to bring them together with some external middleware solution. You can never be sure that somewhere, on some node, some commits didn't slip by when you weren't there to prevent it. Similarly, no matter how much polling you do, you can never know whether some commits somewhere failed between your polls, and you'll never know it even happened. It's just not comparable.

Since you are the creator of Tungsten, it is worth pointing out that I think Tungsten is a great architecture which allows it to be used for many interesting use cases. It is of course perfect for things like heterogeneous replication, and business-wise you are lucky that this is increasingly a relevant use case. Also, as it has been said, when you replicate across many data centers, I don't see that there necessary needs to be any automated data-center failover in place. When the challenge is just to move data around in complex ways, Tungsten seems to again be a great alternative. And so on. But if we are talking purely about HA, then I don't think it can ever compete in robustness with a solution where the clustering is tightly built into the database, and in particular I'm talking about active-active systems.

Hi Henrik,

Here's a point-by-point rebuttal. For anybody reading this please understand that I am not criticizing Galera, which promised to be a great solution because it is (a) theoretically sound and (b) written by people who understand that theory.

1.) My point about failover is that in each case deciding to reconfigure the system requires distributed consensus. The weak point in implementing systems like Galera tends to be group communications. If they work and are fast you are good. However, for anyone who works with these systems know that it is devilishly hard to get to that point because networks at a low level are violent and surprising. Projects like Postgres-R did not get it right and failed. I myself have worked with JGroups for years. The bugs that you get at this level are extremely hard to diagnose and fix; they can cause split brain or complete system failure. Hit one and you are down for the count. (Disclaimer: We use group communications in Tungsten hence are subject to the same problem at least for management decisions.)

2.) To solve master/slave failover you need to add connectivity. That's lacking in most HA solutions and is a primary reason why they fail. We added this in Tungsten right from the start. It's one of the real strengths of the architecture. We don't use VIPs, because they cannot be trusted--the connectivity is controlled by our management layer, which in turn is based on group communications. In MM systems you still need to ensure apps don't try to write to a read-only or otherwise failed node. It is a lesser problem if the DBMS locks itself, of course. (I like that feature!)

3.) You are right that external failover management mechanisms have to deal with a lot of housekeeping issues that you can avoid if protocols are baked into the DBMS. The delayed commit problem is a good example. I wrote the code in Tungsten that deals with that by killing off *all* non-Tungsten logins before failover. Before that...yes, some commits got through.

4.) The biggest problem that Galera promises to solve is synchronous replication. That's very attractive going forward as semi-sync replication never really lived up to its promise in the mainline builds.

What this comes down to is a trade-off: we (mostly me) felt that we could help users build robust systems using off-the-shelf DBMS. Sure there are problems. But on the flip side you have plain old InnoDB which is darned reliable and rarely loses data. It also has all the benefits of fully functional SQL, which MySQL Cluster and NoSQL systems like Dynamo lack. InnoDB has limitations but it's the devil most people know, so those problems are not a surprise.

Personally I have never seen database failure as the only or even the most significant problem for building applications on MySQL. That was the real point of my article about maintenance, which you cite (http://scale-out-blog.blogspot.com/2012/09/database-failure-is-not-bigg…). It's easy to see that planned maintenance, not failover, is responsible for 90% of downtime in many systems. Taking things up yet another level the biggest problem I see for MySQL going forward is lack of baked-in sharding a la MongoDB. I would rather take the big system view and work on those problems. The solutions benefit more people.

In the medium to long term I predict that our work on Tungsten will merge with the benefits of Galera. The fact that we work outside enables us to delegate to the DBMS things that it handles well once they are accessible and reliable enough. While waiting for new capabilities to emerge we have implemented a boatload of successful applications.

Robert, as always, I really appreciate your comments, so you have nothing to worry. I wonder, you might be the most senior of everyone around here when it comes to HA and/or replication solutions?

1) I think what has been amazing with Galera is how well it performs. In all of this we have mainly focused on reliability, avoidance of downtime and avoiding data integrity issues. Clearly, a synchronous replication solution can solve those problems, but to actually have one without a huge performance penalty, so that it's practical to actually use... that's really awesome.
Other than that, implementing NDB, Galera or any such system is of course difficult and an implementation could have bugs that lead to data integrity problems, even if in theory the system does not suffer from integrity issues.

2) I remember you have some kind of connection proxy (well, Ronald highlights it too in his recent post) but don't know enough of it to say anything intelligent. But yes, my gut feeling is you are better of this way. (You might even get relatively close to a built-in solution for all I know.)

"5") For practical purposes it seems many have liked Tungsten for the approach you took. They are conservative DBA's and like the fact they can deploy some external middleware around their trusted MySQL and InnoDB. In terms of going to market (in particular the DBA market) it seems like a wise path to take - despite criticism from people like me :-)

"6") I agree with your vision of what we should copy from NoSQL next. Sharding, but also a HTTP JSON api (see Ulf Wendel on that one).

Henrik, allow me to rectify a few things since you mention me and Yves by name in the context of PRM. I recall only one long dinner I've ever had with Yves in London, at last year's PLUK, and PRM was definitely not "conceived" there. Yves (and possibly others) had been working on it for quite some time, and one of the things we discussed there was how we could integrate their work into the upstream resource-agents repository. We also chatted about dogs, our families, the French language and many other things, if you care to know. I also recall very vividly that no-one else was at our table, and unless you are in possession of a cloaking device that renders you invisible, I guess you weren't present either.

Also:

Pacemaker is designed to be the replacement for Heartbeat.

No it's not. And you know that.

It was created by Ericsson engineers from the telecom world and (at least once) endorsed by Red Hat.

Where did you come up with Ericsson? To the best of my knowledge Ericsson has never made any significant contributions to Pacemaker.

I'll respond in a little more detail in a separate blog post in a couple of days.

Hi Florian

I recall only one long dinner I've ever had with Yves in London, at last year's PLUK...

That is the dinner I'm refering to. I was sitting in the table next to you, and the following morning I asked you what the conclusion was. To me your conversation with Yves was significant, because you were able to explain to him (and then to me) the Pacemaker hooks and states one needs to use to correctly collect the binary log and relay log positions on all nodes, so that one can then safely failover to the correct node (or not, if the cluster state is such that safe failover isn't possible at all). The things you need to do to make it correct is quite complex and until you explained all of these hooks I didn't even realize some of them existed and could certainly not by myself have understood how to combine them to get the necessary failover sequence. Until that morning I was convinced that Pacemaker simply could not be used to handle MySQL replication, but you explained that it was (at least in theory) possible. And yes, this was the direct result of your conversation with Yves.

If there was a version of PRM before that, it had certainly not been announced (on mysqlperformanceblog) and also it would not have been fit for purpose without the information Yves got from you that evening. This is also true for all other Pacemaker agents for MySQL replication that I've seen out in the wild: they don't do everything that is needed to safely failover a MySQL replication cluster. So this is why I still claim that you are the only person in the world who understands Pacemaker well enough to use it for a MySQL replication cluster.

No it's not. And you know that.

It is if you read http://www.linux-ha.org/wiki/Main_Page

Where did you come up with Ericsson?

The heart of a Pacemaker cluster is Corosync. Which is derived from OpenAIS. Which is a product of the SA Forum. Which was founded primarily by Ericsson together with other telco companies.

And before you are coming to educate me that Pacemaker and Corosync are different pieces of software, that's yet another reason I consider this whole monster way too complex for an average DBA to survive. It's a quiltwork of historical baggage, which could have been done much simpler.

It is if you read http://www.linux-ha.org/wiki/Main_Page

Did you actually bother to read that page? There is nothing in there that says Pacemaker is or was a replacement for Heartbeat.

The heart of a Pacemaker cluster is Corosync. Which is derived from OpenAIS. Which is a product of the SA Forum. Which was founded primarily by Ericsson together with other telco companies.

Pacemaker doesn't even have a common codebase with Corosync, it just uses Corosync for cluster communications. Are you honestly saying Unity comes from MIT, only because X was originally written there?

And before you are coming to educate me that Pacemaker and Corosync are different pieces of software, that's yet another reason I consider this whole monster way too complex for an average DBA to survive.

That's effectively the equivalent of saying something lunatic, and then plugging your ears and singing "la la la I'm not listening" instead of facing a rebuttal.

If I may say so, this discussion here is just getting silly. I've written up my own take on the GitHub issue, to get back to a more technical discussion.

Pacemaker actually began life as part of the Heartbeat codebase, so its much more than theoretically possible to run pacemaker+heartbeat.
Support for openAIS/Corosync was added about 4 years after Pacemaker development started.

The reason you don't hear about it much these days, is that heartbeat development basically stopped and corosync became seen as the future (if for no other reason than it continues to be actively maintained).

Thanks for this great blog post, nice analogy with the fabulous Matrix movie wrapped in here. I'm sure Matrix can easily inspire more bloggers in the future too. In a sense, we are all building our own Matrix here.

I must correct some confusion from Robert, in his comments above. First, the division for failover and non-failover replication systems is not about master-slave vs multi-master replication. Robert is himself advertising the commercial Tungsten as "real multi-master" solution, and yet he has to struggle with failover. The key to spoon-less world is about synchronous replication, and partially also about managing the cluster's internal consensus. As a result, each node in the cluster is full representative of the database at all times, hence there is no need for failover of any kind.

Group Communication System (GCS) is effective abstraction for maintaining consensus in a DBMS replication group. A big number of distributed computing systems use GCS of some sort, the latest exciting member just recently joining this league being the Google Spanner. It is a strong asset in Galera to utilize a GCS, and not only one, Galera has a GCS framework which allows different GCS implementations to be used as plugins.

Just to be clear the issue I'm pointing to is distributed consensus, which is common to all systems that maintain consistent copies of data across nodes. Tungsten does not use the same algorithm as Galera but we do use group communications. Both systems get in trouble if they mess it up. I have made that argument enough times that it's not useful to repeat it again. For people who disagree I'm content to wait until they encounter the next bug that brings their system down. Then we can talk.

Speaking of Tungsten, our clusters are master/slave and proud of it. We can configure standalone async multi-master replication for cross site operation, which I have pointed out elsewhere is a better approach than synchronous multi-master when operating over a WAN (http://scale-out-blog.blogspot.com/2012/08/is-synchronous-data-replicat…). Async multi-master is spoonless in another, arguably better, way since it can tolerate prolonged network outages, which are common between sites. There are major opportunities for innovation here and we plan to take advantage of them in Tungsten.

There are also as you say a number of systems like Spanner and Calvin that are working to provide what amounts to synchronous replication, even over long distances. I have yet to see one of these systems work for relational databases; it's a very difficult problem when you have high network latency and transactions, something that Jim Gray pointed out close to 20 years ago. I'm skeptical that Galera will be able to do better. However, there is a lot of clever research that is re-opening these issues in ways that did not seem possible a few years back and I know that you Codership guys are very focused on this problem.

Meanwhile, I'm happy to see that Galera is getting traction on synchronous multi-master replication for local clusters. This has the potential to be a major contribution to MySQL that will allow it to compete very well against other store types. If you nail it we will all toast your success.

Add new comment

The content of this field is kept private and will not be shown publicly. Cookie & Privacy Policy
  • No HTML tags allowed.
  • External and mailto links in content links have an icon.
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.
  • Use [fn]...[/fn] (or <fn>...</fn>) to insert automatically numbered footnotes.
  • Each email address will be obfuscated in a human readable fashion or, if JavaScript is enabled, replaced with a spam resistent clickable link. Email addresses will get the default web form unless specified. If replacement text (a persons name) is required a webform is also required. Separate each part with the "|" pipe symbol. Replace spaces in names with "_".
About the bookAbout this siteAcademicAccordAmazonBeginnersBooksBuildBotBusiness modelsbzrCassandraCloudcloud computingclsCommunitycommunityleadershipsummitConsistencycoodiaryCopyrightCreative CommonscssDatabasesdataminingDatastaxDevOpsDistributed ConsensusDrizzleDrupalEconomyelectronEthicsEurovisionFacebookFrosconFunnyGaleraGISgithubGnomeGovernanceHandlerSocketHigh AvailabilityimpressionistimpressjsInkscapeInternetJavaScriptjsonKDEKubuntuLicensingLinuxMaidanMaker cultureMariaDBmarkdownMEAN stackMepSQLMicrosoftMobileMongoDBMontyProgramMusicMySQLMySQL ClusterNerdsNodeNoSQLodbaOpen ContentOpen SourceOpenSQLCampOracleOSConPAMPPatentsPerconaperformancePersonalPhilosophyPHPPiratesPlanetDrupalPoliticsPostgreSQLPresalespresentationsPress releasesProgrammingRed HatReplicationSeveralninesSillySkySQLSolonStartupsSunSybaseSymbiansysbenchtalksTechnicalTechnologyThe making ofTransactionsTungstenTwitterUbuntuvolcanoWeb2.0WikipediaWork from HomexmlYouTube

Search

Recent blog posts

Recent comments