How Galera does Rolling Schema Upgrade, really
This post is about a fairly technical detail of how Galera works. I'm writing it down in preparation for testing this feature so that I can agree with Alex whether to file a bug or not. I'm sharing it on my blog just in case someone else might benefit from learning this.
Galera 2.0 introduces rolling schema upgrades. This is a new way to do non-blocking schema changes in MySQL.
As the name suggests, it is done as a rolling upgrade. Having seen clusters doing rolling upgrades before, I assumed this is what happens:
- Execute alter table on Node 1.
- Node 1 is removed from the cluster and stops processing transactions.
- Node 1 completes alter table.
- Node 1 re-joins cluster and catches up so that it is in sync.
- Execute alter table on Node 2.
- Node 2 is removed from the cluster and stops processing transactions.
- Node 2 initiates and completes alter table.
The result is that at any single point in time, only one node is unavailable but the rest of the cluster is working just fine while you are executing the alter table.
When testing this feature with our team, we found that while Node 1 was processing the alter table, we were still able to select from it. Normally this should not be possible if a node is disconnected from the cluster. Galera by default will prevent you from reading stale data. This is a very good feature of Galera. So we filed bug 966679.
Discussing this with Alex at the Galera BoF in Santa Clara, I learned that the above is not at all a correct understanding of what Galera is doing. Possibly our test is therefore flawed and there might not be a bug after all. This is now my current understanding of what Galera is doing:
- Execute alter table on Node 1 for table T.
- Node 1 continues to be part of the cluster and apply other updates too, as long as possible. Basically, anything that isn't updating table T can be applied.
- If other nodes update table T (which they can, since there is no alter table ongoing on those) then this event cannot be replicated and replication will halt.
- Node 1 eventually completes alter table.
- If replication was stopped, Node 1 will now catch up.
- Execute alter table on Node 2 for table T.
So once again the Galera team has gone to great lengths in trying to provide a really nice user experience and do something far beyond my expectations. (And my expectations are not low.)
So, with this knowledge, we have to test RSU again and be much more detailed in how the test is setup. Something like this:
- Setup Galera cluster with 3 or more nodes. Configure all nodes to use RSU (non-default).
- Create tables t1 and t2. Insert N records into both. (N is large enough so the alter table will last a while.)
- Alter table t2. For the remaining steps, assume that the alter table is executing on Node 1, and we do inserts on Node 2.
- Launch a for loop doing inserts into t1, on Node 2.
- Select from both t1 and t2 on Node 1. This should succeed. (Selects on Node 2 and 3 should also succeed.)
- Launch a for loop doing inserts into t2, on Node 2.
- Select from both t1 and t2 on Node 1. Should this succeed or fail? See discussion below. *
- Stop the for loops.
- Wait until alter table is finished on Node 1.
- Select from both t1 and t2 on Node 1. This should succeed. Selects on Node 2 and 3 should also succeed. All nodes should again be in sync (such as SELECT COUNT(*) returning the same value)
- Repeat above by executing alter table on Nodes 2 and 3
So what should Node 1 return at the point I marked with '*'? In my view, it should return Unknown error since replication is not happening anymore. Alternatively, it may continue to return a normal (but stale) result. This I consider a bug. We will see.