Uninterrupted DBMS operation — an almost-achievable goal
I’m hearing more and more stories about uninterrupted DBMS operation. There are no iron-clad assurances of zero downtime; if nothing else, you could crash your whole system yourself via some kind of application bug. Even so, it’s a worthy ideal, and near-zero downtime is a practical goal.
Uninterrupted database operations can have a lot of different aspects. The two most basic are probably:
- High availability/fail-over. If a system goes down, another one in the same data center is operational almost immediately.
- Disaster recovery. Same story, but not in the same data center, and hence not quite as immediate.
These work with single-server or scale-out systems alike. However, scale-out and the replication commonly associated with it raise additional issues in continuous database operation:
- Eventual consistency. Scale-out and replication create multiple potential new points of failure, server and network alike. Eventual consistency ensures that a single such failure doesn’t take any part of the database down.
- The use of replicas to avoid planned downtime. If you do rolling maintenance, then you can keep a set of servers with the full database up at all times.
Finally, if you really care about uninterrupted operation, you might also want to examine:
- Administrative tools and utilities. The better your tools, the better your chances of keeping your system up. That applies to anything from administrative dashboards to parallel backup functionality.
- Fencing of in-database analytic processes. If you’re going to do in-database analytics, fenced/out-of-process ones are a lot safer than the alternative.
- Online schema changes. If you change a schema in a relational DBMS, that doesn’t necessarily entail taking the database offline.
Let’s discuss some of those points below.
The core idea of database high availability is straightforward — data gets sent to two or more servers around the same time, which are meant to have identical copies of the database. If one server goes down, the other one takes over. This has been available in productized form for over three decades — think of Tandem Computers. These days, the common form is master-slave — an update goes to one server, which applies it, and also communicates it to slaves. The main distinctions are:
- How is the replication done?
- Are the slaves available to be read, for maximum performance?
Exactly the same is true of database disaster recovery. The big differences are:
- The slave databases need to be geographically remote from the master, so that they aren’t taken down by the same disaster.
- Consequently, the replication technique must be sensitive to the latency and unreliability of a wide-area network.
In these scenarios, three main kinds of replication technique come into play:
- Trigger-based/logical. An update event on the master triggers a message to the slave to apply the same update. This is commonly slow.
- Synchronous log-based. The update log is streamed from the master to the slave, which applies the updates accordingly. This is the state of the art for local high-availability. Those who claim that log-based replication can’t be synchronous haven’t paid attention to product lines such as Schooner or dbShards,
- Asynchronous log-based. Same thing, but with a different tradeoff in reliability vs. latency; the master doesn’t wait for an acknowledgement from the slave. This is the state of the art for remote disaster recovery.
An alternative to master-slave replication is active-active, aka multi-master. Active-active replication is much like master/slave, except that two (or more) databases can each accept updates and try to then update each other. Ultimately, in such a scheme, there needs to be programmer and/or administrator intervention to resolve conflicts that may arise between dueling updates. Active-active is good for uptime insofar as there’s always a copy of the system in the process of taking updates. But it’s bad for uptime insofar as it adds complexity and consequent risk of failure.
When it is impossible or overly expensive to manage data on a single server, data stores can take a scale-out (more or less aka shared-nothing) approach. Examples may be found at least in the worlds of OLTP SQL, analytic SQL, NoSQL, and Hadoop. In most cases, the data in scale-out stores has a replication factor; that is, data written to one server is immediately copied to one or more other servers as well. On a primary data management cluster, the replication factor is usually equal to 2 or 3. But when replication for scale-out replication is integrated with that used for wide-area networking, disaster recovery, and/or extreme high availability, the total replication factor can be 4-6, or even more.
Replication factors — or more precisely the replicas themselves — can both hurt and help database reliability. As noted in my recent post on immediate consistency, the classic two-phase commit protocol turns what would be one write on a single-server into about 3 network hops and 2 writes, creating all sorts of possibilities for blocking and outages. Eventual consistency — for example of the RYW type discussed in the same post and its comment thread — averts most of that risk, but then you have to deal with an increased chance of data inaccuracy.
In other ways, however, replicas are pure uptime goodness. Coolest, in my opinion, is the potential for rolling maintenance and upgrades, in which you always keep up a subset of servers that hold the complete database. In the case of Metamarkets, for example:
One thing Metamarkets does that’s pretty much a best practice these days is roll out new code, mid-day if they like, without ever taking their system down. Why is this possible? Because the data is replicated across nodes, so you can do a rolling deployment of a node at a time without making any data unavailable. Notes on that include:
- Performance could be affected, as the read load is generally balanced across all the data replicas.
- Data locking is not an issue — Metamarkets doesn’t have any read locks, as Druid is an MVCC (Multi-Version Concurrency Control) system.
At least one dbShards customer has a similar story, and I’m sure I’ve heard the same idea from other folks as well.
And finally, I’ll quote myself on another subject as well, namely online schema evolution:
Ever more RDBMS vendors, for example Clustrix, allow schema changes without taking the database down. So does Vertica. Notes on that include:
- It relies on the fact that Vertica uses MVCC (Multi-Version Concurrency Control) rather than doing updates in place. MVCC is pretty much a best practice now, at least for disk storage.
- In particular, a Vertica query locates all the data it needs when it starts running. Database changes made while the query is running don’t affect it. That’s true of ordinary updates and of schema changes alike.
- Schema changes really boil down to dropping and adding columns. Or you can make very simple column changes, such as widening one.
- If you widen a column, that has no effect on the data already in place. (Columnar storage, variable length records — what would there be to change?)
Bottom line: Uninterrupted database operation is a nearly attainable goal.
That’s a lot, but it’s hardly complete. What other points should I have included, had space permitted?
Comments
16 Responses to “Uninterrupted DBMS operation — an almost-achievable goal”
Leave a Reply
I would actually argue that active-active replication does not necessitate conflict resolution. See case (1) from my replication break-down post: http://dbmsmusings.blogspot.com/2011/12/replication-and-latency-consistency.html
Dan,
Please dumb that down for me. I’m not locating the argument.
I guess I would say that there are (at least) two ways you can have an active-active system without needing a conflict resolution mechanism:
(1) You have a protocol for the active masters to chat with other about updates (e.g. a distributed lock manager). [This is not recommended due to major performance issues this often causes.]
(2) You send all updates to all (available) active masters. Each master then performs the update in a deterministic fashion so that it will independently yield the same final result as the other masters. [This only works if you have a good implementation of a deterministic database system, which is an active area of research in my group at Yale.]
Two ways, neither of them practical at this time. Got it. 🙂
Thanks!
Sending all updates to both systems/masters in a ‘dual active’ setup in the hope that they will ‘independently yield the same final result’ is great in theory.
In practice, this approach necessitates constant checking across the systems to ensure they are in sync. This is the only way the query results can be taken as likely to be accurate, irrespective of which system was accessed by the end user.
The other big issue with dual active is that the ‘time to fix’ window is essentially doubled when issues arise. Fixes need to be to fully implemented on the first system before they can be applied to the second system.
Then there’s the cost…
We’ve actually built a system that can run 500,000 transactions a second in a deterministic fashion, and that is completely deterministic, so there is no need for checking to make sure replicas are in sync. For more, see: http://cs-www.cs.yale.edu/homes/dna/papers/calvin-sigmod12.pdf
Curt,
I’m interested in how many copy/dup systems people really have to deploy when you consider not just production and backup, but development and QA/test as well, especially performance/stress test, and testing of the primary-backup failover.
E.g., suppose you are a SaaS system, or alternatively, the system provides always-on services a large organization. Such that you have to have not only disaster recovery but also the ability to bring a new version online without disrupting the old one, to conduct real performance tests on real data (synchronized how often?) without disrupting the online system, etc. I can imagine a production hot, production remote replica, an identical such pair for stress testing/performance, new versions, etc. It all starts to sound pretty expensive. One might play fast and loose and take the dup system offline during less critical periods (over holiday) for stress testing of performance, and then resynchronize it, but then you are clearly not going to be testing how failover works under heavy load…
More perfect copies (a copy of production, and a copy of the replica) seems like the simplest from a personnel standpoint. (fewer variables).
I’ve always tried to come up with architectures where you could get some economies on using the duplicated system for some QA role as well as for the backup of the production, but I’ve wondered if I’m trading off complexity the right way.
Can you comment on what you’ve seen?
Mike,
I’m not seeing great sophistication yet. But then, I’m also talking in the greatest detail to the “We really have to get off of Oracle” crowd.
One complication is that national laws about data leaving the country force a wide-area structure on various global companies. But putting that increasingly important factor aside — which is a big driver for integrated wide-area replication, I’d say:
Also — if you like avoiding duplication from test, dev, etc. databases, that used to be the Delphix story.
@Daniel – I don’t doubt the technology exists for dual-active transaction proecssing.
Running complex ETL with external dependencies for key generation etc into two systems is quite another matter.
Also, no matter what assurances the technologists might offer, business users can and will insist that we perform cross-system checking.
As ever, ‘bad’ data silently exists until someone finds it.
The implications of a bank wrongly reporting to the financial authorities is just one example of why this kind of thinking is unlikely to ever go away.
[…] design? We probably all have similar lists, comprising issues such as scale, scale-out, throughput, availability, security, programming ease, UI, or general cost-effectiveness. Screw those up, and you don’t […]
Well done, Curt.
Only thing I would add for possible color is that one nugget that some readers may infer is:
“local datacenter = high availability”
vs
“multi geo datacenter = disaster recovery”
While that is certainly true as a general rule, I just want to put it on the table that many of our customers have an architectural need for:
“multi geo datacenter = high availability”
because they often have “read/write anywhere” types of demands. They start designing straightaway with eventual consistency in mind.
All the best,
Billy
Hi Curt,
Great post of all the different angles to consider.
Just to add to your list, TokuDB, as a plugin for MySQL or MariaDB, also offers Hot Schema Changes (Hot Indexing, Hot Column Addition/Deletion, etc…). See http://goo.gl/Q1InP and http://goo.gl/PXtz5 for more details.
We started offering this last year in response to customers who were happy with MySQL, but started considering schema-less NoSQL solutions as their add-columns on large databases were taking a day or more to complete. We’ve also seen demand on this from folks who expect to have this functionality based on their prior experience (in one case, migrating from DB2).
Cheers,
–Lawrence.
[…] Tässä hyvä artikkeli. […]
[…] schema change and other uninterrupted operation […]
[…] Online schema change is on the roadmap. […]
[…] Uptime, availability and so on are big deals in many data management sectors. […]