Are there any remaining reasons to put new OLTP applications on disk?
Once again, I’m working with an OLTP SaaS vendor client on the architecture for their next-generation system. Parameters include:
- 100s of gigabytes of data at first, growing to >1 terabyte over time.
- High peak loads.
- Public cloud portability (but they have private data centers they can use today).
- Simple database design — not a lot of tables, not a lot of columns, not a lot of joins, and everything can be distributed on the same customer_ID key.
- Stream the data to a data warehouse, that will grow to a few terabytes. (Keeping only one year of OLTP data online actually makes sense in this application, but of course everything should go into the DW.)
So I’m leaning to saying:
- They should go with a scalable, MySQL-based solution.
- Lots of third-party software works with MySQL, in case that’s helpful.
- Yes, any one vendor is small and not yet firmly established, but there are numerous vendors around with interesting MySQL scaling stories.
- In a vendor emergency, just going with Oracle’s MySQL stuff would probably work …
- … especially because there are these lovely things in the world called solid-state drives.
- There’s also good escapability if one wants to move away from MySQL, because everybody knows how to handle MySQL data.
- The first product to look at is dbShards, because it meets all the topology needs:
- Local scale-out (transparent sharding).
- Local high availability.
- Remote disaster recovery (details of that are underway).
- The first analytic DBMS to look at is Infobright.
- Yes, I know Infobright is focused more on machine-generated data these days, but this client’s analytic needs are so straightforward Infobright should pass with flying colors.
- The MySQL-to-MySQL aspect should make ETL dead simple.
- Again, there’s escapability.
Mainly, this is all fine. But I’m getting pushback on the solid-state aspect, for fear that it will compromise public cloud portability.
Am I missing something here? As far as I’m concerned, if you’re planning an OLTP system with a many-year lifespan today, of course you should assume solid-state storage. Maybe you scale out just as far as you would with disk, striping indexes or entire databases across the RAM of multiple servers. It that case, having solid-state backing reduces the risk of bottlenecks. Maybe you don’t scale out as far as you would with disk. In that case, solid-state backing saves you money.
As for public-cloud support for solid-state storage, that’s coming fast, right? (Actually, I have data points in support of that theory, but they’re a bit tenuous.) A large fraction of web businesses with private data centers seem to be using solid-state storage — from Facebook on down — or so the NoSQL/NewSQL/short-request DBMS guys tell me. Surely a number of public cloud vendors are close behind.
Comments
13 Responses to “Are there any remaining reasons to put new OLTP applications on disk?”
Leave a Reply
Curt,
The cost in ETL is usually not a function of whether the source/destination db are the same platform. Its usually a cost of the logical transformation needed from the source to destination schema’s. If your schemas are identical, ETL is free (or nearly so) independent of source and destination db types.
As for sharding things by ‘customer – id’ , do all their large tables have this column? (something like product-catalog for e.g. may not)
Finally, on disk vs. flash – would really love to understand more. If ssd’s were not an option how would you change the above architecture?
@Curt
You’ve described a good use case for VoltDB.
Thanks.
Fred
Fred,
I don’t think VoltDB’s programming model would work well in this situation.
Anon,
On the ETL side, you are in essence right. But what I didn’t say is that there’s some doubt as to exactly what workload should go against the OLTP database and what should go against the analytic one. Using the “same” DBMS for both simplifies changing one’s mind.
We’ll see. As of yesterday, I’m not wholly convinced a true analytic DBMS has to be in the mix at all.
As for the SSD part — right now the app runs on a single SMP box, running a name-brand DBMS. That, we’re assuming, is unsustainable, and certainly doesn’t transition well to the public cloud. So scale-out is the obvious way to go.
But MySQL scale-out vendors are small — small enough to create non-trivial vendor risk. It’s already easier for me to recommend them because there are a number, so if one disappoints you port to using another. But using vanilla MySQL with SSDs is another nice option to have.
Conversely, while the “scale-out” vendors have advantages even without splitting your database — at least dbShards and Schooner do — if you can get away from splitting and just use better storage, that might be more cost-effective. And, as always, cost matters.
As for the data distribution on a single customer_ID key — off the top of my head, I haven’t thought of ANY join that would be done other than on that key. If I’m missing something, I’m pretty confident it would involve tiny tables, easily replicated. Indeed, the SQL vs. NoSQL choice is not wholly obvious here — even if I am pretty sure that SQL is actually the way to go.
Sounds like a good fit for ScaleDB. My sources say SSD will soon be very available at Public Clouds.
Drizzle (a fork of MySQL) would be a good option to look at, if your OLTP database doesn’t require Stored Procedures, Triggers or Views. It has advantages over MySQL in the area of concurrency.
Drizzle
@Curt
FYI, we have recently added JDBC support, so developers can now choose to use that or our Java stored procedure interface (or both in combination).
That said, my prior comment was more about the use case you described than the specific client opportunity. Apologies if that was unclear.
Fred
Zman,
I haven’t heard much lately about Drizzle, other than grumbling as to its momentum (technical, adoption, whatever).
I really would like there to be healthy MySQL front-end clones, given how I sometimes encourage people to use specialized MySQL engines and the like. But I haven’t heard much favorable rumbling from Drizzle or the Monty Project for a while.
Fred,
Is VoltDB’s performance the same (or very similar) using both interfaces?
Fred,
More to the point, have you relaxed the essential limitations of the VoltDB programming model, such as perfect commutativity? If I understand correctly, that’s what the JSP-only programming model was supposed to enforce.
@Curt
While there is some interesting academic research on commutativity and H-Store style systems, VoltDB has never made commutativity a requirement. Our transactions can perform an arbitrary mix of SQL and computation with few restrictions.
Fundamentally, we require transactions to have a single round trip between the client app and the database. This enables our serialized, no-waits execution model. It requires deterministic transactions and necessitates stored procedures. We leverage this determinism extensively to achieve the same state in parallel on all database nodes, as well as to allow for deterministic replay of transactions in the event of a global failure.
To your original question, VoltDB’s core programming model has not changed in 2.0. While you can do ad hoc database operations via JDBC, the way to achieve uber-high throughput and scale is by executing Java stored procedures via your chosen client API (JDBC or a number of other available language interfaces). So, to clarify my prior comment, we added a JDBC interface in 2.0 to lower the bar for developers who are familiar with that API, as well as to support third party monitoring, reporting and developer tools.
I hope this information is helpful.
Fred
Thanks, Fred. That clears up a lot!
[…] best comment thread of the week was probably on my post about scale-out relational OLTP choices, in which people discussed the merits of various particular […]