Notes on memory-centric data management
I first wrote about in-memory data management a decade ago. But I long declined to use that term — because there’s almost always a persistence story outside of RAM — and coined “memory-centric” as an alternative. Then I relented 1 1/2 years ago, and defined in-memory DBMS as
DBMS designed under the assumption that substantially all database operations will be performed in RAM (Random Access Memory)
By way of contrast:
Hybrid memory-centric DBMS is our term for a DBMS that has two modes:
- In-memory.
- Querying and updating (or loading into) persistent storage.
These definitions, while a bit rough, seem to fit most cases. One awkward exception is Aerospike, which assumes semiconductor memory, but is happy to persist onto flash (just not spinning disk). Another is Kognitio, which is definitely lying when it claims its product was in-memory all along, but may or may not have redesigned its technology over the decades to have become more purely in-memory. (But if they have, what happened to all the previous disk-based users??)
Two other sources of confusion are:
- The broad variety of memory-centric data management approaches.
- The over-enthusiastic marketing of SAP HANA.
With all that said, here’s a little update on in-memory data management and related subjects.
- I maintain my opinion that traditional databases will eventually wind up in RAM.
- At conventional large enterprises — as opposed to for example pure internet companies — production deployments of HANA are probably comparable in number and investment to production deployments of Hadoop. (I’m sorry, but much of my supporting information for that is confidential.)
- Cloudera is emphatically backing Spark. And a key aspect of Spark is that, unlike most of Hadoop, it’s memory-centric.
- It has become common for disk-based DBMS to persist data through a “log-structured” architecture. That’s a whole lot like what you do for persistence in a fundamentally in-memory system.
- I’m also sensing increasing comfort with the strategy of committing writes as soon as they’ve been acknowledged by two or more nodes in RAM.
And finally,
- I’ve never heard a story about an in-memory DBMS actually losing data. It’s surely happened, but evidently not often.
Comments
13 Responses to “Notes on memory-centric data management”
Leave a Reply
There’s no mention of Shark in Mike Olson’s blog post on Spark. (Or the previous one on SQL, Impala and Hive.)
Do you know more about Cloudera’s intentions for SQL-on-Spark than are described in those posts?
Julian,
I was being sloppy, and mixed up Spark and Shark. I’ll edit my post accordingly. Thanks for the catch!
Thanks for catching and making that correction, guys. You’re right — we are big on Spark, but our SQL efforts are concentrated in a direction other than Shark.
To elaborate a little bit: We are convinced that SQL is such an important way to get at data that it makes sense to create a dedicated engine aimed exclusively at SQL query execution. That’s Impala.
Spark allows you to run a wide variety of processing and analytic workloads. The Shark code basically parses SQL queries and creates Spark jobs to execute them, much as Hive parses SQL queries and creates MapReduce jobs to execute them.
That implementation works for sure. We just believe that Impala, specialized for SQL execution, gets to make design decisions that a general-purpose engine can’t, so that it will win the benchmarks over the long term.
I think, over the long term, solution will win that provides:
1. Tight integration with (mostly in-memory) data sources.
2. Full Hive support: UDF, UDAF, UDTF and MapReduce inside the SQL for complex analytics.
Impala does not look like a winner here.
Hi,
I’m a bit surprised, to say the least, by your statement on HANA vs Hadoop. Can you elaborate more on this without giving away any confidential information?
Does it have any relation with the (very recent) partnership between SAP and Hortonworks (http://hortonworks.com/blog/sap-hana-hadoop-a-perfect-match/; looks much like its other partnership with Teradata)? Or could it be a mere coincidence?
Thanks 🙂
Thomas,
Absolutely coincidence.
The thought was triggered by one particular — confidential — survey in which the numbers for Hadoop and HANA happened to come out identical. Beyond that, I have a sense of production Hadoop figures based on various companies’ information. And something resembling HANA numbers gets published.
Great Post as always Curt!!
Can you please comment on Splicemachine?
Thanks, John.
However, I can’t comment on Splice Machine. I haven’t bothered punching past the annoyance of multiple ignored emails.
There can be many flavors of IMDBs or the interpretations.
1) Is just eliminating Physical IO the sole goal of IMDB? If I have 1 TB of data and 1 TB or more of cache and all the data is resident in memory can this be classified as one flavor of IMDB.
2) Are reduced durability (or removing expensive log writes)/removing expensive async IO features a mandatory requirement for classifying as a true IMDB?
• Reduced durability – there will be challenges with persistence.
• Some form of async I/O is needed for persistence.
I think with RHEL6.x the addressable memory can go up to 3TB and it costs way less than 6 figures. I would say this covers 90% if not more of DB population in the world. This effectively turns most of them in some form of IMDBs.
Thanks for the details. It would have been surprising to have a real correlation between the figures, at least here in Europe (France), where we don’t hear much about HANA — yet.
[…] Memory-centric data management is confusing. And so I’m going to clarify a couple of things about MemSQL 3.0 even though I don’t yet have a lot of details.* They are: […]
[…] and other memory-centric technology, including […]
[…] Notes on memory-centric data management (January, 2014) […]