January 3, 2014

Notes on memory-centric data management

I first wrote about in-memory data management a decade ago. But I long declined to use that term — because there’s almost always a persistence story outside of RAM — and coined “memory-centric” as an alternative. Then I relented 1 1/2 years ago, and defined in-memory DBMS as

DBMS designed under the assumption that substantially all database operations will be performed in RAM (Random Access Memory)

By way of contrast:

Hybrid memory-centric DBMS is our term for a DBMS that has two modes:

In-memory.

Querying and updating (or loading into) persistent storage.

These definitions, while a bit rough, seem to fit most cases. One awkward exception is Aerospike, which assumes semiconductor memory, but is happy to persist onto flash (just not spinning disk). Another is Kognitio, which is definitely lying when it claims its product was in-memory all along, but may or may not have redesigned its technology over the decades to have become more purely in-memory. (But if they have, what happened to all the previous disk-based users??)

Two other sources of confusion are:

The broad variety of memory-centric data management approaches.
The over-enthusiastic marketing of SAP HANA.

With all that said, here’s a little update on in-memory data management and related subjects.

I maintain my opinion that traditional databases will eventually wind up in RAM.
At conventional large enterprises — as opposed to for example pure internet companies — production deployments of HANA are probably comparable in number and investment to production deployments of Hadoop. (I’m sorry, but much of my supporting information for that is confidential.)
Cloudera is emphatically backing Spark. And a key aspect of Spark is that, unlike most of Hadoop, it’s memory-centric.
It has become common for disk-based DBMS to persist data through a “log-structured” architecture. That’s a whole lot like what you do for persistence in a fundamentally in-memory system.
I’m also sensing increasing comfort with the strategy of committing writes as soon as they’ve been acknowledged by two or more nodes in RAM.

And finally,

I’ve never heard a story about an in-memory DBMS actually losing data. It’s surely happened, but evidently not often.

Categories: Aerospike, Cloudera, Clustering, Databricks, Spark and BDAS, Hadoop, In-memory DBMS, Kognitio, Market share and customer counts, Memory-centric data management, SAP AG, Theory and architecture

Subscribe to our complete feed!

Comments

13 Responses to “Notes on memory-centric data management”

Julian Hyde on January 3rd, 2014 4:57 am

There’s no mention of Shark in Mike Olson’s blog post on Spark. (Or the previous one on SQL, Impala and Hive.)

Do you know more about Cloudera’s intentions for SQL-on-Spark than are described in those posts?
Curt Monash on January 3rd, 2014 8:12 am

Julian,

I was being sloppy, and mixed up Spark and Shark. I’ll edit my post accordingly. Thanks for the catch!
Mike Olson on January 3rd, 2014 9:07 am

Thanks for catching and making that correction, guys. You’re right — we are big on Spark, but our SQL efforts are concentrated in a direction other than Shark.

To elaborate a little bit: We are convinced that SQL is such an important way to get at data that it makes sense to create a dedicated engine aimed exclusively at SQL query execution. That’s Impala.

Spark allows you to run a wide variety of processing and analytic workloads. The Shark code basically parses SQL queries and creates Spark jobs to execute them, much as Hive parses SQL queries and creates MapReduce jobs to execute them.

That implementation works for sure. We just believe that Impala, specialized for SQL execution, gets to make design decisions that a general-purpose engine can’t, so that it will win the benchmarks over the long term.
Vlad Rodionov on January 3rd, 2014 4:25 pm

I think, over the long term, solution will win that provides:

1. Tight integration with (mostly in-memory) data sources.
2. Full Hive support: UDF, UDAF, UDTF and MapReduce inside the SQL for complex analytics.

Impala does not look like a winner here.
Thomas Vial on January 4th, 2014 4:23 pm

Hi,

I’m a bit surprised, to say the least, by your statement on HANA vs Hadoop. Can you elaborate more on this without giving away any confidential information?
Does it have any relation with the (very recent) partnership between SAP and Hortonworks (http://hortonworks.com/blog/sap-hana-hadoop-a-perfect-match/; looks much like its other partnership with Teradata)? Or could it be a mere coincidence?

Thanks 🙂
Curt Monash on January 5th, 2014 9:41 am

Thomas,

Absolutely coincidence.

The thought was triggered by one particular — confidential — survey in which the numbers for Hadoop and HANA happened to come out identical. Beyond that, I have a sense of production Hadoop figures based on various companies’ information. And something resembling HANA numbers gets published.
John on January 5th, 2014 10:12 am

Great Post as always Curt!!

Can you please comment on Splicemachine?
Curt Monash on January 5th, 2014 10:22 am

Thanks, John.

However, I can’t comment on Splice Machine. I haven’t bothered punching past the annoyance of multiple ignored emails.
GP on January 6th, 2014 1:17 pm

There can be many flavors of IMDBs or the interpretations.

1) Is just eliminating Physical IO the sole goal of IMDB? If I have 1 TB of data and 1 TB or more of cache and all the data is resident in memory can this be classified as one flavor of IMDB.
2) Are reduced durability (or removing expensive log writes)/removing expensive async IO features a mandatory requirement for classifying as a true IMDB?
• Reduced durability – there will be challenges with persistence.
• Some form of async I/O is needed for persistence.

I think with RHEL6.x the addressable memory can go up to 3TB and it costs way less than 6 figures. I would say this covers 90% if not more of DB population in the world. This effectively turns most of them in some form of IMDBs.
Thomas Vial on January 6th, 2014 2:16 pm

Thanks for the details. It would have been surprising to have a real correlation between the figures, at least here in Europe (France), where we don’t hear much about HANA — yet.
MemSQL 3.0 | DBMS 2 : DataBase Management System Services on February 10th, 2014 3:38 pm

[…] Memory-centric data management is confusing. And so I’m going to clarify a couple of things about MemSQL 3.0 even though I don’t yet have a lot of details.* They are: […]
Some stuff I’m thinking about (early 2014) | DBMS 2 : DataBase Management System Services on February 11th, 2014 5:20 pm

[…] and other memory-centric technology, including […]
NoSQL vs. NewSQL vs. traditional RDBMS | DBMS 2 : DataBase Management System Services on March 28th, 2014 10:09 am

[…] Notes on memory-centric data management (January, 2014) […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Notes on memory-centric data management

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin