September 26, 2012

When should analytics be in-memory?

I was asked today for rules or guidance regarding “analytical problems, situations, or techniques better suited for in-database versus in-memory processing”. There are actually two kinds of distinction to be drawn:

Some workloads, in principle, should run on data to which there’s very fast and unfettered access — so fast and unfettered that you’d love the whole data set to be in RAM. For others, there is little such need.
Some products, in practice, are coupled to specific in-memory data stores or to specific DBMS, even though other similar products don’t make the same storage assumptions.

Let’s focus on the first part of that — what work, in principle, should be done in memory?

Please note that:

(Almost) anything you can do in-memory can also be done without the whole data set being in RAM. It’s all a matter of performance.
If all your data fits into RAM, that’s great, and you can leave it there.
A lot depends on how you manage data in memory.

Thus, the choice whether or not to do something entirely in memory usually isn’t a simple one; even in theory, it depends on metrics such as database size, hardware capacity and the like, as well as on the specific approaches used to manage data in the in-memory and on-disk alternatives.

The two biggest exceptions that come to mind are:

Some algorithms rely on fairly random access to the data. Those are typically best started by putting the whole data set into memory. In particular, approaches to relationship analytics and graph processing tend to be fundamentally in-memory, for example in the case of Yarcdata.
Some workloads need such low latency there’s no time to write the data to disk before analyzing it. This is the core use case for complex event/stream processing.

To be more specific, let’s look at everybody’s two favorite kinds of analytics — business intelligence (BI) and predictive modeling — which haven’t yet been well-integrated. Two of the best reasons for putting BI in memory are probably:

You want to keep drilling down on the result set to your original query. That makes a lot of sense; you shouldn’t have to go to disk each time.
You want to keep trying new visualizations of substantially the same data set. Ditto.

Some BI vendors, especially the visualization-intensive ones, address these needs via proprietary in-memory data stores. Others just create temporary in-memory databases from reports and other query result sets. Either can work.

Two other reasons to do BI in-memory are:

You really want low latency. So far this is a fairly niche use case, but it’s one that could well grow.
You just don’t think your DBMS is fast enough. That one has led to considerable marketing hype; while Oracle may not be fast enough, at least at an affordable cost, in many cases an analytic RDBMS alternative would do a great job.

Reasons such as “Prebuilding aggregates is annoying; in-memory lets you use the raw data” are often just disguised forms of the performance argument.

Finally, in the case of predictive modeling, I find it hard to separate the in-memory question from parallelism. The default way of doing predictive modeling is:

Run a big query.
Put the result set in RAM.
Model on it.

The biggest problems with that plan seem to occur when:

The data is on many scale-out nodes.
The extracts don’t really fit well into RAM.

That said, I don’t currently have much to add to what I wrote about the in-memory/in-database trade-off for parallel predictive modeling back in April, 2011.

Categories: Business intelligence, Data warehousing, Memory-centric data management, Parallelization, Predictive modeling and advanced analytics

Subscribe to our complete feed!

Comments

2 Responses to “When should analytics be in-memory?”

Thomas W Dinsmore on September 27th, 2012 9:22 am

All analytic computations run in memory. The distinction between traditional analytics, in-database analytics and the new category of “in-memory analytics” is the size of the memory.

Take SAS, for example. Legacy SAS software runs in memory; when the size of the data set exceeds memory, SAS swaps back in forth between memory and disk, which causes a performance hit. SAS’ new in-memory product (HPA) runs on a box with a large enough memory to avoid swapping on large problems.

Of course, since the HPA box isn’t large enough to hold the entire data warehouse, you’re still moving data around. (Teradata’s largest Model 700 maxes out at 40TB uncompressed).

By eliminating data movement, in-database analytics deployed in an MPP architecture always trump “in-memory” analytics when two conditions are true:

(1) The data is already in the database (because it’s your warehouse)

(2) The analytics problem is embarrassingly parallel. This is always true for scoring and is true for the most commonly used analytics. Case-Based Reasoning, on the other hand, is not embarrassingly parallel, and is an appropriate application for pure in-memory analytics.

SAS has produced no evidence to date that in-memory analytics outperform in-database analytics on comparable problems. The claimed performance benefits (100x) are similar to those reported by those who use in-database analytics.
Neil Hepburn on September 30th, 2012 7:45 pm

For me it comes down to flexible interactive ad hoc querying. This is the sweet spot of analytics. You can’t beat the combination of human intelligence and experience working with an apparatus that can answer virtually any question in less than a second. This is when the magic happens.

But it’s not only just about memory. It’s about overall latency. Namely, never leaving the bus to begin with, not even to go to the network. In other words, scale-out doesn’t work here. This is why they still build super computers, and why the cray was always designed to be cylindrical – for minimum latency.

Associative tools like QlikView and PowerPivot simply will not work when there is latency.

On a related note, I believe we reached an inflection point in 2009 when Windows 7 was released. What happened was this was the first time where Microsoft really got the 64-bit OS right, and 64-bit became the defacto standards, leading to abundant cheap memory. For example, I can now buy a server with 2 TB RAM for < $200k.

Once you've been working with post-OLAP technologies like QlikView and PowerPivot, it becomes pretty obvious that OLAP is turning into a niche technology (possibly suitable for EPM, but that's about it).

We've seen this play out before. Those that recall the slow but steady transition from pre-relational hierarchical (e.g. IMS) and network (e.g. IDS) databases which recognize this pattern. Unfortunately, in the B2B world, these changes are glacial.

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

When should analytics be in-memory?

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin