When should analytics be in-memory?
I was asked today for rules or guidance regarding “analytical problems, situations, or techniques better suited for in-database versus in-memory processing”. There are actually two kinds of distinction to be drawn:
- Some workloads, in principle, should run on data to which there’s very fast and unfettered access — so fast and unfettered that you’d love the whole data set to be in RAM. For others, there is little such need.
- Some products, in practice, are coupled to specific in-memory data stores or to specific DBMS, even though other similar products don’t make the same storage assumptions.
Let’s focus on the first part of that — what work, in principle, should be done in memory?
Please note that:
- (Almost) anything you can do in-memory can also be done without the whole data set being in RAM. It’s all a matter of performance.
- If all your data fits into RAM, that’s great, and you can leave it there.
- A lot depends on how you manage data in memory.
Thus, the choice whether or not to do something entirely in memory usually isn’t a simple one; even in theory, it depends on metrics such as database size, hardware capacity and the like, as well as on the specific approaches used to manage data in the in-memory and on-disk alternatives.
The two biggest exceptions that come to mind are:
- Some algorithms rely on fairly random access to the data. Those are typically best started by putting the whole data set into memory. In particular, approaches to relationship analytics and graph processing tend to be fundamentally in-memory, for example in the case of Yarcdata.
- Some workloads need such low latency there’s no time to write the data to disk before analyzing it. This is the core use case for complex event/stream processing.
To be more specific, let’s look at everybody’s two favorite kinds of analytics — business intelligence (BI) and predictive modeling — which haven’t yet been well-integrated. Two of the best reasons for putting BI in memory are probably:
- You want to keep drilling down on the result set to your original query. That makes a lot of sense; you shouldn’t have to go to disk each time.
- You want to keep trying new visualizations of substantially the same data set. Ditto.
Some BI vendors, especially the visualization-intensive ones, address these needs via proprietary in-memory data stores. Others just create temporary in-memory databases from reports and other query result sets. Either can work.
Two other reasons to do BI in-memory are:
- You really want low latency. So far this is a fairly niche use case, but it’s one that could well grow.
- You just don’t think your DBMS is fast enough. That one has led to considerable marketing hype; while Oracle may not be fast enough, at least at an affordable cost, in many cases an analytic RDBMS alternative would do a great job.
Reasons such as “Prebuilding aggregates is annoying; in-memory lets you use the raw data” are often just disguised forms of the performance argument.
Finally, in the case of predictive modeling, I find it hard to separate the in-memory question from parallelism. The default way of doing predictive modeling is:
- Run a big query.
- Put the result set in RAM.
- Model on it.
The biggest problems with that plan seem to occur when:
- The data is on many scale-out nodes.
- The extracts don’t really fit well into RAM.
That said, I don’t currently have much to add to what I wrote about the in-memory/in-database trade-off for parallel predictive modeling back in April, 2011.
Comments
2 Responses to “When should analytics be in-memory?”
Leave a Reply
All analytic computations run in memory. The distinction between traditional analytics, in-database analytics and the new category of “in-memory analytics” is the size of the memory.
Take SAS, for example. Legacy SAS software runs in memory; when the size of the data set exceeds memory, SAS swaps back in forth between memory and disk, which causes a performance hit. SAS’ new in-memory product (HPA) runs on a box with a large enough memory to avoid swapping on large problems.
Of course, since the HPA box isn’t large enough to hold the entire data warehouse, you’re still moving data around. (Teradata’s largest Model 700 maxes out at 40TB uncompressed).
By eliminating data movement, in-database analytics deployed in an MPP architecture always trump “in-memory” analytics when two conditions are true:
(1) The data is already in the database (because it’s your warehouse)
(2) The analytics problem is embarrassingly parallel. This is always true for scoring and is true for the most commonly used analytics. Case-Based Reasoning, on the other hand, is not embarrassingly parallel, and is an appropriate application for pure in-memory analytics.
SAS has produced no evidence to date that in-memory analytics outperform in-database analytics on comparable problems. The claimed performance benefits (100x) are similar to those reported by those who use in-database analytics.
For me it comes down to flexible interactive ad hoc querying. This is the sweet spot of analytics. You can’t beat the combination of human intelligence and experience working with an apparatus that can answer virtually any question in less than a second. This is when the magic happens.
But it’s not only just about memory. It’s about overall latency. Namely, never leaving the bus to begin with, not even to go to the network. In other words, scale-out doesn’t work here. This is why they still build super computers, and why the cray was always designed to be cylindrical – for minimum latency.
Associative tools like QlikView and PowerPivot simply will not work when there is latency.
On a related note, I believe we reached an inflection point in 2009 when Windows 7 was released. What happened was this was the first time where Microsoft really got the 64-bit OS right, and 64-bit became the defacto standards, leading to abundant cheap memory. For example, I can now buy a server with 2 TB RAM for < $200k.
Once you've been working with post-OLAP technologies like QlikView and PowerPivot, it becomes pretty obvious that OLAP is turning into a niche technology (possibly suitable for EPM, but that's about it).
We've seen this play out before. Those that recall the slow but steady transition from pre-relational hierarchical (e.g. IMS) and network (e.g. IDS) databases which recognize this pattern. Unfortunately, in the B2B world, these changes are glacial.