What kinds of data warehouse load latency are practical?
I took advantage of my recent conversations with Netezza and IBM to discuss what kinds of data warehouse load latency were practical. In both cases I got the impression:
- Subsecond load latency is substantially impossible. Doing that amounts to OLTP.
- 5 seconds or so is doable with aggressive investment and tuning.
- Several minute load latency is pretty easy.
- 10-15 minute latency or longer is now very routine.
There’s generally a throughput/latency tradeoff, so if you want very low latency with good throughput, you may have to throw a lot of hardware at the problem.
I’d expect to hear similar things from any other vendor with reasonably mature analytic DBMS technology. Low-latency load is a problem for columnar systems, but both Vertica and ParAccel designed in workarounds from the getgo. Aster Data probably didn’t meet these criteria until Version 4.0, its old “frontline” positioning notwithstanding, but I think it does now.
Related link
-
Just what is your need for speed anyway?
Comments
4 Responses to “What kinds of data warehouse load latency are practical?”
Leave a Reply
Although there is a throughput/latency tradeoff, most databases are
far from the tradeoff curve. There are two ways to look at this: in
theory and in practice.
In theory: For example B-trees achieve O(log N/log B)
disk-head-movements per insertion and O(log N/log B) for point
queries. There are known data structures that achieve O((log
N)/sqrt(B)) insertion performance while matching B-tree point-query
and range-query performance. Note that sqrt(B)>>log B for most
interesting block sizes, making these data structures much better at
insertions than B-trees. For example O(log N/log B) is more than one,
but O(log N/sqrt{B}) is something like 1/1000. There is a tradeoff
curve, but B-trees are not on it.
In practice: Tokutek is selling a storage engine for MySQL that
maintains indexes with tens of thousands of insertions per second,
even when the indexes are much larger than main memory, and even for
the worst-case data sets (such as when keys look essentially random).
Since Fractal Tree index performance dominates B-tree performace, I
expect Fractal Tree indexes to displace essentially all uses of B-tree
indexes over time.
This means there is no real reason to divide the world into OLTP and
OLAP. One can have OLAP-like throughput with OLTP-like latency.
A key piece of the technology needed to deliver high throughput and low latency loading is offloading and parallelizing the load–serialized loading through a single chokepoint as well as contention between loading and queries are two of the factors that hinder load throughput and latency. I believe your reference to Aster Data has in mind that parallel, offloaded loading leveraging any number of commodity server nodes is part of Aster’s offering.
@Jon,
Actually, I was referring to the fact that single-row inserts were really slow in Aster Data until Version 4. 😉
AFAIK – Vertica takes care of this by delivering results from write cache in memory. I think they give real time update .
Another similar technology is Lucene – in version 2.9 it incorporates near-real-time updates, again by applying queries to the indexWriter cache. it is not an analytic DBMS, but it serves similar purposes. Attivio even builds SQL access including aggregate functions on top of Lucene. This enables to provide DW like functionality on to of semi-structured data.