Is data warehousing now all about sequential access?
A lot of evidence is pointing to a major paradigm shift in data warehouse RDBMS, along the lines of:
Old way: Assume I/O is random; lower total execution time by improving selectivity and thus lowering the amount of I/O.
New way: Drive the amount of random I/O to near zero, and do as much sequential I/O as necessary to achieve this goal.
Examples include:
- Data warehouse appliances (see especially this discussion of DATallegro’s architecture)
- Columnar systems (see Nathan Myer’s first comment in this discussion of the much-hyped Required Technologies prototype)
- Memory-centric systems, notably SAP’s BI Accelerator
The hardware logic is compelling, as long as we rely on hard disks rather than, say, flash memory. Rotation speed has only gone up 12.5-fold in the entire 50-year history of the hard drive, and currently maxes out at 15,000 RPM, which puts a floor of 2 ms on average random access time. But streaming data on and off disk gets exponentially faster, in line with increases in disk density and semiconductor performance. Hence sequential data access gets ever faster, while random access does not.
What I don’t 100% understand yet, however, is the full array of techniques used by the traditional leaders to co-opt or combat this trend. I’m looking into that; in particular, I have a call scheduled with Oracle.
I hope to write about this issue in my October Computerworld column. (My columns are typically submitted on the first Monday or Tuesday morning of the month, to appear in the following week’s edition.) Or if it slips from October, then soon thereafter. Any thoughts in the interim would be most welcome.
Comments
4 Responses to “Is data warehousing now all about sequential access?”
Leave a Reply
[…] I talked with Teradata today, and they called me on my use of the term “sequential.” Basically, if there’s any head movement for disk seeks, some computer science researchers wouldn’t call it “sequential.” I didn’t know that; I was just familiar with the less precise usage of the term in some vendors’ marketing and discussions.* OK, I’ll make up a new, more precise term instead. How about “coarse-grained”? […]
I’m absolutely behind anything that will supress disk head latency as a factor in data warehouse performance. In fact I wrote something on the subject something over a year ago. http://oraclesponge.wordpress.com/2005/07/25/time-slicing-of-disk-io/
I suppose that the vendors are still having trouble grasping the inherently different nature of data warehouses to the small-and-random i/o model that OLTP generates.
[…] This issue popped back into my head after being directed through Log Buffer #11 at Mark Rittman’s site to an article by Curt Monash titled “Is data warehousing all about sequential access?” and which matched my thoughts very well. […]
[…] through Log Buffer #11 at Mark Rittman’s site to an article by Curt Monash titled “Is data warehousing all about sequential access?” and which matched my thoughts very […]