Yet another data warehouse database and appliance overview
For a recent project, it seemed best to recapitulate my thoughts on the overall data warehouse specialty DBMS and appliance marketplace. While what resulted is highly redundant with what I’ve posted in this blog before, I’m sharing anyway, in case somebody finds this integrated presentation more useful. The original is excerpted to remove confidential parts.
… This is a crowded market, with a lot of subsegments, and blurry, shifting borders among the subsegments.
… Everybody starts out selling consumer marketing and telecom call-detail-record apps. …
Oracle and similar products are optimized for updates above everything else. That is, short rows of data are banged into tables. The main indexing scheme is the “b-tree,” which is optimized for finding specific rows of data as needed, and also for being updated quickly in lockstep with updates to the data itself.
By way of contrast, an analytic DBMS is optimized for some or all of:
-
Small numbers of bulk updates, not large numbers of single-row updates.
-
Queries that may involve examining or returning lots of data, rather than finding single records on a pinpoint basis.
-
Doing arithmetic calculations – commonly simple arithmetic, sorts, etc. — on the data.
Database and/or DBMS design techniques that have been applied to analytic uses include:
-
“Denormalizing” the database, by pre-joining tables. This makes queries cheaper, but updates more costly. It’s implicit in single-fact-table designs.
-
“Star indexes”, which capture the benefits of denormalization. But they are large, and costly to update.
-
“Materialized views”, which precompute query results (joins and or aggregations). These obviously accelerate queries that use those results, but you have to pay the cost of continually updating them as data changes.
-
“Range partitioning”, in which data in (say) certain date ranges is clustered together on disk for more efficient processing.
-
“Hypercubes”, aka “MOLAP” (Multi-Dimensional OnLine Analytic Processing). The costs and benefits are extreme forms of those I’ve already cited. At least, the costs are; the benefits aren’t seeming so extreme any more, causing the technology to be increasingly outmoded.
-
“Bit-mapped indexes.” This is another approach to indexing that is fast on queries, at the cost of making updates slower. In its pure form, it’s well-suited for columns with low “cardinality” – i.e., a small number of values. (E.g., colors, sizes, etc.) But it can be extended to cover higher-cardinality cases.
-
Database administration tools to help with the complex choices involved in writing SQL, selecting indexes, etc.
-
Recommended hardware configurations, because the right mix of disks, processors, etc. might otherwise be non-obvious.
That’s pretty much the list of techniques used in general-purpose DBMS products such as Oracle and Microsoft SQL Server. But if you put them all together, you’re still left with the problems:
-
The techniques that greatly accelerate queries also greatly slow down updates.
-
You use a lot of extra disk space for all those indexes.
-
There’s a tremendous amount of labor involved in getting it all right.
-
Because of these drawbacks, you’re likely to optimize only for certain subsets of the queries you’d really like to run. Indeed, you may not make all of your data available for analytic querying.
Specialty analytic DBMS can do a lot better than general-purpose DBMS because:
-
They can run on “shared-nothing” MPP (Massively Multi-Parallel Processing) architectures. Most vendors make this choice, because:
-
Using larger numbers of smaller parts is fundamentally cheaper, if you don’t have a lot of MPP overhead. Most of the vendors have figured out clever ways to avoid that overhead.
-
For larger databases, I/O becomes an absolute bottleneck. But in a shared-nothing DBMS, you can do I/O truly in parallel.
-
If you simplify your software sufficiently, you may be able to get great compression, which has myriad benefits – most obviously to disk costs and I/O, but it can go further than that. Most contenders post-Netezza are good to great at compression. Netezza is playing catch-up. Teradata isn’t really better than Oracle, et al.
-
Disks spin slowly. The fastest disk drive you can buy has 15,000 RPMs, vs. the 1,200 RPMs hard disk technology was introduced with in 1956. (Most systems use 7,500 or 10,000 RPMs.) So random-access disk reads have become the single greatest bottleneck to analytic processing. One solution is to optimize your DBMS for table scans or other sequential reads – i.e., read more bytes of data, but at a much higher per-byte rate. To varying degrees, the analytic DBMS with row-based architectures are optimized for sequential reads. I published two white papers focusing on this point in 2007, sponsored by DATAllegro. http://www.monash.com/whitepapers.html
-
You also can break rows apart, and organize data by columns. Columnar architectures have tremendous advantages if you only ever want to retrieve a small fraction of a row. They also can help with compression and general query speed. They are hard to update, however. Vertica has some very clever techniques to beat the update speed problem. ParAccel argues that this cleverness isn’t needed, and more straightforward techniques suffice.
-
You can have specialized hardware designs or optimizations, even beyond the shared-nothing MPP. Netezza has an FPGA, which is almost a custom chip. Kickfire has some kind of custom chip. Calpont keeps trying and failing with a custom chip. Teradata is a lot like standard hardware, but they have their own switching system. DATAllegro and other vendors do use standard hardware, but rely on more inter-node communication than might otherwise be there. Columnar vendors, however, tend to be fairly hardware-agnostic.
…
Beyond raw database size, characteristics of the database and workload that affect which analytic DBMS works best include:
-
Do you have to do any significant volume of low-latency updates at all? If so, how low? (15 minute latency is a common but still minority data warehousing requirement, both in cases where there’s a legitimate business benefit and in cases where there is not. Most products meet that requirement, some more gracefully than others.)
-
Are your result sets likely to be huge? (E.g., inputs into SAS data mining software). Fairly large? Single-row? Columnar systems are bad at single-row result sets.
-
How many queries are likely to be running at once? The ability to handle concurrency well is a function of product maturity even more than basic architecture. Each time Netezza or DATAllegro has a major release, they tell me that now their concurrency is great and confess it wasn’t so hot in the prior version. Very high concurrency is a call center or feeding a website’s personalization. Medium concurrency is reporting and dashboards for a large but not huge enterprise. Low concurrency is serving a few specialized data analysts in a department.
-
What absolute response time do you need? (Are you serving a call center? A personalized web site? A user who doesn’t mind tapping her fingers for a few minutes, but doesn’t want to wait a few hours? A user who wants a response within a few seconds?) Different DBMS are optimized a bit differently. But frankly, if a system has great price/performance, it usually will be good in any scenario.
-
How much are you doing in the way of arithmetic calculations? An application very light on data volume and heavy on arithmetic is sometimes a genuine excuse for using MOLAP. Otherwise, it’s nice to have good flexibility with a feature called “user defined functions”.
-
When you bring back a row, do you typically want the whole row, or are many of the columns of that row just wasted I/O? If it’s the latter, columnar systems shine. This is particularly common in consumer marketing/targeting types of applications, where you may start with 1000 or more columns of data.
-
Are you basically querying a single large “fact table” across many “dimensions”, or a small group of closely-related fact tables? Or is the database schema significantly more complicated than that? Vertica only allows one fact table. At the other extreme, Teradata has for decades been optimized for any kind of schema. Most systems let you use any kind of schema you want, but that doesn’t mean they perform well in all scenarios.
Comments
Leave a Reply