Database management system choices – relational data warehouse
This is the third of a five-part series on database management system choices. For the first post in the series, please click here.
High-end OLTP relational database management system vendors try to offer one-stop shopping for almost all data management needs. But as I noted in my prior post, their product category is facing two major competitive threats. One comes from specialty data warehouse database management system products. I’ve covered those extensively in this blog, with key takeaways including:
- Specialty data warehouse products offer huge cost advantages versus less targeted DBMS. This applies to purchase/maintenance and administrative costs alike. And it’s true even when the general-purposed DBMS boast data warehousing features such as star indexes, bitmap indexes, or sophisticated optimizers.
- The larger the database, the bigger the difference. It’s almost inconceivable to use Oracle for a 100+ terabyte data warehouse. But if you only have 5 terabytes, Oracle is a perfectly viable – albeit annoying and costly – alternative.
- Most specialty data warehouse products have a shared-nothing architecture. Smaller parts are cheaper per unit of capacity. Hence shared nothing/grid architectures are inherently cheaper, at least in theory. In data warehousing, that theoretical possibility has long been made practical.
- Specialty data warehouse products with row-based architectures are commonly sold in appliance formats. In particular, this is true of Teradata, Netezza, DATAllegro, and Greenplum. One reason is that they’re optimized to stream data off of disk fairly sequentially, as opposed to relying on random seeks.
- Specialty data warehouse products with columnar architectures are commonly available in software-only formats. Even so, Vertica and ParAccel also boast appliance deals, with HP and Sun respectively.
- There is tremendous technical diversity and differentiation in the specialty data warehouse system market.
Let me expand on that last point. Different features may or may not be important to you, depending on whether your precise application needs include:
- Absolute scalability. Teradata, DATAllegro, and SAS have customers with multi-hundred terabyte data warehouses (user data). Netezza is headed for that range soon. Many other vendors effectively top out in the tens of terabytes right now.
- Pre-projection. Columnar data warehouse products only retrieve the columns needed for a particular query. Depending on how wide your rows are, that can be a huge advantage. Through materialized views – and in Netezza’s case a hardware assist – row-based vendors have varying degrees of effectiveness at accomplishing similar things.
- Pinpoint queries. For some data warehouse applications (e.g., in call centers), it is important to look up specific records at high speed. Conventional-but-much-faster row-based systems like Teradata, Greenplum, or DATAllegro can usually meet these needs. Not all columnar vendors would do as well.
- Data scoring. Similarly, data scoring – e.g., in applications such on-the-fly call center offer optimization — tends to involve entire rows of data. Row-based systems may outperform columnar ones for that kind of use.
- Partitioning. A major virtue of most specialty data warehouse products is that they give pretty good performance without a lot of pre-tuning. Even so, range partitioning can ensure that data with particular field values is concentrated together, meaning that less data total needs to come off of disk to satisfy certain queries. A major use of this capability is in applications where data with recent dates is retrieved or analyzed more intensely than older information – and there are a lot of those.
- Load speed. Row-based data warehouse managers usually have fast load speeds, for bulk loads and trickle feeds alike. Columnar vendors differ in how fast they are at which kinds of loads.
- Concurrency. If you need your data warehouse to support a large and diverse set of concurrent reports and queries, Teradata will definitely do the job for you. Netezza and DATAllegro have also cleaned up bottlenecks through a few revisions each. Newer products might or might not do as well.
- Schema support. Some data warehouse products are optimized for pure star schemas with single fact tables, or schemas only a little more complex than that. Others are more broadly applicable. For many data warehouses, the simple schemas suffice.
- Compressibility. Every serious data warehouse product offers at least 2-5X compression. (Columnar systems are somewhat ahead of row-based ones.) Some spike up into double digits, especially for certain kinds of columns. Depending on the nature of your data, compression may or may not be a huge differentiating factor.
- Transparency/compatibility. If you need your data warehouse to run exactly the SQL that SQL Server does, you might want to talk to ParAccel. If you want to be compatible with MySQL, Infobright Brighthouse is the ticket.
- MDX or “Model by” instead of standard SQL.
One more complication: There also are a number of products that accelerate data analysis outside the DBMS, usually in a memory-centric way. Ones I’ve written about in the past include QlikView, SAP BI Accelerator, Applix, and the whole category of complex event/stream processing.
The complete series
- Part 1: Database management system choices – overview
- Part 2: Database management system choices – 4 categories of relational
- Part 3: Database management system choices – relational data warehouse
- Part 4: Database management system choices – mid-range-relational
- Part 5: Database management system choices – beyond relational
Comments
20 Responses to “Database management system choices – relational data warehouse”
Leave a Reply
Do you have much idea how much benefit Netezza gets from its hardware assist? Do they explain it technically anywhere that the general public can access?
I wonder about this because over the years, I often hear about products with specialized hardware assists, and in the long run it turns out that they aren’t really valuable enough to justify their cost and the fact that you can’t use stock hardware. (Please omit comments about the irony of a Symbolics co-founder saying something like this; I am referring to more recent history than that, anyway.) For example, one of my friends co-founded a company to make network file servers that would use hardware to speed up the “inner loops” (performance critical parts). After a lot of careful design work, they finally concluded that there wasn’t actually any opportunity for specialized hardware that made engineering and business sense (and they had to lay off a lot of hardware guys, sadly, but the company eventually did well). So I just have a bit of a tendency to be skeptical about specialized hardware.
On the other hand, where it does work, it’s often quite cool…
Dan,
So far, they make a compelling case.
The really short version is that they offload a lot of work from the CPU to a cheaper and cooler FPGA. But beyond that, the FPGA handles records as they stream, rather than taking them into cache and processing them afterwards, so the FPGA does a lot less work itself than it’s saving the CPU.
CAM
Curt,
For what technical reasons do you feel it is “almost inconceivable to use Oracle for a 100+ terabyte data warehouse”?
It has been almost 2.5 years, but in the 2005 WinterCorp TopTen Survey Yahoo! had a 100TB Oracle data warehouse [1] and I’m sure it has grown since then. Also, as of Q2 2007, AT&T has two separate 224TB Oracle data warehouses [2]. So why do you not mention Oracle as part of your “Absolute scalability” bullet point given that there are several multi-hundred terabyte Oracle data warehouses?
[1] http://www.wintercorp.com/VLDB/2005_TopTen_Survey/2005TopTenWinners.pdf
[2] http://www.lgr.biz/index.php?option=com_content&view=article&id=67&Itemid=54
Every time I check out an Oracle warehouse in the tens of terabytes range or higher, it turns out to involve very unnatural acts.
As for your examples, the second one refers to AT&T data warehouses supposedly running on Oracle — but the details are on a broken link. Your first link (correctly, IMO) lists only AT&T warehouses running on Daytona, not Oracle.
Leaving AT&T out, your links suggest one Oracle data warehouse over 25 terabytes (I actually know that there are a few others). And it’s at Yahoo, which thinks nothing of — well, of unnatural acts to get database management systems to run in loosely couple clusters via custom code.
No doubt there are a few genuine cases of Oracle data warehouses with tens of terabytes of user data running reasonably smoothly. But I do mean a “few”, except perhaps in the low end of that range.
And don’t be at all sure that databases mentioned a couple of years ago have grown *on their original platforms*. For example, Yahoo has bought data warehouse appliances.
Finally, please note that it is hard to find a Winter survey later than 2005. And please further note that the world’s largest data warehouses are definitely not in the survey.
CAM
You didn’t really answer my first question. I’m interested in the technicals. What Oracle warehouses have you checked out and can you define “unnatural acts”?
The second link works (but the case study link is broke) and brief clearly states: “The CDRlive system at AT&T currently comprises two separate 224TB Oracle databases”, so where does *supposedly* come into play?
I do recognize that the survey is from 2005 (and there isn’t a newer one) and not every customer or vendor participated in it, but that shouldn’t discount the facts (there was equal opportunity to participate). Just to note, there is another Oracle database at 222TB in that report, the one at Max-Planck-Institute for Meteorology. Also to clarify, the AT&T Daytona the 224TB AT&T Oracle DB are different. The latter is from Cingular.
Just because you say Yahoo bought a data warehouse appliance doesn’t mean the that database (from 2005) doesn’t run on Oracle today.
Just to be clear, I’m not claiming any db is better or worse, I’d just like to see supporting evidence of some of your claims, but perhaps this info is secret and we’ll just have to take your word.
Greg,
The fact that something appears in a Winter Corp survey is only very weak evidence of its accuracy.
Dick Winter et al. are good guys. But they publish that survey even though they know it’s largely false. The big limitation is that for whatever reasons of methodology, they mainly find sites vendors tell them about and get customer consent to have mentioned. Hence no Teradata sites, for instance.
And I frankly find the Yahoo claim not-so-credible, until somebody proves to me that Yahoo actually disclosed it, and it’s really a data warehouse as we understand the term.
By the way — do you know whether those Winter Corp claims are about user data? Or total disk? Or what? I imagine that aspect of their methodology IS disclosed, but I haven’t checked.
CAM
The Winter Corp number is user data, you can find how that is calculated by the provided scripts. I don’t really have interest in debating the creditability of the Winter Crop survey. It is what it is.
The survey details are documented here:
http://www.wintercorp.com/VLDB/2005_TopTen_Survey/TopTen_Survey_FAQ.htm
http://www.wintercorp.com/VLDB/2005_TopTen_Survey/2005_TTSurvey_Validation_Preview.htm
Even if you have your doubts about the Yahoo claim, it has been made public. Can you provide the public references for your Teradata, DATAllegro, and SAS claims of 100+ TB warehouses? I’d be very interested in knowing more about these.
Telling me — unless under NDA — constitutes making a claim public.
As for “unnatural acts” — can you write the SQL one would think you can write, or do you have to union over the result sets from a partitioned set of tables to get your results?
And looking at your latest Winter Corp links — what I and most other people call “user data” seems to be what they’re calling “normalized data”. In Oracle data warehouses, that’s commonly 1/5 – 1/10 as much as what Winter Corp calls “user data”.
CAM
When there are documented references to your 100+ TB db claims, please share them. I’m interested in the big data arena as I’m sure other readers are.
WRT “unnatural acts”: I’ve never had to do this, nor heard of anyone who has, and would not see why it would need to be done. Seems to me like a case of poor SQL writing skills rather than an Oracle product limitation. If you have a test case I’d gladly look at it offline.
Can you give the definition that you use for “user data” and explain the difference between the Winter Corp definitions?
Why is this number 1/5 – 1/10 as much in Oracle data warehouses? Is this ratio also applied to other data warehouse platforms?
User data = data. Not work space. Not indexes. Not aggregate tables. Not disks used for mirroring. Not the disk space savings from compression. Just raw data.
http://www.dbms2.com/2006/09/28/expansion-ratio-data-warehouse-explosion/
As for naming the big databases — I’m sorry. Most of the info I get about particular users is under possible NDA, and I often don’t bother checking what has or hasn’t made it into press releases and the like. However, I do know DATAllegro has a press release about a multihundred terabyte site. Dataupia put out a release about a big user very recently. (If memory serves, they called it 150 terabytes, but upon query from me said it was 120 Tb of user data. I haven’t posted about it yet.) eBay, which I just posted about, is not — I’m pretty sure — running its main warehouses on Oracle or SQL Server. If I get clarity for public consumption, I’ll update accordingly. Walmart has huge data warehouses, and is a long-time Teradata customer. Ditto K-Mart pre-Sears-merger. I posted about a 35 Tb Sun/Greenplum site where Oracle had been thrown out. And so on.
If you really want to learn and not just argue, there are a lot of pointers out there. If nothing else, just take the obvious suspects for the largest warehouses, leave out whoever you think will be bewildering due to decentralized IT decision making or secrecy (hint: start that list with Amazon and the intelligence community), and start seeing who’s known to be their vendor.
Have you even read my prior posts on these subjects? There’s a reasonably decent search capability on my sites.
CAM
Thanks for info. I apologize if you think I want to argue – not the case at all. The engineer in me just likes to see supporting evidence and some more details surrounding some of your statements.
Cheers.
Greg,
I’ve been an analyst since 1981, and I’ve always dealt with huge amounts of semi-confidential and only partly-verified info. (My stock analyst years were at the beginning of that period, long before Reg FD.) A big part of an analyst’s skill lies in figuring out what’s actually for real.
I’m usually right. Not always, to be sure. But usually. At least about reading the current state of the market and technology. Predictions are sometimes a different matter …
Best,
CAM
Any reason you didn’t mention Dataupia in the transparency/compatibility bullet? That is their whole play, isn’t it?
Tom,
I haven’t talked enough with Dataupia to make many en passant comments about them.
E.g., I just caught an error in the bullet you commented on — I had implied ParAccel’s Oracle compatibility was already available, when in fact that’s a “roadmap” item, with a vague delivery timeframe target that they freely admit isn’t set in stone. I knew enough to edit that, without hesitation. For Dataupia I wouldn’t be as sure about what’s real and what isn’t.
CAM
Hi Curt,
One nice feature that Teradata now offer is the ability to mix “hot” (frequently accessed) and “cold” (rarely accessed) data on the same disc spindles.
This gets around the limitations of disk reader speeds vs increased density discs.
If say only 20% of the data on a spindle is hot then it stands to reason that the data can be read from many spindles that bit quicker than trying to read all the hot data from fewer spindles.
Maybe other products have similar functionality…
BTW I currently work with SQL Server so I am not trying to promote TD ! Too expensive IMHO.
John
John,
I’d missed that Teradata feature. Thanks!
DATAllegro does “multi-temperature” in another way. In essence, they let you put two different brands of disk in the same installation, one of which is cheaper/bigger/slower than the other. Then they manage it all as a fairly seamless logical whole, as part of their grid strategy.
CAM
[…] Part 3: Database management system choices – relational data warehouse […]
[…] Part 3: Database management system choices – relational data warehouse […]
http://www.erienewsnow.com/story/30579130/seo-for-financial-advisors-announced-seo-services-for-wealth-management-experts
Database management system choices – relational data warehouse | DBMS 2 : DataBase Management System Services