Data warehouse appliances
Analysis of data warehouse appliances – i.e., of hardware/software bundles optimized for fast query and analysis of large volumes of (usually) relational data. Related subjects include:
- Data warehousing
- Parallelization
- Netezza
- DATAllegro
- Teradata
- Kickfire
- (in The Monash Report) Computing appliances in multiple domains
DATAllegro discloses a few numbers
Privately held DATAllegro just announced a few tidbits about financial results and suchlike for the fiscal year ended June, 2007. I sent over a few clarifying questions yesterday. Responses included:
- Yes, the company experienced 330% year-over-year annual revenue growth.
- The majority of DATAllegro customers have bought systems in the 25-100 terabyte range.
- One system over 250 terabytes has been in production for months (surely the one I previously wrote about); a second is being installed.
- DATAllegro has “about 100” employees. By way of comparison, Netezza reported 225 full-time employees for the year ended January, 2007 – which probably means as of January 31, 2007.
All told, it sounds as if DATAllegro is more than 1/3 the size of Netezza, although given its higher system size and price points I’d guess it has well under 1/3 as many customers.
Here’s a link. I’ll likely edit that to something more permament-seeming later, and generally spruce this up when I’m not so rushed.
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, DATAllegro | 8 Comments |
One Greenplum customer — 35 terabytes and growing fast
I was at the Business Objects conference this week, and as usual went to very few sessions. But one I did stroll into was on “Managing Rapid Growth With the Right BI Strategy.” This was by Reliance Telecommunications, an outfit in India that is adding telecom subscribers very quickly, and consequently banging 100-150 gigs of data per day into a 35 terabyte warehouse.
The beginning of the talk astonished me, as the presenter seemed to be saying they were doing all this on Oracle. Hah. Oracle is what they moved away from; instead, they got Greenplum. I couldn’t get details; indeed, as a BI guy he was far enough away from DBMS to misspeak and say that Greenplum was brought in by ‘HP’, before quickly correcting himself when prompted. Read more
Categories: Analytic technologies, Business Objects, Data warehouse appliances, Data warehousing, Greenplum, Oracle, Specific users, Telecommunications | Leave a Comment |
Gartner 2007 Magic Quadrant for Data Warehouse Database Management Systems
February, 2011 edit: I’ve now commented on Gartner’s 2010 Data Warehouse Database Management System Magic Quadrant as well.
It’s early autumn, the leaves are turning in New England, and Gartner has issued another Magic Quadrant for data warehouse DBMS. (Edit: As of January, 2009, that link is dead but this one works.) The big winners vs. last year are Greenplum and, secondarily, Sybase. Teradata continues to lead. Oracle has also leapfrogged IBM, and there are various other minor adjustments as well, among repeat mentionees Netezza, DATAllegro, Sand, Kognitio, and MySQL. HP isn’t on the radar yet; ditto Vertica. Read more
Three ways Oracle or Microsoft could go MPP
I’ve been arguing for a while that Oracle and Microsoft are screwed in high-end data warehousing. The reason is that they’re stuck with SMP (Symmetric Multi-Processing) architectures, while Teradata, Netezza, DATAllegro, and many others enjoy the benefits of MPP (Massively Parallel Processing). Thus, Teradata and DATAllegro boast installations in the hundreds of terabytes each, while Oracle and Microsoft users usually have to perform unnatural acts of hard-coded partitioning even to reach the 10 terabyte level.
That said, there are at least three ways Oracle and/or Microsoft could get out of this technical box:
1. They could buy or just partner with MPP vendors such as Dataupia, who offer plug-compatibility with their respective main DBMS.
2. They could buy whoever they want, plug-compatibility be damned. Presumably, they’d quickly add a light-weight data federation front-end to give the appearance of integration, then merge the products more closely over time.
3. They could develop or buy technology like DATAllegro’s, which essentially federates instances of an ordinary SMP DBMS across nodes of an MPP grid (Greenplum does something similar). I imagine that, for example, ripping Ingres out of DATAllegro and slotting in Oracle instead would be a pretty straightforward exercise; even without dramatic change to any of the optimizations, the resulting port would be something that ran pretty quickly on Day 1.
Bottom line: Oracle and Microsoft are hemorrhaging at the data warehouse high end now. But there are ways they could stanch the bleeding.
SAS goes MPP on Teradata first
After a hurried discussion with SAS CTO Keith Collins and a followup with Teradata CTO Todd Walter, I think I’ve figured out the essence of the SAS port to Teradata. (Subtle nuances, however, have to await further research.) Here’s what I think is going on:
1. SAS is porting or creating two different products or modules, with two different names (and I don’t know exactly what those names are). The two different things they are porting amount to modeling (i.e., analysis) and scoring (i.e., using the results of the model for automated decision-making).
2. Both products are slated for delivery at or near the time of SAS 9.2, which is slated for GA at or near the middle of next year. (Maybe somebody from SAS could send me the official word, as well as product names and so on?)
3. The essence of the modeling port is a library of static UDFs (User Defined Functions).
4. The essence of the SAS scoring port is the ability to easily generate a single “dynamic” UDF to score according to a particular model. This would seem to leverage Teradata scoring-related enhancements much more than it would compete or conflict with them.
5. There are two different kinds of benefits SAS gets from integrating with an MPP (Massively Parallel Processing) DBMS. One is actual parallel processing of operations, shortening absolute calculation time dramatically, and also leveraging Moore’s Law without painful SMP (Symmetric MultiProcessing) overhead. The other is a radical reduction in data movement costs for the handoff between the database and the SAS software. Interestingly, SAS reports huge performance gains even from putting its software on a single node inside the Teradata grid. That is, changing how data movement is done is already a huge win, even when there’s no reduction in the overall amount moved. But of course, in the complete implementation, where database and SAS processing are done on the same nodes, there’s also a huge reduction in actual data movement effort required.
One obvious question would be: How hard would it be for SAS to replicate this work on other MPP DBMS? Well, at its core this work involves implementing a variety of elementary arithmetic and data manipulation functions. So a first-best guess is that a fairly efficient port would be easy (given that this one has already been performed), but that the last 20% or whatever of the performance optimizations require a lot more work. As to whether or not this is more than a theoretical question — well, both SAS and SPSS are disclosed members of the Netezza Developers Network. As for SMP DBMS — well, some of the work certainly could be replicated, but other important parts don’t even make sense on Oracle or Microsoft the way they do on Teradata, Netezza, DATAllegro, et al. Read more
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, SAS Institute, Teradata | 7 Comments |
Marketing versus reality on the one-petabyte barrier
Usually, I don’t engage in the kind of high-speed quick-response blogging I have over the past couple of days from the Teradata Partners conference (and more generally have for the past week or so). And I’m not sure it’s working out so well.
For example, the claim that Teradata has surpassd the one-petabyte mark comes as quite a surprise to variety of Teradata folks, not to mention at least one reliable outside anonymous correspondent. That claim may indeed be true about raw disk space on systems sold. But the real current upper limit, according to CTO Todd Walter,* is 5-700 terabytes of user data. He thinks half a dozen or so customers are in that range. I’d guess quite strongly that three of those are Wal-Mart, eBay, and an unspecified US intelligence agency.
*Teradata seems to have quite a few CTOs. But I’ve seen things much sillier than that in the titles department, and accordingly shan’t scoff further — at least on that particular subject. 😉
On the other hand, if anybody did want to buy a 10 petabyte system, Teradata could ship them one. And by the way, the Teradata people insist Sybase’s claims in the petabyte area are quite bogus. Teradata claims to have had bigger internal systems tested earlier than the one Sybase writes about.
Categories: Data warehouse appliances, Data warehousing, eBay, Petabyte-scale data management, Specific users, Sybase, Teradata | 3 Comments |
Yet more on petabyte-scale Teradata databases
I managed to buttonhole Teradata’s Darryl MacDonald again, to follow up on yesterday’s brief chat. He confirmed that there are more than one petabyte+ Teradata databases out there, of which at least one is commercial rather than government/classified. Without saying who any of them were, he dropped a hint suggestive of Wal-Mart. That makes sense, given that a 423 terabyte figure for Wal-Mart is now three years old, and Wal-Mart is in the news for its 4 petabyte futures. Yes, that news has tended to mention HP NeoView recently more than Teradata. But it seems very implausible that a NeoView replacement of Teradata has already happened, if if such a thing is a possibility for the future. So right now however much data Wal-Mart has on its path from 423 terabytes to 4 petabytes and beyond is probably collected mainly on Teradata machines.
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, HP and Neoview, Petabyte-scale data management, Teradata | 1 Comment |
Hot buzzword — multidimensional partitioning
Teradata finally announced multidimensional range partitioning in Version 12, not that they kept their plans in that regard a big secret. DATAllegro has also shipped multidimensional partitioning to at least one customer. Other vendors — well, I’ll stop there, given my ongoing atttitude problems about vendors’ self-defeating NDAs.
Whether or not multidimensional partitioning is a big improvement over single-dimensional will of course depend a great deal on the details of a particular database. Teradata used a figure of 30% performance improvement, but that’s surely just an example. Certainly in some extreme cases one could have a rather large reduction in the amount of data retrieved, and correspondingly a many-times-X improvement in the performance of certain important queries. Read more
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, DATAllegro, Teradata | Leave a Comment |
Teradata apparently has crossed the petabyte barrier
According to a hurried conversation I had with Chief Marketing Office Darryl MacDonald, Teradata has customers with over 1 petabyte of user data in a single instance. He wouldn’t disclose any names, but I’d guess one is eBay, who he did confim is a customer. The intelligence area is another one where I’d speculate there are Very Large Databases.
However, since Darryl mentioned testing systems internally up to 4 petabytes, I’d guess the upper limit of Teradata deployments is in the 1-2 petabyte range.
EDIT: I’m now guessing that Teradata’s largest classified database — which previously was the largest overall — isn’t much over a petabyte in size. And there’s a strong chance this is larger than any unclassified one.
Update: That wasn’t really 1+ petabyte of user data.
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, eBay, Specific users, Teradata | Leave a Comment |
SAS gets close to the database
One of the big announcements at the Teradata user conference this week (confusingly named “Partners”) is SAS integration. Now, SAS is integrating with other MPP data warehouse appliance vendors as well, but it’s likely that the Teradata integration is indeed the most advanced. For example, one customer proofpoint offered was an insurer who used this capability to reevaluate its risk profile at high speed after Hurricane Katrina. I doubt any of the other SAS/DBMS integrations I know of were in customer hands a year ago.
Three still-open questions I hope to address over the next couple of days are: Read more