Data warehouse appliances
Analysis of data warehouse appliances – i.e., of hardware/software bundles optimized for fast query and analysis of large volumes of (usually) relational data. Related subjects include:
- Data warehousing
- Parallelization
- Netezza
- DATAllegro
- Teradata
- Kickfire
- (in The Monash Report) Computing appliances in multiple domains
Netezza, enterprise data warehouses, and the 100 terabyte mark
Phil Francisco of Netezza checked in tonight with some news that will be embargoed for a few hours. While I had him on the phone anyway, I asked him about large databases and/or enterprise data warehouses. Highlights included:
- Netezza has one customer with 200 TB of user data. The name is confidential (but he told me who it was).
- Netezza has sold 15 or so of its NPS 10-800s, which are rated at 100 TB capacity.
- The second-largest database in production on Netezza is probably 80 TB or so at Catalina Marketing, which has been a Netezza early adopter all along.
- Netezza’s biggest users typically have a handful (literally — off the top of his head, Phil said “4 to 6”) of applications, each with its own primary set of fact tables.
- Each application-specific set of fact tables in such big-honking-data-mart installations is usually either of cardinality one, or else a small set sharing a common hash key.
- Phil insists Netezza isn’t exaggerating when it claims to have true enterprise data warehouse installations. What he means by an EDW is something that is an enterprise’s primary data warehouse, is used by lots of departments, draws data from lots of sources, has loads going on at various points during the day, and has 100s if not 1000s of total users.
- Netezza’s biggest EDW has about 30 TB of user data. Phil wouldn’t tell me the name of that customer.
ParAccel unveils its EMC-related appliance strategy
Embargoes are getting ever more stupid these days, wasting analysts’ and bloggers’ time in doomed attempts to micromanage the news flow. ParAccel is no exception to the rule. An announcement that’s actually been public knowledge for a couple of months was finally made official a few minutes ago. It’s an appliance, or at least an attempt to gain customers for an appliance. The core ideas include:
- ParAccel’s usual shared-nothing configuration is hooked up to SAN-based EMC storage at the back end.
- Around half of the total data is on internal (i.e., node-specific) disks, mirrored on the storage device. The rest of the data lives only on the EMC device. Logically, all this data is integrated. So hopefully you’ll be able to process more data per unit of time than you could on a standard ParAccel configuration.
- Also, different parts of the EMC device are dedicated to different ParAccel nodes. So, while this isn’t a shared-nothing architecture, at least it’s shared-not-very-much. (DATAllegro does something similar, although without the mirroring on direct-attached storage.)
- Backup, snapshotting, and so on are inherited from EMC. Administration will increasingly be integrated with EMC’s.
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, EMC, ParAccel, Parallelization | 2 Comments |
Yet another data warehouse database and appliance overview
For a recent project, it seemed best to recapitulate my thoughts on the overall data warehouse specialty DBMS and appliance marketplace. While what resulted is highly redundant with what I’ve posted in this blog before, I’m sharing anyway, in case somebody finds this integrated presentation more useful. The original is excerpted to remove confidential parts.
… This is a crowded market, with a lot of subsegments, and blurry, shifting borders among the subsegments.
… Everybody starts out selling consumer marketing and telecom call-detail-record apps. …
Oracle and similar products are optimized for updates above everything else. That is, short rows of data are banged into tables. The main indexing scheme is the “b-tree,” which is optimized for finding specific rows of data as needed, and also for being updated quickly in lockstep with updates to the data itself.
By way of contrast, an analytic DBMS is optimized for some or all of:
-
Small numbers of bulk updates, not large numbers of single-row updates.
-
Queries that may involve examining or returning lots of data, rather than finding single records on a pinpoint basis.
-
Doing arithmetic calculations – commonly simple arithmetic, sorts, etc. — on the data.
Database and/or DBMS design techniques that have been applied to analytic uses include: Read more
DATAllegro finally has a blog
It took a lot of patient nagging, but DATAllegro finally has a blog. Based on the first post, I predict:
- DATAllegro’s blog will live up to CEO Stuart Frost’s talent for clear, interesting writing.
- Like a number of other vendor blogs — e.g., Netezza’s — DATAllegro’s will have infrequent but usually long posts.
The crunchiest part of the first post is probably
Another very important aspect of performance is ensuring sequential reads under a complex workload. Traditional databases do not do a good job in this area – even though some of the management tools might tell you that they are! What we typically see is that the combination of RAID arrays and intervening storage infrastructure conspires to break even large reads by the database into very small reads against each disk. The end result is that most large DW installations have very large arrays of expensive, high-speed disks behind them – and still suffer from poor performance.
I’ve pounded the table about sequential reads multiple times — including in a (DATAllegro-sponsored) white paper — but the point about misleading management tools is new to me.
Now if I could just get a production DATAllegro reference, I’d be completely happy …
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, DATAllegro | 6 Comments |
Netezza pricing
In connection with the announcement of the Teradata 2500, I asked some Teradata competitors about pricing. Netezza’s response amounted to “We don’t disclose list pricing, but our cheapest system handles about 3 1/4 TB and sells for under $200K.” So Netezza’s actual pricing is well below the list price of the Teradata 2500.
Categories: Data warehouse appliances, Data warehousing, Netezza, Pricing, Teradata | 11 Comments |
Teradata introduces lower-cost appliances
After months of leaks, Teradata has unveiled its new lines of data warehouse appliances, raising the total number either from 1 to 3 (my view) or 0 to 2 (what you believe if you think Teradata wasn’t previously an appliance vendor). Most significant is the new Teradata 2500 series, meant to compete directly with the smaller data warehouse specialists. Highlights include:
- An oddly precise estimated capacity of “6.12 terabytes”/node (user data). This estimate is based on 30% compression, which is low by industry standards, and surely explains part of the price umbrella the Teradata 2500 is offering other vendors.
- $125K/TB of user data. Obviously, list pricing and actual pricing aren’t the same thing, and many vendors don’t even bother to disclose official price lists. But the Teradata 2500 seems more expensive than most smaller-vendor alternatives.
- Scalability up to 24 nodes (>140 TB).
- Full Teradata application-facing functionality. Some of Teradata’s rivals are still working on getting all of their certifications with tier-1 and tier-2 business intelligence tools. Teradata has a rich application ecosystem.
- What will be controversial performance, until customer-benchmark trends clearly emerge.
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, Database compression, Pricing, Teradata | 6 Comments |
Kickfire kicks off
I chatted with Raj Cherabuddi and others on the Kickfire (formerly C2) team for over an hour on Monday, and now have a better sense of their story. There are some very basic questions I still don’t have answers to; I’ll fill those in when I can.
Highlights of what I have and haven’t figured out so far include:
-
Kickfire’s technology has two main parts: A SQL co-processor chip and a MySQL storage engine.
-
Kickfire makes a Type 0 appliance. If I understood correctly, it contains the chip, a couple of standard CPU cores, and 64 gigs of RAM. Or else it contains just the chip, and is meant to be hooked up to a 2U box with 64 gigs of RAM. I’m confused.
-
The Kickfire box can handle up to 3 terabytes of user data. The disk required for that is 4-5 terabytes without redundancy, 2X with. Based on that formulation and other clues, I’m guessing Kickfire — unlike other appliance vendors — doesn’t build in storage itself.
-
I don’t know whether the Kickfire chip is true custom silicon or an FPGA emulation.
-
The essential idea of the chip is dataflow programming for SQL, with pipelining between operations. This eliminates the overhead of registers and context switching. I don’t know what the trade-offs are, if any.
-
Kickfire’s database software is columnar, operating on compressed data even in RAM. In that, Kickfire’s story is most similar to Vertica’s, although I’m guessing Exasol may do something similar as well. Like Vertica, Kickfire uses multiple compression methods (they’re reluctant to give detail, but agreed it would be fair to say they use both something like dictionary/token and something like delta compression).
-
Kickfire’s software is ACID-compliant. You can do incremental loads or trickle feeds. Bulk load speed is 100 Gb/hour. Kickfire’s solution for the traditional problem of updating column stores is called “snapshots.” Without giving details, they position that as similar to the Vertica solution.
-
Like other MySQL storage engines, Kickfire inherits whatever data connectivity, stored procedure capabilities, user-defined functions ability, etc. that MySQL has.
-
Kickfire has no paying customers, but does have a slide showing many logos of “prospects and beta customers.”
-
Kickfire has no MPP capabilities at this time, but says adding those is “on the roadmap” and will be “easy.”
-
Kickfire submitted a 100 Gb TPC-H result, in which it beat the previous leaders — Exasol, ParAccel, and Microsoft – on price-performance, and lagged only Exasol and ParAccel on absolute performance. Kickfire is extremely proud of this. Indeed, I don’t recall another vendor ascribing that much weight to them in the entire history of TPCs.* Kickfire seems unfazed by the fact that its result is for a system listed with a ship date 6 months in the future (I’m guessing that’s the latest the TPC will allow), while the other results are for systems available today.
*Somebody – perhaps adman extraordinaire Rick Bennett? — may want to check my memory on this, but I think Oracle’s famed “Gentlemen, start your snails” ad in the early 1990s was about PC World tests, not TPCs. Oracle also had an ad about WW1-style planes nosediving, but I don’t think those referenced TPCs either.
Kickfire is de-cloaking
Kickfire, the renamed C2, is doing one of those buzz-building rollouts in which they make sure the first word comes from people on their payroll golly-gee-whizzing. You can see those at Xarpb and Diamond Notes, as well as a forthcoming article in MySQL magazine. Farhan Mashraqi also appears to be involved. Kickfire is also sponsoring the MySQL user conference next week.
I plan to write more after I get some substance, but a few things seem clear:
1. Kickfire’s product is an appliance that functions as a MySQL storage engine.
2. There’s a custom chip involved.
3. Kickfire plans to throw around the “stream processing” buzzphrase a lot.
Now, “stream processing” means a lot of different things to different people. E.g., Netezza uses the phrase just because their FPGA throws away a lot of data before ever routing it to more conventional SQL processing. But pending a briefing, I’m guessing that Kickfire’s sense is similar to what underlies the case for using CEP in BI.
Edit: Here’s an update after an actual Kickfire briefing.
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, Kickfire, MySQL | 7 Comments |
Positioning the data warehouse appliances and specialty DBMS
There now are four hardware vendors that each offer or seem about to announce two different tiers of data warehouse appliances: Sun, HP, EMC, and Teradata. Specifically:
-
Sun partners with both Greenplum and ParAccel.
-
HP sells Neoview, and also is partnered with Vertica.
-
EMC (together with Dell in North America and Bull in Europe) sells DATAllegro. Now EMC is also entering a partnership with ParAccel.
-
Teradata is pretty far down the road toward releasing a low-end product.
EMC is partnering with ParAccel
A talk about a ParAccel/EMC partnership has been promised for a forthcoming EMC user conference. Otherwise, ParAccel is exposing no useful information on the matter.*
*So what else is new?
The talk is called Highly Scalable Analytic Appliance Powered by EMC and ParAccel, and the abstract says: Read more
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, EMC, ParAccel | 2 Comments |