Data warehouse appliances
Analysis of data warehouse appliances – i.e., of hardware/software bundles optimized for fast query and analysis of large volumes of (usually) relational data. Related subjects include:
- Data warehousing
- Parallelization
- Netezza
- DATAllegro
- Teradata
- Kickfire
- (in The Monash Report) Computing appliances in multiple domains
A quick survey of data warehouse management technology
There are at least 16 different vendors offering appliances and/or software that do database management primarily for analytic purposes.* That’s a lot to keep up with,. So I’ve thrown together a little overview of the analytic data management landscape, liberally salted with links to information about specific vendors, products, or technical issues. In some ways, this is a companion piece to my prior post about data warehouse appliance myths and realities.
*And that’s just the tabular/alphanumeric guys. Add in text search and you run the total a lot higher.
Numerous data warehouse specialists offer traditional row-based relational DBMS architectures, but optimize them for analytic workloads. These include Teradata, Netezza, DATAllegro, Greenplum, Dataupia, and SAS. All of those except SAS are wholly or primarily vendors of MPP/shared-nothing data warehouse appliances. EDIT: See the comment thread for a correction re Kognitio.
Numerous data warehouse specialists offer column-based relational DBMS architectures. These include Sybase (with the Sybase IQ product, originally from Expressway), Vertica, ParAccel, Infobright, Kognitio (formerly White Cross), and Sand. Read more
Netezza rolls out its compression story
The proximate cause for today’s flurry of Netezza-related posts is that the company has finally rolled out its compression story. In a nutshell, Netezza has developed its own version of columnar delta compression, slated to ship May, 2008. It compresses 2-5X, with the factor sometimes going up into double digits. Netezza estimates this produces a 2-3X improvement in overall performance, with the core marketing claim being that performance will “double” from compression alone. Read more
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, Database compression, Netezza, Theory and architecture | Leave a Comment |
ANALYTIC is the antonym of TRANSACTIONAL
In 1993, Ted Codd introduced the term OLAP (OnLine Analytic Processing) to describe data management that wasn’t optimized for OLTP (OnLine Transaction Processing). Later in the 1990s, Henry Morris of IDC introduced the term analytic applications to describe apps that weren’t transactional. Since then, no better word than “analytic” has emerged to cover the broad class of IT apps and technologies that aren’t focused on transactional processing.
In the latest incarnation, analytic appliances are coming to the fore. Read more
Categories: Analytic technologies, Data warehouse appliances, Netezza, Vertica Systems | Leave a Comment |
Netezza is finally opening the kimono
I’ve bashed Netezza repeatedly for secrecy and obscurity about its technology and technical plans. Well, they’re getting a lot better. The latest post in a Netezza company blog, by marketing exec Phil Francisco, lays out their story clearly and concisely. And it’s backed up by a white paper that does more of the same. In particular, Page 11 of that white paper spells out possible future directions for enhancement, such as better compression, encryption, join filtering, and Netezza Developer Network stuff. Read more
The Netezza strategy for data shipping
I talked with Netezza today, and finally understand better why they don’t have node-to-node data shipping problems with only 1-gigabit (gigE) interconnects:
- Netezza boxes have lots of relatively small nodes, so all else being equal, each individual node has less communicating to do than, say, a DATAllegro node does.
- It’s not just just 1-gigabit. There’s a hierarchical communications architecture, and at one level in the hierarchy switches are talking to each other through 32 parallel 1-gigabit channels at a time.
Categories: Data warehouse appliances, Netezza | Leave a Comment |
Data warehouse appliances – fact and fiction
Borrowing the “Fact or fiction?” meme from the sports world:
- Data warehouse appliances have to have specialized hardware. Fiction. Indeed, most contenders except Teradata and Netezza — for example, DATAllegro, Vertica, ParAccel, Greenplum, and Infobright — offer Type 2 appliances. (Dataupia is another exception.)
- Specialized hardware is a dead-end for data warehouse appliances. Fiction. If it were easy for Teradata to replace its specialized switch technology, it would have done so a decade ago. And Netezza’s strategy has a lot of appeal.
- Data warehouse appliances are nothing new, and failed long ago. Fiction, but only because of Teradata. 1980s appliance pioneer Britton-Lee didn’t do so well (it was actually bought by Teradata). IBM and ICL (Britain’s national-champion hardware company) had content-addressable data store technology that went nowhere.
- Since data warehouse appliances failed long ago, they’ll fail now too. Fiction. Shared-nothing MPP is a fundamental advantage of appliances. So are various index-light strategies. Data warehouse appliances are here to stay.
- Data warehouse appliances only make sense if your main database management system can’t handle the job. Fiction. There are dozens of data warehouse appliances managing under 5 terabytes of user data, if not under 1 terabyte. True, some of them are legacy installations, dating back to when Oracle couldn’t handle that much data well itself. But new ones are still going in. Even if Oracle or Microsoft SQL Server can do the job, a data warehouse appliance is often a far superior — cheaper, easier to deploy and keep running, and/or better performing — alternative.
- Data warehouse appliances are just for data marts. For your full enterprise data warehouse, use a conventional DBMS. Part fact, part fiction. It depends on the appliance, and on the complexity of your needs. Teradata systems can do pretty much everything. Netezza and DATAllegro, two of the oldest data warehouse appliance startups, have worked hard on their concurrency issues and now can support fairly large user or reporting loads. They also can handle reasonable volumes of transactional or trickle-feed updates, and probably can support full EDW requirements for decent-sized organizations. Even so, there are some warehouse use cases for which they’re ill-suited. Newer appliance vendors are more limited yet.
- Analytic appliances are just renamed data warehouse appliances. Fact, even if misleading. Netezza is using the term “analytic appliance” to highlight additional things one can do on its boxes beyond answering queries. But those are still operations on a data mart or data warehouse.
- Teradata is the leading data warehouse appliance vendor. More fact than fiction. Some observers say that Teradata systems aren’t data warehouse appliances. But I think they are. Competitors may be superior to Teradata in one or the other characteristic trait of appliances – e.g., speed of installation – but it’s hard to define “appliances” in an objective way that excludes Teradata.
If you liked this post, you might also like one on text mining fact and fiction.
Netezza has another big October quarter
Netezza reported a big October quarter, ahead of expectations. And official guidance for next quarter is essentially flat quarter-over-quarter, suggesting Q3 was indeed surprisingly big. However, Netezza’s year-over-year growth for Q3 was a little under 50%, suggesting the quarter wasn’t so remarkable after all. (Netezza has a January fiscal year.)
Tentative conclusion: Netezza just tends to have big October quarters, perhaps by timing sales cycles to finish soon after the late September user conference. If Netezza’s user conference ever moves to later in the fall, expect Q3 to be weak that year.
Netezza reported 18 new customers, double last year’s figure. Read more
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, Greenplum, Kognitio, Netezza | 3 Comments |
Vertica update – HP appliance deal, customer information, and more
Vertica quietly announced an appliance bundling deal with HP and Red Hat today. That got me quickly onto the phone with Vertica’s Andy Ellicott, to discuss a few different subjects. Most interesting was the part about Vertica’s customer base, highlights of which included:
- Vertica’s claim to have “50” customers includes a bunch of unpaid licenses, many of them in academia.
- Vertica has about 15 paying customers.
- Based on conversations with mutual prospects, Vertica believes that’s more customers than DATAllegro has. (Of course, each DATAllegro sale is bigger than one of Vertica’s. Even so, I hope Vertica is wrong in its estimate, since DATAllegro told me its customer count was “double digit” quite a while ago.)
- Most Vertica customers manage over 1 terabyte of user data. A couple have bought licenses showing they intend to manage 20 terabytes or so.
- Vertica’s biggest customer/application category – existing customers and sales pipelines alike – is call detail records for telecommunications companies. (Other data warehouse specialists also have activity in the CDR area.). Major applications are billing assurance (getting the inter-carrier charges right) and marketing analysis. Call center uses are still in the future.
- Vertica’s other big market to date is investment research/tick history. Surely not coincidentally, this is a big area of focus for Mike Stonebraker, evidently at both companies for which he’s CTO. (The other, of course, is StreamBase.)
- Runners-up in market activity are clickstream analysis and general consumer analytics. These seem to be present in Vertica’s pipeline more than in the actual customer base.
Categories: Analytic technologies, Business Objects, Data warehouse appliances, Data warehousing, DATAllegro, HP and Neoview, RDF and graphs, Vertica Systems | 5 Comments |
Netezza cites three warehouses over 50 terabytes
Netezza is finally making it clear that they run some largish warehouses. Their latest press release cites Catalina Marketing, Epsilon, and NYSE Euronext as having 50+ terabytes each. I checked with Netezza’s Marketing VP Ellen Rubin, and she confirmed that those are clean figures — user data, single warehouses, etc. Ellen further tells me that Netezza’s total count of warehouses that big is “significantly more” than the 3 named in the release.
Of course, this makes sense, given that Netezza’s largest box, the NPS 10800, runs 100 terabytes. And Catalina was named as having bought a 10800 in a press release back in December, 2006. Read more
ParAccel opens the kimono slightly
Please do not rely on the parts of this post that draw a distinction between in-memory and disk-based operation. See our February 18, 2008 post about ParAccel instead. It turns out that communication with ParAccel was yet worse than I had realized.
Officially launched today at the TDWI conference, ParAccel is out to compete with Netezza. Right out of the chute, ParAccel may have surpassed Netezza in at least one area: pointlessly annoying secrecy. (In other regards I love them dearly, but that paranoia can be a real pain.) As best I can remember, here are some things about ParAccel that I both am allowed to say and find interesting:
- ParAccel offers a columnar, MPP data warehouse DBMS, called the ParAccel Analytic Database.
- ParAccel’s product runs in two main modes. “Maverick” is normal, stand-alone mode. “Amigo” mode amounts to a plug-compatible accelerator for Oracle or Microsoft SQL*Server. Early sales and marketing were concentrated on SQL*Server Amigo mode.
- ParAccel’s product also runs in another pair of modes – in-memory and disk-based. Early sales and marketing were concentrated on in-memory mode. Hybrid memory-centric processing sounds like something for a future release.
- Sun has a reseller partnership with ParAccel, focused on in-memory mode.
- Sun and ParAccel published record-shattering 100 gigabyte, 300 gigabyte, and 1 terabyte TPC-H benchmarks today, based on in-memory mode. (If you’d like to throw 13 terabytes of disk at 1 terabyte of user data, running simple and repetitive queries, that benchmark might be a useful guide to your own experience. But hey – that’s a big improvement on the prior champion, who used 40 terabytes of disk. To ParAccel’s credit, they’re not pretending that this is a bigger deal than it is.)