Greenplum
Analysis of data warehouse DBMS vendor Greenplum and its successor, EMC’s Data Computing division. Related subjects include:
- EMC, which bought Greenplum in 2010
- Data warehousing
- Data warehouse appliances
- PostgreSQL
Greenplum’s strategy
I talked with Greenplum honchos Bill Cook and Scott Yara yesterday. Bill is the new CEO, formerly head of Sun’s field operations. Scott is president, and in effect the marketing-guy co-founder. I still don’t know whether I really believe their technical story. But I do think I have a feel for what they’re trying to do. Key aspects of the Greenplum strategy include:
- Greenplum rewrote a lot of PostgreSQL to parallelize it, in the correct belief that MPP is the best way to go for high-end data warehousing.
- Indeed, Greenplum claims to have a general solution to DBMS parallelization. Unlike Netezza, DATallegro, Vertica, and Kognitio, Greenplum offers a row-oriented data store with a fairly full set of indexing techniques. You want star indices or bitmaps? They have them. (They even claimed to be used for some text management when last we talked, although that was for O’Reilly and Mark Logic seems to be O’Reilly’s main text-indexing vendor.)
- Greenplum’s main sales strategy is to be part of Sun’s product line, bundled into Thumper boxes as single-part-number Sun offerings. They certainly could add other hardware OEMs, just like Checkpoint sells firewalls through multiple appliance vendors. But at least for now it’s all about Sun.
Categories: Data warehouse appliances, Data warehousing, Greenplum, Open source, PostgreSQL | 5 Comments |
Really big databases
Business Intelligence Lowdown has a well-dugg post listing what it claims are the 10 largest databases in the world. The accuracy leaves much to be desired, as is illustrated by the fact that #10 on the list is only 20 terabytes, while entirely unmentioned is eBay’s 2-petabyte database (mentioned here, and also here). Read more
Categories: Data warehouse appliances, Data warehousing, DATAllegro, Greenplum, IBM and DB2, Netezza, Oracle, SAS Institute, Teradata, Theory and architecture | 4 Comments |
Data warehouse appliance hardware strategies
Recently, I’ve done extensive research into the hardware strategies of computing appliance vendors, across multiple functional areas. Data warehousing, firewall/unified threat management, antispam, data integration – you name it, I talked to them. Of course, each vendor has a unique twist. But some architectural groupings definitely emerged.
The most common approaches seem to be:
Type 1: Custom assembly from off-the-shelf parts. In this model, the only unusual (but still off-the-shelf) parts are usually in the area of network acceleration (or occasionally encryption). Also, the box may be balanced differently than standard systems, in terms of compute power and/or reliability.
Type 2 (Virtual): We don’t need no stinkin’ custom hardware. In this model, the only “appliancy” features are in the areas of easy deployment, custom operating systems, and/or preconfigured hardware.
And of course there are also appliances of Type 0: Custom hardware including proprietary ASICs or FPGAs.
Different markets had different emphases; e.g., firewall appliances are typically Type 1, while antispam devices cluster in Type 2. But the data warehouse appliance market is highly diverse, which maybe shouldn’t be a surprise. After all, the revenue market leader is non-appliance software vendor Oracle, while noisy upstart Netezza is famous for its FPGA. Read more
Categories: Data warehouse appliances, Data warehousing, DATAllegro, Greenplum, IBM and DB2, Kognitio, Netezza, Teradata | 8 Comments |
Introduction to Kognitio WX-2
Kognitio called me for a briefing this morning on their WX-2 product. Technical highlights included:
- Their core technology is MPP/shared-nothing data warehousing.
- Unlike most other vendors (but like Greenplum), they are available software-only.
- Like DATallegro and Netezza, they have no global indexing.
- Unlike the other MPP players, they don’t hash partition the data and lead with hash joins. Rather, they have local compressed bitmap indices on every node.
- Similarly, they distribute data utterly randomly and have no concept of range partitioning whatsoever.
- Probably for that reason, WX-2 reads data in small 32K blocks. This forfeits the benefit of sequential reads, unless David Aldridge is correct that Linux can take care of that on its own.
- They seem more chip-heavy than DATallegro and Netezza. A dual-core Opteron blade with 16 or 32 gigabytes of RAM talks to 144, 288, or in some cases 600 gigabytes of disk (before mirroring).
- The position themselves somewhat as being a memory-centric product supplier. While I suspect this is exaggerated, it probably indicates that they’ve put some work into managing RAM as well as disk.
Much like the other “new” MPP data warehouse vendors, Kognitio claims to never have knowingly been outbenchmarked, whether on performance or on TCO factors such as ease of installation.
Read more
Categories: Data warehouse appliances, Data warehousing, Greenplum, Kognitio, Memory-centric data management | 11 Comments |
Competitive issues in data warehouse ease of administration
The last person I spoke with at the Netezza conference on Tuesday was a customer/presenter that the company had picked out for me. One thing he said baffled me — he claimed that Netezza was a real appliance vendor, but DATallegro wasn’t, presumably due to administrability issues. Now, it wasn’t clear to me that he’d ever evaluated DATallegro, so I didn’t take this too seriously, but still the exchange brought into focus the great differences between data warehouse products in the area of administration. For example:
- Netezza has no indices at all. And no caches. And the hardware is preconfigured. This all makes administration pretty simple.
- DATallegro has almost no indices, and also has preconfigured hardware. But it has some partitioning, optionally.
- Teradata also has preconfigured hardware. It does have indices, but rather simple ones. Plus it has join indices. And it has a few more configuration options in other areas (e.g., block size) than the other appliance vendors. (Yes, I count Teradata among the appliances.)
- If you go through all the fuss of installing SAP’s applications and BI technology anyway, the incremental administration of just SAP BI Accelerator is pretty light.
- Oracle and IBM have mammothly complex indexing options, but have put large amounts of work into tools to lessen the resulting administrative burden.
- IBM offers preconfigured hardware units to simplify some installation issues.
- Come to think of it, I don’t really know how hard it is to administer columnar systems (e.g., Sybase IQ).
Categories: Data warehouse appliances, Data warehousing, DATAllegro, Greenplum, IBM and DB2, Netezza, Oracle, SAP AG, Teradata | 3 Comments |
Introduction to Greenplum and some compare/contrast
Netezza relies on FPGAs. DATallegro essentially uses standard components, but those include Infiniband cards (and there’s a little FPGA action when they do encryption). Greenplum, however, claims to offer a highly competitive data warehouse solution that’s so software-only you can download it from their web site. That said, their main sales mode seems to also be through appliances, specifically ones branded and sold by Sun, combining Greenplum and open source software on a “Thumper” box. And the whole thing supposedly scales even higher than DATallegro and Netezza, because you can manage over a petabyte if you chain together a dozen of the 100 terabyte racks.
Read more
Categories: Actian and Ingres, Data warehouse appliances, DATAllegro, Greenplum, Netezza, Open source, PostgreSQL | 4 Comments |