Introduction to Greenplum and some compare/contrast
Netezza relies on FPGAs. DATallegro essentially uses standard components, but those include Infiniband cards (and there’s a little FPGA action when they do encryption). Greenplum, however, claims to offer a highly competitive data warehouse solution that’s so software-only you can download it from their web site. That said, their main sales mode seems to also be through appliances, specifically ones branded and sold by Sun, combining Greenplum and open source software on a “Thumper” box. And the whole thing supposedly scales even higher than DATallegro and Netezza, because you can manage over a petabyte if you chain together a dozen of the 100 terabyte racks.
As often happens in introductory calls, I came away from my first Greenplum conversation with a much better sense of what it was they’re claiming than I managed to get of why it’s believable, how they’ve achieved what they appear to have, or where the “gotchas” are. Anyhow, here are some highlights of their story so far:
- They offer a proprietary, extended version of PostgreSQL, called Bizgres. As is common in open source-based projects, there are a lot of wrinkles as to what’s closed source, what they’ve created themselves and donated to the open source community, etc., etc.
- Bizgres comes in two flavors. Generic Bizgres is free to download and use to manage up to a few hundred gigabytes of data. It runs on a single processor. Bizgres MPP is free to download and develop with, but costs money to deploy.
- The company has had a somewhat checkered history. In its current setup it’s located in San Mateo, has 30 employees, has a partner with 8 engineers providing Tier 1 and Tier 2 support, and has closed 11 customers in the past 5 months.
- They added some basic data warehousing capabilities to PostgreSQL, such as range partitioning, and bitmaps that work for cardinalities up to 10,000 or so. Note that these probably are not used by Netezza, which built its system on an older version of PostgreSQL, although as usual I’m not sure about anything technical at Netezza, due to their lack of interest in having their technology analyzed. (DATallegro is built on Ingres.)
- Like DATallegro, they previously had an architecture in which queries that couldn’t be executed in one partition sent partial results to a “fat head” node that did the rest of the work – but subsequently have adopted a more sophisticated parallelization strategy. However, they talk of “query shipping,” while DATallegro stresses “repartitioning” of the database, so I suspect their approach is somewhat different, although one way or the other lots of data has to be shipped from node to node. But I’m not clear on the details of how this works in the Greenplum case.
- Like Netezza and unlike DATallegro, they think gigabit Ethernet is just dandy for the internode data transport. DATallegro, however, prefers Infiniband, as it creates almost no processor load (literally, no processor load, and no more than 1% effective processor slowdown), while gigabit Ethernet can slow processors by a factor of two in the worst case.
- The Sun appliance comes in 10, 40, and 100 terabyte sizes. (That’s actual warehouse size. Disk space is more than twice as much, of course.) One customer is seriously evaluating a 200 terabyte configuration. I think the scalability past that is largely theoretical at this point.
- List prices, if they recall correctly, are $370K, $700K, and $1.8 million for the 10, 40, and 100 terabyte versions. I didn’t get a sense of performance. (DATallegro, of course, offers very different data capacities for the same or similar prices, at different performance points.) Uh, I forgot to ask how much of those 100 terabytes are typically data, and how much is index, which I should have done because:
- Unlike Netezza and DATallegro, Greenplum thinks indexes are fine and dandy. As an example, they point out that using an index for a time series in no way interferes with sequential data access. In general, they think it’s important that they merely extend the underlying DBMS rather than build on top of it, as that makes it easier to use all the DBMS’ functionality. (I guess in principle DATallegro could use Ingres’s transactional capabilities, albeit only in low doses since performance would be very unoptimized. However, I have no idea whether they actually built the system that way.)
Comments
4 Responses to “Introduction to Greenplum and some compare/contrast”
Leave a Reply
[…] I previously noted that Attensity seemed to putting a lot of emphasis on a partnership with Business Objects and Teradata, although due to vacations I’ve still failed to get anybody from Business Objects to give me their view of the relationship’s importance. Now Greenplum tells me that O’Reilly is using their system to support text mining (apparently via homegrown technology), although I wasn’t too clear on the details. I also got the sense Greenplum is doing more in text mining, but the details of that completely escaped me. […]
Thanks Curt, this is a good introduction. It’s nice to have someone dig into the technology and find the differences.
WRT “query shipping”, I was actually referring to a simpler approach used by others, not ours. My admittedly subtle but I think important point was that the whole problem of supporting arbitrary DBMS work is that you have to get inside the database engine and implement optimization at the “execution plan” level and not the “query plan” level. Rather than “repartitioning on the fly”, we pipeline rows through the interconnect among execution plan fragments in real time. We do this because of the performance, generality and ease of adding future capabilities associated with a DBMS internal architecture. I think this is critical and you should expect us to continue to have advantages like supporting a rich assortment of native indexing and highly optimized aggregations that can’t be done easily without our architecture.
Let’s see if this sparks some comments!
Since we haven’t yet seen Greenplum in any competitive situations, it’s hard for me to comment in any detail. However, there are a few things I don’t understand:
1. Shipping rows around between execution plan fragments sounds OK for small amounts of data, but with large volumes, it’s far more efficient to move data around in large blocks to avoid the overhead of many small movements (especially on GigE). We’ve been able to handle all queries put before us, so I don’t see any inherent advantages in terms of functionality.
2. The last time I looked, Postgres was multi-process and not multi-threaded and I don’t think that’s changed. My guess is that there’s a lot of time spent waiting for rows with this kind of approach.
3. I don’t understand why the approach described above leads to the conclusion that ‘native indexing and highly optimized aggregations’ can’t be easily done with different architectures. We certainly manage it.
4. In any real-world DW, what use are bit-mapped indexes with cardinality of up to 10,000? We generally deal with tables of billions of rows and cardinality in the tens or hundreds of millions.
5. The network throughput numbers quoted by Sun don’t make any sense. How do I get 1GBps through four GigE links? Each link will max out at 80MBps, so that gives only 320MBps. Also, even with a TOE, you’d see a lot of CPU load with that kind of data movement. With such light CPU power relative to the number of disks, how does the system scale under concurrency?
6. There’s also a claim of one TB per minute table scans from ten servers floating around. I presume that’s calculated by just multiplying Sun’s claim of 2GBps disk throughput x 10 x 60. Seems unlikely that Postgres could get anywhere near that in practice with just two dual core CPUs. Even if a simple table scan could run at that speed (which I doubt), our experience with Postgres is that it’s MUCH slower than Ingres when running actual queries.
Stuart
DATAllegro
[…] Greenplum всю дорогу является Типом 2. Несомненно, они бы с удовольствием продали вам только лицензию на программное обеспечение, но я не знаю о ком-либо, кто бы хотел их купить. […]