Greenplum is going hybrid columnar as well
Over the past summer, Vertica, VectorWise, and Oracle all announced flavors of hybrid row/columnar storage. Now it’s Greenplum’s turn. Greenplum is actually offering true columnar storage, as opposed to Oracle’s PAX-like scheme — and also as opposed to the kind of Frankencolumn storage Daniel Abadi decries. For example, you don’t have to do a join to retrieve multiple columns; you just ask for them and there they are. Similarly, Greenplum doesn’t maintain explicit row IDs – whether in row-oriented or column-oriented append-only storage – relying instead on block-level header information.
Highlights include:
- Column orientation is a special case of what Greenplum is calling Polymorphic Data Storage.*
- As per product management chief Ben Werther’s blog post, what Greenplum’s polymorphic data storage boils down to is that you can store different tables in different storage paradigms. This is transparent to the SQL or any other API; it’s just a performance choice.
- Indeed, Greenplum lets you store different partitions of the same table in different storage and/or compression schemes. So Greenplum now has a kind of ILM (Information Lifecycle Management) story, although it doesn’t offer the faster vs. cheaper storage media differentiation options of Sybase IQ or Vertica.
- Greenplum now has, depending on how one counts, three or four main types of table:
- Traditional PostgreSQL, which has been available since Day One
- Row-oriented append-only (compressible and scan-optimized), available since Greenplum 3.2 (July, 2008)
- Columnar append-only (new in Greenplum 3.3.4, shipping now)
- External, in which Greenplum treats something external – in a relational DBMS or otherwise – as if it were a Greenplum table
- Traditional PostgreSQL, which has been available since Day One
- Greenplum offers multiple versions of LZ (Lempel-Ziv) and gzip compression, any of which you can choose on a table-by-table or partition-by-partition basis.
- Greenplum offers the same compression algorithms for both row-oriented and column-oriented tables.
- Greenplum says that compression is typically at least 50% better (i.e., to 2/3 as much space) in columnar vs. row storage, for the same algorithm.
- Just as it doesn’t offer columnar-specific compression algorithms, Greenplum also doesn’t sport other columnar features Daniel loves, such as in-memory compression or late materialization. (But then, VectorWise doesn’t do in-memory compression either, and Daniel likes VectorWise.)
- All the Greenplum choices I’ve mentioned have to be made manually by DBAs.
- Similarly, I doubt Greenplum can match Vertica’s engineering for getting updates and trickle feeds quickly into a column store – a traditional columnar Achilles heel that Vertica has invested a lot of effort to circumvent.
*The term “polymorphic” is somewhat, shall we say, overloaded these days.
Comments
12 Responses to “Greenplum is going hybrid columnar as well”
Leave a Reply
Interesting — the append-only compressed row store sounds kind of like a compressed MySQL/MyISAM table though. I’m curious how they’ve approached indexing in the column store mechanism. Have you found any data on that?
Very helpful write-up!
Nice write-up, Although it sounds like Greenplum loves to copy technology rather than innovating. Does this mean that they cannot perform as well as Columnar DBMS? Are they loosing business to Columnar Vendors?
DW Consultant —
– You’d have to agree that every vendor is building from a largely shared pool of ideas. Most of everything that every vendor does is covered in academic literature going back decades. Our goal isn’t being novel in everything we do — it is delivering value to customers.
– That being said, I think a little credit is due here. We’ve built a flexible enough storage infrastructure to allow us to (1) easily add a very efficient implementation of column-oriented tables, and (2) allow both row- and column-orientation to be used not just in the same database but in different partitions of the same table.
So why did we add this feature? It is about customer choice. For most analytical queries and mixed workloads – particularly with high-rate continuous microbatched loads – our row processing wins out over columnar approaches. (i.e. There are good reasons why the pure columnar guys aren’t winning mixed-workload EDW deals against Teradata like we are). But there are a lot of cases where columnar processing does great and does have an edge over row processing. Customers wanted the choice, so now we do both.
Curt,
You pretty much predicted everything I was going to say, but nonetheless, my reactions can be found at:
http://dbmsmusings.blogspot.com/2009/10/greenplum-announces-column-oriented.html
Well done to Greenplum for offering more choice say I. A hybrid column/row capability is pretty cool.
We downloaded the new release a few days ago after Luke mentioned during a call that the new column stuff had been made available.
It’ll be interesting to see how it works once folks start beating on it.
[…] Data, and the like with significant innovations in in-memory processing, exploiting parallelism, columnar storage options, and more. We already starting to see hybrid approaches between the Hadoop players and […]
[…] Data, and the like with significant innovations in in-memory processing, exploiting parallelism, columnar storage options, and more. Additionally, significant opportunities to push application processing into […]
[…] Aster Data has now joined Greenplum/EMC among row-based analytic DBMS vendors with hybrid row-column stores. Oracle will join them some […]
[…] that truly offer some form of hybrid row/column storage include Vertica, EMC/Greenplum, and Aster Data. Oracle Exadata, in my opinion, does not, but I can see why people might get […]
[…] Data, and the like with significant innovations in in-memory processing, exploiting parallelism, columnar storage options, and more. We already starting to see hybrid approaches between the Hadoop players and […]
[…] neglects to praise Greenplum for true hybrid row/columnar data management, a feature shared by Teradata and Vertica, among others, but not by Oracle, DB2, or […]