August 25, 2008

Greenplum is in the big leagues

After a March, 2007 call, I didn’t talk again with Greenplum until earlier this month. That changed fast. I flew out to see Greenplum last week and spent over a day with president/co-founder Scott Yara, CTO/co-founder Luke Lonergan, marketing VP Paul Salazar, and product management/marketing director Ben Werther. Highlights – besides some really great sushi at Sakae in Burlingame – start with an eye-opening set of customer proof points, such as:

50 total paying Greenplum customers, over half of whom are already in production.
6 Greenplum users in production with >100 terabytes of user data. That may beat anybody except Teradata, among SQL data warehouse specialist vendors.
2 Greenplum customers expected to be in production within 60 days with >1 petabyte of user data. That may beat even Teradata. Anyhow, it looks as if Greenplum and Teradata will be 1-2 in some order crossing the 1-petabyte line. (Edit: Here’s more detail on >1 petabyte Greenplum users.)
5 Greenplum customers with “multiple 100s of users.” That’s not much by the standards of more mature vendors, but it suffices to show that Greenplum has some kind of a handle on concurrency.
3 Greenplum customers with 1000s of tables. That suffices to show that Greenplum’s claims to schema agnosticity are more than academic, even if it’s not enough to show that many enterprises care.
Greenplum customers using tools from the following list, and I quote: SAS, Unica, Datastage, Information Builders, Informatica, Oracle BI, Microstrategy, Microsoft SSIS and SSRS, Business Objects / BODI, SAP, Talend, Pentaho
(Again I quote) “Tier 1” customers in the following verticals:
- Retail
- Pharma
- Telco
- Internet
- Retail Banking
- Insurance
- Health Care
- Commercial Banking
- Transportation
- Service Providers
- Media
- Manufacturing

Even though the bulk of Greenplum’s revenue comes from the Sun appliance relationship, 20 paying customers run Greenplum on Linux. Another interesting demographic is that 25-40% of Greenplum’s revenue tends to come from Asia (obviously, the figure fluctuates greatly from quarter to quarter). Perhaps not coincidentally, one of Greenplum’s three salespeople last year was based in Asia. (The current total is 15, and growing fast.)

Technical highlights include:

Greenplum is row-based, shared-nothing, MPP. It runs on standard hardware and operating systems. (But fortunately for its key partnership, Greenplum evidently does really run best, at least for now, on recommended Sun standard appliance configurations.)
Most or all of the PostgreSQL data access methods are left intact. The big changes to PostgreSQL lie in the areas of query optimization, planning, and execution. I.e., Greenplum has its own way of breaking up a query into pieces – and of course of seeing that data gets shipped among nodes – but the low-level operators for storage and access are from PostgreSQL.
Greenplum nodes are just connected to a group of standard switches, via standard 1 gigabit Ethernet. Greenplum insists that interconnect bandwidth is not a problem.
Currently, there’s a boss node, with all the other nodes being peers. But by now (as opposed to in an early prototype of Greenplum), intermediate results are shipped peer-to-peer rather than back up to the boss node. In the future, compute and storage nodes will be (optionally) split out from each other.
Compression is being introduced in the next point release, with big numbers (at least by row-based standards) out of the gate. It will initially be just for append-only tables, but that limitation will be lifted later on.
Also in that release, Greenplum is introducing embedded parallel mathematical packages, such as linear algebra and statistics (specifically, R).
Greenplum has no current in-the-cloud offering, but one is in the works.
Greenplum offers an ever-growing variety of administration tools.

Categories: Analytic technologies, Data warehouse appliances, Data warehousing, Greenplum, Petabyte-scale data management, PostgreSQL

Subscribe to our complete feed!

Comments

19 Responses to “Greenplum is in the big leagues”

Greenplum’s single biggest customer | DBMS2 -- DataBase Management System Services on August 25th, 2008 3:04 pm

[…] offered a bit of clarification regarding the usage figures I posted last night. Everything on the list is in production, except […]
Greenplum’s single biggest customer | DBMS2 -- DataBase Management System Services on August 25th, 2008 3:04 pm

[…] offered a bit of clarification regarding the usage figures I posted last night. Everything on the list is in production, except […]
Confluence: Office of the CTO on August 26th, 2008 1:29 pm

BI Futurewatch…

This page tracks trends in BI which are not likely to be implemented at PayCycle in the near term…….
Anon on August 27th, 2008 10:52 am

Dead link on “Confluence: …”
Sales figures for analytic DBMS | DBMS2 -- DataBase Management System Services on August 29th, 2008 11:25 pm

[…] claims 50 paying customers, all within the past year. Greenplum also claims 50 paying customers, almost all within the past […]
Are analytic DBMS vendors overcomplicating their interconnect architectures? | DBMS2 -- DataBase Management System Services on August 30th, 2008 2:10 am

[…] is a big challenge there. Among the very-large-scale MPP data warehouse software vendors, Greenplum is unusual in that its interconnect of choice is (sufficiently many) cheap 1 gigabit Ethernet […]
Estimating user data vs. spinning disk | DBMS2 -- DataBase Management System Services on September 1st, 2008 6:01 am

[…] just to confuse things — compression can get most or all of that back. For example, at a multi-petabyte customer that is loading up its Greenplum/Thor machines now, early indications suggest a compression factor […]
Phil Rack on September 9th, 2008 4:04 pm

So how do you use R for large scale analytics when it has to hold its data in memory?
Luke Lonergan on September 9th, 2008 6:01 pm

Hi Phil,

That’s one problem with R in general, it holds its results in RAM.

With Greenplum, we enable you to run R programs as stored procedures, which provides you the ability to reuse the math routines in R to some extent, specifically to help you calculate intermediate results as part of WINDOW functions or other OLAP use-cases.

We have also re-implemented some of the routines that R provides as native parallel functions within Greenplum, including multi-variable linear regression, a naive bayes classifier and some others.
Curt Monash on September 9th, 2008 6:05 pm

Luke,

It might be helpful if you listed a few ways R results might wind up on disk — if indeed there are a few different ways. 🙂

Thanks,

CAM
Luke Lonergan on September 9th, 2008 6:15 pm

I see – it’s actually still an in-memory proposition in Greenplum within the R functions themselves, but we can stream data through the R functions and the output may end up spooling to disk if our optimizer thinks it has to.

An example use-case where we’ve used R as a UDF: doing various forms of linear regression required the use of a matrix pseudo-inverse routine to solve the eigenvalue problem. Instead of writing our own pseudo-inverse routine, we instead used the one that comes with R to evaluate different approaches. The matrix solve part is actually pretty small, so we were able to do it in memory as the final stage of processing and the R routine was a good fit.

In the end, we ended up implementing our own pseudo-inverse routine, now available as the ‘pinv()’ from within Greenplum. It’s written in C internally and is blazingly fast.

So – the embedded R UDF capability within Greenplum is useful, but it’s often good to re-write the routine for performance optimization when moving to production. We provide many of these kinds of functions to our customers in the form of libraries. Note that we also provide a large array of built-in matrix manipulation routines as well.
Phil Rack on September 9th, 2008 7:01 pm

Interesting indeed. I’ve been writing some software for WPS as well as SAS so that users can have access to R routines and R graphics. Of course, part of the problem is the memory constraint issue. I’ve been playing with executing R where the user can determine which R routines/programs they want to run in parallel and have WPS or SAS collect the output and write it back into the appropriate windows. This actually works but I’m not satisfied with what I have done.

Since I don’t see R going 64 bit on Windows anytime soon, I’m starting the process of specing out a system where R runs in a Linux 64 bit OS and has access to a lot more memory space to solve statistical problems. Currently, the idea is to make the Linux system a VM that is easily installed and has all quite a bit of the R libraries already installed.

All I need is time!
Web analytics — clickstream and network event data | DBMS2 -- DataBase Management System Services on September 22nd, 2008 6:10 am

[…] believe that both of the previously mentioned petabyte+ databases on Greenplum will feature clickstream […]
Greenplum pushes envelope with MapReduce and parallelism enhancements to its extreme-scale data offering | Dana Gardner’s BriefingsDirect | ZDNet.com on September 29th, 2008 9:50 am

[…] promise to wrap MapReduce into the newest version of its data solutions. The announcement from the data warehousing and analytics supplier comes to a fast-changing landscape, given last week’s HP-Oracle Exadata […]
Infology.Ru » Blog Archive » Оценивая КПД системы хранения: какую долю объема системы хранения занимают данные пользователя on October 21st, 2008 5:14 pm

[…] вернуть все или почти все это назад. Например, для клиента, у которого объем хранилища равен нескольким п… и который сейчас загружает данными свои системы […]
Greenplum – Reaching Escape Velocity « Market Strategies for IT Suppliers on May 11th, 2009 6:50 pm

[…] there’s no need for me to do that here – Curt Monash does an excellent job on this post from 2008, and he recently talked with Ebay about their use of Greenplum on a massive scale in this article. […]
Greenplum update — Release 3.3 and so on | DBMS2 -- DataBase Management System Services on September 21st, 2009 4:52 am

[…] Greenplum had about 65 paying customers at the end of Q1. I’ve forgotten how that jibes with a figure of 50 customers last August. […]
Greenplum customer notes | DBMS2 -- DataBase Management System Services on October 18th, 2009 2:43 pm

[…] As of the past quarter or two, <10% of Greenplum’s sales activity is on Sun, which works out to maybe one sale per quarter and at most a small number of sales cycles. (That’s down from from 50%+ not that long ago.) […]
Greenplum Chorus and Greenplum 4.0 | DBMS2 -- DataBase Management System Services on April 12th, 2010 7:54 am

[…] customers, including Fox/MySpace, eBay, Sears, and T-Mobile. While Fox/MySpace never got to the predicted 1-petabyte level of user data, T-Mobile is loosely projected to indeed get there. The same […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Greenplum is in the big leagues

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin