Greenplum is in the big leagues
After a March, 2007 call, I didn’t talk again with Greenplum until earlier this month. That changed fast. I flew out to see Greenplum last week and spent over a day with president/co-founder Scott Yara, CTO/co-founder Luke Lonergan, marketing VP Paul Salazar, and product management/marketing director Ben Werther. Highlights – besides some really great sushi at Sakae in Burlingame – start with an eye-opening set of customer proof points, such as:
- 50 total paying Greenplum customers, over half of whom are already in production.
- 6 Greenplum users in production with >100 terabytes of user data. That may beat anybody except Teradata, among SQL data warehouse specialist vendors.
- 2 Greenplum customers expected to be in production within 60 days with >1 petabyte of user data. That may beat even Teradata. Anyhow, it looks as if Greenplum and Teradata will be 1-2 in some order crossing the 1-petabyte line. (Edit: Here’s more detail on >1 petabyte Greenplum users.)
- 5 Greenplum customers with “multiple 100s of users.” That’s not much by the standards of more mature vendors, but it suffices to show that Greenplum has some kind of a handle on concurrency.
- 3 Greenplum customers with 1000s of tables. That suffices to show that Greenplum’s claims to schema agnosticity are more than academic, even if it’s not enough to show that many enterprises care.
- Greenplum customers using tools from the following list, and I quote: SAS, Unica, Datastage, Information Builders, Informatica, Oracle BI, Microstrategy, Microsoft SSIS and SSRS, Business Objects / BODI, SAP, Talend, Pentaho
- (Again I quote) “Tier 1” customers in the following verticals:
- Retail
- Pharma
- Telco
- Internet
- Retail Banking
- Insurance
- Health Care
- Commercial Banking
- Transportation
- Service Providers
- Media
- Manufacturing
Even though the bulk of Greenplum’s revenue comes from the Sun appliance relationship, 20 paying customers run Greenplum on Linux. Another interesting demographic is that 25-40% of Greenplum’s revenue tends to come from Asia (obviously, the figure fluctuates greatly from quarter to quarter). Perhaps not coincidentally, one of Greenplum’s three salespeople last year was based in Asia. (The current total is 15, and growing fast.)
Technical highlights include:
- Greenplum is row-based, shared-nothing, MPP. It runs on standard hardware and operating systems. (But fortunately for its key partnership, Greenplum evidently does really run best, at least for now, on recommended Sun standard appliance configurations.)
- Most or all of the PostgreSQL data access methods are left intact. The big changes to PostgreSQL lie in the areas of query optimization, planning, and execution. I.e., Greenplum has its own way of breaking up a query into pieces – and of course of seeing that data gets shipped among nodes – but the low-level operators for storage and access are from PostgreSQL.
- Greenplum nodes are just connected to a group of standard switches, via standard 1 gigabit Ethernet. Greenplum insists that interconnect bandwidth is not a problem.
- Currently, there’s a boss node, with all the other nodes being peers. But by now (as opposed to in an early prototype of Greenplum), intermediate results are shipped peer-to-peer rather than back up to the boss node. In the future, compute and storage nodes will be (optionally) split out from each other.
- Compression is being introduced in the next point release, with big numbers (at least by row-based standards) out of the gate. It will initially be just for append-only tables, but that limitation will be lifted later on.
- Also in that release, Greenplum is introducing embedded parallel mathematical packages, such as linear algebra and statistics (specifically, R).
- Greenplum has no current in-the-cloud offering, but one is in the works.
- Greenplum offers an ever-growing variety of administration tools.
Comments
19 Responses to “Greenplum is in the big leagues”
Leave a Reply
[…] offered a bit of clarification regarding the usage figures I posted last night. Everything on the list is in production, except […]
[…] offered a bit of clarification regarding the usage figures I posted last night. Everything on the list is in production, except […]
BI Futurewatch…
This page tracks trends in BI which are not likely to be implemented at PayCycle in the near term…….
Dead link on “Confluence: …”
[…] claims 50 paying customers, all within the past year. Greenplum also claims 50 paying customers, almost all within the past […]
[…] is a big challenge there. Among the very-large-scale MPP data warehouse software vendors, Greenplum is unusual in that its interconnect of choice is (sufficiently many) cheap 1 gigabit Ethernet […]
[…] just to confuse things — compression can get most or all of that back. For example, at a multi-petabyte customer that is loading up its Greenplum/Thor machines now, early indications suggest a compression factor […]
So how do you use R for large scale analytics when it has to hold its data in memory?
Hi Phil,
That’s one problem with R in general, it holds its results in RAM.
With Greenplum, we enable you to run R programs as stored procedures, which provides you the ability to reuse the math routines in R to some extent, specifically to help you calculate intermediate results as part of WINDOW functions or other OLAP use-cases.
We have also re-implemented some of the routines that R provides as native parallel functions within Greenplum, including multi-variable linear regression, a naive bayes classifier and some others.
Luke,
It might be helpful if you listed a few ways R results might wind up on disk — if indeed there are a few different ways. 🙂
Thanks,
CAM
I see – it’s actually still an in-memory proposition in Greenplum within the R functions themselves, but we can stream data through the R functions and the output may end up spooling to disk if our optimizer thinks it has to.
An example use-case where we’ve used R as a UDF: doing various forms of linear regression required the use of a matrix pseudo-inverse routine to solve the eigenvalue problem. Instead of writing our own pseudo-inverse routine, we instead used the one that comes with R to evaluate different approaches. The matrix solve part is actually pretty small, so we were able to do it in memory as the final stage of processing and the R routine was a good fit.
In the end, we ended up implementing our own pseudo-inverse routine, now available as the ‘pinv()’ from within Greenplum. It’s written in C internally and is blazingly fast.
So – the embedded R UDF capability within Greenplum is useful, but it’s often good to re-write the routine for performance optimization when moving to production. We provide many of these kinds of functions to our customers in the form of libraries. Note that we also provide a large array of built-in matrix manipulation routines as well.
Interesting indeed. I’ve been writing some software for WPS as well as SAS so that users can have access to R routines and R graphics. Of course, part of the problem is the memory constraint issue. I’ve been playing with executing R where the user can determine which R routines/programs they want to run in parallel and have WPS or SAS collect the output and write it back into the appropriate windows. This actually works but I’m not satisfied with what I have done.
Since I don’t see R going 64 bit on Windows anytime soon, I’m starting the process of specing out a system where R runs in a Linux 64 bit OS and has access to a lot more memory space to solve statistical problems. Currently, the idea is to make the Linux system a VM that is easily installed and has all quite a bit of the R libraries already installed.
All I need is time!
[…] believe that both of the previously mentioned petabyte+ databases on Greenplum will feature clickstream […]
[…] promise to wrap MapReduce into the newest version of its data solutions. The announcement from the data warehousing and analytics supplier comes to a fast-changing landscape, given last week’s HP-Oracle Exadata […]
[…] вернуть все или почти все это назад. Например, для клиента, у которого объем хранилища равен нескольким п… и который сейчас загружает данными свои системы […]
[…] there’s no need for me to do that here – Curt Monash does an excellent job on this post from 2008, and he recently talked with Ebay about their use of Greenplum on a massive scale in this article. […]
[…] Greenplum had about 65 paying customers at the end of Q1. I’ve forgotten how that jibes with a figure of 50 customers last August. […]
[…] As of the past quarter or two, <10% of Greenplum’s sales activity is on Sun, which works out to maybe one sale per quarter and at most a small number of sales cycles. (That’s down from from 50%+ not that long ago.) […]
[…] customers, including Fox/MySpace, eBay, Sears, and T-Mobile. While Fox/MySpace never got to the predicted 1-petabyte level of user data, T-Mobile is loosely projected to indeed get there. The same […]