October 6, 2010

eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more

I chatted with Oliver Ratzesberger of eBay around a Stanford picnic table yesterday (the XLDB 4 conference is being held at Jacek Becla’s home base of SLAC, which used to stand for “Stanford Linear Accelerator Center”). Todd Walter of Teradata also sat in on the latter part of the conversation. Things I learned included:

eBay has thrown out Greenplum. (Edit: As per the comments below, eBay wouldn’t endorse that wording itself.) eBay’s 6 ½ petabyte Greenplum database has turned into a >10 petabyte Teradata database, which will grow 2 1/2x further in size soon.
- Specifically, Oliver told me there are 8 petabytes of spinning disk, with 80% compression. So that’s 40 petabytes before you multiply by a reducing factor to cover mirroring, temp space, and so on. My low end for that factor would be 25-28%; my high end would be 35-40%; either way, we’re talking about >10 petabytes of true user data.
- The 8 petabytes of spinning disk are headed to 20 petabytes next year.
- Oliver gave the impression that Greenplum got thrown out more for reliability reasons than performance. (While eBay saw a major performance difference between Teradata and Greenplum, Oliver previously indicated he was inclined to attribute this more to specific Sun Thumper hardware/storage choices than to software.)
That database, called “Singularity,” has some interesting aspects – notably, a character field that’s a string of name-value pairs – on which you can do views and so on for virtual tables — in a table that otherwise has dozens of conventional relational columns.
- The system ingests log data in the form of lots and lots of name-value pairs.
- The most commonly found ones go into columns in the usual way.
- The rest are strung together into, well, a character string.
- Teradata has developed some features for eBay that make it easier to index, query, etc. on that character string of name-value pairs.
eBay’s more EDW-like (Enterprise Data Warehouse) multi-petabyte Teradata database continues to grow, with the main system apparently up to 4 ½ petabytes from the previous 2 ½.
I took the opportunity to ask what kinds of data marts (virtual or otherwise) were spun out in practice.
- In Oliver’s ranking,
  - #1 was derived data based on other data already in the data warehouse.
  - #2 was other data within eBay that had never been put into the data warehouse in the first place.
  - #3 was data truly from outside data.
- Todd Walter chimed in to point out that at other Teradata customers who perhaps didn’t have as fully fleshed out an EDW, #1 and #2 could be reversed.
eBay sees Hadoop as an interesting tool for certain special purposes.
- eBay likes Hadoop for certain tasks such as image analysis. (Edit: And analysis of search results.)
- eBay doesn’t like Hadoop for anything that requires data movement, such as a join.
- Similarly, eBay doesn’t like HBase.
eBay is enamored of the idea to do “social networking around analytics.”
- This is something that has been built but not rolled out yet.
- It seems more focused on actual business intelligence than on the underlying data, unlike Greenplum Chorus, which seems more focused on the databases themselves.
- Since it hasn’t been rolled out yet, we don’t know which (if any) of activity streams, forums, or whatever will actually get significant adoption.

Categories: Data warehousing, Derived data, eBay, Greenplum, Hadoop, HBase, Log analysis, Petabyte-scale data management, Teradata

Subscribe to our complete feed!

Comments

30 Responses to “eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more”

M-A-O-L » eBay replaces Greenplum with Teradata on October 7th, 2010 12:50 am

[…] quicky: eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more. Interesting to see that the impression is that Greenplum got thrown out more for reliability […]
Paul Johnson on October 7th, 2010 8:50 am

Many sites try and keep Teradata on their toes through either the threat or actual deployment of a competing technology.

Along with credibility gained through being presented as the ‘Sun Data Warehouse’ (Greenplum on Thumper), I saw this as the main driver behind Greenplum’s adoption at ebay. It has been happening increasingly since Netezza came on the scene in ~2003.

Now that Sun has gone to Oracle and Greenplum to EMC (ironically one of Teradata’s disk sub-system suppliers), the roadmap in support of ebay’s continued use of Sun/Greenplum must be non-existent.

Big companies expect multi-year technology roadmaps and some expect to be able to bend the roadmap significantly to meet its own requirements. This must have been a factor, surely?

The name-value pair database is interesting given that Nokia presented with Teradata at last year’s Teradata Partners event in Washington and basically said the POC they carried out couldn’t make the data easy to use, with multiple self-joins required for even the simplest query.

Do ebay have new Teradata features in support of this approach I wonder?
Curt Monash on October 7th, 2010 9:51 am

Yes, there are new features.
No, I don’t know what they are. 🙂
Ramakrishna Vedantam on October 7th, 2010 10:18 am

Greenplum bought by EMC and the platform bought by database company…. Does it mean that the product has become lask lustured now?
Michael McIntire on October 7th, 2010 12:01 pm

Curt, the fundamental thrust of this post, that GP was thrown out, is simply not true and implies that GP is not viable in the MPP space, which is also not true.

eBay and GP did a research project, they both learned a great deal. Sun, to it’s credit, stepped up and put a lot of time and energy into the 45xx platform.

The issue about NVP, is the need for branch and loop control in expressions. If your database cannot branch/loop in an expression – you cannot process NVP…

There was a lot of technology developed by all Three companies – so when eBay goes and does the next big and unexpected thing – don’t throw Teradata under the bus like you just did GP.

In the interest of fair disclosure, I lead the Singularity project and the research relationships between eBay and GP, and eBay and TD. Two months ago, I became the Chief Architect for User Data and Analytics at Yahoo!.

Michael McIntire
Curt Monash on October 7th, 2010 1:53 pm

Michael,

I neither implied nor believe that GP is not viable in the MPP space — that would be pretty silly.

Umm — NVP?

Thanks,

CAM
Michael McIntire on October 7th, 2010 2:52 pm

Curt – NVP = Name Value Pair. -Michael
Implementation Engineer on October 8th, 2010 1:07 pm

I have worked on either technologies and respect either. Honestly, I think I smell marketing. For people who make decision based on comments, I’d encourage you all to do an evaluation before you go with any solution.
Michael McIntire on October 8th, 2010 3:51 pm

BTW – the simple calculation of usable space works like this….

8PB Raw = 4PB mirrored.
4PB Less File System + Overhead (30%) = 2.8PB.

Compression:
60% avg = 2.8PB * 2.5 = 7PB Usable
70% avg = 2.8PB * 3.3 = 9.3PB Usable

Ergo, a 20PB box would be:
60% avg = 7PB * 2.5 = 17.5PB Usable
70% avg = 7PB * 3.3 = 23.1PB Usable

The Basic rule of thumb is:
with generalized compression Raw Disk Size = Usable Disk Size
Curt Monash on October 8th, 2010 9:58 pm

So using Oliver’s 80% figure, we’re talking 14 PB?

Or is 80% more like a peak than an average number?
Oliver Ratzesberger on October 9th, 2010 12:13 am

Curt,

As I stated previously the “thrown out” part of your statement could not be further from the truth.

The answer to a casual question over lunch was: Do you still use vendor XYZ? And my response was a simple No.

For various reasons that I will not go into in this form, we have simply selected a different vendor for V2 or our Singularity project. The same is true for many areas of our business. That said we treat our vendors, current or past with respect. The guys at Greenplum have gone above and beyond during the time we worked with them on a next generation prototype.
We value and respect their entire team and as Michael previously stated, have simply selected a different vendor for the next generation system implementation.

I realize that provocative statements drive traffic to your blog, but I would appreciate if you could remove any exaggerations from what reads like a quote from myself.

I am sorry but I cannot support your statements in this blog post.
Curt Monash on October 9th, 2010 10:37 am

Oliver,

Thank you (and ditto Michael) for correcting any connotations you feel people may have wrongly inferred from what I wrote.

Usually that’s something I have to do myself.

Best,

CAM
Partnering with Cloudera | DBMS 2 : DataBase Management System Services on October 10th, 2010 12:40 pm

[…] Owners of that much data commonly like to store it using free or quasi-free software, especially if the data isn’t structured in such a way that relational tables are a great fit in the first place. HDFS (Hadoop Distributed File System) is the default choice. (Of course, there always are exceptions.) […]
La petite revue de presse du décisionnel | www.LeGrandBI.com on October 10th, 2010 2:41 pm

[…] of Teradata also sat in on the latter part of the conversation. Things I learned included… Lire l’article Article liésIBM va-t-il s’emparer de Netezza ?La Revue de Presse de l’été […]
Notes and links October 22, 2010 | DBMS 2 : DataBase Management System Services on October 22nd, 2010 2:48 am

[…] That eBay comment was particularly interesting. […]
A few notes from XLDB 4 | DBMS 2 : DataBase Management System Services on January 25th, 2011 3:09 am

[…] (ditto), Luke Lonergan (ditto), Todd Walter (almost unrecognizable without his usual cowboy gear), Oliver Ratzesberger, and a bunch of actual science […]
Newbie on April 7th, 2011 1:47 pm

can any one tell me what could be the job of an ETL developer at Ebay?

Thanks in advance.
Quora on June 10th, 2011 3:21 am

What are the scalability limits of existing data warehouse products?…

Those aren’t all the same thing. Oracle RAC isn’t shared-nothing, although Exadata gets some of the shared-nothing benefits. Microsoft’s shared-nothing offering is very immature, as it was based on the troubled DATAllegro acquisition. Anyhow, Terada…
No, companies are NOT entitled to manage news about themselves | Strategic Messaging on June 21st, 2011 4:49 pm

[…] I got into a flap with EMC Greenplum. I blindsided them on a story; they retaliated for the story by, among other things, screwing me over business-wise. Why did I […]
Data management at Zynga and LinkedIn | DBMS 2 : DataBase Management System Services on September 6th, 2011 2:50 am

[…] this time that is indeed the phrase that was […]
Data Management at Zynga and LinkedIn | Inside-BigData.com on September 24th, 2011 11:02 am

[…] ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like eBay‘s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into […]
Social technology in the enterprise | Text Technologies on November 16th, 2011 5:17 am

[…] an important concept on the monitoring-oriented side of business intelligence and — if Oliver Ratzesberger is to be believed — in investigative analytics as well. But the operational side may actually […]
Data into results » Big data and mobile BI : New hype but same old issue on December 29th, 2011 2:22 pm

[…] the first issue, which is size, let me point out that eBay have two data warehouse with many petabytes running Teradata. Obviously, Teradata is far from cutting edge new stuff. I didn’t heard of a […]
Big data and mobile BI : New hype but same old issue - Business Intelligence Weekly on January 17th, 2012 9:15 pm

[…] the first issue, which is size, let me point out that eBay have two data warehouse with many petabytes running Teradata. Obviously, Teradata is far from cutting edge new stuff. I didn’t heard of a […]
Under the covers of eBay’s big data operation — Cloud Computing News on January 31st, 2012 12:57 pm

[…] says eBay’s traffic volumes produce huge data, not just big data. In late 2010, eBay predicted its Teradata deployment would grow from about 10 petabytes to 20 petabytes (or 20,000 terabytes — equivalent to about 266 years […]
Under the covers of eBay’s big data operation | Ubuntu Cloud Portal on January 31st, 2012 1:34 pm

[…] says eBay’s traffic volumes produce huge data, not just big data. In late 2010, eBay predicted its Teradata deployment would grow from about 10 petabytes to 20 petabytes (or 20,000 terabytes — equivalent to about 266 years […]
SquareCows.com » Under the covers of eBay’s big data operation on February 1st, 2012 1:04 am

[…] says eBay’s traffic volumes produce huge data, not just big data. In late 2010, eBay predicted its Teradata deployment would grow from about 10 petabytes to 20 petabytes (or 20,000 terabytes — equivalent to about 266 years […]
SquareCows.com » Under the covers of eBay’s big data operation on February 1st, 2012 1:04 am

[…] says eBay’s traffic volumes produce huge data, not just big data. In late 2010, eBay predicted its Teradata deployment would grow from about 10 petabytes to 20 petabytes (or 20,000 terabytes — equivalent to about 266 years […]
What those nested data structures are about | DBMS 2 : DataBase Management System Services on January 28th, 2013 6:30 am

[…] explanation was led by Oliver Ratzesberger, late of eBay* and progenitor of eBay’s Singularity project. In simplest terms, one event can spawn a lot of event attribute information, perhaps in the form […]
Schema-on-need | DBMS 2 : DataBase Management System Services on October 30th, 2013 10:30 am

[…] ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like eBay‘s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin