October 6, 2010
eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more
I chatted with Oliver Ratzesberger of eBay around a Stanford picnic table yesterday (the XLDB 4 conference is being held at Jacek Becla’s home base of SLAC, which used to stand for “Stanford Linear Accelerator Center”). Todd Walter of Teradata also sat in on the latter part of the conversation. Things I learned included:
- eBay has thrown out Greenplum. (Edit: As per the comments below, eBay wouldn’t endorse that wording itself.) eBay’s 6 ½ petabyte Greenplum database has turned into a >10 petabyte Teradata database, which will grow 2 1/2x further in size soon.
- Specifically, Oliver told me there are 8 petabytes of spinning disk, with 80% compression. So that’s 40 petabytes before you multiply by a reducing factor to cover mirroring, temp space, and so on. My low end for that factor would be 25-28%; my high end would be 35-40%; either way, we’re talking about >10 petabytes of true user data.
- The 8 petabytes of spinning disk are headed to 20 petabytes next year.
- Oliver gave the impression that Greenplum got thrown out more for reliability reasons than performance. (While eBay saw a major performance difference between Teradata and Greenplum, Oliver previously indicated he was inclined to attribute this more to specific Sun Thumper hardware/storage choices than to software.)
- That database, called “Singularity,” has some interesting aspects – notably, a character field that’s a string of name-value pairs – on which you can do views and so on for virtual tables — in a table that otherwise has dozens of conventional relational columns.
- The system ingests log data in the form of lots and lots of name-value pairs.
- The most commonly found ones go into columns in the usual way.
- The rest are strung together into, well, a character string.
- Teradata has developed some features for eBay that make it easier to index, query, etc. on that character string of name-value pairs.
- eBay’s more EDW-like (Enterprise Data Warehouse) multi-petabyte Teradata database continues to grow, with the main system apparently up to 4 ½ petabytes from the previous 2 ½.
- I took the opportunity to ask what kinds of data marts (virtual or otherwise) were spun out in practice.
- In Oliver’s ranking,
- #1 was derived data based on other data already in the data warehouse.
- #2 was other data within eBay that had never been put into the data warehouse in the first place.
- #3 was data truly from outside data.
- Todd Walter chimed in to point out that at other Teradata customers who perhaps didn’t have as fully fleshed out an EDW, #1 and #2 could be reversed.
- In Oliver’s ranking,
- eBay sees Hadoop as an interesting tool for certain special purposes.
- eBay likes Hadoop for certain tasks such as image analysis. (Edit: And analysis of search results.)
- eBay doesn’t like Hadoop for anything that requires data movement, such as a join.
- Similarly, eBay doesn’t like HBase.
- eBay is enamored of the idea to do “social networking around analytics.”
- This is something that has been built but not rolled out yet.
- It seems more focused on actual business intelligence than on the underlying data, unlike Greenplum Chorus, which seems more focused on the databases themselves.
- Since it hasn’t been rolled out yet, we don’t know which (if any) of activity streams, forums, or whatever will actually get significant adoption.
Categories: Data warehousing, Derived data, eBay, Greenplum, Hadoop, HBase, Log analysis, Petabyte-scale data management, Teradata
Subscribe to our complete feed!
Comments
30 Responses to “eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more”
Leave a Reply
[…] quicky: eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more. Interesting to see that the impression is that Greenplum got thrown out more for reliability […]
Many sites try and keep Teradata on their toes through either the threat or actual deployment of a competing technology.
Along with credibility gained through being presented as the ‘Sun Data Warehouse’ (Greenplum on Thumper), I saw this as the main driver behind Greenplum’s adoption at ebay. It has been happening increasingly since Netezza came on the scene in ~2003.
Now that Sun has gone to Oracle and Greenplum to EMC (ironically one of Teradata’s disk sub-system suppliers), the roadmap in support of ebay’s continued use of Sun/Greenplum must be non-existent.
Big companies expect multi-year technology roadmaps and some expect to be able to bend the roadmap significantly to meet its own requirements. This must have been a factor, surely?
The name-value pair database is interesting given that Nokia presented with Teradata at last year’s Teradata Partners event in Washington and basically said the POC they carried out couldn’t make the data easy to use, with multiple self-joins required for even the simplest query.
Do ebay have new Teradata features in support of this approach I wonder?
Yes, there are new features.
No, I don’t know what they are. 🙂
Greenplum bought by EMC and the platform bought by database company…. Does it mean that the product has become lask lustured now?
Curt, the fundamental thrust of this post, that GP was thrown out, is simply not true and implies that GP is not viable in the MPP space, which is also not true.
eBay and GP did a research project, they both learned a great deal. Sun, to it’s credit, stepped up and put a lot of time and energy into the 45xx platform.
The issue about NVP, is the need for branch and loop control in expressions. If your database cannot branch/loop in an expression – you cannot process NVP…
There was a lot of technology developed by all Three companies – so when eBay goes and does the next big and unexpected thing – don’t throw Teradata under the bus like you just did GP.
In the interest of fair disclosure, I lead the Singularity project and the research relationships between eBay and GP, and eBay and TD. Two months ago, I became the Chief Architect for User Data and Analytics at Yahoo!.
Michael McIntire
Michael,
I neither implied nor believe that GP is not viable in the MPP space — that would be pretty silly.
Umm — NVP?
Thanks,
CAM
Curt – NVP = Name Value Pair. -Michael
I have worked on either technologies and respect either. Honestly, I think I smell marketing. For people who make decision based on comments, I’d encourage you all to do an evaluation before you go with any solution.
BTW – the simple calculation of usable space works like this….
8PB Raw = 4PB mirrored.
4PB Less File System + Overhead (30%) = 2.8PB.
Compression:
60% avg = 2.8PB * 2.5 = 7PB Usable
70% avg = 2.8PB * 3.3 = 9.3PB Usable
Ergo, a 20PB box would be:
60% avg = 7PB * 2.5 = 17.5PB Usable
70% avg = 7PB * 3.3 = 23.1PB Usable
The Basic rule of thumb is:
with generalized compression Raw Disk Size = Usable Disk Size
So using Oliver’s 80% figure, we’re talking 14 PB?
Or is 80% more like a peak than an average number?
Curt,
As I stated previously the “thrown out” part of your statement could not be further from the truth.
The answer to a casual question over lunch was: Do you still use vendor XYZ? And my response was a simple No.
For various reasons that I will not go into in this form, we have simply selected a different vendor for V2 or our Singularity project. The same is true for many areas of our business. That said we treat our vendors, current or past with respect. The guys at Greenplum have gone above and beyond during the time we worked with them on a next generation prototype.
We value and respect their entire team and as Michael previously stated, have simply selected a different vendor for the next generation system implementation.
I realize that provocative statements drive traffic to your blog, but I would appreciate if you could remove any exaggerations from what reads like a quote from myself.
I am sorry but I cannot support your statements in this blog post.
Oliver,
Thank you (and ditto Michael) for correcting any connotations you feel people may have wrongly inferred from what I wrote.
Usually that’s something I have to do myself.
Best,
CAM
[…] Owners of that much data commonly like to store it using free or quasi-free software, especially if the data isn’t structured in such a way that relational tables are a great fit in the first place. HDFS (Hadoop Distributed File System) is the default choice. (Of course, there always are exceptions.) […]
[…] of Teradata also sat in on the latter part of the conversation. Things I learned included… Lire l’article Article liésIBM va-t-il s’emparer de Netezza ?La Revue de Presse de l’été […]
[…] That eBay comment was particularly interesting. […]
[…] (ditto), Luke Lonergan (ditto), Todd Walter (almost unrecognizable without his usual cowboy gear), Oliver Ratzesberger, and a bunch of actual science […]
can any one tell me what could be the job of an ETL developer at Ebay?
Thanks in advance.
What are the scalability limits of existing data warehouse products?…
Those aren’t all the same thing. Oracle RAC isn’t shared-nothing, although Exadata gets some of the shared-nothing benefits. Microsoft’s shared-nothing offering is very immature, as it was based on the troubled DATAllegro acquisition. Anyhow, Terada…
[…] I got into a flap with EMC Greenplum. I blindsided them on a story; they retaliated for the story by, among other things, screwing me over business-wise. Why did I […]
[…] this time that is indeed the phrase that was […]
[…] ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like eBay‘s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into […]
[…] an important concept on the monitoring-oriented side of business intelligence and — if Oliver Ratzesberger is to be believed — in investigative analytics as well. But the operational side may actually […]
[…] the first issue, which is size, let me point out that eBay have two data warehouse with many petabytes running Teradata. Obviously, Teradata is far from cutting edge new stuff. I didn’t heard of a […]
[…] the first issue, which is size, let me point out that eBay have two data warehouse with many petabytes running Teradata. Obviously, Teradata is far from cutting edge new stuff. I didn’t heard of a […]
[…] says eBay’s traffic volumes produce huge data, not just big data. In late 2010, eBay predicted its Teradata deployment would grow from about 10 petabytes to 20 petabytes (or 20,000 terabytes — equivalent to about 266 years […]
[…] says eBay’s traffic volumes produce huge data, not just big data. In late 2010, eBay predicted its Teradata deployment would grow from about 10 petabytes to 20 petabytes (or 20,000 terabytes — equivalent to about 266 years […]
[…] says eBay’s traffic volumes produce huge data, not just big data. In late 2010, eBay predicted its Teradata deployment would grow from about 10 petabytes to 20 petabytes (or 20,000 terabytes — equivalent to about 266 years […]
[…] says eBay’s traffic volumes produce huge data, not just big data. In late 2010, eBay predicted its Teradata deployment would grow from about 10 petabytes to 20 petabytes (or 20,000 terabytes — equivalent to about 266 years […]
[…] explanation was led by Oliver Ratzesberger, late of eBay* and progenitor of eBay’s Singularity project. In simplest terms, one event can spawn a lot of event attribute information, perhaps in the form […]
[…] ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like eBay‘s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into […]