October 15, 2008
eBay doesn’t love MapReduce
The first time I ever heard from Oliver Ratzesberger of eBay, the subject line of his email mentioned MapReduce. That was early this year. Subsequently, however, eBay seems to have become a MapReduce non-fan. The reason is simple: eBay’s parallel efficiency tests show that MapReduce leaves most processors idle most of the time. The specific figure they mentioned was parallel efficiency of 18%.
Comments
7 Responses to “eBay doesn’t love MapReduce”
Leave a Reply
Search our blogs and white papers
Monash Research blogs
- DBMS 2 covers database management, analytics, and related technologies.
- Text Technologies covers text mining, search, and social software.
- Strategic Messaging analyzes marketing and messaging strategy.
- The Monash Report examines technology and public policy issues.
- Software Memories recounts the history of the software industry.
User consulting
Building a short list? Refining your strategic plan? We can help.
Vendor advisory
We tell vendors what's happening -- and, more important, what they should do about it.
Monash Research highlights
Learn about white papers, webcasts, and blog highlights, by RSS or email. |
-
Recent posts
-
Categories
- About this blog
- Analytic glossary
- Analytic technologies
- Application areas
- Buying processes
- Companies and products
- 1010data
- Ab Initio Software
- Actian and Ingres
- Aerospike
- Akiban
- Aleri and Coral8
- Algebraix
- Alpha Five
- Amazon and its cloud
- ANTs Software
- Aster Data
- Ayasdi
- Basho and Riak
- Business Objects
- Calpont
- Cassandra
- Cast Iron Systems
- Cirro
- Citus Data
- ClearStory Data
- Cloudant
- Cloudera
- Clustrix
- Cogito and 7 Degrees
- Cognos
- Continuent
- Couchbase
- CouchDB
- Databricks, Spark and BDAS
- DATAllegro
- Datameer
- DataStax
- Dataupia
- dbShards and CodeFutures
- Elastra
- EMC
- Endeca
- EnterpriseDB and Postgres Plus
- Exasol
- Expressor
- FileMaker
- GenieDB
- Gooddata
- Greenplum
- Groovy Corporation
- Hadapt
- Hadoop
- HBase
- Hortonworks
- HP and Neoview
- IBM and DB2
- illuminate Solutions
- Infobright
- Informatica
- Information Builders
- Inforsense
- Intel
- Intersystems and Cache'
- Jaspersoft
- Kafka and Confluent
- Kalido
- Kaminario
- Kickfire
- Kognitio
- KXEN
- MapR
- MarkLogic
- McObject
- memcached
- MemSQL
- Metamarkets and Druid
- Microsoft and SQL*Server
- MicroStrategy
- MonetDB
- MongoDB
- MySQL
- Neo Technology and Neo4j
- Netezza
- NuoDB
- Nutonian
- Objectivity and Infinite Graph
- Oracle
- Oracle TimesTen
- ParAccel
- Pentaho
- Pervasive Software
- PivotLink
- Platfora
- PostgreSQL
- Progress, Apama, and DataDirect
- QlikTech and QlikView
- Rainstor
- Revolution Analytics
- Rocana
- salesforce.com
- SAND Technology
- SAP AG
- SAS Institute
- ScaleBase
- ScaleDB
- Schooner Information Technology
- SciDB
- SenSage
- SequoiaDB
- SnapLogic
- Software AG
- solidDB
- Splunk
- Starcounter
- StreamBase
- Sybase
- Syncsort
- Tableau Software
- Talend
- Teradata
- Tokutek and TokuDB
- Truviso
- VectorWise
- Vertica Systems
- VoltDB and H-Store
- WibiData
- Workday
- Xkoto
- XtremeData
- Yarcdata and Cray
- Zettaset
- Zoomdata
- Data integration and middleware
- Data types
- DBMS product categories
- Emulation, transparency, portability
- Fun stuff
- Market share and customer counts
- Memory-centric data management
- Michael Stonebraker
- Parallelization
- Presentations
- Pricing
- Public policy
- Software as a Service (SaaS)
- Specific users
- Storage
- Theory and architecture
- TransRelational
- Uncategorized
-
Date archives
-
Links
-
Admin
I can imagine that this is a difficult thing to measure with Map/Reduce. I know Google’s implementation does the same part of the query on several nodes to protect from any one node having a performance issue slowing down the return of the result. So while this part is parallel it is parallelism for redundancy including it in parallel efficiency determinations could be debatable.
Hmm. I would imagine eBay wasn’t including 2-4X the redundancy they think they really need to get the work done.
CAM
Tony: At least in the public Map/Reduce paper, redundant tasks are only started toward the very end of the entire Map/Reduce job, so they shouldn’t represent a very significant percentage of the total work required by the job (so the paper argues, anyway).
I think this is going to be heavily dependent on exactly what you use MapReduce for. Some things are in its sweet spot more than others, depending on how much work the map part is and how much work the reduce part is. I think it’s a bit premature for anyone to draw general conclusions about MapReduce from this one anecdote. (Note: I have never used MapReduce and am not in any way an expert; but I have read about it and believe I get the idea.)
Hi Curt,
This is an interesting topic and I would like to share my thoughts here. MapReduce is a parallelization paradigm and the hardware utilization is heavily implementation-dependent. When MapReduce is used just to process text files, as is the case with some popular implementations, the text files are randomly split across a large number of nodes and there is no notion of a “schema” as it exists in a relational database. Lack of schema requires brute force reading of all the data and lot of shuffling over the network during MapReduce execution. This, in turn, puts heavy load on disk I/O and the network, leaving processors in a waiting mode for majority of the time. In contrast to that, Aster’s implementation of MapReduce inside a database results in much higher efficient utilization levels. Aster has provided tight integration between MapReduce and SQL and SQL/MR functions (Aster’s In-Database MapReduce) are seamlessly invoked as a part of SQL. The query planner plans for SQL/MR functions just like it does for other SQL operations. This means that Aster’s MapReduce can take advantage of the database functionality – schema-awareness, globally optimum query planning, partition pruning and also indexes. This results a much smaller amount of data being read from the disks, reducing disk I/O. When data has to be shuffled, Aster nCluster’s Optimized Transport capabilities kick in, making network transport more efficient. Hence, Aster nCluster is able to take data to the processors more efficiently and makes them work at a much higher levels of utilization by reducing I/O and making network more efficient. More details of our In-Database MapReduce can be found at http://www.asterdata.com/product/mapreduce.php
Thanks,
Ajeet
[…] Speaking of not being CPU-constrained — I heard 7-10% as an estimate for typical Hadoop utilization, and also 10-15%. While I didn’t ask, I presume these figures assume traditional MapReduce types of Hadoop workloads. I’m not sure why these figures are yet lower than eBay’s long-ago estimates of Hadoop “parallel efficiency”. […]
[…] the time. Back then I exchanged emails with industry watcher Curt Monash who wrote at the time, “eBay doesn’t love MapReduce.” At eBay, we thoroughly evaluated Hadoop and came to the clear conclusion that MapReduce is […]