MapReduce

Analysis of implementations of and issues associated with the parallel programming framework MapReduce. Related subjects include:

October 15, 2008

eBay doesn’t love MapReduce

The first time I ever heard from Oliver Ratzesberger of eBay, the subject line of his email mentioned MapReduce. That was early this year. Subsequently, however, eBay seems to have become a MapReduce non-fan. The reason is simple: eBay’s parallel efficiency tests show that MapReduce leaves most processors idle most of the time. The specific figure they mentioned was parallel efficiency of 18%.

Categories: eBay, MapReduce, Parallelization

7 Comments

September 5, 2008

More on known MapReduce application areas

In surveying MapReduce applications to date, I said that they fell mainly into three overlapping categories:

Text tokenization, indexing, and search
Creation of other kinds of data structures (e.g., graphs)
Data mining and machine learning

and really should have included a fourth:

Data transformation

Nokia just released another MapReduce implementation, Disco, and its list of applications to date fits right into that template. The relevant quote is:

This far Disco has been succesfully used, for instance, in parsing and reformatting data, data clustering, probabilistic modelling, data mining, full-text indexing, and log analysis with hundreds of gigabytes of real-world data.

Categories: MapReduce

Three different implementations of MapReduce

So far as I can see, there are three implementations of MapReduce that matter for enterprise analytic use – Hadoop, Greenplum’s, and Aster Data’s.* Hadoop has of course been available for a while, and used for a number of different things, while Greenplum’s and Aster Data’s versions of MapReduce – both in late-stage beta – have far fewer users.

*Perhaps Nokia’s Disco or another implementation will at some point join the list.

Earlier this evening I posted some Mike Stonebraker criticisms of MapReduce. It turns out that they aren’t all accurate across all MapReduce implementations. So this seems like a good time for me to stop stalling and put up a few notes about specific features of different MapReduce implementations. Here goes. Read more

Categories: Aster Data, Greenplum, MapReduce

3 Comments

September 4, 2008

Mike Stonebraker’s counterarguments to MapReduce’s popularity

In response to recent posting I’ve done about MapReduce, Mike Stonebraker just got on the phone to give me his views. His core claim, more or less, is that anything you can do in MapReduce you could already do in a parallel database that complies with SQL-92 and/or has PostgreSQL underpinnnings. In particular, Mike says: Read more

Categories: Data warehousing, MapReduce, Michael Stonebraker, PostgreSQL

5 Comments

September 1, 2008

Yes, but what are the Very Biggest benefits of MapReduce?

On behalf of On-Demand Enterprise, nee’ Grid Today, Dennis Barker asked me to clarify the most important benefits, features, etc. to various constituencies (business users, programmers, DBAs, etc.) of the Greenplum and Aster Data MapReduce announcements. Questions like that are hard to answer simply. Here’s why.

The core benefit of MapReduce is price/performance (because it allows the cost benefits of parallelization to be applied to analyses that are hard to parallelize otherwise). Large price/performance gains commonly mix together three kinds of benefits.

1. They let you do what you did before, for less money.
2. They let you do a better version of what you did before, for similar money.
3. They let you do new things that didn’t make economic sense before, but now do.
Read more

Categories: Analytic technologies, Data warehousing, MapReduce

Three approaches to parallelizing data transformation

Many MPP data warehousing vendors have told me their products are used for ELT (Extract/Load/Transform) instead of ETL (Extract/Transform/Load). I.e., needed data transformations are done on the MPP system, rather than on the — probably SMP — system the data comes from.* If the data transformation is being applied on a record-by-record basis, then it’s automatically fully parallelized. Even if the transforms are more complex, considerable parallel processing may still be going on.

*Or it’s some of each, at which point it’s called ETLT — I bet you can work out what that stands for.

Categories: Aster Data, Data integration and middleware, Data warehousing, EAI, EII, ETL, ELT, ETLT, MapReduce, Parallelization, Pervasive Software

8 Comments

August 26, 2008

Why MapReduce matters to SQL data warehousing

Greenplum and Aster Data have both just announced the integration of MapReduce into their SQL MPP data warehouse products. So why do I think this could be a big deal? The short answer is “Because MapReduce offers dramatic performance gains in analytic application areas that still need great performance speed-up.” The long answer goes something like this.

The core ideas of MapReduce are: Read more

Categories: Analytic technologies, Data warehousing, MapReduce, Parallelization

24 Comments

August 26, 2008

Known applications of MapReduce

Most of the actual MapReduce applications I’ve heard of fall into a few areas:

Text tokenization, indexing, and search
Creation of other kinds of data structures (e.g., graphs)
Data mining and machine learning

That covers all MapReduce apps I recall hearing about via commercial companies and users, and also includes most of what’s in the two big sources I found online. Read more

Categories: MapReduce, RDF and graphs, Text

16 Comments

August 25, 2008

MapReduce links

For whatever reason, I seem to be making the peripheral posts about MapReduce tonight before getting to the meat of the issues. So be it. There’s a rich set of links out there about MapReduce, and here are some of the best of them:

Aster Data introduced MapReduce integrated into its SQL data warehouse DBMS tonight. Aster’s site features an excellent white paper.
Exactly the same is true of Greenplum.
Google Labs offers the seminal MapReduce research paper. It also has a broken link to an associated slide presentation, which fortunately is available here.
One can get a good sense of MapReduce by reading up on the open source implementation Hadoop.
In particular, this list of Hadoop applications is the longest list of MapReduce applications I know of (ahead even of Google’s long internal list).
Joel Spolsky explained the core MapReduce concept a couple of years ago.

Categories: MapReduce, Parallelization

8 Comments

August 25, 2008

MapReduce sound bites

Last Thursday, both Greenplum and Aster Data — the two most recent of my numerous data warehouse specialist customers — both told me of the same major innovation. Both were rushing to announce it first, before anybody else did. This led to considerable tap dancing, with the upshot being that both are releasing the information tonight or tomorrow morning.

What’s going on is that Aster Data and Greenplum have both integrated MapReduce into their respective MPP shared-nothing data warehouse DBMS. Read more

Categories: Analytic technologies, Aster Data, Greenplum, MapReduce, Parallelization

11 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in