MapReduce

Analysis of implementations of and issues associated with the parallel programming framework MapReduce. Related subjects include:

April 14, 2009

Stonebraker, DeWitt, et al. compare MapReduce to DBMS

Along with five other coauthors — the lead author seems to be Andy Pavlo — famous MapReduce non-fans Mike Stonebraker and David DeWitt have posted a SIGMOD 2009 paper called “A Comparison of Approaches to Large-Scale Data Analysis.” The heart of the paper is benchmarks of Hadoop, Vertica, and “DBMS-X” on identical clusters of 100 low-end nodes., across a series of tests including (if I understood correctly):

A couple of different flavors of a Grep task originally proposed in a Google MapReduce paper.
A database query on simulated clickstream data
A join on the same clickstream data.
Two aggregations on the clickstream data.

Categories: Analytic technologies, Hadoop, MapReduce, Michael Stonebraker, Parallelization, Vertica Systems

6 Comments

April 3, 2009

Amazon Elastic MapReduce

Amazon is introducing a beta of Amazon Elastic MapReduce. What it boils down to is cheap, on-demand Hadoop.

This seems like a great way to experiment with MapReduce and see if you like it. But for serious use, I don’t know why you wouldn’t prefer MapReduce more closely integrated into a DBMS.

Categories: Amazon and its cloud, Cloud computing, MapReduce

1 Comment

March 31, 2009

Twitter is considering using MapReduce

From a Twitter job listing (formatting mine). The most interesting section is “Additional preferred experience.” Read more

Categories: Analytic technologies, Data warehousing, MapReduce, Specific users, Web analytics

6 Comments

March 7, 2009

Three Greenplum customers’ applications of MapReduce

Greenplum (and Truviso) advisor Joseph Hellerstein offers a few examples of MapReduce applications (specifically Greenplum MapReduce), namely:

The big aha moment occured for me during our panel discussion, which included Luke Lonergan from Greenplum, Roger Magoulas from O’Reilly, and Brian Dolan from Fox Interactive Media (which runs MySpace among other web properties).

Roger talked about using MapReduce to extract structured entities from text for doing tech trend analyses from billions of rows of online job postings. Brian (who is a mathematician by training) was talking about implementing conjugate gradiant and Support Vector Machines in parallel SQL to support “hypertargeting” for advertisers. I mentioned how Jonathan Goldman at LinkedIn was using SQL and MapReduce to do graph algorithms for social network analysis.

Incidentally: While it’s been some months since I asked, my sense is that the O’Reilly text extraction is home-grown, and primitive compared to what one could do via commercial products. That said, if the specific application is examining job postings, I’m not sure how much value more sophisticated products would add. After all, tech job listings are generally written in a style explicitly designed to ensure that most or all of their meaning is conveyed simply by a bag of keywords. And by the way, this effort has been underway for quite some time.

Related link

Greenplum has a page on the O’Reilly relationship. However, the part that isn’t behind a registration barrier is trivial — and I wouldn’t know one way or the other about the registration-required part.

Categories: Analytic technologies, Data warehousing, Fox and MySpace, Greenplum, MapReduce, Specific users, Web analytics

3 Comments

February 23, 2009

MapReduce user eHarmony chose Netezza over Aster or Greenplum

Depending on which IDG reporter you believe, eHarmony has either 4 TB of data or more than 12 TB, stored in Oracle but now analyzed on Netezza. Interestingly, eHarmony is a Hadoop/MapReduce shop, but chose Netezza over Aster Data or Greenplum even so. Price was apparently an important aspect of the purchase decision. Netezza also seems to have had a very smooth POC. Read more

Categories: Application areas, Aster Data, Benchmarks and POCs, Data warehousing, Greenplum, MapReduce, Netezza, Oracle, Predictive modeling and advanced analytics, Pricing

5 Comments

February 12, 2009

An example of Aster Data’s nPath/MapReduce syntax

Perhaps in response to my prior post on Aster Data’s introduction of MapReduce-based nPath, Steve Wooledge of Aster offers a more detailed example. The particular case he works through is:

… the question: for SEO/SEM-driven traffic that stay on our site only for 5 or less pageviews and then leave our site and never return in the same session, what are the top referring search queries and what are the top path of navigated pages on our site?

Categories: Analytic technologies, Aster Data, Data warehousing, MapReduce, Web analytics

Aster Data nPath

Edit: Unfortunately, this post and its sequel rely on Aster Data posts that Aster’s buyer Teradata no longer makes easily available.

At the same time as it rolled out its cloud story, Aster Data told of nPath, a MapReduce-based feature in nCluster. As best I understand it, the core idea of nPath is that it preprocesses sequential data via MapReduce so that you can then do ordinary SQL on it. (Steve Wooledge’s blog post about nPath outlines why that might be needed. Point 1 in Mayank Bawa’s August, 2008 post is much more concise. 😉 ) Now, that might seem to contradict the syntax, which is all about MapReduce being invoked via SQL — still, it’s what’s really going on.

That leads to two obvious questions: What is nPath used (or useful) for? and How is the preprocessing done anyway? Read more

Categories: Aster Data, Data warehousing, MapReduce, Predictive modeling and advanced analytics, Web analytics

2 Comments

February 10, 2009

Aster Data in the cloud

Aster Data is in the news, bragging about a cloud version of nCluster, and providing both a press release and a blog post on the subject. It seems there are three actual customers, two of which have been publicly named. One of them, ShareThis, is in production. (2 terabytes of data on 9 nodes, planning to scale to 10-18 TB on 24 or so nodes by year-end.) All seem to be doing something in the area of internet marketing, web analytics or otherwise — which makes sense, as the same could be said of almost all Aster customers overall. That said, it seems that these customers are doing their primary analytic processing remotely, which makes Aster’s experience in that regard more akin to Kognitio’s than to Vertica’s. Read more

Categories: Analytic technologies, Application areas, Aster Data, Cloud computing, Data warehousing, MapReduce, Software as a Service (SaaS), Web analytics

1 Comment

November 15, 2008

High-performance analytics

For the past few months, I’ve collected a lot of data points to the effect that high-performance analytics – i.e., beyond straightforward query — is becoming increasingly important. And I’ve written about some of them at length. For example:

MapReduce – controversial or in some cases even disappointing though it may be – has a lot of use cases.
It’s early days, but Netezza and Teradata (and others) are beefing up their geospatial analytic capabilities.
Memory-centric analytics is in the spotlight.

Ack. I can’t decide whether “analytics” should be a singular or plural noun. Thoughts?

Another area that’s come up which I haven‘t blogged about so much is data mining in the database. Data mining accounts for a large part of data warehouse use. The traditional way to do data mining is to extract data from the database and dump it into SAS. But there are problems with this scenario, including: Read more

Categories: Aster Data, Data warehousing, EAI, EII, ETL, ELT, ETLT, Greenplum, MapReduce, Netezza, Oracle, Parallelization, SAS Institute, Teradata

6 Comments

October 22, 2008

Update on Aster Data Systems and nCluster

I spent a few hours at Aster Data on my West Coast swing last week, which has now officially put out Version 3 of nCluster. Highlights included: Read more

Categories: Application areas, Aster Data, Data warehousing, Database compression, MapReduce, Market share and customer counts, Parallelization, Specific users, Theory and architecture, Web analytics

3 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

MapReduce

Stonebraker, DeWitt, et al. compare MapReduce to DBMS

Amazon Elastic MapReduce

Twitter is considering using MapReduce

Three Greenplum customers’ applications of MapReduce

MapReduce user eHarmony chose Netezza over Aster or Greenplum

An example of Aster Data’s nPath/MapReduce syntax

Aster Data nPath

Aster Data in the cloud

High-performance analytics

Update on Aster Data Systems and nCluster

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin