MapReduce
Analysis of implementations of and issues associated with the parallel programming framework MapReduce. Related subjects include:
MapReduce webinars and annotated slides
As previously noted, I’m giving a webinar twice today — i.e., Thursday, October 15 — at 10:00 am and 1:00 pm Eastern time.
- The subject is MapReduce.
- The sponsor is Aster Data.
- Part of the webinar will be an explanation of MapReduce basics, especially the conflict between theory/propaganda and reality.
- As you might guess from the identity of the sponsor, there will be an emphasis on how MapReduce and SQL play nicely with each other.
- You can register for the webinar on Aster’s site.
- (Edit) The webinar replay can be found here.
- I’ve already uploaded the slides from which I will present. (But not the ones from which Aster folks will be talking. I’ve seen those, and there’s some good technical crunch in some of them.) The “Notes” under the slides have a number of relevant URLs for follow-up, as well as a small number of explanatory comments (e.g., as to why one slide simply has a quote from and corresponding picture of Shakespeare).
Categories: Aster Data, MapReduce, Presentations | 6 Comments |
How 30+ enterprises are using Hadoop
MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera’s files. Facts and metrics ranged widely, of course:
- Some are in heavy production with Hadoop, and closely engaged with Cloudera. Others are active Hadoop users but are very secretive. Yet others signed up for initial Hadoop training last week.
- Some have Hadoop clusters in the thousands of nodes. Many have Hadoop clusters in the 50-100 node range. Others are just prototyping Hadoop use. And one seems to be “OEMing” a small Hadoop cluster in each piece of equipment sold.
- Many export data from Hadoop to a relational DBMS; many others just leave it in HDFS (Hadoop Distributed File System), e.g. with Hive as the query language, or in exactly one case Jaql.
- Some are household names, in web businesses or otherwise. Others seem to be pretty obscure.
- Industries include financial services, telecom (Asia only, and quite new), bioinformatics (and other research), intelligence, and lots of web and/or advertising/media.
- Application areas mentioned — and these overlap in some cases — include:
- Log and/or clickstream analysis of various kinds
- Marketing analytics
- Machine learning and/or sophisticated data mining
- Image processing
- Processing of XML messages
- Web crawling and/or text processing
- General archiving, including of relational/tabular data, e.g. for compliance
I have some presentations coming up (all on October Thursdays)
On Thursday, October 15, and two different times (10:00 am and 1:00 pm Eastern time), I’ll be giving a webinar for Aster Data on MapReduce. The content is very much work in progress, but it definitely will:
- Be overviewy in nature
- Emphasize SQL/MapReduce integration
Then, on the evening of Thursday, October 22, there’s something called the Boston Big Data Summit, in Waltham, where “Big Data” evidently is to be construed as anything from a few terabytes on up. (Things are smaller in the Northeast than in California …) It’s being put together by Amrith Kumar (who I don’t really know) and Bob Zurek (who everybody knows). This is the inaguaral meeting. It seems I’m both giving the keynote and running the subsequent panel, one of whose participants will be Ellen Rubin. Read more
Categories: Analytic technologies, Aster Data, Cloud computing, MapReduce, Presentations | 4 Comments |
Oracle’s version of “actually, we’ve been doing MapReduce all along too”
In a recent blog post, Jean-Pierre Dijcks of Oracle makes the argument that Oracle has supported MapReduce all along, essentially because:
- You can do lots of procedural logic in the Oracle database, in a broad choice of languages, so in particular you can do Map steps.
- You can do lots of procedural logic in the Oracle database, in a broad choice of languages, so in particular you can do Reduce steps.
- Oracle offers a mechanism for parallelizing procedural logic.
Oracle doesn’t appear to have an explicit Map/Reduce programming interface, but I wouldn’t be surprised if Oracle Consulting cranked one out at some point to meet customer demand.
The post goes on to claim the usual in-database MapReduce benefit of avoiding the overhead of intermediate query result materialization. Presumably, then, Oracle’s quasi-MapReduce would also lack query fault-tolerance.
Categories: Analytic technologies, MapReduce, Oracle, Parallelization | 1 Comment |
Jacek Becla on issues in scientific data management
Just as Martin Kersten did, Jacek Becla emailed a response to my post on issues in scientific data management. With his permission, I’ve lightly edited his email too, and am posting it below, with some interspersed comments of my own. Read more
Categories: Analytic technologies, Hadoop, MapReduce, Objectivity and Infinite Graph, Open source, Parallelization, SciDB, Scientific research | 4 Comments |
MapReduce tidbits
I’ve never had children, and so have never had to supervise squabbling siblings, each accusing the other of selfishness and insufficient sharing. Perhaps the MapReduce vendors are a form of karmic payback. Be that as it may, my client Cloudera has organized Hadoop World on October 2 in New York, and my other client Aster Data is hosting a MapReduce-centric Big Data Summit the night before, at the same venue. Even if you don’t go, both conference’s agenda pages offer a peek into what’s going on in MapReduce applications. I’m not going either, but even so I hope to post an overview of MapReduce uses after the conferences serve to publicize some of them.
Even better, I plan to hold a couple of webinars on MapReduce, the first at 10 am (blech) and 1 pm Eastern time on October 15. They’re sponsored by Aster Data, and so will have a strong SQL/MapReduce orientation.
In connection with its conference, Aster is introducing an nCluster-Hadoop connector — i.e., a loader from HDFS (Hadoop Distributed File System) implemented in SQL/MapReduce. In particular: Read more
Categories: Aster Data, Cloudera, Data warehousing, Hadoop, MapReduce | 7 Comments |
Yahoo wants to do decapetabyte-scale data warehousing in Hadoop
My old client Mark Tsimelzon moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo’s Hadoop effort — everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get Hadoop to the point that it is managing 10s of petabytes of data for Yahoo, with reasonable data warehousing functionality.
Highlights of our visit included:
- There are dozens of people at Yahoo doing Hadoop development that will wind up getting open sourced. (Full-time or close to it.) In particular, everything Mark’s team does goes to open source.
- Yahoo is moving as much of its analytics to Hadoop as possible. Much of this is being moved away from Oracle and from Yahoo’s own Everest.
- A column store is being put on top of HDFS, based on Yahoo technology. Columns will be striped across nodes. Perhaps that’s why the effort is called Project Zebra.
- Mark believes that in a year Hadoop will be much further along in meeting traditional data warehousing requirements, in areas such as:
- Metadata
- SLAs/high availability/other workload management
- Data retention policies
- Security/privacy*
- Yahoo views the time-to-market benefits of Hadoop as being more important than TCO.
Categories: Analytic technologies, Data warehousing, Hadoop, MapReduce, Open source, Oracle, Petabyte-scale data management, Web analytics, Yahoo | 6 Comments |
HadoopDB
Despite a thoughtful heads-up from Daniel Abadi at the time of his original posting about HadoopDB, I’m just getting around to writing about it now. HadoopDB is a research project carried out by a couple of Abadi’s students. Further research is definitely planned. But it seems too early to say that HadoopDB will ever get past the “research and oh by the way the code is open sourced” stage and become a real code line — whether commercialized, open source, or both.
The basic idea of HadoopDB is to put copies of a DBMS at different nodes of a grid, and use Hadoop to parcel work among them. Major benefits when compared with massively parallel DBMS are said to be:
- Open/cheap/free
- Query fault-tolerance
- The related concept of tolerating node degradation that isn’t an outright node failure.
HadoopDB has actually been built with PostgreSQL. That version achieved performance well below that of a commercial DBMS “DBX”, where X=2. Column-store guru Abadi has repeatedly signaled his intention to try out HadoopDB with VectorWise at the nodes instead. (Recall that VectorWise is shared-everything.) It will be interesting to see how that configuration performs.
The real opportunity for HadoopDB, however, in my opinion may lie elsewhere. Read more
Teradata and Netezza are doing MapReduce too
Netezza told me a while ago that it planned to introduce MapReduce, and agreed yesterday this was no longer NDAed. Stephen Brobst of Teradata* let slip at XLDB that Teradata has MapReduce too, apparently implemented but not yet generally available.
I don’t have details in either case. Netezza and Teradata evidently aren’t taking MapReduce as seriously as Aster Data, or even Greenplum or Vertica. But MapReduce has become pretty much of a “checkmark” item for large-database analytic DBMS vendors even so.
*Technically, Brobst is not and never has been a Teradata employee — but he’s widely and correctly regarded as being “of Teradata” even so. 🙂
Categories: Data warehousing, MapReduce, Netezza, Teradata | 6 Comments |
Vertica’s version of MapReduce integration
I talked with Omer Trajman of Vertica Monday night about Vertica’s MapReduce integration, part of its Vertica 3.5 release. Highlights included:
- By “integrating Vertica and MapReduce,” Vertica means “integrating Vertica and Hadoop.”
- Vertica’s Hadoop integration is based on Cloudera’s DBInputFormat.
- Omer called out for me several features of Vertica’s Hadoop integration that didn’t just come from Cloudera, namely:
- Cloudera’s DBInputFormat assumes the database runs on a single computer, or a single head node of an MPP system. Vertica’s technology, however, runs on peer parallel nodes with no head, and so Vertica adapted the DBInputFormat technology accordingly.
- Vertica lets you push down Map functions to the database. Omer reports a roughly even division among users and prospects between those who want to do this and ones who don’t.
- Vertica lets you do Reduce functions (or Map functions, if you don’t push them down to the database) on a separate cluster than you run the database software. Vertica asserts that its customers and prospects all want to do this. Right here is the big difference between Vertica’s MapReduce integration and Aster’s or Greenplum’s. (Aster would also say that Vertica’s weaker MapReduce/SQL programming integration is a big difference as well.)
- Indeed, Vertica lets you Reduce into a different DBMS than Vertica, if you choose.
- Vertica gives you flexibility on the size of the Map and Reduce clusters. Omer agreed with me when I said there were some limits on how fast one can add or subtract nodes in a Vertica grid, because there’s data redistribution involved. But one can add/change/delete Hadoop clusters extremely quickly.
Apparently, the use cases for Vertica/Hadoop integration to date lie in algorithmic trading and two kinds of web analytics. Specifically: Read more