Hadoop

Discussion of Hadoop. Related subjects include:

MapReduce
Open source database management systems

December 30, 2009

Clearing up MapReduce confusion, yet again

I’m frustrated by a constant need — or at least urge 🙂 — to correct myths and errors about MapReduce. Let’s try one more time: Read more

December 12, 2009

The legit part of the NoSQL idea

I’ve written some snarky things about the “NoSQL” concept – or at least the moniker. (Carl Olofson’s term “non-schematic databases” seems less bad.) Yet I’m actually favorable about the increasing use of SQL alternatives. Perhaps I should pull those thoughts together. Read more

October 18, 2009

Three big myths about MapReduce

Once again, I find myself writing and talking a lot about MapReduce. But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:

Read more

October 10, 2009

How 30+ enterprises are using Hadoop

MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera’s files. Facts and metrics ranged widely, of course:

Read more

October 4, 2009

Jacek Becla on issues in scientific data management

Just as Martin Kersten did, Jacek Becla emailed a response to my post on issues in scientific data management. With his permission, I’ve lightly edited his email too, and am posting it below, with some interspersed comments of my own. Read more

October 3, 2009

Issues in scientific data management

In the opinion of the leaders of the XLDB and SciDB efforts, key requirements for scientific data management include:

However: Read more

October 1, 2009

MapReduce tidbits

I’ve never had children, and so have never had to supervise squabbling siblings, each accusing the other of selfishness and insufficient sharing. Perhaps the MapReduce vendors are a form of karmic payback. Be that as it may, my client Cloudera has organized Hadoop World on October 2 in New York, and my other client Aster Data is hosting a MapReduce-centric Big Data Summit the night before, at the same venue. Even if you don’t go, both conference’s agenda pages offer a peek into what’s going on in MapReduce applications. I’m not going either, but even so I hope to post an overview of MapReduce uses after the conferences serve to publicize some of them.

Even better, I plan to hold a couple of webinars on MapReduce, the first at 10 am (blech) and 1 pm Eastern time on October 15. They’re sponsored by Aster Data, and so will have a strong SQL/MapReduce orientation.

In connection with its conference, Aster is introducing an nCluster-Hadoop connector — i.e., a loader from HDFS (Hadoop Distributed File System) implemented in SQL/MapReduce. In particular: Read more

October 1, 2009

Yahoo wants to do decapetabyte-scale data warehousing in Hadoop

My old client Mark Tsimelzon moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo’s Hadoop effort — everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get Hadoop to the point that it is managing 10s of petabytes of data for Yahoo, with reasonable data warehousing functionality.

Highlights of our visit included:

Read more

September 13, 2009

HadoopDB

Despite a thoughtful heads-up from Daniel Abadi at the time of his original posting about HadoopDB, I’m just getting around to writing about it now. HadoopDB is a research project carried out by a couple of Abadi’s students. Further research is definitely planned. But it seems too early to say that HadoopDB will ever get past the “research and oh by the way the code is open sourced” stage and become a real code line — whether commercialized, open source, or both.

The basic idea of HadoopDB is to put copies of a DBMS at different nodes of a grid, and use Hadoop to parcel work among them. Major benefits when compared with massively parallel DBMS are said to be:

HadoopDB has actually been built with PostgreSQL. That version achieved performance well below that of a commercial DBMS “DBX”, where X=2. Column-store guru Abadi has repeatedly signaled his intention to try out HadoopDB with VectorWise at the nodes instead. (Recall that VectorWise is shared-everything.) It will be interesting to see how that configuration performs.

The real opportunity for HadoopDB, however, in my opinion may lie elsewhere. Read more

September 13, 2009

Fault-tolerant queries

MapReduce/Hadoop fans sometimes raise the question of query fault-tolerance. That is — if a node fails, does the query need to be restarted, or can it keep going? For example, Daniel Abadi et al. trumpet query fault-tolerance as one of the virtues of HadoopDB. Some of the scientists at XLDB spoke of query fault-tolerance as being a good reason to leave 100s or 1000s of terabytes of data in Hadoop-managed file systems.

When we discussed this subject a few months ago in a couple of comment threads, it seemed to be the case that:

This raises an obvious (pair of) question(s) — why and/or when would anybody ever care about query fault-tolerance? Read more

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.