MapReduce

Analysis of implementations of and issues associated with the parallel programming framework MapReduce. Related subjects include:

July 10, 2011

Hadoop futures and enhancements

Hadoop is immature technology. As such, it naturally offers much room for improvement in both industrial-strengthness and performance. And since Hadoop is booming, multiple efforts are underway to fill those gaps. For example:

Cloudera’s proprietary code is focused on management, set-up, etc.
The “Phase 1” plans Hortonworks shared with me for Apache Hadoop are focused on industrial-strengthness, as are significant parts of “Phase 2”.*
MapR tells a performance story versus generic Apache Hadoop HDFS and MapReduce. (One aspect of same is just C++ vs. Java.)
So does Hadapt, but mainly vs. Hive.
Cloudera also tells me there’s a potential 4-5X performance improvement in Hive coming down the pike from what amounts to an optimizer rewrite.

(Zettaset belongs in the discussion too, but made an unfortunate choice of embargo date.)

Categories: Cloudera, Greenplum, Hadapt, Hadoop, HBase, MapR, MapReduce, Parallelization, Zettaset

20 Comments

July 10, 2011

Cloudera and Hortonworks

My clients at Cloudera have been around for a while, in effect positioned as “the Hadoop company.” Their business, in a nutshell, consists of:

Packaging up a Cloudera distribution of Apache Hadoop. This distribution doesn’t have proprietary code; it’s just packaged by Cloudera from Apache projects (with a decent minority of the code happening to have been contributed by Cloudera engineers).
Paid subscription support for Apache Hadoop and, in connection with that …
… proprietary software that all support customers automatically get. There are two points to this proprietary software:
- It adds value for the customer.
- It makes Cloudera’s support job easier.
Professional services around Hadoop.
Training and conferences around Hadoop, which probably don’t generate all that much money, but are great marketing in terms of visibility, thought leadership, and lead generation.

Hortonworks spun out of Yahoo last week, with parts of the Cloudera business model, namely Hadoop support, training, and I guess conferences. Hortonworks emphatically rules out professional services, and says that it will contribute all code back to Apache Hadoop. Hortonworks does grudgingly admit that it might get into the proprietary software business at some point — but evidently hopes that day will never actually come.

Categories: Cloudera, Hadoop, Hortonworks, IBM and DB2, MapReduce, Open source, Yahoo

9 Comments

July 6, 2011

Hadapt update

I met with the Hadapt guys today. I think I can be a bit crisper than before in positioning Hadapt and its use cases, namely:

Hadapt is additional software on a cluster that also runs fully functional Hadoop/HDFS. (Cloudera Hadoop more than straight-from-Apache Hadoop to date, but that’s not a requirement.)
The cluster also runs a DBMS on every node, such as PostgreSQL or one of Infobright/Vectorwise.
Hadapt’s software manages parallel SQL queries by distributing them to the DBMS living on each node. Hadapt says that the resulting query performance far outshines Hive’s.
Hadapt further says that, by exploiting the partner DBMS, its SQL functionality outpaces Hive’s as well.
Target Hadapt use cases are centered around keeping machine-generated or other poly-structured data in Hadoop, and extracting, enhancing, or otherwise deriving some of it to live in the relational store.
In particular, Hadapt seems like an interesting choice when you want to use that relational data as you work on other data that’s still in HDFS, or if you want to keep using the relational data in other kinds of MapReduce jobs.
That all fits well with my thoughts about the importance of derived data.

Other evolution from what I wrote about Hadapt a few months ago includes:

Hadapt is in beta now.
Hadapt has added adult supervision in the form of Philip Wickline, late of Endeca.

In other news, Hadapt is our newest client.

Categories: Analytic technologies, Cloudera, Data models and architecture, Data warehousing, Hadapt, Hadoop, Infobright, MapReduce, Open source, PostgreSQL, SQL/Hadoop integration, VectorWise

Petabyte-scale Hadoop clusters (dozens of them)

I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.

Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo’s latest stated figures are:

42,000 Hadoop nodes …
… holding 180-200 petabytes of data.

Categories: Cloudera, Facebook, Hadoop, Investment research and trading, Log analysis, MapReduce, Market share and customer counts, Petabyte-scale data management, Scientific research, Web analytics, Yahoo

13 Comments

July 5, 2011

Eight kinds of analytic database (Part 2)

In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I’ll cover four more kinds of analytic database — even newer, for the most part, with a use case/product short list match that is even less clear. Read more

Categories: Analytic technologies, Archiving and information preservation, Business intelligence, Buying processes, Cloud computing, Columnar database management, Data mart outsourcing, Data types, Data warehouse appliances, Data warehousing, Database compression, Database diversity, EAI, EII, ETL, ELT, ETLT, Greenplum, Hadoop, Investment research and trading, Log analysis, MapReduce, MOLAP, MySQL, Netezza, NoSQL, Open source, Petabyte-scale data management, Predictive modeling and advanced analytics, Rainstor, SAND Technology, Scientific research, SenSage, Software as a Service (SaaS), Streaming and complex event processing (CEP), Telecommunications, Vertica Systems, Web analytics

6 Comments

June 24, 2011

Forthcoming Oracle appliances

Edit: I checked with Oracle, and it’s indeed TimesTen that’s supposed to be the basis of this new appliance, as per a comment below. That would be less cool, alas.

Oracle seems to have said on yesterday’s conference call Oracle OpenWorld (first week in October) will feature appliances based on Tangosol and Hadoop. As I post this, the Seeking Alpha transcript of Oracle’s call is riddled with typos. Bolded comments below are by me. Read more

Categories: Data warehouse appliances, Hadoop, In-memory DBMS, MapReduce, Memory-centric data management, Object, Oracle

8 Comments

June 15, 2011

Metaphors amok

It all started when I disputed James Kobielus’ blogged claim that Hadoop is the nucleus of the next-generation cloud EDW. Jim posted again to reiterate the claim, only this time he wrote that all EDW vendors [will soon] bring Hadoop into their heart of their architectures. (All emphasis mine.)

That did it. I tweeted, in succession:

Actually, I vote for Hadoop as the lungs of the EDW — first place of entry for essential nutrients.
Data integration can be the heart of the EDW, pumping stuff around. RDBMS/analytic platform can be the brain.
iPad-based dashboards that may engender envy, but which actually are only used occasionally and briefly … well, you get the picture.*

*Woody Allen said in Sleeper that the brain was his second-favorite organ.

Of course, that body of work was quickly challenged. Responses included: Read more

Categories: Analytic technologies, Business intelligence, Data warehousing, EAI, EII, ETL, ELT, ETLT, Fun stuff, Hadoop, Humor, MapReduce

Patent nonsense: Parallel Iron/HDFS edition

Alan Scott commented with concern about Parallel Iron’s patent lawsuit attacking HDFS (Hadoop Distributed File System), filed in — where else? — Eastern Texas. The patent in question — US 7,415,565 — seems to in essence cover any shared-nothing block storage that exploits a “configurable switch fabric”; indeed, it’s more oriented to OLTP (OnLine Transaction Processing) than to analytics. For example, the Background section starts: Read more

Categories: EMC, Hadoop, MapReduce, Parallelization, Storage

9 Comments

June 5, 2011

Hadoop confusion from Forrester Research

Jim Kobielus started a recent post

Most Hadoop-related inquiries from Forrester customers come to me. These have moved well beyond the “what exactly is Hadoop?” phase to the stage where the dominant query is “which vendors offer robust Hadoop solutions?”

What I tell Forrester customers is that, yes, Hadoop is real, but that it’s still quite immature.

So far, so good. But I disagree with almost everything Jim wrote after that.

Jim’s thesis seems to be that Hadoop will only be mature when a significant fraction of analytic DBMS vendors have own-branded versions of Hadoop alongside their DBMS, possibly via acquisition. Based on this, he calls for a formal, presumably vendor-driven Hadoop standardization effort, evidently for the whole Hadoop stack. He also says that

Hadoop is the nucleus of the next-generation cloud EDW, but that promise is still 3-5 years from fruition

where by “cloud” I presume Jim means first and foremost “private cloud.”

I don’t think any of that matches Hadoop’s actual strengths and weaknesses, whether now or in the 3-7 year future. My reasoning starts:

Hadoop is well on its way to being a surviving data-storage-plus-processing system — like an analytic DBMS or DBMS-imitating data integration tool …
… but Hadoop is best-suited for somewhat different use cases than those technologies are, and the gap won’t close as long as the others remain a moving target.
I don’t think MapReduce is going to fail altogether; it’s too well-suited for too many use cases.
Hadoop (as opposed to general MapReduce) has too much momentum to fizzle, perhaps unless it is supplanted by one or more embrace-and-extend MapReduce-plus systems that do a lot more than it does.
The way for Hadoop to avoid being a MapReduce afterthought is to evolve sufficiently quickly itself; ponderous standardization efforts are quite beside the point.

As for the rest of Jim’s claim — I see three main candidates for the “nucleus of the next-generation enterprise data warehouse,” each with better claims than Hadoop:

Relational DBMS, much like today. (E.g., Teradata, DB2, Exadata or their successors.) This is the case in which robustness of the central data store matters most.
Grand cosmic data integration tools. (The descendants of Informatica PowerCenter, et al.) This is the case in which the logic of data relationships can safely be separated from physical storage.
Nothing. (The architecture could have several strong members, none of which is truly the “nucleus.”) This is the case in which new ways keep being invented to extract high value from data, outrunning what grandly centralized solutions can adapt to. I think this is the most likely case of all.

Categories: Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, Theory and architecture

9 Comments

May 14, 2011

Alternatives for Hadoop/MapReduce data storage and management

There’s been a flurry of announcements recently in the Hadoop world. Much of it has been concentrated on Hadoop data storage and management. This is understandable, since HDFS (Hadoop Distributed File System) is quite a young (i.e. immature) system, with much strengthening and Bottleneck Whack-A-Mole remaining in its future.

Known HDFS and Hadoop data storage and management issues include but are not limited to:

Hadoop is run by a master node, and specifically a namenode, that’s a single point of failure.
HDFS compression could be better.
HDFS likes to store three copies of everything, whereas many DBMS and file systems are satisfied with two.
Hive (the canonical way to do SQL joins and so on in Hadoop) is slow.

Different entities have different ideas about how such deficiencies should be addressed. Read more

Categories: Aster Data, Cassandra, Cloudera, Data warehouse appliances, DataStax, EMC, Greenplum, Hadapt, Hadoop, IBM and DB2, MapReduce, MongoDB, Netezza, Parallelization

22 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

MapReduce

Hadoop futures and enhancements

Cloudera and Hortonworks

Hadapt update

Petabyte-scale Hadoop clusters (dozens of them)

Eight kinds of analytic database (Part 2)

Forthcoming Oracle appliances

Metaphors amok

Patent nonsense: Parallel Iron/HDFS edition

Hadoop confusion from Forrester Research

Alternatives for Hadoop/MapReduce data storage and management

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin