Some notes on Hadoop (mainly) and appliances
1. EMC Greenplum has evolved its appliance product line. As I read that, the latest announcement boils down to saying that you can neatly network together various Greenplum appliances in quarter-rack increments. If you take a quarter rack each of four different things, then Greenplum says “Hooray! Our appliance is all-in-one!” Big whoop.
2. That said, the Hadoop part of EMC ‘s story is based on MapR, which so far as I can tell is actually a pretty good Hadoop implementation. More precisely, MapR makes strong claims about performance and so on, and Apache Hadoop folks don’t reply “MapR is full of &#$!” Rather, they say “We’re going to close the gap with MapR a lot faster than the MapR folks like to think — and by the way, guys, thanks for the butt-kick.” A lot more precision about MapR may be found in this M. C. Srivas SlideShare.
3. On its latest earnings call, Oracle clearly said it would introduce a Hadoop appliance, versus just hinting at a Hadoop appliance the prior quarter. The money quote was:
Finally, big data or the searching of large amounts of data using Hadoop. After Hadoop finishes filtering the data, the place you want to put that data is an Oracle Database, and that’s what a lot of our customers are doing. And we are exploiting the trend, the big data technology and the big data trend, if you prefer, by building a Hadoop appliance that attaches to the Oracle Exadata database or any Oracle Database for that matter. But you don’t have to buy our Hadoop appliance if you can use whatever servers you want running Hadoop, and we provide the interface between Hadoop and the Oracle Database.
In other words, Oracle is saying “We’d like to sell you a Hadoop appliance, but you can run Hadoop in some other way and we’ll coexist with it just fine.” That makes sense; refusing to coexist with Hadoop is not exactly a realistic option.
4. Back in June, I expressed great skepticism about the idea of a Hadoop appliance. There was at least partial pushback in the comment thread from both Amr Awadallah and Eric Baldeschwieler. Oops.
Their reasoning seems to be centered around matters of installation, administration, and general packaging.
5. A month ago I noted aggressive near-term plans for Apache Hadoop evolution. As noted above, one reason this is needed is competition from folks like MapR. Also, I note that:
- Three years ago, Oliver Ratzesberger’s group at eBay complained that CPU utilization running Hadoop was at 18%.
- Now Oliver uses a figure of 10-15%., and attributes an even lower figure to — I’m guessing here — Yahoo. (Another possibility might be Facebook.)
- In between eBay became one of the biggest and most prominent users of Hadoop.
The moral of eBay’s Hadoop adventures, as I see it, is neither “Hadoop sucks!” nor “Hadoop doesn’t suck!”; rather, it’s that there’s a lot of scope for Hadoop to operate differently in the future than it does today.
Similarly, whatever throughput Yahoo does or doesn’t get, it clearly has adopted Hadoop at the expense of the columnar-in-Postgres system it previously was so proud of.
Also, there has been a claim going around that — notwithstanding NameNode’s status as a single point of Hadoop failure — no Hadoop installation has ever lost data due to a NameNode failure. The folks at MapR beg to differ, and sent over some links that sure seem to say the opposite.
6. Since we’ve just established that Hadoop will change, rapidly and pretty fundamentally, what exactly is the benefit of an appliance that is “balanced” for Hadoop usage today?
Comments
2 Responses to “Some notes on Hadoop (mainly) and appliances”
Leave a Reply
Oliver is talking about a 12,000 core cluster which is gigantic nd way bigger then most hadoop installs. From everything I’ve heard the namenode becomes a major blocking point at that range.
I believe namenode federation is intended to address that.
I also wonder whether Oliver should be considered a hostile witness with regards to hadoop. (-:
The thing that I wonder about all these appliance plays is how internal and external clouds will influence them. Apache Whirr gives you the same effect as an appliance in a lot of ways without the appliance.
From what I can tell – only the EMC Greenplum HD Community Edition is available on their new DCA – not the Enterprise version from MapR