June 10, 2015
Hadoop generalities
Occasionally I talk with an astute reporter — there are still a few left 🙂 — and get led toward angles I hadn’t considered before, or at least hadn’t written up. A blog post may then ensue. This is one such post.
There is a group of questions going around that includes:
- Is Hadoop overhyped?
- Has Hadoop adoption stalled?
- Is Hadoop adoption being delayed by skills shortages?
- What is Hadoop really good for anyway?
- Which adoption curves for previous technologies are the best analogies for Hadoop?
To a first approximation, my responses are:
- The Hadoop hype is generally justified, but …
- … what exactly constitutes “Hadoop” is trickier than one might think, in at least two ways:
- Hadoop is much more than just a few core projects.
- Even the core of Hadoop is repeatedly re-imagined.
- RDBMS are a good analogy for Hadoop.
- As a general rule, Hadoop adoption is happening earlier for new applications, rather than in replacement or rehosting of old ones. That kind of thing is standard for any comparable technology, both because enabling new applications can be valuable and because migration is a pain.
- Data transformation, as pre-processing for analytic RDBMS use, is an exception to that general rule. That said …
- … it’s been adopted quickly because it saves costs. But of course a business that’s only about cost savings may not generate a lot of revenue.
- Dumping data into a Hadoop-centric “data lake” is a smart decision, even if you haven’t figured out yet what to do with it. But of course, …
- … even if zero-application adoption makes sense, it isn’t exactly a high-value proposition.
- I’m generally a skeptic about market numbers. Specific to Hadoop, I note that:
- The most reliable numbers about Hadoop adoption come from Hortonworks, since it is the only pure-play public company in the market. (Compare, for example, the negligible amounts of information put out by MapR.) But Hortonworks’ experiences are not necessarily identical to those of other vendors, who may compete more on the basis of value-added service and technology rather than on open source purity or price.
- Hadoop (and the same is true of NoSQL) are most widely adopted at digital companies rather than at traditional enterprises.
- That said, while all traditional enterprises have some kind of digital presence, not all have ones of the scope that would mandate a heavy investment in internet technologies. Large consumer-oriented companies probably do, but companies with more limited customer bases might not be there yet.
- Concerns about skill shortages are exaggerated.
- The point of distributing processing frameworks such as Spark or MapReduce is to make distributed analytic or application programming not be much harder than any other kind.
- If a new programming language or framework needs to be adopted — well, programmers nowadays love learning that kind of stuff.
- The industry is moving quickly to make distributed systems easier to administer. Any skill shortages in operations should prove quite temporary.
Categories: Application areas, Data warehousing, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Hortonworks, MapR, MapReduce, Market share and customer counts, Open source, Pricing
Subscribe to our complete feed!
Comments
6 Responses to “Hadoop generalities”
Leave a Reply
Hi Curt,
Great questions, good answers.
I’d appreciate more clarification on good use cases beyond data dump ( lake ), which I find pretty weak, but quite on par with current industry DW practices.
Comparison with RDBMS adoption pattern is also useful. I have a problem with Gartner hype cycle – surely not all tech doesn’t follow same path ( even at different pace ).
Are there any sources in your Software Memories blog or elsewere that describe how RDBMS early adoption occurred ( before 90’s ) ?
Thanks, Ranko.
I would compare Hadoop with new operating system. It has its VFS (named DFS) and several implementations of it. It has YARN which defines what is Hadoop application and manage their resource allocation. There are also several virtual machines (like JVM or CLR) – MapReduce, Spark, Tez, and there are some “native” applications like HBase.
This operation system indeed oriented for data processing and RDBMS are very popular.
Ranko,
I’ve probably stressed company and technology history more than adoption history, come to think of it. But I’ll say this — except for DB2, RDBMS adoption closely tracked the adoption of alternatives to IBM mainframes. Those alternatives were initially a variety of minicomputers with proprietary OS, mainly DEC VAX/VMS, and then later UNIX-based systems, plus a couple of data warehouse appliances (mainly Teradata).
The rise of relational data warehousing and of modern BI was in the 1990s. Indeed, early in the 1990s Ted Codd placed his bets on MOLAP rather relational DW; he was then quickly proved to be more wrong than right.
Thanks Curt.
Hadoop adoption will of course be different than RDBMS’s ( history doesn’t repeat, but it rhymes ).
It then looks like RDBMS adoption had very long gestation time – from 1970 ( Codd paper ) to 90s take off.
Hadoop is already 11 years old ( if we pick Google MR paper publication as start date ). Internet and other factors like Cloud might speed adoption up ( compared to pre-Internet time ). Finding convincing “standard corporation” use cases and other complexities might slow it down.
It was easier with OLTP – relatively clearly defined requirements and use cases.
But, as you and Merv said: it looks like Hadoop is now mainstream. And it looks like it will climb up from the current perceived lull.
Curt, while I think it’s a good short list of fears in the market about the Apache Ecosystem, I don’t think it portends what is about to occur in the market. I think we would all be better served to observe the interactions of such fundamental change against an industry “adoption curve”. Mebbe not as broad as client/server to internet, but certainly in the analytics space it is… If I look at the market in that framework, what I see are that the “early adopters” already have – tech companies for example, and there are a lot of failures there from a ROI perspective.
But as we head up the curve of adoption – those companies not yet in the Apache Ecosystem are going to write checks to exploit the time to market and other long term ROI benefits of Open Source. They are also going to demand enterprise fit and accountability… And just like every other technology, they will evaluate the overall costs – such as training and maintenance – which early adopters do not.
My fear is that we’re in the transition phase, and all that extra engineering required to make something work outside of the early adopters – where 80% of the cost of engineering happens to be – is going to slow adoption.
I work as a business intelligence analyst at a traditional enterprise that is currently investing heavily in Hadoop infrastructure. So far I’m seeing two main patterns:
(1) aggregation / summarisation of event data en route to the relational DWH, where there is already a clearly defined use case for the data but the volumes are too high for the DWH to handle directly: effectively an ETL pre-processor for the classic data warehouse
(2) exploratory analysis of high volume data sources that *might* turn out to contain high value information, but where either the volumes are too high, or the potential usefulness still too vague, for a relational DWH load process to make sense.