Hadoop execution enhancements
Hadoop 2.0/YARN is the first big step in evolving Hadoop beyond a strict Map/Reduce paradigm, in that it at least allows for the possibility of non- or beyond-MapReduce processing engines. While YARN didn’t meet its target of general availability around year-end 2012, Arun Murthy of Hortonworks told me recently that:
- Yahoo is a big YARN user.
- There are other — paying — YARN users.
- YARN general availability is now targeted for well before the end of 2013.
Arun further told me about Tez, the next-generation Hadoop processing engine he’s working on, which he also discussed in a recent blog post:
With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG [Directed Acyclic Graph] of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others. The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.
This is similar to the approach of BDAS Spark:
Rather than being restricted to Maps and Reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order.
although Tez won’t match Spark’s richer list of primitive operations.
More specifically, there will be six primitive Tez operations:
- HDFS (Hadoop Distributed File System) input and output.
- Sorting on input and output (I’m not sure why that’s two operations rather than one).
- Shuffling of input and output (ditto).
A Map step would compound HDFS input, output sorting, and output shuffling; a Reduce step compounds — you guessed it! — input sorting, input shuffling, and HDFS output.
I can’t think of much in the way of algorithms that would be logically impossible in MapReduce yet possible in Tez. Rather, the main point of Tez seems to be performance, performance consistency, response-time consistency, and all that good stuff. Specific advantages that Arun and I talked about included:
- The requirement for materializing (onto disk) intermediate results that you don’t want to is gone. (Yay!)
- Hadoop jobs will step on each other’s toes less. Instead of Maps and Reduces from unrelated jobs getting interleaved, all the operations from a single job will by default be executed in one chunk. (Even so, I see no reason to expect early releases of Tez to do a great job on highly concurrent mixed workload management.)
- Added granularity brings opportunities for additional performance enhancements, for example in the area of sorting. (Arun loves sorts.)
Comments
14 Responses to “Hadoop execution enhancements”
Leave a Reply
Are you seeing companies starting to offload apps from their data warehouses onto Hadoop? It seems like the roadmap points to continuous improvement, with issues such as concurrency being solved relatively rapidly.
An important point to make here is that Hadoops are not just ecosystems but also assembly languages for distributed filesystems and databases.
The original frameworks are adequate for batch big file processing, perhaps an order of magnitude slower than a quickly written program plus some startup time.
The startup time is a current hot button for twitchy fingered web priorities. This is leading to YARN and Impala and many other approaches to short and mixed workloads.
Everyone needs to be clear that we are in transition. Current technology is broken, and the next iteration of Hadoops (not Hadoop2!) will be better at complex workload management.
What will evolve will be a collection of:
– Java API components to perform tasks at a high level (and higher level languages consuming them)
– a local “OS” and distributed high level OS to perform work
– A distributed file system that will take on either file system semantics or structured ones
We have these now, but everything is brute force. YARN can’t do much without constrained workloads and dishing out uncontrolled processes, and with manual optimization. We have no Hadoop of Hadoops with hierarchical management….
If you are planning a YARN implementation, you should still segregate small from large workloads on different clusters. You should also make sure your work is constrained, with little mix of small and medium size work.
Francois,
If by “offload” you mean “port”, that’s probably much more myth than reality. But workloads on Hadoop, if they could be done at all without Hadoop, would otherwise be done in RDBMS and ETL tools. Or Splunk. Heck, sometimes they’d be done BETTER in alternative technologies, e.g. RDBMS with an analytic platform flavor. So in that sense, Hadoop often offloads from somewhere else.
Most hadoop I see (99%) can be done in RDBMS, so most of it is a combination of
– tech feeler
– RDBMS license play (not economical to use RDBMS, archive logs and hope, archive/preprocess inputs, …)
– tactical easy stuff, such as distributed work
With 1% of enormous scale that would not fit in tactical RDBMS. With the exception of HBase (evolving schema in iterative development and good random access), most hadoop won’t work with any sort of traditional warehouse.
BUT – it is starting to suck in stuff at the periphery. Semistructured or DQ failing stuff that would have been 99% discarded is now brought in to hadoop with weak links to DW. It’s nibbling at the edges.
No sane person would migrate a DW to hadoop – it’s missing too much: it can scan, but mostly can’t join, little indexing, no governance.
Thanks aaron/Curt. Do you see any sort of boomerang or backlash effect, where the economics of putting the data back onto an RDBMS start to make sense?
I’ve heard Teradata is down to $30k/TB list for its 6000 series. If data warehouse appliances are down to under $10k/TB in two or three years, do people view that as making more sense for workloads that are better suited for an RDBMS?
I suppose some would argue the late-binding flexibility of Hadoop is what makes it so attractive, rather than the cost.
My take –
Current price model:
Server:
– $3K 16 core
– $1K 64G RAM + 400G SSD
– $.15 X 24 3TB SATA drives
Total of <$4K for a server with ~70TB net, no RAID
Cost of RDBMS – license range:
– pg or a few community editions ($0, chat support)
– traditional RDBMS $15-25K/core – (~$250K + $35K annually)
– classic analytics DBMS ranges from ~$100K to $3,000M+)
Cost of Hadoop support – annual range from $0-$150K
(granted – this HW intentionally has no redundancy, so hadoop needs a master and 3x redundancy, rdbms not as clear. Compression changes density as well)
So – this nifty workstation for a data scientist costs virtually nothing and is a rounding error in cost relative to license and support.
The rest of the decade will be a dynamic in RDBMS where SQL is a commodity and with a conflict between established vendors attempting to preserve pricing (lock in, enterprise integration and recovery) and new ones looking to establish themselves (where price is less critical to them if they don't have recurring revenues to cannibalize). New vendors are at a big disadvantage in the near term, but….
The license/sales model will drive how the other stuff goes, mostly. This in turn will drive most of how analytics shapes up. My guess is the market breaks apart in 3-5 years. The legacy vendors will be in a real legacy mode – needing to preserve revenue with little growth and unable to compete with lower cost spreads.
Meanwhile, Hadoop will grow in different ways. It is a long way from a generic analytic solution, but is a useful thing for specialized use cases such as giant data and compute intensive scanning (e.g., scoring) and weak semantics data. Data will grow in both camps independently. Most structured data will remain in RDBMS.
Francois,
You’re behind the times. 🙂 Netezza was down to $20K/TB after compression in 2009. http://www.dbms2.com/2009/07/30/the-netezza-price-point/ Other firms have responded. Infobright is trying to go especially low.
Yup – the low end of supported I’ve seen is http://aws.amazon.com/redshift/pricing/ ($2-10K/tb/year; with networking and secondary disk, you can be under $5K/tb year.) Established vendors SW-only are breaking under 10K/year, twice that for appliances – based on source data….
Thanks guys. You are right, Curt, but Teradata has traditionally been at the high end of the pricing spectrum. I’d note that their 2000 series is down below 15k/TB list and their 1500 (singularity) stuff is down below $4k/TB list.
Question: how do you think solid state drives impact the price/performance dynamic between Hadoop and MPP systems? Would SSDs (particularly for systems like Teradata that heavily use random i/o) dramatically swing the price/performance conversation back in MPP’s favor?
Francois,
1. Hadoop is just as MPP as Teradata, Netezza, Vertica, et al. I reject terminology to the contrary. 🙂
2. If we impute software license revenue to both Teradata and the entire world of Hadoop distributions in Hive uses, Teradata’s advantage might be 100X. (Please nobody quote that figure; I didn’t actually check any numbers.) Hardware, of course, is a different story.
3. You’re right that flash helps Teradata in analytic processing vs. Hadoop for those analytic workloads Teradata can handle, but I might question whether it’s “dramatic”. Teradata’s done pretty well in the disk world.
4. Flash potentially helps the HBase/Hadoop integration story.
5. Tez potentially cuts Hadoop’s sequential I/O burden.
Fair point on the terminology, Curt!
It seems like there is a limit to what Hadoop can address, however. Workloads that you might term “operational analytics” with hundreds or thousands of users, for example. I’m sure this is on the roadmap for Cloudera and Hortonworks, but Hadoop seems like it has limitations from a performance perspective (particularly with regards to concurrency), at least until the cost of memory drops enough. Please correct me if I’m wrong.
[…] a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 […]
[…] Stinger is Hortonworks’ (and presumably also Apache’s) answer to Impala, but is more of a Hive upgrade than an outright replacement. In particular, Stinger’s answer to the new Impala engine is a port of Hive to the new engine Tez. […]
[…] seems to agree, except perhaps for Hortonworks folks betting on the more limited and less mature Tez. Spark’s biggest technical advantages as a general data processing engine are […]