Hadoop’s next refactoring?
I believe in all of the following trends:
- Hadoop is a Big Deal, and here to stay.
- Spark, for most practical purposes, is becoming a big part of Hadoop.
- Most servers will be operated away from user premises, whether via SaaS (Software as a Service), co-location, or “true” cloud computing.
Trickier is the meme that Hadoop is “the new OS”. My thoughts on that start:
- People would like this to be true, although in most cases only as one of several cluster computing platforms.
- Hadoop, when viewed as an operating system, is extremely primitive.
- Even so, the greatest awkwardness I’m seeing when different software shares a Hadoop cluster isn’t actually in scheduling, but rather in data interchange.
There is also a minor issue that if you distribute your Hadoop work among extra nodes you might have to pay a bit more to your Hadoop distro support vendor. Fortunately, the software industry routinely solves more difficult pricing problems than that.
Recall now that Hadoop — like much else in IT — has always been about two things: data storage and program execution. The evolution of Hadoop program execution to date has been approximately:
- Originally, MapReduce and JobTracker were the way to execute programs in Hadoop, period, at least if we leave HBase out of the discussion.
- In a major refactoring, YARN replaced a lot of what JobTracker did, with the result that different program execution frameworks became easier to support.
- Most of the relevant program execution frameworks — such as MapReduce, Spark or Tez — have data movement and temporary storage near their core.
Meanwhile, Hadoop data storage is mainly about HDFS (Hadoop Distributed File System). Its evolution, besides general enhancement, has included the addition of file types suitable for specific kinds of processing (e.g. Parquet and ORC to accelerate analytic database queries). Also, there have long been hacks that more or less bypassed central Hadoop data management, and let data be moved in parallel on a node-by-node basis. But several signs suggest that Hadoop data storage should and will be refactored too. Three efforts in particular point in that direction:
- HDFS caching, about which I know relatively little.
- Tachyon, in which I gather Spark’s creators continue to believe.
- Cloudera’s deal to run Hadoop against Isilon storage and its associated interest in also supporting other kinds of storage, such as the object storage commonly found in clouds.
The part of all this I find most overlooked is inter-program data exchange. If two programs both running on Hadoop want to exchange data, what do they do, other than reading and writing to HDFS, or invoking some kind of a custom connector? What’s missing is a nice, flexible distributed memory layer, which:
- Works well with Hadoop execution engines (Spark, Tez, Impala …).
- Works well with other software people might want to put on their Hadoop nodes.
- Interfaces nicely to HDFS, Isilon, object storage, et al.
- Is fully parallel any time it needs to talk with persistent or external storage.
- Can be fully parallel any time it needs to talk with any other software on the Hadoop cluster.
Tachyon could, I imagine, become that. HDFS caching probably could not.
In the past, I’ve been skeptical of in-memory data grids. But now I think that a such a grid could take Hadoop to the next level of generality and adoption.
Related links
- One recent trigger for this line of thinking was Eric Frenkiel’s idea of MemSQL/Spark integration.
- The most recent Tachyon slide deck I’ve seen cited my April, 2012 post on many kinds of memory-centric data management.
Comments
15 Responses to “Hadoop’s next refactoring?”
Leave a Reply
Hi!
When I read “What’s missing is a nice, flexible distributed memory layer”, I immediately thought about HDFS caching. Could you elaborate more on the reasons you don’t believe in it for this purpose?
Also, for the sake of completeness HW seems to be heading towards a Hive|Pig-on-Tez-on-Spark architecture. There is a strong overlap between Tez and Spark but the former could take the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution, statistics or… HDFS caching). The latter would be reduced to its execution layer, without the need of a nasty Spark/Hadoop coupling. It’s nice but leaves open the questions about the rest of the Spark ecosystem (SparkSQL, SparkML).
Funnily, Cloudera has recently announced a PoC with Hive-on-Spark. With Impala and SparkSQL, it’s the 3rd SQL framework they put forward.
Thomas,
Maybe I’m missing something, but I don’t see how HDFS caching helps when HDFS isn’t used for persistent storage.
As for Tez running over Spark — this is the first I’ve heard of it, and if it’s true I have trouble reconciling other things Hortonworks has said, e.g. what I quoted in my recent Spark vs. Tez post.
Curt
Thanks for the insightful post.
TL;DR
I work at Cray and one of the contexts we see this in, is ‘traditional’ HPC users wanting to do analytics. Eg – you generate data out of a large high-fidelity simulation (higher fidelity = closer to reality = bigger data). Visualization tools alone, which are common front ends to this data, are not going to work as data sizes become larger, because they scale as a function of eyeballs. Not only that, increasingly the simulation output is just one source of data which needs to be augmented with external data gathered asynchronously.
So you’re squarely in ‘analytic toolset’ territory. Spark is an increasingly attractive proposition in scientific computing/HPC because it doesnt come with HDFS baggage (or YARN for that matter). But the immediate next problem is making Spark talk with say, an MPI program. The simplest possible solution is to use the file system (typically a global parallel FS) as the data exchange layer. But that’s not great, since, well, that’s gated by the global FS bandwidth, i.e the neck of the funnel. So the next question is ‘how to build a high-performance, close-to-compute distributed storage layer’ for data interchange – which brings up exactly the problem you’re talking about.
Tachyon is interesting on many fronts – not the least of which is that it leverages the processing frameworks to recover using recompute, rather than replication (no 3x, with caveats). Also promising is the ability to plug-and-play filesystems underneath. You’ve probably already seen this, but I think this paper clarifies its motivations: https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/disk-irrelevant_hotos2011.pdf.
Another driver we’re seeing emerge at the same time – it’s not obvious that the coarse distributed memory model(RDD) in Spark is optimal for all classes of analytics or data processing (graph traversals, for example). In such cases, the processing program (say a graph query engine) would ‘prefer’ a finer-grained shared memory view across distributed memory. This isnt so much a data interchange problem, but shows how the locus of data management is moving towards memory hierarchies starting at DRAM, spanning local and remote NVRAM – e.g SSD, and finally ‘far-storage’ on disk.
Thanks, Venkat.
Cray/uRika/Yarcdata was part of what I was thinking of when I wrote this post. 🙂
Indeed, HDFS caching is working together with persistence, and when data also triple replicated – writes are getting expensive.
In the same time – common trick to speed up writes of intermediate data into HDFS is to set replication factor to one for those files. Doing this, together with in-memory caching might give good results.
Thanks for your answers. The confusion is mine. I was thinking of sharing computation results between different jobs that might need it (kind of like a table useful to many jobs running on the same cluster, so useful that you may want to cache it). You seem to be talking about distributing an intermediate dataset to the various tasks inside the same job (like a cached RDD in Spark), which is a different story.
I haven’t seen your Spark vs Tez post, I’m going to read it. The integration I was alluding to might me no more than a possibility or an option that was once considered, I don’t know the current status. But I clearly remember hearing it from a HW representative a couple of months ago.
Thomas,
Actually, a big part of my objection to HDFS caching in this role is a belief that HDFS won’t be the persistent storage choice for all Hadoop.
I agree that in-memory data grids (IMDGs) have the potential to address Hadoop’s inter-program data interchange challenge for several reasons. First, they employ a fine-grained, object-oriented data storage model which integrates seamlessly with business logic written in Java, C#, and other o-o languages using straightforward create/read/update/delete (CRUD) APIs. They offer more flexibility and finer-grained data access than Spark RDDs, and while they can serve as an efficient HDFS cache, they do not rely on HDFS as a backing store. Second, they allow fully parallel access across a cluster of commodity servers (and can be co-located on a Hadoop cluster), enabling linearly scalable throughput for feeding an execution engine or for data motion to/from a backing store. Third – and often overlooked – they provide high availability through data replication, which allows IMDGs to be deployed in live, operational environments requiring immediate recovery after a server failure (and unable to tolerate the delays required to checkpoint and recompute data to handle failures); IMDGs enable live, mission-critical data to be analyzed while it changes in real time. Lastly, like Spark, they can incorporate highly efficient data-parallel execution on in-memory datasets and avoid data motion to maximize performance. (Our ScaleOut hServer product runs Apache MapReduce and Hive unchanged using YARN to schedule jobs and supports additional data-parallel execution models.)
The main limitation in using IMDGs for inter-program data interchange is that they cannot host huge datasets that would otherwise fit on disk. However, in practice this usually is not a problem because “live” datasets in many applications (e.g., ecommerce, finserv, and cable media) tend to fit in memory without difficulty. Also, when performing data interchange, we expect applications to pipeline data through an IMDG, so the entire dataset does not have to fit in memory at once. Note the underlying assumption that an IMDG is not used as a persistent data store (high av notwithstanding), which is the proper role for HDFS, MongoDB, and other scalable, disk-based stores.
I do not see HDFS inability for random read/write on the list ( onky mapr seems to be a ablr to do it).
The community seems to be aware of that need, and has already started to address it :
Heterogeneous Storage will expose an abstract storage layer :
https://issues.apache.org/jira/browse/HDFS-5682
Thanks.
Is the “quota management support” something like DBMS workload management? Quotas for what?
GridGain / Apache Ignite is a nice flexible distributed memory layer. Fully parallel. Open source. So additional adapters can be contributed. http://gridgain.com/products/in-memory-data-fabric/
[…] Hadoop needs a memory-centric storage grid. […]
[…] When you’re writing data, you want it to be banged into a sufficiently-durable-to-acknowledge condition fast. If acknowledgements are slow, performance nightmares can ensue. So writing is the last place you want an extra layer, perhaps unless you’re content with the durability provided by an in-memory data grid. […]
[…] Even the core of Hadoop is repeatedly re-imagined. […]