Spark, Shark, and RDDs — technology notes
Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level:
- Rather than persisting data to disk after every step, as MapReduce does, Spark instead writes to something called RDDs (Resilient Distributed Datasets), which can live in memory.
- Rather than being restricted to maps and reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order. All the primitives are parallel with respect to the RDDs.
- Shark is a lot like Hive, only rewritten (in significant parts) and running over Spark.
- There’s an approach to launching tasks quickly — ~5 milliseconds or so — that I unfortunately didn’t grasp.
The key concept here seems to be the RDD. Any one RDD:
- Is a collection of Java objects, which should have the same or similar structure.
- Can be partitioned/distributed and shuffled/redistributed across the cluster.
- Doesn’t have to be entirely in memory at once.
Otherwise, there’s a lot of flexibility; an RDD can be a set of tuples, a collection of XML documents, or whatever other reasonable kind of dataset you want. And I gather that:
- At the moment, RDDs expire at the end of a job.
- This restriction will be lifted in a future release.
Just like MapReduce, Spark wants to be fault-tolerant enough to work on clusters of dubiously-reliable hardware. Unlike MapReduce, Spark doesn’t persist intermediate result sets (unless they’re too large to fit into RAM). Rather, Spark’s main fault-tolerance strategy is:
- RDDs are written by single operations (typically executed in a distributed fashion).
- If there’s a failure, the operation is replayed over the portion of the data that was on the affected node.
Further, Reynold Xin emailed:
Spark [supports] speculative execution for dealing with stragglers. Speculation is particularly important for low-latency jobs, which are common in Spark.
Shark borrows a lot of Hive code to do what Hive does, only over Spark. Notes on Shark’s query planning include:
- Shark borrows the Hive optimizer for up-front join reordering and so on.
- Shark can dynamically re-plan work in progress to:
- Change how work is partitioned among nodes.
- Select a join algorithm appropriate for the cardinalities of intermediate result sets.
Further Shark smarts are to be added down the road.
And finally, Shark gives a columnar storage format to its RDDs, which has already been discussed on this blog.
Comments
11 Responses to “Spark, Shark, and RDDs — technology notes”
Leave a Reply
For anyone interested in learning more about Spark and Shark, here are their homepages: http://spark-project.org, http://shark.cs.berkeley.edu.
[…] Curt Monash, editor of the popular DBMS2 blog, wrote a great introduction to Spark and Shark, as well as a more detailed technical overview. […]
[…] Spark, Shark, and RDDs — technology notes, December 13, 2012 […]
[…] Is a flagship user of Spark. […]
[…] is emphatically backing Shark. And a key aspect of Shark is that, unlike most of Hadoop, it’s […]
[…] the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is […]
I’m confused, is this related to Sparqlcity?
Spark is no relation to Sparql.
[…] both Hadoop and Spark support speculative execution, in which several clones of a processing step are executed at once […]
[…] something better than ZooKeeper for cluster management, and I presume some help in persisting Spark RDDs as […]
[…] A clever approach to fault-tolerance. […]