December 13, 2012

Spark, Shark, and RDDs — technology notes

Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level:

Rather than persisting data to disk after every step, as MapReduce does, Spark instead writes to something called RDDs (Resilient Distributed Datasets), which can live in memory.
Rather than being restricted to maps and reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order. All the primitives are parallel with respect to the RDDs.
Shark is a lot like Hive, only rewritten (in significant parts) and running over Spark.
There’s an approach to launching tasks quickly — ~5 milliseconds or so — that I unfortunately didn’t grasp.

The key concept here seems to be the RDD. Any one RDD:

Is a collection of Java objects, which should have the same or similar structure.
Can be partitioned/distributed and shuffled/redistributed across the cluster.
Doesn’t have to be entirely in memory at once.

Otherwise, there’s a lot of flexibility; an RDD can be a set of tuples, a collection of XML documents, or whatever other reasonable kind of dataset you want. And I gather that:

At the moment, RDDs expire at the end of a job.
This restriction will be lifted in a future release.

Just like MapReduce, Spark wants to be fault-tolerant enough to work on clusters of dubiously-reliable hardware. Unlike MapReduce, Spark doesn’t persist intermediate result sets (unless they’re too large to fit into RAM). Rather, Spark’s main fault-tolerance strategy is:

RDDs are written by single operations (typically executed in a distributed fashion).
If there’s a failure, the operation is replayed over the portion of the data that was on the affected node.

Further, Reynold Xin emailed:

Spark [supports] speculative execution for dealing with stragglers. Speculation is particularly important for low-latency jobs, which are common in Spark.

Shark borrows a lot of Hive code to do what Hive does, only over Spark. Notes on Shark’s query planning include:

Shark borrows the Hive optimizer for up-front join reordering and so on.
Shark can dynamically re-plan work in progress to:
- Change how work is partitioned among nodes.
- Select a join algorithm appropriate for the cardinalities of intermediate result sets.

Further Shark smarts are to be added down the road.

And finally, Shark gives a columnar storage format to its RDDs, which has already been discussed on this blog.

Categories: Data models and architecture, Databricks, Spark and BDAS, Hadoop, MapReduce, Memory-centric data management, Open source, Parallelization, SQL/Hadoop integration

Subscribe to our complete feed!

Comments

11 Responses to “Spark, Shark, and RDDs — technology notes”

Matei Zaharia on December 14th, 2012 2:30 pm

For anyone interested in learning more about Spark and Shark, here are their homepages: http://spark-project.org, http://shark.cs.berkeley.edu.
Spark and Shark in the news | Spark on December 21st, 2012 1:39 pm

[…] Curt Monash, editor of the popular DBMS2 blog, wrote a great introduction to Spark and Shark, as well as a more detailed technical overview. […]
Spark, Shark, and BDAS In the News | Andy Konwinski on February 19th, 2013 7:11 pm

[…] Spark, Shark, and RDDs — technology notes, December 13, 2012 […]
ClearStory, Spark, and Storm | DBMS 2 : DataBase Management System Services on September 29th, 2013 10:56 pm

[…] Is a flagship user of Spark. […]
Notes on memory-centric data management | DBMS 2 : DataBase Management System Services on January 3rd, 2014 4:36 am

[…] is emphatically backing Shark. And a key aspect of Shark is that, unlike most of Hadoop, it’s […]
Spark and Databricks | DBMS 2 : DataBase Management System Services on February 2nd, 2014 1:51 pm

[…] the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is […]
Bob Schilmann on February 4th, 2014 2:55 pm

I’m confused, is this related to Sparqlcity?
Curt Monash on February 4th, 2014 5:23 pm

Spark is no relation to Sparql.
Optimism, pessimism, and fatalism — fault-tolerance, Part 2 | DBMS 2 : DataBase Management System Services on June 8th, 2014 12:58 pm

[…] both Hadoop and Spark support speculative execution, in which several clones of a processing step are executed at once […]
Basho and Riak | DBMS 2 : DataBase Management System Services on October 15th, 2015 11:18 am

[…] something better than ZooKeeper for cluster management, and I presume some help in persisting Spark RDDs as […]
Spark on fire | DBMS 2 : DataBase Management System Services on July 31st, 2016 7:20 am

[…] A clever approach to fault-tolerance. […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Spark, Shark, and RDDs — technology notes

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin