Introduction to Spark, Shark, BDAS and AMPLab
UC Berkeley’s AMPLab is working on a software stack that:
- Is meant (among other goals) to improve upon Hadoop …
- … but also to interoperate with it, and which in fact …
- … uses significant parts of Hadoop.
- Seems to have the overall name BDAS (Berkeley Data Analytics System).
The whole thing has $30 million in projected funding (half government, half industry) and a 6-year plan (which they’re 2 years into).
Specific projects of note in all that include:
- Mesos, a cluster manager. I don’t know much about Mesos, but it seems to be in production use, most notably at Twitter supporting Storm.
- Spark, a replacement for MapReduce and the associated execution stack.
- Shark, a replacement for Hive.
Mike Franklin* and his colleagues, who recently introduced me to all this, are focused on the database parts, including Spark and Shark. A recent slide deck gives details; Slide 11 in particular shows some of the project elements (I gather that everything on that slide is expected some time in 2013). A fuller accounting of project components may be found on the AMPLab website.
*Mike is the guy on whose work Truviso was based.
The most obvious improvements in Spark over MapReduce are:
- Richer and more flexible syntax, in that:
- You can do stuff beyond Map and Reduce.
- You can mix steps at will.
- An alternate approach to fault tolerance, in which data doesn’t have to be written to disk between steps.
The most obvious improvements in Shark over Hive are:
- It uses Spark, which performs better than MapReduce.
- It has columnar, in-memory data structures.
Not spilling intermediate results to disk is an important point. We normally think of this as a big deal in complex query execution, for example as an aspect of the design of Impala or Hadapt. But it’s perhaps even more important in iterative machine learning algorithms, which seem to be top-of-mind as a design point for the AMPLab guys.
There seems to be quite a bit of interest in and even adoption of these projects. The AMPLab guys seemed more comfortable talking about that for the record via email, and so with permission I quote (lightly edited):
We’ve seen Spark used for a variety of analytics and statistical learning applications, mostly on Hadoop and Hive data. These range from replacing Hive or Pig for simple SQL queries, to anomaly detection, to interactive dashboards where users can drill into data. Two examples of companies that have talked publicly about their Spark use cases are:
- Conviva (Ion Stoica’s video analytics company), one of our earlier users, which has used it to replace a large fraction of their queries.
- Quantifind, a company that performs predictive analytics and text mining on social data to help marketers at large entertainment companies.
See http://data-informed.com/blog/2012/10/17/spark-an-open-source-engine-for-iterative-data-mining/ for a short writeup on both of these use cases. Other users we know about are performing web analytics and BI-like workloads.
Several companies have also contributed to the open source projects. For example, Yahoo! has contributed a JDBC server to Shark, and is working on a bytecode optimizer.
We have a growing user community. Our meetup group is approaching 500 members. To date, meetups have been hosted by AirBnb, Groupon, Yelp, Palantir, Conviva, and Klout. More details at http://www.meetup.com/spark-users/.
Finally, we held a Big Data bootcamp for industrial practitioners back in August that offered two days of training using Spark and Shark. The bootcamp was sold out for on-site attendance and 5000 people attended via online live streaming. Details at http://ampcamp.berkeley.edu.
You can find the list of public contributors to Spark and Shark at the following two GitHub pages:
and
I went through the list and identified the companies the contributors are associated with based on public information. Below is the partial list, roughly in the order of lines of code contributed.
- UC Berkeley AMPLab
- Yahoo!
- Conviva
- Quantifind
- Clearstory Data
- Time Out (UK)
- GoodData (Czech)
- AdMobius
- Nuxeo (France)
- Princeton University
More on Spark and Shark technology in a separate post.
Comments
11 Responses to “Introduction to Spark, Shark, BDAS and AMPLab”
Leave a Reply
[…] Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level: […]
Thanks for writing this up, Curt. For more details on Spark and Shark, here are the homepages: http://spark-project.org and http://shark.cs.berkeley.edu.
Thanks for covering AMPLab. The work on Shark/Spark is very innovative and look as if it has excellent potential. The other interesting team in Soda Hall is Joe Hellerstein’s group, who have been doing work around CALM. That work is more relevant to OLTP processing though it also has interesting application to analytics as well. I hope you will have a chance to interview them the next time you visit UC Berkeley, assuming you have not done so already.
Robert,
As you mention, our discussions with Curt have focused on a few components of the BDAS stack that have had recent releases.
While AMPLab’s emphasis is on analytics, we do work on OLTP as well. Projects include the MDCC (Multi-Data Center Consistency) protocol, and the Probabilistically Bounded Staleness (PBS) framework – the latter of which is being done in collaboration with Joe and his group.
AMPLab also has efforts on scalable Machine Learning and using Crowdsourcing and Human Computation for analytics. I wrote a recent blog post including some of these other projects at https://amplab.cs.berkeley.edu/2012/12/02/a-snapshot-of-database-research-in-the-amplab/
Of course, this is just one effort in the Big Data area on campus. Suffice it to say that Curt could indeed find lots to write about in Soda Hall and elsewhere at Berkeley.
Thanks for the link Mike. The output from your lab, let alone Cal CS as a whole, is quite impressive. I didn’t know about mesos but am looking at it now. Cheers, Robert
[…] Monash, editor of the popular DBMS2 blog, wrote a great introduction to Spark and Shark, as well as a more detailed technical […]
[…] Introduction to Spark, Shark, BDAS and AMPLab, December 13, 2012 […]
[…] [13]http://www.dbms2.com/2012/12/13/introduction-to-spark-shark-bdas-and-amplab/ […]
[…] Charles is also seeing at least POC interest in Spark. […]
[…] heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some […]
Tһanks for the marvelous posting! I definitely enjoyeed reading it, you
arre a great author.I will ensure that I bookmaгk your blog and definitеly
will come back in the future. I want to encouгage
continue youг greɑt poѕts, have a nice evening!
Нere is my weeb blog – blocked drains horsham