Spark on fire
Spark is on the rise, to an even greater degree than I thought last month.
- Numerous clients and other companies I talk with have adopted Spark, plan to adopt Spark, or at least think it’s likely they will. In particular:
- A number of analytic-stack companies are joining ClearStory in using Spark. Most of the specifics are confidential, but I hope some will be announced soon.
- MapR has joined Cloudera in supporting Spark, and indeed — unlike Cloudera — is supporting the full Spark stack.
- Mike Olson of Cloudera is on record as predicting that Spark will be the replacement for Hadoop MapReduce. Just about everybody seems to agree, except perhaps for Hortonworks folks betting on the more limited and less mature Tez. Spark’s biggest technical advantages as a general data processing engine are probably:
- The Directed Acyclic Graph processing model. (Any serious MapReduce-replacement contender will probably echo that aspect.)
- A rich set of programming primitives in connection with that model.
- Support also for highly-iterative processing, of the kind found in machine learning.
- Flexible in-memory data structures, namely the RDDs (Resilient Distributed Datasets).
- A clever approach to fault-tolerance.
- Spark is a major contender in streaming.
- There’s some cool machine-learning innovation using Spark.
- Spark 1.0 will drop by mid-May, Apache voters willin’ an’ the creek don’ rise. Publicity will likely ensue, with strong evidence of industry support.*
*Yes, my fingerprints are showing again.
The most official description of what Spark now contains is probably the “Spark ecosystem” diagram from Databricks. However, at the time of this writing it is slightly out of date, as per some email from Databricks CEO Ion Stoica (quoted with permission):
… but if I were to redraw it, SparkSQL will replace Shark, and Shark will eventually become a thin layer above SparkSQL and below BlinkDB.
With this change, all the modules on top of Spark (i.e., SparkStreaming, SparkSQL, GraphX, and MLlib) are part of the Spark distribution. You can think of these modules as libraries that come with Spark.
In an unfortunate non-development, Tachyon is not (yet?) part of Spark, and so it is hard for a Spark job’s data to be shared with other jobs (Spark or otherwise) or processes. That said:
- The tight integration of data structures and processes gives similar performance benefits to those of in-process vs. out-of-process in-database analytic functions. (It also of course raises similar stability concerns, but those seem less important in the case of Spark than of a true DBMS.)
- From a Hadoop vendor’s standpoint, Tachyon’s benefit of not requiring HDFS (Hadoop Distributed File System) isn’t important, and Tachyon somewhat conflicts with a newish effort called HDFS Caching.
A couple of Spark machine learning stories are very cool, in that they involve intra-day retraining of models. The better-known one is Yahoo’s, which in a prototype built in 120 lines of code trains a new model for recommendation of each candidate top-page news story. When I challenged that anecdote, Ion told me about his own former company Conviva, which retrains models every minute to decide which particular source of streaming video each client system will be connected to.
I am generally skeptical of immature SQL efforts, and SparkSQL is no exception. That said, it seems to be going in sensible directions, which should be welcome to those folks who used or were planning to use Shark anyway.
- SparkSQL actually has its own optimizer, rather than using the inappropriate Hive one. As with many new optimizers, it’s starting out rule-based, but is planned to become cost-based down the road.
- SparkSQL can run queries against data that’s either inside Spark or outside-but-accessible.
- SparkSQL can be accessed via Python and other APIs.
- Spark works with the Hive metastore, nee’ HCatalog.
And finally, there’s no public news as to what Databricks’ own business is. I think that’s a bit silly, but in fairness:
- The Spark 1.0 launch will consume every bit of marketing bandwidth they have.
- They don’t yet want to commit to a delivery date of their first offering.
Comments
15 Responses to “Spark on fire”
Leave a Reply
Perhaps a completion, or an add-on to this welcome presentation… SparkSQL as Streaming option. This could hit direct competitors like StreamSQL and others.
More evidence of the demise of MapReduce is the recent announcement (APR-25-2014) from Apache Mahout. They will be no longer adding MR algorithms to the project as a move to “modern data processing systems”
https://mahout.apache.org/
There is a lot of excitement in Apache Spark and here at MapR we are fully supporting it (together with Databricks) in our distribution. There are a lot of benefits with faster development and in-memory execution where optimization techniques maximize data locality across multiple iterations. We co-hosted a webinar (http://bit.ly/1rJXRna) with Databricks 2 days ago which is a great primer on the use cases and had over 50 great questions (e.g., differences between Spark Streaming and Storm).
Re:“replacement for Hadoop MapReduce”, we take a balanced view in that Hadoop MapReduce has been tried and true in production at 1000’s of companies. MapR has many customers executing 10’s of 1000’s of jobs on with this framework. Our focus is to support broad support for multiple execution engines based on customer choice and complementary benefits of each approach. The Hadoop community is thriving with innovation, so giving customers more choice and backward compatibility (http://bit.ly/1lORVIM) across releases of different projects is important. Over time, Hadoop MR jobs may be moved to Spark jobs, but it will be a measured, “enterprise” approach.
That said, MapR is “all in” on Apache Spark and—as you pointed out—are the only commercial Hadoop distribution that supports the whole Spark stack. More details on our blog (http://bit.ly/1ktzUxB).
Specifically, Patrick’s link says new Mahout additions will be over Spark, at least if you want to run them in parallel. But old routines will still be supported.
Thanks for the find!
So when will the “most official description of what Spark now contains” be hosted on the actual Apache Spark project pages at http://spark.apache.org, rather than on a vendor’s site?
A key part of the success is the permissively licensed code and the independent governance of the Apache Spark project itself. While we certainly appreciate all the IP various vendors have donated (either directly, or through their employee committers) to the ASF, it would be nice to see a little more credit to the independent Apache Spark PMC and their committers as a group.
Thanks!
There is one thing which prevents me from designating Spark to be Hadoop map-reduce replacement: throughput. Where the advantage in throughput will come from?
During read from HDFS Spark re-use Hadoop input formats, shuffling also do not looks superior to the Hadoop’s implementation.
Without advantage in throughput I think Spark will take iterative and interactive jobs, but heavy batch processing will be left in Hadoop MR.
Hi David!
One thing you may be forgetting — Spark’s advantages in programmability. It has a lot more primitives than Tez, which in turn has somewhat more than classical MapReduce.
But if you’re saying that many existing processes that run well on MR will continue to do so for quite a while, I wouldn’t argue with that.
[…] claim that Spark will replace Hadoop MapReduce got much Twitter attention — including some high-profile endorsements — and also some […]
During Spark benchmark I have found that indeed shuffle performance of Spark are inferior to Hadoop MapReduce. If we take enough data to disregard startup overhead, Hive group by is faster then the same group by implemented in Spark.
We did our best to avoid most obvious performance pitfalls – used Kryo serialization for spark, and made Hadoop to utilize all available CPU by giving enough memory to Yarn.
I am curious is it “by design” or it is possible to tune Spark to perform better in such cases.
Hi David,
Spark committer here. Thanks for the feedback. The goal of the Spark project definitely includes making it perform well in interactive, iterative, and high throughput batch jobs. Even for on-disk data, Spark typically outperforms Hadoop MapReduce in real workloads where multiple stages of MRs are required, because of fast scheduling (as you pointed out) and the ability to understand DAGs of tasks. There are also some advanced features such as data partitioning that applications can exploit that simply don’t exist in MapReduce or equivalent frameworks.
Some of our own benchmarks as well as workloads reported by the community do include high intensity shuffles, and Spark performs quite favorably. That said, as Spark becomes more popular and is exposed to a wider variety of workloads, we do find room for improvement. In particular, both Databricks and the Spark community at large have plans to improve shuffle, including general optimizations as well as making it more pluggable to exploit characteristics of specific hardware or setup. The design of Spark makes it easy to implement many of these optimizations.
Meantime, please reach out if you need help tuning Spark.
David,
You mentioned a benchmark with Spark. Can you elaborate on what benchmarking tool you used? We have been primarily using YCSB, but that is not integrated with the Spark API.
Thanks,
Brian
Hi Brian,
We didn’t use any special tools. We just took several jobs we have in production with hadoop and implemented them in Spark. In this particular example it was job that boils down to “group by”
David
[…] that time, Databricks CEO Ion Stoica told database industry analyst Curt Monash the same, although he also mentioned plans to continue developing an interactive engine called […]
[…] that time, Databricks CEO Ion Stoica told database industry analyst Curt Monash the same, although he also mentioned plans to continue developing an interactive engine called […]
[…] human real time. Predictive models now often get rescored on every click. Sometimes, they even get retrained at short intervals. And while data reduction in the sense of “event extraction from high-volume streams” […]