Notes on Spark and Databricks — generalities
I visited Databricks in early July to chat with Ion Stoica and Reynold Xin. Spark also comes up in a large fraction of the conversations I have. So let’s do some catch-up on Databricks and Spark. In a nutshell:
- Spark is indeed the replacement for Hadoop MapReduce.
- Spark is becoming the default platform for machine learning.
- SparkSQL (nee’ Shark) is puttering along predictably.
- Databricks reports good success in its core business of cloud-based machine learning support.
- Spark Streaming has strong adoption, but its position is at risk.
- Databricks, the original authority on Spark, is not keeping a tight grip on that role.
I shall explain below. I also am posting separately about Spark evolution, especially Spark 2.0. I’ll also talk a bit in that post about Databricks’ proprietary/closed-source technology.
Spark is the replacement for Hadoop MapReduce.
This point is so obvious that I don’t know what to say in its support. The trend is happening, as originally decreed by Cloudera (and me), among others. People are rightly fed up with the limitations of MapReduce, and — niches perhaps aside — there are no serious alternatives other than Spark.
The greatest use for Spark seems to be the same as the canonical first use for MapReduce: data transformation. Also in line with the Spark/MapReduce analogy:
- Data-transformation-only use cases are important, but they don’t dominate.
- Most other use cases typically have a data transformation element as well …
- … which has to be started before any other work can be done.
And so it seems likely that, at least for as long as Spark is growing rapidly, data transformation will appear to be the biggest Spark use case.
Spark is becoming the default platform for machine learning.
Largely, this is a corollary of:
- The previous point.
- The fact that Spark was originally designed with machine learning as its principal use case.
To do machine learning you need two things in your software:
- A collection of algorithms. Spark, I gather, is one of numerous good alternatives there.
- Support for machine learning workflows. That’s where Spark evidently stands alone.
And thus I have conversations like:
- “Are you doing anything with Spark?”
- “We’ve gotten more serious about machine learning, so yes.”
SparkSQL (nee’ Shark) is puttering along.
SparkSQL is pretty much following the Hive trajectory.
- Useful from Day One as an adjunct to other kinds of processing.
- A tease and occasionally useful as a SQL engine for its own sake, but really not very good, pending years to mature.
Databricks reports good success in its core business of cloud-based machine learning support.
Databricks, to an even greater extent than I previously realized, is focused on its cloud business, for which there are well over 200 paying customers. Notes on that include:
- As you might expect based on my comments above, the majority of usage is for data transformation, but a lot of that is in anticipation of doing machine learning/predictive modeling in the near future.
- Databricks customers typically already have their data in the Amazon cloud.
- Naturally, a lot of Databricks customers are internet companies — ad tech startups and the like. Databricks also reports “strong” traction in the segments:
- Media
- Financial services (especially but not only insurance)
- Health care/pharma
- The main languages Databricks customers use are R and Python. Ion said that Python was used more on the West Coast, while R was used more in the East.
Databricks’ core marketing concept seems to be “just-in-time data platform”. I don’t know why they picked that, as opposed to something that emphasizes Spark’s flexibility and functionality.
Spark Streaming’s long-term success is not assured.
To a first approximation, things look good for Spark Streaming.
- Spark Streaming is definitely the leading companion to Kafka, and perhaps also to cloud equivalents (e.g. Amazon Kinesis).
- The “traditional” alternatives of Storm and Samza are pretty much done.
- Newer alternatives from Twitter, Confluent and Flink aren’t yet established.
- Cloudera is a big fan of Spark Streaming.
- Even if Spark Streaming were to generally decline, it might keep substantial “good enough” usage, analogously to Hive and SparkSQL.
- Cool new Spark Streaming technology is coming out.
But I’m also hearing rumbles and grumbles about Spark Streaming. What’s more, we know that Spark Streaming wasn’t a core part of Spark’s design; the use case just happened to emerge. Demanding streaming use cases typically involve a lot of short-request inserts (or updates/upserts/whatever). And if you were designing a system to handle those … would it really be based on Spark?
Databricks is not keeping a tight grip on Spark leadership.
For starters:
- Databricks’ main business, as noted above, is its cloud service. That seems to be going well.
- Databricks’ secondary business is licensing stuff to Spark distributors. That doesn’t seem to amount to much; it’s too easy to go straight to the Apache distribution and bypass Databricks. No worries; this never seemed it would be a big revenue opportunity for Databricks.
At the moment, Databricks is pretty clearly the general leader of Spark. Indeed:
- If you want the story on where Spark is going, you do what I did — you ask Databricks.
- Similarly, if you’re thinking of pushing the boundaries on Spark use, and you have access to the Databricks folks, that’s who you’ll probably talk to.
- Databricks employs ~1/3 of Spark committers.
- Databricks organizes the Spark Summit.
But overall, Databricks doesn’t seem to care much about keeping Spark leadership. Its marketing efforts in that respect are minimal. Word-of-mouth buzz paints a similar picture. My direct relationship with the company gives the same impression. Oh, I’m sure Databricks would like to remain the Spark leader. But it doesn’t seem to devote much energy toward keeping the role.
Related links
Starting with my introduction to Spark, previous overview posts include those in:
Comments
6 Responses to “Notes on Spark and Databricks — generalities”
Leave a Reply
[…] my recent visit to Databricks, I of course talked a lot about technology — largely with Reynold Xin, but a bit with Ion […]
Re: default platform for ML/DL: Google TensorFlow is making strong inroads in this space. TF is recently open sourced in distributed incarnation ( single node was os’d last December). TF can run on heterogeneous hardware ( CPU, GPU, mobile ). This might be an event of great importance, similar to the release of MapReduce paper by Google, with the difference that Google is actually releasing the code this time.
Ranko – an issue with TensorFlow at scale is wrangling the data into the right structure, which as Curt mentions above, Spark is the only game in town. I expect to see some Spark tools evolve towards feeding data into TensorFlow.
[…] Spark and Databricks are both prospering. […]
[…] CEO Ali Ghodsi checked in because he disagreed with part of my recent post about Databricks. Ali’s take on Databricks’ position in the Spark world […]
Spark is not becoming the default machine-learning platform. That’s wishful thinking.
The reason is because Databricks has not invested enough in MLlib over the years, so it doesn’t have a defensible technological edge in ML. Instead, the space is highly fragmented. Apache Mahout is just one of many competitors that come to mind for traditional ML on the JVM.
One of the reasons Spark probably won’t win the race here is that it’s not computationally efficient for large linear algebra operations. What it does best is fast ETL, which makes it an important part of an ML pipeline, but unable to offer a full solution.
Beyond traditional ML like random forests, if you look at deep learning, the default platform for the JVM is Deeplearning4j. DL4J has a sophisticated Spark integration, and it’s certified on CDH.
http://deeplearning4j.org/spark
DL4J integrates with Spark as a data access layer, using it to orchestrate multiple host threads. And Spark does OK on that task. DL4J also makes Spark run fast on multi-GPUs. Performance is equal to Caffe on non-trivial image processing jobs.
http://deeplearning4j.org/gpu
For the computations, you need different tools. We use ND4J, a Java/C++ scientific computing library that uses JavaCPP to avoid the overhead of the JNI.
ND4J is n-dimensional arrays for Java. We basically ported Numpy to the JVM for the purpose of scientific computing, large matrix manipulations, etc. It also comes with a Scala API, ND4S.
http://nd4j.org/
https://github.com/deeplearning4j/libnd4j
https://github.com/bytedeco/javacpp
https://github.com/deeplearning4j/nd4s