Databricks and Spark update
I chatted last night with Ion Stoica, CEO of my client Databricks, for an update both on his company and Spark. Databricks’ actual business is Databricks Cloud, about which I can say:
- Databricks Cloud is:
- Spark-as-a-Service.
- Currently running on Amazon only.
- Not dependent on Hadoop.
- Databricks Cloud, despite having a 1.0 version number, is not actually in general availability.
- Even so, there are a non-trivial number of paying customers for Databricks Cloud. (Ion gave me an approximate number, but is keeping it NDA until Spark Summit East.)
- Databricks Cloud gets at data from S3 (most commonly), Redshift, Elastic MapReduce, and perhaps other sources I’m forgetting.
- Databricks Cloud was initially focused on ad-hoc use. A few days ago the capability was added to schedule jobs and so on.
- Unsurprisingly, therefore, Databricks Cloud has been used to date mainly for data exploration/visualization and ETL (Extract/Transform/Load). Visualizations tend to be scripted/programmatic, but there’s also an ODBC driver used for Tableau access and so on.
- Databricks Cloud customers are concentrated (but not unanimously so) in the usual-suspect internet-centric business sectors.
- The low end of the amount of data Databricks Cloud customers are working with is 100s of gigabytes. This isn’t surprising.
- The high end of the amount of data Databricks Cloud customers are working with is petabytes. That did surprise me, and in retrospect I should have pressed for details.
I do not expect all of the above to remain true as Databricks Cloud matures.
Ion also said that Databricks is over 50 people, and has moved its office from Berkeley to San Francisco. He also offered some Spark numbers, such as:
- 15 certified distributions.
- ~40 certified applications.
- 2000 people trained last year by Databricks alone.
Please note that certification of a Spark distribution is a free service from Databricks, and amounts to checking that the API works against a test harness. Speaking of certification, Ion basically agrees with my views on ODP, although like many — most? — people he expresses himself more politely than I do.
We talked briefly about several aspects of Spark or related projects. One was DataFrames. Per Databricks:
In Spark, a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
I gather this is modeled on Python pandas, and extends an earlier Spark capability for RDDs (Resilient Distributed Datasets) to carry around metadata that was tantamount to a schema.
SparkR is also on the rise, although it has the usual parallel R story to the effect:
- You can partition data, run arbitrary R on every partition, and aggregate the results.
- A handful of algorithms are truly parallel.
So of course is Spark Streaming. And then there are Spark Packages, which are — and I’m speaking loosely here — a kind of user-defined function.
- Thankfully, Ion did not give me the usual hype about how a public repository of user-created algorithms is a Great Big Deal.
- Ion did point out that providing an easy way for people to publish their own algorithms is a lot easier than evaluating every candidate contribution to the Spark project itself. 🙂
I’ll stop here. However, I have a couple of other Spark-related posts in the research pipeline.
Comments
7 Responses to “Databricks and Spark update”
Leave a Reply
“The low end of the amount of data Databricks Cloud customers are working with is 100s of gigabytes. This isn’t surprising.”
Why do you need Spark or Hadoop for what can fit on one host with a lot of RAM and/or a lot of SSD?
You might need hundreds of CPU cores to do processing of this 100s of gigabytes in interactive speed. It is something we can not pack into single server in reasonable price.
Like David mentioned, using more nodes can help with improving interactivity, as many workloads are CPU and/or I/O bounded . This is important since one of the main values of Databricks Cloud is allowing users to interactively query and process the data.
[…] out what DBMS2 has to say about Databricks and […]
[…] Monash writes about Databricks and Spark on his DBMS2 […]
[…] since then have been in line with my thoughts. For example, Spark added DataFrames, which promise substantial data model flexibility for Spark use cases, but more mature products […]
[…] February, 2015 […]