February 28, 2015

Databricks and Spark update

I chatted last night with Ion Stoica, CEO of my client Databricks, for an update both on his company and Spark. Databricks’ actual business is Databricks Cloud, about which I can say:

Databricks Cloud is:
- Spark-as-a-Service.
- Currently running on Amazon only.
- Not dependent on Hadoop.
Databricks Cloud, despite having a 1.0 version number, is not actually in general availability.
Even so, there are a non-trivial number of paying customers for Databricks Cloud. (Ion gave me an approximate number, but is keeping it NDA until Spark Summit East.)
Databricks Cloud gets at data from S3 (most commonly), Redshift, Elastic MapReduce, and perhaps other sources I’m forgetting.
Databricks Cloud was initially focused on ad-hoc use. A few days ago the capability was added to schedule jobs and so on.
Unsurprisingly, therefore, Databricks Cloud has been used to date mainly for data exploration/visualization and ETL (Extract/Transform/Load). Visualizations tend to be scripted/programmatic, but there’s also an ODBC driver used for Tableau access and so on.
Databricks Cloud customers are concentrated (but not unanimously so) in the usual-suspect internet-centric business sectors.
The low end of the amount of data Databricks Cloud customers are working with is 100s of gigabytes. This isn’t surprising.
The high end of the amount of data Databricks Cloud customers are working with is petabytes. That did surprise me, and in retrospect I should have pressed for details.

I do not expect all of the above to remain true as Databricks Cloud matures.

Ion also said that Databricks is over 50 people, and has moved its office from Berkeley to San Francisco. He also offered some Spark numbers, such as:

15 certified distributions.
~40 certified applications.
2000 people trained last year by Databricks alone.

Please note that certification of a Spark distribution is a free service from Databricks, and amounts to checking that the API works against a test harness. Speaking of certification, Ion basically agrees with my views on ODP, although like many — most? — people he expresses himself more politely than I do.

We talked briefly about several aspects of Spark or related projects. One was DataFrames. Per Databricks:

In Spark, a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

I gather this is modeled on Python pandas, and extends an earlier Spark capability for RDDs (Resilient Distributed Datasets) to carry around metadata that was tantamount to a schema.

SparkR is also on the rise, although it has the usual parallel R story to the effect:

You can partition data, run arbitrary R on every partition, and aggregate the results.
A handful of algorithms are truly parallel.

So of course is Spark Streaming. And then there are Spark Packages, which are — and I’m speaking loosely here — a kind of user-defined function.

Thankfully, Ion did not give me the usual hype about how a public repository of user-created algorithms is a Great Big Deal.
Ion did point out that providing an easy way for people to publish their own algorithms is a lot easier than evaluating every candidate contribution to the Spark project itself. 🙂

I’ll stop here. However, I have a couple of other Spark-related posts in the research pipeline.

Categories: Amazon and its cloud, Cloud computing, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Parallelization, Petabyte-scale data management, Predictive modeling and advanced analytics, Software as a Service (SaaS)

Subscribe to our complete feed!

Comments

7 Responses to “Databricks and Spark update”

Mark Callaghan on February 28th, 2015 9:56 am

“The low end of the amount of data Databricks Cloud customers are working with is 100s of gigabytes. This isn’t surprising.”

Why do you need Spark or Hadoop for what can fit on one host with a lot of RAM and/or a lot of SSD?
David Gruzman on February 28th, 2015 2:27 pm

You might need hundreds of CPU cores to do processing of this 100s of gigabytes in interactive speed. It is something we can not pack into single server in reasonable price.
Ion Stoica on February 28th, 2015 9:29 pm

Like David mentioned, using more nodes can help with improving interactivity, as many workloads are CPU and/or I/O bounded . This is important since one of the main values of Databricks Cloud is allowing users to interactively query and process the data.
Databricks and Spark update | Analytics Team on February 28th, 2015 11:54 pm

[…] out what DBMS2 has to say about Databricks and […]
Big Analytics Roundup (March 2, 2015) | The Big Analytics Blog on March 2nd, 2015 10:30 am

[…] Monash writes about Databricks and Spark on his DBMS2 […]
Multi-model database managers | DBMS 2 : DataBase Management System Services on August 24th, 2015 4:07 am

[…] since then have been in line with my thoughts. For example, Spark added DataFrames, which promise substantial data model flexibility for Spark use cases, but more mature products […]
Notes on Spark and Databricks — generalities | DBMS 2 : DataBase Management System Services on July 31st, 2016 10:30 am

[…] February, 2015 […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Databricks and Spark update

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin