Notes on Spark and Databricks — technology
During my recent visit to Databricks, I of course talked a lot about technology — largely with Reynold Xin, but a bit with Ion Stoica as well. Spark 2.0 is just coming out now, and of course has a lot of enhancements. At a high level:
- Using the new terminology, Spark originally assumed users had data engineering skills, but Spark 2.0 is designed to be friendly to data scientists.
- A lot of this is via a focus on simplified APIs, based on
- Unlike similarly named APIs in R and Python, Spark DataFrames work with nested data.
- Machine learning and Spark Streaming both work with Spark DataFrames.
- There are lots of performance improvements as well, some substantial. Spark is still young enough that Bottleneck Whack-A-Mole yields huge benefits, especially in the SparkSQL area.
- SQL coverage is of course improved. For example, SparkSQL can now perform all TPC-S queries.
The majority of Databricks’ development efforts, however, are specific to its cloud service, rather than being donated to Apache for the Spark project. Some of the details are NDA, but it seems fair to mention at least:
- Databricks’ notebooks feature for organizing and launching machine learning processes and so on is a biggie. Jupyter is an open source analog.
- Databricks has been working on security, and even on the associated certifications.
Two of the technical initiatives Reynold told me about seemed particularly cool. One, on the machine learning side, was a focus on training models online as new data streams in. In most cases this seems to require new algorithms for old model types, with a core idea being that the algorithm does a mini gradient descent for each new data point.
The other cool idea fits the trend of alternatives to the “lambda architecture”. Under the name “structured streaming”, which seems to be a replacement for “DStreaming”, the idea is to do set-based SQL processing even though membership of the set changes over time. Result sets are extracted on a snapshot basis; you can keep either all the results from each snapshot query or just the deltas.
Despite all this, there’s some non-trivial dissatisfaction with Spark, fair or otherwise.
- Some of the reason is that SparkSQL is too immature to be great.
- Some is annoyance that Databricks isn’t putting everything it has into open source.
- Some is that everything has its architectural trade-offs.
To the last point, I raised one of the biggest specifics with Reynold, namely Spark’s lack of a strong built-in data persistence capability. Reynold’s answer was that they’re always working to speed up reading and writing from other forms of persistent storage. E.g., he cited a figure of ~100 million rows/core/second decoded from Parquet.
Comments
3 Responses to “Notes on Spark and Databricks — technology”
Leave a Reply
[…] shall explain below. I also am posting separately about Spark evolution, especially Spark 2.0. I’ll also talk a bit in that post about Databricks’ […]
[…] and Databricks are both prospering, and of course enhancing their technology as […]
[…] models at interactive speeds is often easy. Retraining them quickly is much harder, and at this point only rarely […]