Introduction to data Artisans and Flink
data Artisans and Flink basics start:
- Flink is an Apache project sponsored by the Berlin-based company data Artisans.
- Flink has been viewed in a few different ways, all of which are similar to how Spark is seen. In particular, per co-founder Kostas Tzoumas:
- Flink’s original goal was “Hadoop done right”.
- Now Flink is focused on streaming analytics, as an alternative to Spark Streaming, Samza, et al.
- Kostas seems to see Flink as a batch-plus-streaming engine that’s streaming-first.
Like many open source projects, Flink seems to have been partly inspired by a Google paper.
To this point, data Artisans and Flink have less maturity and traction than Databricks and Spark. For example: Read more
More about Databricks and Spark
Databricks CEO Ali Ghodsi checked in because he disagreed with part of my recent post about Databricks. Ali’s take on Databricks’ position in the Spark world includes:
- What I called Databricks’ “secondary business” of “licensing stuff to Spark distributors” was really about second/third tier support. Fair enough. But distributors of stacks including Spark, for whatever combination of on-premise and cloud as the case may be, may in many cases be viewed as competitors to Databricks cloud-only service. So why should Databricks help them?
- Databricks’ investment in Spark Summit and similar evangelism is larger than I realized.
- Ali suggests that the fraction of Databricks’ engineering devoted to open source Spark is greater than I understood during my recent visit.
Ali also walked me through customer use cases and adoption in wonderful detail. In general:
- A large majority of Databricks customers have machine learning use cases.
- Predicting and preventing user/customer churn is a huge issue across multiple market sectors.
The story on those sectors, per Ali, is: Read more
Notes on DataStax and Cassandra
I visited DataStax on my recent trip. That was a tipping point leading to my recent discussions of NoSQL DBAs and misplaced fear of vendor lock-in. But of course I also learned some things about DataStax and Cassandra themselves.
On the customer side:
- DataStax customers still overwhelmingly use Cassandra for internet back-ends — web, mobile or otherwise as the case might be.
- This includes — and “includes” might be understating the point — traditional enterprises worried about competition from internet-only ventures.
Customers in large numbers want cloud capabilities, as a potential future if not a current need.
One customer example was a large retailer, who in the past was awful at providing accurate inventory information online, but now uses Cassandra for that. DataStax brags that its queries come back in 20 milliseconds, but that strikes me as a bit beside the point; what really matters is that data accuracy has gone from “batch” to some version of real-time. Also, Microsoft is a DataStax customer, using Cassandra (and Spark) for the Office 365 backend, or at least for the associated analytics.
Per Patrick McFadin, the four biggest things in DataStax Enterprise 5 are: Read more
Categories: Cassandra, DataStax, Microsoft and SQL*Server, NoSQL, Specific users | 2 Comments |
Notes on Spark and Databricks — technology
During my recent visit to Databricks, I of course talked a lot about technology — largely with Reynold Xin, but a bit with Ion Stoica as well. Spark 2.0 is just coming out now, and of course has a lot of enhancements. At a high level:
- Using the new terminology, Spark originally assumed users had data engineering skills, but Spark 2.0 is designed to be friendly to data scientists.
- A lot of this is via a focus on simplified APIs, based on
- Unlike similarly named APIs in R and Python, Spark DataFrames work with nested data.
- Machine learning and Spark Streaming both work with Spark DataFrames.
- There are lots of performance improvements as well, some substantial. Spark is still young enough that Bottleneck Whack-A-Mole yields huge benefits, especially in the SparkSQL area.
- SQL coverage is of course improved. For example, SparkSQL can now perform all TPC-S queries.
The majority of Databricks’ development efforts, however, are specific to its cloud service, rather than being donated to Apache for the Spark project. Some of the details are NDA, but it seems fair to mention at least:
- Databricks’ notebooks feature for organizing and launching machine learning processes and so on is a biggie. Jupyter is an open source analog.
- Databricks has been working on security, and even on the associated certifications.
Two of the technical initiatives Reynold told me about seemed particularly cool. Read more
Categories: Benchmarks and POCs, Cloud computing, Databricks, Spark and BDAS, Predictive modeling and advanced analytics, Streaming and complex event processing (CEP) | 3 Comments |
Notes on Spark and Databricks — generalities
I visited Databricks in early July to chat with Ion Stoica and Reynold Xin. Spark also comes up in a large fraction of the conversations I have. So let’s do some catch-up on Databricks and Spark. In a nutshell:
- Spark is indeed the replacement for Hadoop MapReduce.
- Spark is becoming the default platform for machine learning.
- SparkSQL (nee’ Shark) is puttering along predictably.
- Databricks reports good success in its core business of cloud-based machine learning support.
- Spark Streaming has strong adoption, but its position is at risk.
- Databricks, the original authority on Spark, is not keeping a tight grip on that role.
I shall explain below. I also am posting separately about Spark evolution, especially Spark 2.0. I’ll also talk a bit in that post about Databricks’ proprietary/closed-source technology.
Spark is the replacement for Hadoop MapReduce.
This point is so obvious that I don’t know what to say in its support. The trend is happening, as originally decreed by Cloudera (and me), among others. People are rightly fed up with the limitations of MapReduce, and — niches perhaps aside — there are no serious alternatives other than Spark.
The greatest use for Spark seems to be the same as the canonical first use for MapReduce: data transformation. Also in line with the Spark/MapReduce analogy: Read more
Categories: Cloudera, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, Market share and customer counts, Predictive modeling and advanced analytics | 6 Comments |
Notes on vendor lock-in
Vendor lock-in is an important subject. Everybody knows that. But few of us realize just how complicated the subject is, nor how riddled it is with paradoxes. Truth be told, I wasn’t fully aware either. But when I set out to write this post, I found that it just kept growing longer.
1. The most basic form of lock-in is:
- You do application development for a target set of platform technologies.
- Your applications can’t run without those platforms underneath.
- Hence, you’re locked into those platforms.
2. Enterprise vendor standardization is closely associated with lock-in. The core idea is that you have a mandate or strong bias toward having different apps run over the same platforms, because:
- That simplifies your environment, requiring less integration and interoperability.
- That simplifies your staffing; the same skill sets apply to multiple needs and projects.
- That simplifies your vendor support relationships; there’s “one throat to choke”.
- That simplifies your price negotiation.
3. That last point is double-edged; you have more power over suppliers to whom you give more business, but they also have more power over you. The upshot is often an ELA (Enterprise License Agreement), which commonly works:
- For a fixed period of time, the enterprise may use as much of a given product set as they want, with costs fixed in advance.
- A few years later, the price is renegotiated, based on then-current levels of usage.
Categories: Amazon and its cloud, Buying processes, Cassandra, Exadata, Facebook, IBM and DB2, Microsoft and SQL*Server, MongoDB, Neo Technology and Neo4j, Open source, Oracle, SAP AG | 12 Comments |
Notes from a long trip, July 19, 2016
For starters:
- I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
- The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
- I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
- I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.
A running list of recent posts is:
- As a companion to this post, I’m publishing a very long one on vendor lock-in.
- Spark and Databricks are both prospering, and of course enhancing their technology as well.
- Ditto DataStax.
- Flink is interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need.
Subjects I’d like to add to that list include:
- MemSQL, Zoomdata, and Neo Technology (also prospering).
- Cloudera (multiple topics, as usual).
- Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
- Microsoft’s reinvention (it feels real).
- Metadata (it’s ever more of a thing).
- Machine learning (it’s going to be a big portion of my research going forward).
- Transitions to the cloud — this subject affects almost everything else.
Governments vs. tech companies — it’s complicated
Numerous tussles fit the template:
- A government wants access to data contained in one or more devices (mobile/personal or server as the case may be).
- The computer’s manufacturer or operator doesn’t want to provide it, for reasons including:
- That’s what customers prefer.
- That’s what other governments require.
- Being pro-liberty is the right and moral choice. (Yes, right and wrong do sometimes actually come into play. 🙂 )
As a general rule, what’s best for any kind of company is — pricing and so on aside — whatever is best or most pleasing for their customers or users. This would suggest that it is in tech companies’ best interest to favor privacy, but there are two important quasi-exceptions: Read more
Categories: Amazon and its cloud, Google, Microsoft and SQL*Server, Surveillance and privacy, Web analytics | 2 Comments |
Kafka and more
In a companion introduction to Kafka post, I observed that Kafka at its core is remarkably simple. Confluent offers a marchitecture diagram that illustrates what else is on offer, about which I’ll note:
- The red boxes — “Ops Dashboard” and “Data Flow Audit” — are the initial closed-source part. No surprise that they sound like management tools; that’s the traditional place for closed source add-ons to start.
- “Schema Management”
- Is used to define fields and so on.
- Is not equivalent to what is ordinarily meant by schema validation, in that …
- … it allows schemas to change, but puts constraints on which changes are allowed.
- Is done in plug-ins that live with the producer or consumer of data.
- Is based on the Hadoop-oriented file format Avro.
Kafka offers little in the way of analytic data transformation and the like. Hence, it’s commonly used with companion products. Read more
Kafka and Confluent
For starters:
- Kafka has gotten considerable attention and adoption in streaming.
- Kafka is open source, out of LinkedIn.
- Folks who built it there, led by Jay Kreps, now have a company called Confluent.
- Confluent seems to be pursuing a fairly standard open source business model around Kafka.
- Confluent seems to be in the low to mid teens in paying customers.
- Confluent believes 1000s of Kafka clusters are in production.
- Confluent reports 40 employees and $31 million raised.
At its core Kafka is very simple:
- Kafka accepts streams of data in substantially any format, and then streams the data back out, potentially in a highly parallel way.
- Any producer or consumer of data can connect to Kafka, via what can reasonably be called a publish/subscribe model.
- Kafka handles various issues of scaling, load balancing, fault tolerance and so on.
So it seems fair to say:
- Kafka offers the benefits of hub vs. point-to-point connectivity.
- Kafka acts like a kind of switch, in the telecom sense. (However, this is probably not a very useful metaphor in practice.)