Kafka and Confluent
For starters:
- Kafka has gotten considerable attention and adoption in streaming.
- Kafka is open source, out of LinkedIn.
- Folks who built it there, led by Jay Kreps, now have a company called Confluent.
- Confluent seems to be pursuing a fairly standard open source business model around Kafka.
- Confluent seems to be in the low to mid teens in paying customers.
- Confluent believes 1000s of Kafka clusters are in production.
- Confluent reports 40 employees and $31 million raised.
At its core Kafka is very simple:
- Kafka accepts streams of data in substantially any format, and then streams the data back out, potentially in a highly parallel way.
- Any producer or consumer of data can connect to Kafka, via what can reasonably be called a publish/subscribe model.
- Kafka handles various issues of scaling, load balancing, fault tolerance and so on.
So it seems fair to say:
- Kafka offers the benefits of hub vs. point-to-point connectivity.
- Kafka acts like a kind of switch, in the telecom sense. (However, this is probably not a very useful metaphor in practice.)
Jay also views Kafka as something like a file system. Kafka doesn’t actually have a file-system-like interface for managing streams, but he acknowledges that as a need and presumably a roadmap item.
The most noteworthy technical point for me was that Kafka persists data, for reasons of buffering, fault-tolerance and the like. The duration of the persistence is configurable, and can be different for different feeds, with two main options:
- Guaranteed to have the last update of anything.
- Complete for the past N days.
Jay thinks this is a major difference vs. messaging systems that have come before. As you might expect, given that data arrives in timestamp order and then hangs around for a while:
- Kafka can offer strong guarantees of delivering data in the correct order.
- Persisted data is automagically broken up into partitions.
Technical tidbits include:
- Data is generally fresh to within 1.5 milliseconds.
- 100s of MB/sec/server is claimed. I didn’t ask how big a server was.
- LinkedIn runs >1 trillion messages/day through Kafka.
- Others in that throughput range include but are not limited to Microsoft and Netflix.
- A message is commonly 1 KB or less.
- At a guesstimate, 50%ish of messages are in Avro. JSON is another frequent format.
Jay’s answer to any concern about performance overhead for current or future features is usually to point out that anything other than the most basic functionality:
- Runs in different processes from core Kafka …
- … if it doesn’t actually run on a different cluster.
For example, connectors have their own pools of processes.
I asked the natural open source question about who contributes what to the Apache Kafka project. Jay’s quick answers were:
- Perhaps 80% of Kafka code comes from Confluent.
- LinkedIn has contributed most of the rest.
- However, as is typical in open source, the general community has contributed some connectors.
- The general community also contributes “esoteric” bug fixes, which Jay regards as evidence that Kafka is in demanding production use.
Jay has a rather erudite and wry approach to naming and so on.
- Kafka got its name because it was replacing something he regarded as Kafkaesque. OK.
- Samza is an associated project that has something to do with transformations. Good name. (The central character of The Metamorphosis was Gregor Samsa, and the opening sentence of the story mentions a transformation.)
- In his short book about logs, Jay has a picture caption “ETL in Ancient Greece. Not much has changed.” The picture appears to be of Sisyphus. I love it.
- I still don’t know why he named a key-value store Voldemort. Perhaps it was something not to be spoken of.
What he and his team do not yet have is a clear name for their product category. Difficulties in naming include:
- Kafka is limited and simple. But of course Confluent has plans to broaden its capabilities.
- It’s long been hard to decide whether to talk about “events”, “streams” or both.
- “Streaming” has another tech meaning, in the context of video, songs, etc.
- One candidate name, “event hub”, has already been grabbed by IBM and Microsoft for their specific offerings.
- Naming is always hard in general.
Confluent seems to be using “stream data platform” as a placeholder. As per the link above, I once suggested Data Stream Management System, or more concisely Datastream Manager. “Event”, “event stream” or “event series” could perhaps be mixed in as well. I don’t really have an opinion yet, and probably won’t until I’ve studied the space in a little more detail.
And on that note, I’ll end this post for reasons of length, and discuss Kafka-related technology separately.
Related links
- My October, 2014 post on Streaming for Hadoop is a sort of predecessor to this two-post series.
Comments
11 Responses to “Kafka and Confluent”
Leave a Reply
[…] a companion introduction to Kafka post, I observed that Kafka at its core is remarkably simple. Confluent offers a marchitecture […]
If data is fresh within 1.5 milliseconds and this uses disk then the data must be visible before being durable on 1+ hosts. Fsync takes longer than that unless you have HW RAID cards or fast storage.
Mark,
You are right. In Kafka data is considered “committed” and therefore visible when it is acknowledged by leader and all synchronized brokers (administrators can control minimum number of these brokers). Acknowledge means “in memory” but not necessarily on disk.
The idea is that if the minimum number of synced brokers is 3, it is unlikely that data that was visible to clients and then will get lost because all 3 crashed.
Hope this clarifies?
Just to add to what Gwen has said, this is the case with any distributed system that promotes durability. Well built systems will always account for flakey parts. Disks lie. Networks introduce latency. Servers fail. Replicating with the “Rule of 3” is always a good plan.
Some notes:
– Twitter uses Kafka (see https://blog.twitter.com/2015/handling-five-billion-sessions-a-day-in-real-time), as do most volume media
– Kafka is a highly scalable message pub/sub layer. The use of the word ‘event’ is market speak until Kafka can do more than feed messages from source to target(s) – compare to message brokers which provide the ability to process data as opposed to Kafka which simply feeds to consumers at high volumes. For reference of what event processing is, compare to CEP systems which allow ETL-like workflows against streaming data, so they can correlate events.
– Storm and Samza and Spark streaming (perhaps Flink) provide those ETL like capabilities, but need a resilient message broker like Kafka
– Scale is a killer value add, Kafka being 2 orders of magnitude more scalable then whatever is in place. Other FOSS and proprietary message brokers cannot process at these volumes, forcing developers to hack to manage. Kafka is so scalable you can usually feed any data that is sharable through it – creating sharing/caring.
Mark,
I’m with Gwen and Patrick on this. The modern default is to declare victory when sufficiently many servers have what they need in RAM, without waiting for any of them to complete a write to persistent storage.
Kafka does support log.flush.interval.messages for per-message flushing, but as docs say its not as efficient replication, but can be turned ON
from docs
…We generally feel that the guarantees provided by replication are stronger than sync to local disk, however the paranoid still may prefer having both and application level fsync policies are still supported….
[…] DBMS2, Curt Monash covers Kafka and Confluent, which likely means that Confluent has hired Curt […]
How is Kafka different than Fluentd? Both are opensource data collectors.
[…] While Kafka is widely agreed to be the universal delivery mechanism for streams, the landscape for companion […]
[…] has shared its insights on big data architecture regarding data services. They started by reviewing Kafka and Confluent with its inherent problems along with Kudu as a better version of Kafka. They note however that […]