Cloudera Kudu deep dive
This is part of a three-post series on Kudu, a new data storage system from Cloudera.
- Part 1 is an overview of Kudu technology.
- Part 2 (this post) is a lengthy dive into how Kudu writes and reads data.
- Part 3 is a brief speculation as to Kudu’s eventual market significance.
Let’s talk in more detail about how Kudu stores data.
- As previously noted, inserts land in an in-memory row store, which is periodically flushed to the column store on disk. Queries are federated between these two stores. Vertica taught us to call these the WOS (Write-Optimized Store) and ROS (Read-Optimized Store) respectively, and I’ll use that terminology here.
- Part of the ROS is actually another in-memory store, aka the DeltaMemStore, where updates and deletes land before being applied to the DiskRowSets. These stores are managed separately for each DiskRowSet. DeltaMemStores are checked at query time to confirm whether what’s in the persistent store is actually up to date.
- A major design goal for Kudu is that compaction should never block — nor greatly slow — other work. In support of that:
- Compaction is done, server-by-server, via a low-priority but otherwise always-on background process.
- There is a configurable maximum to how big a compaction process can be — more precisely, the limit is to how much data the process can work on at once. The current default figure = 128 MB, which is 4X the size of a DiskRowSet.
- When done, Kudu runs a little optimization to figure out which 128 MB to compact next.
- Every tablet has its own write-ahead log.
- This creates a practical limitation on the number of tablets …
- … because each tablet is causing its own stream of writes to “disk” …
- … but it’s only a limitation if your “disk” really is all spinning disk …
- … because multiple simultaneous streams work great with solid-state memory.
- Log retention is configurable, typically the greater of 5 minutes or 128 MB.
- Metadata is cached in RAM. Therefore:
- ALTER TABLE kinds of operations that can be done by metadata changes only — i.e. adding/dropping/renaming columns — can be instantaneous.
- To keep from being screwed up by this, the WOS maintains a column that labels rows by which schema version they were created under. I immediately called this MSCC — Multi-Schema Concurrency Control ๐ — and Todd Lipcon agreed.
- Durability, as usual, boils down to “Wait until a quorum has done the writes”, with a configurable option as to what constitutes a “write”.
- Servers write to their respective write-ahead logs, then acknowledge having done so.
- If it isn’t too much of a potential bottleneck — e.g. if persistence is on flash — the acknowledgements may wait until the log has been fsynced to persistent storage.
- There’s a “thick” client library which, among other things, knows enough about the partitioning scheme to go straight to the correct node(s) on a cluster.
Leaving aside the ever-popular possibilities of:
- Cluster-wide (or larger) equipment outages
- Bugs
the main failure scenario for Kudu is:
- The leader version of a tablet (within its replica) set goes down.
- A new leader is elected.
- The workload is such that the client didn’t notice and adapt to the error on its own.
Todd says that Kudu’s MTTR (Mean Time To Recovery) for write availability tests internally at 1-2 seconds in such cases, and shouldn’t really depend upon cluster size.
Beyond that, I had some difficulties understanding details of the Kudu write path(s). An email exchange ensued, and Todd kindly permitted me to post some of his own words (edited by me for clarification and format).
Every tablet has its own in-memory store for inserts (MemRowSet). From a read/write path perspective, every tablet is an entirely independent entity, with its own MemRowSet, rowsets, etc. Basically the flow is:
- The client wants to make a write (i.e. an insert/update/delete), which has a primary key.
- The client applies the partitioning algorithm to determine which tablet that key belongs in.
- The information about which tablets cover which key ranges (or hash buckets) is held in the master. (But since it is cached by the clients, this is usually a local operation.)
- It sends the operation to the “leader” replica of the correct tablet (batched along with any other writes that are targeted to the same tablet).
- Once the write reaches the tablet leader:
- The leader enqueues the write to its own WAL (Write-Ahead Log) and also enqueues it to be sent to the “follower” replicas.
- Once it has reached a majority of the WALs (i.e. 2/3 when the replication factor = 3), the write is considered “replicated”. That is to say, it’s durable and would always be rolled forward, even if the leader crashed at this point.
- Only now do we enter the “storage” part of the system, where we start worrying about MemRowSets vs DeltaMemStores, etc.
Put another way, there is a fairly clean architectural separation into three main subsystems:
- Metadata and partitioning (map from a primary key to a tablet, figure out which servers host that tablet).
- Consensus replication (given a write operation, ensure that it is durably logged and replicated to a majority of nodes, so that even if we crash, everyone will agree whether it should be applied or not).
- Tablet storage (now that we’ve decided a write is agreed upon across replicas, actually apply it to the database storage).
These three areas of the code are separated as much as possible — for example, once we’re in the “tablet storage” code, it has no idea that there might be other tablets. Similarly, the replication and partitioning code don’t know much anything about MemRowSets, etc – that’s entirely within the tablet layer.
As for reading — the challenge isn’t in the actual retrieval of the data so much as in figuring out where to retrieve it from. What I mean by that is:
- Data will always be either in memory or in a persistent column store. So I/O speed will rarely be a problem.
- Rather, the challenge to Kudu’s data retrieval architecture is finding the relevant record(s) in the first place, which is slightly more complicated than in some other systems. For upon being told the requested primary key, Kudu still has to:
- Find the correct tablet(s).
- Find the record(s) on the (rather large) tablet(s).
- Check various in-memory stores as well.
The “check in multiple places” problem doesn’t seem to be of much concern, because:
- All that needs to be checked is the primary key column.
- The on-disk data is front-ended by Bloom filters.
- The cases in which a Bloom filter returns a false positive are generally the same busy ones where the key column is likely to be cached in RAM.
- Cloudera just assumes that checking a few different stores in RAM isn’t going to be a major performance issue.
When it comes to searching the tablets themselves:
- Kudu tablets feature data skipping among DiskRowSets, based on value ranges for the primary key.
- The whole point of compaction is to make the data skipping effective.
Finally, Kudu pays a write-time (or compaction-time) cost to boost retrieval speeds from inside a particular DiskRowSet, by creating something that Todd called an “ordinal index” but agreed with me would be better called something like “ordinal offset” or “offset index”. Whatever it’s called, it’s an index that tells you the number of rows you would need to scan before getting the one you want, thus allowing you to retrieve (except for the cost of an index probe) at array speeds.
Comments
25 Responses to “Cloudera Kudu deep dive”
Leave a Reply
[…] Part 2 is a lengthy dive into how Kudu writes and reads data. […]
I am trying to categorize Kudu together with some known technologies and facing the following questions:
1. It can not be called 100% storage because it can not store arbitrary data. So it can not replace HDFS and will be run alongside it. Is it true?
2. It is not classical NoSQL – since it is not schema-less, and schema migrations can became significant hassle, as for any fixed schema solutions.
I would call it “Analytical database distributed storage engine”, which is built to be reused by execution engines like Impala and Spark.
Aside of categorization I have a few more questions :
1. It is not clear what happens with nested data? I see it as significant part of data processed in Hadoop ecosystem, and especially by Spark.
2. What happens with massive data insert? It is long time problem with many NoSQls – how to load terabytes of data in hours.
3. As long time advocate of predicate push-down technologies i am curious : If it is C++ , how predicate push-down will work securely, at least from the stability viewpoint? Is there some kind of sand-boxing to be introduced?
4. If Kudu is run on the same cluster as HDFS, how space will be shared or balanced? We are yet to have YARN for disk space..
Hey David,
Thanks for the questions. Will do my best to answer, though it may turn into an essay:
> 1. It can not be called 100% storage because it can not store arbitrary data. So it can not replace HDFS and will be run alongside it. Is it true?
That’s correct. In the Hadoop ecosystem we use the term “storage” a bit more loosely than its traditional “enterprise storage” sense — we call anything that stores persistent data “storage”. So, Kudu’s storage by our terminology, but it’s not a filesystem, and doesn’t replace HDFS, unless all of your data strictly fits Kudu’s relational data model. Our assumption is that most folks will run Kudu and HDFS together, and put each dataset in the storage engine that makes the most sense.
> 2. It is not classical NoSQL โ since it is not schema-less, and schema migrations can became significant hassle, as for any fixed schema solutions.
Yep, it’s true that it’s not “classical” NoSQL (though the term made me chuckle a bit – was only 6.5 years ago I was at the very first NoSQL meetup in SF!). However, we put a lot of effort into ensuring that adding and dropping columns is very fast and low-impact – right now it blocks inserts for a couple of seconds on each server, one at a time, but we have a design that we can move to later which would only block for a couple hundred microseconds. So, given the common case of schema changes where you’ve just decided to collect some new data, that’s pretty low impact.
I would also argue that, the way many people use NoSQL databases, they aren’t “schemaless”, but rather there is an “implied schema” which is baked into your application. You still do schema migrations — it just involves changing your code to read or write new columns (or JSON fields) instead of physically modifying the table schema itself.
Early in Kudu’s development we considered a much more “open” schema design, but ended up opting for the relational table model for our first release. One of the major advantages here is its simplicity with integration into other tools (eg SQL engines or Spark) where the table metadata provides everything the engine needs to know for table mapping without requiring any special syntax from the user.
I’m sure in the future we’ll start to offer more flexible options as well — we’ve discussed for a while the idea of a protobuf or JSON-typed field with operators within Kudu that will perform efficient (LLVM-jitted) pushdown and projection. Hopefully this mixed model will satisfy use cases where there truly is a rapidly changing or unknown schema.
> 1. It is not clear what happens with nested data? I see it as significant part of data processed in Hadoop ecosystem, and especially by Spark.
We don’t support it yet. Agreed that nested data is important, which is why we’re continuing to support plenty of storage options outside of Kudu ๐ The next release of Impala, for example, will support querying nested data from Parquet files. I definitely anticipate adding some support for nesting within Kudu in the future, but as Curt mentioned in one of his posts here, these types of projects are huge and take a long time to get right. We opted for “quality” and “time” on the quality-time-scope triangle for our first release. Hopefully the community and customer base will agree that that was the correct trade-off!
> 2. What happens with massive data insert? It is long time problem with many NoSQls โ how to load terabytes of data in hours.
I’ve seen this be more of a concern in artificial benchmark scenarios and less in real workloads. If you are loading 1TB/hour, that’s 8.8PB/year before replication. If you retain three years of data, you’ll need 80PB of space on your cluster. With 36TB per machine, you’re talking 2000+ machines. That 1TB/hr spread across the 2000 machines you’ll need for storage is 500MB/hour/machine which only works out to 138KB/sec if I did my math right, which is pretty trivial.
Of course data loading is a concern when you’re migrating from another system, but less so in the ongoing operations.
To answer your question more directly: our data loading speed is decent, but it’s not the area we’ve spent our time on thus far. For example, doing a ‘CREATE TABLE x STORED AS PARQUET AS SELECT * FROM other_table’ is 5-6x faster than doing the same to create a Kudu table within Impala. We believe we can improve that by at least 3x with low hanging fruit, but time will tell. For our first public release we put most of our effort into the read path optimizations.
> 3. As long time advocate of predicate push-down technologies i am curious : If it is C++ , how predicate push-down will work securely, at least from the stability viewpoint? Is there some kind of sand-boxing to be introduced?
We only offer pushdown of pre-defined predicates (eg range predicates on columns) so there isn’t a sandboxing concern. We have some ideas on how to sandbox arbitrary UDFs in the future, but so far have only done a few prototypes/experiments in that area.
BTW, despite being a big fan of Java, I don’t think sandboxing arbitrary Java is much easier than C++ (maybe the opposite) — it’s still hard to “kill” a Java thread, and lack of shared memory primitives and access to OS sandboxing features makes it tough to sandbox via fork/exec. (I worked a bit on the sandboxing task executor in Hadoop years back)
> 4. If Kudu is run on the same cluster as HDFS, how space will be shared or balanced? We are yet to have YARN for disk space..
HDFS has a configuration which allows it to stop using a disk once the amount of free space drops below a certain threshold. Kudu doesn’t yet have this same feature, but we’re planning on adding it. Beyond that, it’s a matter of monitoring and alerting. In the future we have plans to add table-level (or perhaps user-level) disk quotas and better local space isolation featuers. But, for now, it’s much the same as how MR scratch space shares space with HDFS. We know this isn’t a great answer, and working on improving it before a GA release.
Sorry for the long-winded answer. Feel free to ask more questions if the above isn’t clear.
-Todd (lead eng on Kudu)
Hi Todd,
Thank you for so comprehensive answer! If I understand you right it is more or less correct to categorize Kudu as “analytics oriented No-SQL”.
I agree with most points, maybe aside one about the need for massive data load. I saw many cases when company is doing massive pre-calculation daily and than load TBs of data into HBASE. Another cases is just creating ad-hoc temporary datasets for research. It will lead me to the question:
what use case do you see as “typical” for Kudu?
Another, pure technical question i would like to ask:
Will Kudu provide memory mapped access to enable high performance engines to do their best?
David
Kudu is not a DBMS. Kudu plus a front-end (e.g. Impala) is more or less a DBMS. So saying that Kudu is “relational” or “NoSQL” is hard to make meaningful. ๐
That said — if one has inherently non-tabular data, is Kudu likely to offer much in the near future?
> I saw many cases when company is doing massive pre-calculation daily and than load TBs of data into HBASE.
Yea, if you’re only pre-calculating and bulk-loading, with no plans to later update the data, Kudu might not be the right fit. I could see us adding a “bulk load” API which bypasses locking, consensus, etc, if such use cases turn out to be in high demand. (I actually worked on this feature in HBase back in 2010: https://issues.apache.org/jira/browse/HBASE-1923 )
> what use case do you see as โtypicalโ for Kudu?
I won’t say “typical” until we have more deployments ๐ But I can share that Xiaomi, one of our development partners, is using it to implement a service which streams data in as it’s generated off IoT style devices, and then allows their customers to perform ad-hoc analytics. Originally they were planning to build this using Parquet, but Kudu is now allowing them to offer fast random access lookups (eg fetching all the data for one customer or device in a few milliseconds), updates (eg correcting past data), and support for late arriving inserts (eg if a connection was down for some hours, they can still insert old data without partition-swapping heroics).
> Will Kudu provide memory mapped access to enable high performance engines to do their best?
We had an intern who was exploring shared-memory transport for localhost access over the summer. We built a lot of the code but didn’t get it to the point where it showed substantial perf improvements, due to page table faulting overhead. We need to do a bit more work to re-use shmem segments across RPCs. You can where this project left off here: http://gerrit.cloudera.org:8080/#/c/958/
-Todd
> That said โ if one has inherently non-tabular data, is Kudu likely to offer much in the near future?
It’s certainly possible to do:
CREATE TABLE kvstore (key BINARY NOT NULL PRIMARY KEY, value BINARY NOT NULL);
and you’ll get pretty good performance for key lookups as well as scans of the binary data. But, we don’t have any support for medium or large BLOBs, so it’s probably not a particularly good fit. I’d typically recommend HBase for cases where your data isn’t well structured – most of the benefits of a true columnar store come from understanding typing and type-specific columnar compression.
So David was incorrect to label Kudu as “NoSQL”? That’s what I thought. ๐
Well, I could have written that CREATE TABLE example using the Java or C++ API too, so I guess it’s “MaybeSQL” or “BYOSQL” ๐
I would tell that in Xiaomi case do sounds like NoSQL. Without Kudu I would tell that HBase is likely solution.
Having side that I would speculate that Kudu will enable solution to many use cases, which today have mostly awkward solutions:
– We have stream of data which we have to analyze and some data arrives late.
– We have 99% of static data, with rare updates.
– We have history of events, with a lot of reports, but there is a need to find a few events for technical support in real time.
– We have segmented dataset (usually in SaaS companies), and we need to query each segment separately.
Yes, David, I think it is best to compare the alternatives.
For use cases similar to the one you describe, the current Hadoop solution was having HBase for random reads/inserts/updates/deletes, Impala on Parquet for fast analytics and of course, improvising some “ETL” jobs/stream processing to copy data from HBase to Parquet.
With Kudu, the app could just do random reads/writes to Kudu, and point Impala to the same copy of it. No ETL, just one (replicated) copy of the data, and hopefully reasonable / decent performance for both the OLTP workload and the analytics workload.
Sounds great to me , should likely be appealing to many use cases – especially the more structured ones ๐
Ofir, David — you guys hit the nail on the head describing use cases. Thanks!
Regarding the Xiaomi use case in particular, Binglin Chang from their team will be presenting on it during my Strata session on Wednesday, if you’re in town for the conference.
Stonebraker:
If it turns out that the DBMS point of view prevails in the marketplace over time, then HDFS will atrophy as the DBMS vendors abandon its features. In such a world, there is a local file system at each node, a parallel DBMS supporting a high-level query language, and various tools on top or extensions defined via user-defined functions. In this scenario, Hadoop will morph into a standard shared-nothing DBMS …
Good find! The relevant link is apparently http://cacm.acm.org/blogs/blog-cacm/177467-hadoop-at-a-crossroads/fulltext
There has always been a local file system at each node with HDFS.
The question for me is whether:
1) storage + processing are tightly bundled as in parallel DBMS
or
2) there is layering so we get a great distributed storage system on top of which different processing solutions can be built.
I think the Cloudera solution is closer to #1. I prefer #2 but I am not suggesting that HDFS is the right answer for #2
Mark,
What are you hoping this distributed storage system will be great at storing? Is it everything from, say, single database records to video files? I’m not sure that’s possible in its purest form.
I agree, it can’t be all things or it will be less than stellar.
Possible features:
* write once files
* compression
* online Reed Solomon (target replication factor = 1MB)
* much better availability & throughput than HDFS
Maybe not an option, but it will have a minimum size at which reads are performant, maybe >= 1MB
And that target replication factor = 1MB above makes no sense. The target is < 3.
Hey Mark,
I think Kudu is somewhere between your #1 and #2. Kudu itself is a storage system, and doesn’t offer any “parallel database” functionality (eg join/agg/etc). We rely on loosely coupled (but well integrated) SQL systems like Impala running on top of Kudu in order to provide those features. I jokingly called this “BYO SQL” recently, but actually think it’s a good description.
That said, it’s a storage system specifically for tabular data. I think it’s impossible to provide a pure “storage” abstraction (ie bytes-based) and also allow the kind of performance that Kudu provides across both random and sequential workloads.
Would be interested in chatting more about the perceived shortcomings of HDFS for your “#2” use case, but this probably isn’t the right venue ๐
-Todd
Jeff pointed me to http://getkudu.io/kudu.pdf and I will read that before trying to have more opinions.
Todd – Have you looked at Druid? What are the differences with kudu? as far as I see, there is definitely an advantage in kudu for doing the random updates.. but beyond that?
VJ
Jumping in here — I’m not aware of a way to do analytic SQL over Druid.
http://www.dbms2.com/2012/06/16/metamarkets-druid-overview/
Curt — Spark can be integrated over DRUID to provide the analytic SQL capabilities..
I doubt the performance of that combination is very good yet.
[…] Part 2 is a lengthy dive into how Kudu writes and reads data. […]