Hadoop
Discussion of Hadoop. Related subjects include:
Cloudera Kudu deep dive
This is part of a three-post series on Kudu, a new data storage system from Cloudera.
- Part 1 is an overview of Kudu technology.
- Part 2 (this post) is a lengthy dive into how Kudu writes and reads data.
- Part 3 is a brief speculation as to Kudu’s eventual market significance.
Let’s talk in more detail about how Kudu stores data.
- As previously noted, inserts land in an in-memory row store, which is periodically flushed to the column store on disk. Queries are federated between these two stores. Vertica taught us to call these the WOS (Write-Optimized Store) and ROS (Read-Optimized Store) respectively, and I’ll use that terminology here.
- Part of the ROS is actually another in-memory store, aka the DeltaMemStore, where updates and deletes land before being applied to the DiskRowSets. These stores are managed separately for each DiskRowSet. DeltaMemStores are checked at query time to confirm whether what’s in the persistent store is actually up to date.
- A major design goal for Kudu is that compaction should never block — nor greatly slow — other work. In support of that:
- Compaction is done, server-by-server, via a low-priority but otherwise always-on background process.
- There is a configurable maximum to how big a compaction process can be — more precisely, the limit is to how much data the process can work on at once. The current default figure = 128 MB, which is 4X the size of a DiskRowSet.
- When done, Kudu runs a little optimization to figure out which 128 MB to compact next.
- Every tablet has its own write-ahead log.
- This creates a practical limitation on the number of tablets …
- … because each tablet is causing its own stream of writes to “disk” …
- … but it’s only a limitation if your “disk” really is all spinning disk …
- … because multiple simultaneous streams work great with solid-state memory.
- Log retention is configurable, typically the greater of 5 minutes or 128 MB.
- Metadata is cached in RAM. Therefore:
- ALTER TABLE kinds of operations that can be done by metadata changes only — i.e. adding/dropping/renaming columns — can be instantaneous.
- To keep from being screwed up by this, the WOS maintains a column that labels rows by which schema version they were created under. I immediately called this MSCC — Multi-Schema Concurrency Control 🙂 — and Todd Lipcon agreed.
- Durability, as usual, boils down to “Wait until a quorum has done the writes”, with a configurable option as to what constitutes a “write”.
- Servers write to their respective write-ahead logs, then acknowledge having done so.
- If it isn’t too much of a potential bottleneck — e.g. if persistence is on flash — the acknowledgements may wait until the log has been fsynced to persistent storage.
- There’s a “thick” client library which, among other things, knows enough about the partitioning scheme to go straight to the correct node(s) on a cluster.
Categories: Cloudera, Columnar database management, Hadoop, Solid-state memory, SQL/Hadoop integration | 25 Comments |
Introduction to Cloudera Kudu
This is part of a three-post series on Kudu, a new data storage system from Cloudera.
- Part 1 (this post) is an overview of Kudu technology.
- Part 2 is a lengthy dive into how Kudu writes and reads data.
- Part 3 is a brief speculation as to Kudu’s eventual market significance.
Cloudera is introducing a new open source project, Kudu,* which from Cloudera’s standpoint is meant to eventually become the single best underpinning for analytics on the Hadoop stack. I’ve spent multiple hours discussing Kudu with Cloudera, mainly with Todd Lipcon. Any errors are of course entirely mine.
*Like the impala, the kudu is a kind of antelope. I knew that, because I enjoy word games. What I didn’t know — and which is germane to the naming choice — is that the kudu has stripes. 🙂
For starters:
- Kudu is an alternative to HDFS (Hadoop Distributed File System), or to HBase.
- Kudu is meant to be the underpinning for Impala, Spark and other analytic frameworks or engines.
- Kudu is not meant for OLTP (OnLine Transaction Processing), at least in any foreseeable release. For example:
- Kudu doesn’t support multi-row transactions.
- There are no active efforts to front-end Kudu with an engine that is fast at single-row queries.
- Kudu is rather columnar, except for transitory in-memory stores.
- Kudu’s core design points are that it should:
- Accept data very quickly.
- Immediately make that data available for analytics.
- More specifically, Kudu is meant to accept, along with slower forms of input:
- Lots of fast random writes, e.g. of web interactions.
- Streams, viewed as a succession of inserts.
- Updates and inserts alike.
- The core “real-time” use cases for which Kudu is designed are, unsurprisingly:
- Low-latency business intelligence.
- Predictive model scoring.
- Kudu is designed to work fine with spinning disk, and indeed has been tested to date mainly on disk-only nodes. Even so, Kudu’s architecture is optimized for the assumption that there will be at least some flash on the node.
- Kudu is designed primarily to support relational/SQL processing. However, Kudu also has a nested-data roadmap, which of course starts with supporting the analogous capabilities in Impala.
Rocana’s world
For starters:
- My client Rocana is the renamed ScalingData, where Rocana is meant to signify ROot Cause ANAlysis.
- Rocana was founded by Omer Trajman, who I’ve referenced numerous times in the past, and who I gather is a former boss of …
- … cofounder Eric Sammer.
- Rocana recently told me it had 35 people.
- Rocana has a very small number of quite large customers.
Rocana portrays itself as offering next-generation IT operations monitoring software. As you might expect, this has two main use cases:
- Actual operations — figuring out exactly what isn’t working, ASAP.
- Security.
Rocana’s differentiation claims boil down to fast and accurate anomaly detection on large amounts of log data, including but not limited to:
- The sort of network data you’d generally think of — “everything” except packet-inspection stuff.
- Firewall output.
- Database server logs.
- Point-of-sale data (at a retailer).
- “Application data”, whatever that means. (Edit: See Tom Yates’ clarifying comment below.)
Data messes
A lot of what I hear and talk about boils down to “data is a mess”. Below is a very partial list of examples.
To a first approximation, one would expect operational data to be rather clean. After all, it drives and/or records business transactions. So if something goes awry, the result can be lost money, disappointed customers, or worse, and those are outcomes to be strenuously avoided. Up to a point, that’s indeed true, at least at businesses large enough to be properly automated. (Unlike, for example — 🙂 — mine.)
Even so, operational data has some canonical problems. First, it could be inaccurate; somebody can just misspell or otherwise botch an entry. Further, there are multiple ways data can be unreachable, typically because it’s:
- Inconsistent, in which case humans might not know how to look it up and database JOINs might fail.
- Unintegrated, in which case one application might not be able to use data that another happily maintains. (This is the classic data silo problem.)
Inconsistency can take multiple forms, including: Read more
Hadoop generalities
Occasionally I talk with an astute reporter — there are still a few left 🙂 — and get led toward angles I hadn’t considered before, or at least hadn’t written up. A blog post may then ensue. This is one such post.
There is a group of questions going around that includes:
- Is Hadoop overhyped?
- Has Hadoop adoption stalled?
- Is Hadoop adoption being delayed by skills shortages?
- What is Hadoop really good for anyway?
- Which adoption curves for previous technologies are the best analogies for Hadoop?
To a first approximation, my responses are: Read more
Teradata will support Presto
At the highest level:
- Presto is, roughly speaking, Facebook’s replacement for Hive, at least for queries that are supposed to run at interactive speeds.
- Teradata is announcing support for Presto with a classic open source pricing model.
- Presto will also become, roughly speaking, Teradata’s replacement for Hive.
- Teradata’s Presto efforts are being conducted by the former Hadapt.
Now let’s make that all a little more precise.
Regarding Presto (and I got most of this from Teradata)::
- To a first approximation, Presto is just another way to write SQL queries against HDFS (Hadoop Distributed File System). However …
- … Presto queries other data stores too, such as various kinds of RDBMS, and federates query results.
- Facebook at various points in time created both Hive and now Presto.
- Facebook started the Presto project in 2012 and now has 10 engineers on it.
- Teradata has named 16 engineers – all from Hadapt – who will be contributing to Presto.
- Known serious users of Presto include Facebook, Netflix, Groupon and Airbnb. Airbnb likes Presto well enough to have 1/3 of its employees using it, via an Airbnb-developed tool called Airpal.
- Facebook is known to have a cluster cited at 300 petabytes and 4000 users where Presto is presumed to be a principal part of the workload.
Daniel Abadi said that Presto satisfies what he sees as some core architectural requirements for a modern parallel analytic RDBMS project: Read more
IT-centric notes on the future of health care
It’s difficult to project the rate of IT change in health care, because:
- Health care is suffused with technology — IT, medical device and biotech alike — and hence has the potential for rapid change. However, it is also the case that …
- … health care is heavily bureaucratic, political and regulated.
Timing aside, it is clear that health care change will be drastic. The IT part of that starts with vastly comprehensive electronic health records, which will be accessible (in part or whole as the case may be) by patients, care givers, care payers and researchers alike. I expect elements of such records to include:
- The human-generated part of what’s in ordinary paper health records today, but across a patient’s entire lifetime. This of course includes notes created by doctors and other care-givers.
- Large amounts of machine-generated data, including:
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
- Most tests exploit electronic technology. Progress in electronics is intense.
- Biomedical research is itself intense.
- In particular, most research technologies (for example gene sequencing) can be made cheap enough over time to be affordable clinically.
- The output of consumer health-monitoring devices — e.g. Fitbit and its successors. The buzzword here is “quantified self”, but what it boils down to is that every moment of our lives will be measured and recorded.
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
These vastly greater amounts of data cited above will allow for greatly changed analytics.
Read more
MemSQL 4.0
I talked with my clients at MemSQL about the release of MemSQL 4.0. Let’s start with the reminders:
- MemSQL started out as in-memory OTLP (OnLine Transaction Processing) DBMS …
- … but quickly positioned with “We also do ‘real-time’ analytic processing” …
- … and backed that up by adding a flash-based column store option …
- … before Gartner ever got around to popularizing the term HTAP (Hybrid Transaction and Analytic Processing).
- There’s also a JSON option.
The main new aspects of MemSQL 4.0 are:
- Geospatial indexing. This is for me the most interesting part.
- A new optimizer and, I suppose, query planner …
- … which in particular allow for serious distributed joins.
- Some rather parallel-sounding connectors to Spark. Hadoop and Amazon S3.
- Usual-suspect stuff including:
- More SQL coverage (I forgot to ask for details).
- Some added or enhanced administrative/tuning/whatever tools (again, I forgot to ask for details).
- Surely some general Bottleneck Whack-A-Mole.
There’s also a new free MemSQL “Community Edition”. MemSQL hopes you’ll experiment with this but not use it in production. And MemSQL pricing is now wholly based on RAM usage, so the column store is quasi-free from a licensing standpoint is as well.
Notes on analytic technology, May 13, 2015
1. There are multiple ways in which analytics is inherently modular. For example:
- Business intelligence tools can reasonably be viewed as application development tools. But the “applications” may be developed one report at a time.
- The point of a predictive modeling exercise may be to develop a single scoring function that is then integrated into a pre-existing operational application.
- Conversely, a recommendation-driven website may be developed a few pages — and hence also a few recommendations — at a time.
Also, analytics is inherently iterative.
- Everything I just called “modular” can reasonably be called “iterative” as well.
- So can any work process of the nature “OK, we got an insight. Let’s pursue it and get more accuracy.”
If I’m right that analytics is or at least should be modular and iterative, it’s easy to see why people hate multi-year data warehouse creation projects. Perhaps it’s also easy to see why I like the idea of schema-on-need.
2. In 2011, I wrote, in the context of agile predictive analytics, that
… the “business analyst” role should be expanded beyond BI and planning to include lightweight predictive analytics as well.
I gather that a similar point is at the heart of Gartner’s new term citizen data scientist. I am told that the term resonates with at least some enterprises. Read more
Notes, links and comments, May 2, 2015
I’m going to be out-of-sorts this week, due to a colonoscopy. (Between the prep, the procedure, and the recovery, that’s a multi-day disablement.) In the interim, here’s a collection of links, quick comments and the like.
1. Are you an engineer considering a start-up? This post is for you. It’s based on my long experience in and around such scenarios, and includes a section on “Deadly yet common mistakes”.
2. There seems to be a lot of confusion regarding the business model at my clients Databricks. Indeed, my own understanding of Databricks’ on-premises business has changed recently. There are no changes in my beliefs that:
- Databricks does not directly license or support on-premises Spark users. Rather …
- … it helps partner companies to do so, where:
- Examples of partner companies include usual-suspect Hadoop distribution vendors, and DataStax.
- “Help” commonly includes higher-level support.
However, I now get the impression that revenue from such relationships is a bigger deal to Databricks than I previously thought.
Databricks, by the way, has grown to >50 people.
3. DJ Patil and Ruslan Belkin apparently had a great session on lessons learned, covering a lot of ground. Many of the points are worth reading, but one in particular echoed something I’m hearing lots of places — “Data is super messy, and data cleanup will always be literally 80% of the work.” Actually, I’d replace the “always” by something like “very often”, and even that mainly for newish warehouses, data marts or datasets. But directionally the comment makes a whole lot of sense.