More notes on HBase
1. Continuing from last week’s HBase post, the Cloudera folks were fairly proud of HBase’s features for performance and scalability. Indeed, they suggested that use cases which were a good technical match for HBase were those that required fast random reads and writes with high concurrency and strict consistency. Some of the HBase architecture for query performance seems to be:
- Everything is stored in sorted files. (I didn’t probe as to what exactly the files were sorted on.)
- Files have indexes and optional Bloom filters.
- Files are marked with min/max field values and time stamp ranges, which helps with data skipping.
Notwithstanding that a couple of those features sound like they might help with analytic queries, the base expectation is that you’ll periodically massage your HBase data into a more analytically-oriented form. For example — I was talking with Cloudera after all — you could put it into Parquet.
2. The discussion of which kinds of data are originally put into HBase was a bit confusing.
- HBase is commonly used to receive machine-generated data. Everybody knows that.
- Cloudera drew a distinction between:
- Straightforward time series, which should probably just go into HDFS (Hadoop Distributed File System) rather than HBase.
- Data that is bucketed by entity, which likely should go into HBase. Examples of entities are specific users or devices.
- Cloudera also reminded me that OpenTSDB, a popular time series data store, runs over HBase.
OpenTSDB, by the way, likes to store detailed data and aggregates side-by-side, which resembles a pattern I discussed in my recent BI for NoSQL post.
3. HBase supports caching, tiered storage, and so on. Cloudera is pretty sure that it is publicly known (I presume from blog posts or conference talks) that:
- Pinterest has a large HBase database on SSDs (Solid-State Drives), a large fraction of which is actually in RAM.
- eBay has an HBase database largely on spinning disk, used to inform its search engine.
Cloudera also told me of a Very Famous Company that has many 100s of HBase nodes managing petabytes of mobile device data. That sounds like multiple terabytes per node even before considering a replication factor, so I presume it’s disk-based as well. The takeaway from those examples, other than general big-use-case impressiveness, is that storage choices for HBase can vary greatly by user and application.
4. HBase has master/master geographically remote replication. I gather that Yahoo replicates between a couple of 1000-node clusters, on behalf of its Flurry operation. HBase also has the technical capability to segment data across geographies — i.e., the geo-partitioning feature essential to data sovereignty compliance — but no actual implementations came to mind.
5. Besides the ones already mentioned, and famed HBase user Facebook, a few other users came up.
- It seems to be common for ad-tech companies to store in HBase the data that arrives from many different computers and mobile devices.
- An agency that Cloudera didn’t name, but which is obviously something like the SEC or CFTC, stores all trade data in HBase.
- Cerner — or perhaps its software — stores data in HBase on a patient-by-patient basis.
In general, Cloudera suggested that HBase was used in a fair number of OEM situations.
6. Finally, I have one number: As of January, 2014 there were 20,000 HBase nodes managed by Cloudera software. Obviously, that number is growing very quickly, and of course there are many HBase nodes that Cloudera has nothing to do with.
Related link
- A lot of this echoes what I hear from DataStax (December, 2013), notwithstanding the consensus that HBase and Cassandra rarely compete in the marketplace.
Comments
4 Responses to “More notes on HBase”
Leave a Reply
I think sorting is very important, and relatively rare property among NoSQL systems. The fact that records are stored by their key gives important flexibility in system design. For example, architect can compose key from user id and time to get efficiently access to all records related to the given user in the given time frame. Other systems, built on consistent hash, can not guarantee data proximity – so almost any query will touch all servers in the cluster.
In other words – the fact that data is sorted makes HBase much better solution for DWH then many others, built on consistent hash.
>> Cloudera also told me of a Very Famous Company that has many 100s of HBase nodes managing petabytes of mobile device data.
Apple?
David, I am not sure what you mean by “rare”, key sorted files are not only common in noSQL world like HBase or Cassandra (if you use OrderPreservingPartitioner), they are also common in RDBMS e.g. Index Organized Tables in Oracle
By rare, I mean that default choice for most of NoSQL’s is consistent hashing and many of them do not give us other choice.
Cassandra, by default, is using hash partitioning to distribute data among nodes, and only inside a node data is sorted.
The is quote from DataStax documentation says : “Unless absolutely required by your application, DataStax strongly recommends against using the ordered partitioner for the following reasons … “.
In a nutshell – because of the hot spots.
Coachbase, Riak, Voldemort, are using consistent hashing.
I do not think that their respective designers do not know benefits of sorting, but they also aware about the toll – “hot” regions problem in HBase terms. So they choose to keep their system simpler and focus on “access by key” case, where sorting is not required.
In my view, only NoSQL solutions, which are committed to fight “hot regions” problem” should be considered when sorting is needed. In best of my knowledge, only HBase is doing so (at least among popular NoSQL systems).