March 17, 2015

More notes on HBase

1. Continuing from last week’s HBase post, the Cloudera folks were fairly proud of HBase’s features for performance and scalability. Indeed, they suggested that use cases which were a good technical match for HBase were those that required fast random reads and writes with high concurrency and strict consistency. Some of the HBase architecture for query performance seems to be:

Everything is stored in sorted files. (I didn’t probe as to what exactly the files were sorted on.)
Files have indexes and optional Bloom filters.
Files are marked with min/max field values and time stamp ranges, which helps with data skipping.

Notwithstanding that a couple of those features sound like they might help with analytic queries, the base expectation is that you’ll periodically massage your HBase data into a more analytically-oriented form. For example — I was talking with Cloudera after all — you could put it into Parquet.

2. The discussion of which kinds of data are originally put into HBase was a bit confusing.

HBase is commonly used to receive machine-generated data. Everybody knows that.
Cloudera drew a distinction between:
- Straightforward time series, which should probably just go into HDFS (Hadoop Distributed File System) rather than HBase.
- Data that is bucketed by entity, which likely should go into HBase. Examples of entities are specific users or devices.
Cloudera also reminded me that OpenTSDB, a popular time series data store, runs over HBase.

OpenTSDB, by the way, likes to store detailed data and aggregates side-by-side, which resembles a pattern I discussed in my recent BI for NoSQL post.

3. HBase supports caching, tiered storage, and so on. Cloudera is pretty sure that it is publicly known (I presume from blog posts or conference talks) that:

Pinterest has a large HBase database on SSDs (Solid-State Drives), a large fraction of which is actually in RAM.
eBay has an HBase database largely on spinning disk, used to inform its search engine.

Cloudera also told me of a Very Famous Company that has many 100s of HBase nodes managing petabytes of mobile device data. That sounds like multiple terabytes per node even before considering a replication factor, so I presume it’s disk-based as well. The takeaway from those examples, other than general big-use-case impressiveness, is that storage choices for HBase can vary greatly by user and application.

4. HBase has master/master geographically remote replication. I gather that Yahoo replicates between a couple of 1000-node clusters, on behalf of its Flurry operation. HBase also has the technical capability to segment data across geographies — i.e., the geo-partitioning feature essential to data sovereignty compliance — but no actual implementations came to mind.

5. Besides the ones already mentioned, and famed HBase user Facebook, a few other users came up.

It seems to be common for ad-tech companies to store in HBase the data that arrives from many different computers and mobile devices.
An agency that Cloudera didn’t name, but which is obviously something like the SEC or CFTC, stores all trade data in HBase.
Cerner — or perhaps its software — stores data in HBase on a patient-by-patient basis.

In general, Cloudera suggested that HBase was used in a fair number of OEM situations.

6. Finally, I have one number: As of January, 2014 there were 20,000 HBase nodes managed by Cloudera software. Obviously, that number is growing very quickly, and of course there are many HBase nodes that Cloudera has nothing to do with.

Related link

A lot of this echoes what I hear from DataStax (December, 2013), notwithstanding the consensus that HBase and Cassandra rarely compete in the marketplace.

Categories: Cloudera, eBay, Facebook, Hadoop, HBase, Market share and customer counts, NoSQL, Open source, Petabyte-scale data management, Specific users, Yahoo

Subscribe to our complete feed!

Comments

4 Responses to “More notes on HBase”

David Gruzman on March 17th, 2015 6:17 pm

I think sorting is very important, and relatively rare property among NoSQL systems. The fact that records are stored by their key gives important flexibility in system design. For example, architect can compose key from user id and time to get efficiently access to all records related to the given user in the given time frame. Other systems, built on consistent hash, can not guarantee data proximity – so almost any query will touch all servers in the cluster.
In other words – the fact that data is sorted makes HBase much better solution for DWH then many others, built on consistent hash.
Vladimir Rodionov on April 28th, 2015 12:56 am

>> Cloudera also told me of a Very Famous Company that has many 100s of HBase nodes managing petabytes of mobile device data.

Apple?
jiri on May 29th, 2015 7:15 pm

David, I am not sure what you mean by “rare”, key sorted files are not only common in noSQL world like HBase or Cassandra (if you use OrderPreservingPartitioner), they are also common in RDBMS e.g. Index Organized Tables in Oracle
David Gruzman on May 30th, 2015 4:30 am

By rare, I mean that default choice for most of NoSQL’s is consistent hashing and many of them do not give us other choice.

Cassandra, by default, is using hash partitioning to distribute data among nodes, and only inside a node data is sorted.
The is quote from DataStax documentation says : “Unless absolutely required by your application, DataStax strongly recommends against using the ordered partitioner for the following reasons … “.
In a nutshell – because of the hot spots.

Coachbase, Riak, Voldemort, are using consistent hashing.

I do not think that their respective designers do not know benefits of sorting, but they also aware about the toll – “hot” regions problem in HBase terms. So they choose to keep their system simpler and focus on “access by key” case, where sorting is not required.

In my view, only NoSQL solutions, which are committed to fight “hot regions” problem” should be considered when sorting is needed. In best of my knowledge, only HBase is doing so (at least among popular NoSQL systems).

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

More notes on HBase

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin