August 6, 2013

Hortonworks, Hadoop, Stinger and Hive

I chatted yesterday with the Hortonworks gang. The main subject was Hortonworks’ approach to SQL-on-Hadoop — commonly called Stinger — but at my request we cycled through a bunch of other topics as well. Company-specific notes include:

Hortonworks founder J. Eric “Eric14” Baldeschwieler is no longer at Hortonworks, although I imagine he stays closely in touch. What he’s doing next is unspecified, except by the general phrase “his own thing”. (Derrick Harris has more on Eric’s departure.)
John Kreisa still is at Hortonworks, just not as marketing VP. Think instead of partnerships and projects.
~250 employees.
~70-75 subscription customers.

Our deployment and use case discussions were a little confused, because a key part of Hortonworks’ strategy is to support and encourage the idea of combining use cases and workloads on a single cluster. But I did hear:

10ish nodes for a typical starting cluster.
100ish nodes for a typical “data lake” committed adoption.
Teradata UDA (Unified Data Architecture)* customers sometimes (typically?) jumping straight to a data lake scenario.
A few users in the 10s of 1000s of nodes. (Obviously Yahoo is one.)
HBase used in >50% of installations.
Hive probably even more than that.
Hortonworks is seeing a fair amount of interest in Windows Hadoop deployments.

*By the way — Teradata seems serious about pushing the UDA as a core message.

Ecosystem notes, in Hortonworks’ perception, included:

Cloudera is obviously Hortonworks’ biggest distro competitor. Next is IBM, presumably in its blue-forever installed base. MapR is barely on the radar screen; Pivotal’s likely rise hasn’t yet hit sales reports.
Hortonworks evidently sees a lot of MicroStrategy and Tableau, and some Platfora and Datameer, the latter two at around the same level of interest.
Accumulo is a big deal in the Federal government, and has gotten a few health care wins as well. Its success is all about security. (Note: That’s all consistent with what I hear elsewhere.)

I also asked specifically about OpenStack. Hortonworks is a member of the OpenStack project, contributes nontrivially to Swift and other subprojects, and sees Rackspace as an important partner. But despite all that, I think strong Hadoop/OpenStack integration is something for the indefinite future.

Hortonworks’ views about Hadoop 2.0 start from the premise that its goal is to support running a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 include:

It’s been in preview/release candidate/commercial beta mode for weeks.
Q3 is the goal; H2 is the emphatic goal.
Yahoo’s been in production with YARN >8 months, and has no MapReduce 1 clusters left. (Yahoo has >35,000 Hadoop nodes.)
The last months of delays have been mainly about sprucing up various APIs and protocols, which may need to serve for a similar multi-year period as Hadoop 1’s have. But there also was some YARN stabilization into May.

Frankly, I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of Hadoop 2.0 is what was least ready. HDFS (Hadoop Distributed File System) performance, NameNode failover and so on were well worth having, and it’s more than a year between Cloudera starting supporting them and when Hortonworks is offering Hadoop 2.0.

Hortonworks’ approach to doing SQL-on-Hadoop can be summarized simply as “Make Hive into as good an analytic RDBMS as possible, all in open source”. Key elements include:

Providing a Hive-friendly execution environment in Hadoop 2.0. For example, this seems to be a main point of Tez, although Tez is also meant to support Pig and so on as well. (Recall the close relationship between Hortonworks and Pig fan Yahoo.)
Providing a Hive-friendly HDFS file format, called ORC. To a first approximation, ORC sounds a lot like Cloudera Impala’s preferred format Parquet.
Improving Hive itself, notably in:
- SQL functionality.
- Query planning and optimization.
- Vectorized execution (Microsoft seems to be helping significantly with that).

Specific notes include:

Some of the Hive improvements — e.g. SQL windowing, better query planning over MapReduce 1 — came out in May.
Others — e.g. Tez port — seem to be coming soon.
Yet others — notably a true cost-based optimizer — haven’t even been designed yet.
Hive apparently often takes 4-5 seconds to plan a query, with a lot of the problem being slowness in the metadata store. (I hope that that’s already improved in HCatalog, but I didn’t think to ask.) Hortonworks thinks 100 milliseconds would be a better number.
Other SQL functionality that got mentioned was UDFs (User Defined Functions) and sub-queries. In general, it sounds as if the Hive community is determined to someday falsify the “Hive supports a distressingly small subset of SQL” complaint.

As for ORC:

ORC manages data in 256 megabyte chunks of rows. Within such chunks, ORC is columnar.
Hortonworks asserts that ORC is ahead of Parquet in such areas and indexing and predicate pushdown, and only admits a Parquet advantage in one area — the performance advantages of being written in C.
The major contributors to ORC are Hortonworks, Microsoft, and Facebook. There are ~10 contributors in all.
ORC has a 2-tiered compression story.
- “Lightweight” type-specific compression is mandatory, for example:
  - Dictionary/tokenization, for single columns within chunks.
  - Run-length encoding for integers.
- Block-level compression on top of that is optional, via a collection of usual-suspect algorithms.

Finally, I asked Hortonworks what it sees as a typical or default Hadoop node these days. Happily, the answers seemed like straightforward upgrades to what Cloudera said in October, 2012. Specifics included:

2 x 6 = 12 cores.
12 or so disks, usually 2-3 terabytes each. 4 TB disks are beginning to show up in “outlier” cases.
Usually 72 gigs or more of RAM. 128 gigs is fairly common. 256 sometimes happens.
10GigE is showing up at some web companies, but Hortonworks groaned a bit about the expense. Hearing that, I didn’t even ask about Infiniband, its use in certain Hadoop appliances notwithstanding.
Hortonworks isn’t seeing much solid-state drive adoption yet, some NameNodes excepted. No doubt that’s a cost issue.
Hortonworks sees GPUs only for “outlier” cases.

Related links

I’ve been posting quite a bit about SQL-on-Hadoop. Links can be found in my June Dan Abadi post.
When I posted in March about the great expense and difficulty of building a good DBMS, I was thinking especially of SQL-on-Hadoop.

Categories: Cloudera, Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Datameer, Facebook, Greenplum, Hadoop, Hortonworks, IBM and DB2, MapR, MapReduce, Market share and customer counts, Microsoft and SQL*Server, MicroStrategy, Open source, Petabyte-scale data management, Solid-state memory, SQL/Hadoop integration, Tableau Software, Teradata, Yahoo

Subscribe to our complete feed!

Comments

12 Responses to “Hortonworks, Hadoop, Stinger and Hive”

Hortonworks, Hadoop, Stinger and Hive — Tech News and Analysis on August 6th, 2013 9:11 pm

[…] Hortonworks, Hadoop, Stinger and Hive By Derrick Harris 1 min ago Aug. 6, 2013 – 6:11 PM PDT […]
sacharya on August 6th, 2013 10:10 pm

Hadoop/OpenStack integration may be closer than you may think. Check out project Savanna:
https://wiki.openstack.org/wiki/Savanna

Its being actively developed in the OpenStack community, and you can get a working Hadoop cluster deployed right now. Although its nowhere production ready just yet.
Hortonworks, Hadoop, Stinger and Hive | DBMS 2 : DataBase Management System Services | Big Data Cloud on August 7th, 2013 12:12 am

[…] via Hortonworks, Hadoop, Stinger and Hive | DBMS 2 : DataBase Management System Services. […]
Himanshu Bari on August 7th, 2013 6:09 pm

Thanks for the comments Curt.
Just wanted to provide more information on the OpenStack front
– Our OpenStack integration has received good customer, partner interest for beta testing and feedback.
– Swift integration is part of the effort but the crux of the value proposition for phase-1 is to enable provisioning of the complete Hortonworks data platform on OpenStack using templates in a few clicks and in an easily repeatable fashion. Hortonworks is actively working with the OpenStack community on Project Savanna for this and we have made great progress! For more information check out slides & video of our presentation & demo at Hadoop Summit. Links below

http://www.slideshare.net/Hadoop_Summit/elterman-speidel-june26455pmhall1v2
http://www.youtube.com/watch?v=3bI1WjB-5AM&feature=youtu.be
Things I keep needing to say | DBMS 2 : DataBase Management System Services on August 12th, 2013 8:25 am

[…] The transition from Hadoop 1 to Hadoop 2 will be drastic. […]
Un verano cargado de macrodatos – resumen de noticias | BigData4Success on August 12th, 2013 7:43 pm

[…] ha dado que hablar. (1, 2, 3 y 4) Y ha sonado bastante en los medios especializados por la repentina salida de uno de sus […]
Big Data annd No SQL links | Fresh Water Perl on August 19th, 2013 6:16 am

[…] Hadoop 2 […]
Hortonworks business notes | DBMS 2 : DataBase Management System Services on August 24th, 2013 8:47 am

[…] Since Hortonworks a couple of times made it seem that Rackspace was an important partner, behind only Teradata and Microsoft, I finally asked why. Answers boiled down to a Rackspace Hadoop-as-a-service offering, plus joint work to improve Hadoop-on-OpenStack. […]
Cloudera Sentry and other security subjects | DBMS 2 : DataBase Management System Services on August 25th, 2013 11:39 am

[…] have unpleasant performance consequences. From there, I segued the discussion to Accumulo. Unlike Hortonworks, Cloudera sees Accumulo demand strictly in the Federal government, where Accumulo is baked into […]
Distinctions in SQL/Hadoop integration | DBMS 2 : DataBase Management System Services on February 9th, 2014 1:51 pm

[…] most detailed discussions of Impala and Stinger were last June and August, respectively. Categories: Cloudera, Data integration and middleware, […]
Hardware and storage notes | DBMS 2 : DataBase Management System Services on April 30th, 2014 10:05 pm

[…] one of them, Cloudera, about typical Hadoop hardware, and got answers that sounded consistent with hardware trends Hortonworks told me about last August. The story is, more or […]
Layering of database technology & DBMS with multiple DMLs | DBMS 2 : DataBase Management System Services on June 15th, 2014 2:02 am

[…] — Hive, Impala, Stinger, Shark and so on (including […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Hortonworks, Hadoop, Stinger and Hive

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin