Hortonworks, Hadoop, Stinger and Hive
I chatted yesterday with the Hortonworks gang. The main subject was Hortonworks’ approach to SQL-on-Hadoop — commonly called Stinger — but at my request we cycled through a bunch of other topics as well. Company-specific notes include:
- Hortonworks founder J. Eric “Eric14” Baldeschwieler is no longer at Hortonworks, although I imagine he stays closely in touch. What he’s doing next is unspecified, except by the general phrase “his own thing”. (Derrick Harris has more on Eric’s departure.)
- John Kreisa still is at Hortonworks, just not as marketing VP. Think instead of partnerships and projects.
- ~250 employees.
- ~70-75 subscription customers.
Our deployment and use case discussions were a little confused, because a key part of Hortonworks’ strategy is to support and encourage the idea of combining use cases and workloads on a single cluster. But I did hear:
- 10ish nodes for a typical starting cluster.
- 100ish nodes for a typical “data lake” committed adoption.
- Teradata UDA (Unified Data Architecture)* customers sometimes (typically?) jumping straight to a data lake scenario.
- A few users in the 10s of 1000s of nodes. (Obviously Yahoo is one.)
- HBase used in >50% of installations.
- Hive probably even more than that.
- Hortonworks is seeing a fair amount of interest in Windows Hadoop deployments.
*By the way — Teradata seems serious about pushing the UDA as a core message.
Ecosystem notes, in Hortonworks’ perception, included:
- Cloudera is obviously Hortonworks’ biggest distro competitor. Next is IBM, presumably in its blue-forever installed base. MapR is barely on the radar screen; Pivotal’s likely rise hasn’t yet hit sales reports.
- Hortonworks evidently sees a lot of MicroStrategy and Tableau, and some Platfora and Datameer, the latter two at around the same level of interest.
- Accumulo is a big deal in the Federal government, and has gotten a few health care wins as well. Its success is all about security. (Note: That’s all consistent with what I hear elsewhere.)
I also asked specifically about OpenStack. Hortonworks is a member of the OpenStack project, contributes nontrivially to Swift and other subprojects, and sees Rackspace as an important partner. But despite all that, I think strong Hadoop/OpenStack integration is something for the indefinite future.
Hortonworks’ views about Hadoop 2.0 start from the premise that its goal is to support running a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 include:
- It’s been in preview/release candidate/commercial beta mode for weeks.
- Q3 is the goal; H2 is the emphatic goal.
- Yahoo’s been in production with YARN >8 months, and has no MapReduce 1 clusters left. (Yahoo has >35,000 Hadoop nodes.)
- The last months of delays have been mainly about sprucing up various APIs and protocols, which may need to serve for a similar multi-year period as Hadoop 1’s have. But there also was some YARN stabilization into May.
Frankly, I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of Hadoop 2.0 is what was least ready. HDFS (Hadoop Distributed File System) performance, NameNode failover and so on were well worth having, and it’s more than a year between Cloudera starting supporting them and when Hortonworks is offering Hadoop 2.0.
Hortonworks’ approach to doing SQL-on-Hadoop can be summarized simply as “Make Hive into as good an analytic RDBMS as possible, all in open source”. Key elements include:
- Providing a Hive-friendly execution environment in Hadoop 2.0. For example, this seems to be a main point of Tez, although Tez is also meant to support Pig and so on as well. (Recall the close relationship between Hortonworks and Pig fan Yahoo.)
- Providing a Hive-friendly HDFS file format, called ORC. To a first approximation, ORC sounds a lot like Cloudera Impala’s preferred format Parquet.
- Improving Hive itself, notably in:
- SQL functionality.
- Query planning and optimization.
- Vectorized execution (Microsoft seems to be helping significantly with that).
Specific notes include:
- Some of the Hive improvements — e.g. SQL windowing, better query planning over MapReduce 1 — came out in May.
- Others — e.g. Tez port — seem to be coming soon.
- Yet others — notably a true cost-based optimizer — haven’t even been designed yet.
- Hive apparently often takes 4-5 seconds to plan a query, with a lot of the problem being slowness in the metadata store. (I hope that that’s already improved in HCatalog, but I didn’t think to ask.) Hortonworks thinks 100 milliseconds would be a better number.
- Other SQL functionality that got mentioned was UDFs (User Defined Functions) and sub-queries. In general, it sounds as if the Hive community is determined to someday falsify the “Hive supports a distressingly small subset of SQL” complaint.
As for ORC:
- ORC manages data in 256 megabyte chunks of rows. Within such chunks, ORC is columnar.
- Hortonworks asserts that ORC is ahead of Parquet in such areas and indexing and predicate pushdown, and only admits a Parquet advantage in one area — the performance advantages of being written in C.
- The major contributors to ORC are Hortonworks, Microsoft, and Facebook. There are ~10 contributors in all.
- ORC has a 2-tiered compression story.
- “Lightweight” type-specific compression is mandatory, for example:
- Dictionary/tokenization, for single columns within chunks.
- Run-length encoding for integers.
- Block-level compression on top of that is optional, via a collection of usual-suspect algorithms.
- “Lightweight” type-specific compression is mandatory, for example:
Finally, I asked Hortonworks what it sees as a typical or default Hadoop node these days. Happily, the answers seemed like straightforward upgrades to what Cloudera said in October, 2012. Specifics included:
- 2 x 6 = 12 cores.
- 12 or so disks, usually 2-3 terabytes each. 4 TB disks are beginning to show up in “outlier” cases.
- Usually 72 gigs or more of RAM. 128 gigs is fairly common. 256 sometimes happens.
- 10GigE is showing up at some web companies, but Hortonworks groaned a bit about the expense. Hearing that, I didn’t even ask about Infiniband, its use in certain Hadoop appliances notwithstanding.
- Hortonworks isn’t seeing much solid-state drive adoption yet, some NameNodes excepted. No doubt that’s a cost issue.
- Hortonworks sees GPUs only for “outlier” cases.
Related links
- I’ve been posting quite a bit about SQL-on-Hadoop. Links can be found in my June Dan Abadi post.
- When I posted in March about the great expense and difficulty of building a good DBMS, I was thinking especially of SQL-on-Hadoop.
Comments
12 Responses to “Hortonworks, Hadoop, Stinger and Hive”
Leave a Reply
[…] Hortonworks, Hadoop, Stinger and Hive By Derrick Harris 1 min ago Aug. 6, 2013 – 6:11 PM PDT […]
Hadoop/OpenStack integration may be closer than you may think. Check out project Savanna:
https://wiki.openstack.org/wiki/Savanna
Its being actively developed in the OpenStack community, and you can get a working Hadoop cluster deployed right now. Although its nowhere production ready just yet.
[…] via Hortonworks, Hadoop, Stinger and Hive | DBMS 2 : DataBase Management System Services. […]
Thanks for the comments Curt.
Just wanted to provide more information on the OpenStack front
– Our OpenStack integration has received good customer, partner interest for beta testing and feedback.
– Swift integration is part of the effort but the crux of the value proposition for phase-1 is to enable provisioning of the complete Hortonworks data platform on OpenStack using templates in a few clicks and in an easily repeatable fashion. Hortonworks is actively working with the OpenStack community on Project Savanna for this and we have made great progress! For more information check out slides & video of our presentation & demo at Hadoop Summit. Links below
http://www.slideshare.net/Hadoop_Summit/elterman-speidel-june26455pmhall1v2
http://www.youtube.com/watch?v=3bI1WjB-5AM&feature=youtu.be
[…] The transition from Hadoop 1 to Hadoop 2 will be drastic. […]
[…] ha dado que hablar. (1, 2, 3 y 4) Y ha sonado bastante en los medios especializados por la repentina salida de uno de sus […]
[…] Hadoop 2 […]
[…] Since Hortonworks a couple of times made it seem that Rackspace was an important partner, behind only Teradata and Microsoft, I finally asked why. Answers boiled down to a Rackspace Hadoop-as-a-service offering, plus joint work to improve Hadoop-on-OpenStack. […]
[…] have unpleasant performance consequences. From there, I segued the discussion to Accumulo. Unlike Hortonworks, Cloudera sees Accumulo demand strictly in the Federal government, where Accumulo is baked into […]
[…] most detailed discussions of Impala and Stinger were last June and August, respectively. Categories: Cloudera, Data integration and middleware, […]
[…] one of them, Cloudera, about typical Hadoop hardware, and got answers that sounded consistent with hardware trends Hortonworks told me about last August. The story is, more or […]
[…] — Hive, Impala, Stinger, Shark and so on (including […]