SQL/Hadoop integration

Discussion of SQL-on-Hadoop and other forms of SQL/Hadoop integration.

June 23, 2013

Impala and Parquet

I visited Cloudera Friday for, among other things, a chat about Impala with Marcel Kornacker and colleagues. Highlights included:

Impala is meant to someday be a competitive MPP (Massively Parallel Processing) analytic RDBMS.
At the moment, it is not one. For example, Impala lacks any meaningful form of workload management or query optimization.
While Impala will run against any HDFS (Hadoop Distributed File System) file format, claims of strong performance assume that the data is in Parquet …
- … which is the replacement for the short-lived Trevni …
- … and which for most practical purposes is true columnar.
Impala is also meant to be more than an RDBMS; Parquet and presumably in the future Impala can accommodate nested data structures.
Just as Impala runs against most or all HDFS file formats, Parquet files can be used by most Hadoop execution engines, and of course by Pig and Hive.
The Impala roadmap includes workload management, query optimization, data skipping, user-defined functions, hash distribution, two turtledoves, and a partridge in a pear tree.

Data gets into Parquet via batch jobs only — one reason it’s important that Impala run against multiple file formats — but background format conversion is another roadmap item. A single table can be split across multiple formats — e.g., the freshest data could be in HBase, with the rest is in Parquet.

Categories: Cloudera, Columnar database management, Data models and architecture, Data warehousing, Database compression, Hadoop, HBase, MapReduce, Market share and customer counts, Specific users, SQL/Hadoop integration, Workload management

11 Comments

June 6, 2013

Dave DeWitt responds to Daniel Abadi

A few days ago I posted Daniel Abadi’s thoughts in a discussion of Hadapt, Microsoft PDW (Parallel Data Warehouse)/PolyBase, Pivotal/Greenplum Hawq, and other SQL-Hadoop combinations. This is Dave DeWitt’s response. Emphasis mine.

Read more

Categories: Benchmarks and POCs, Cloudera, Clustering, Data warehousing, Greenplum, Hadapt, Hadoop, MapReduce, Microsoft and SQL*Server, PostgreSQL, SQL/Hadoop integration

6 Comments

June 2, 2013

SQL-Hadoop architectures compared

The genesis of this post is:

Dave DeWitt sent me a paper about Microsoft Polybase.
I argued with Dave about the differences between Polybase and Hadapt.
I asked Daniel Abadi for his opinion.
Dan agreed with Dave, in a long email …
… that he graciously permitted me to lightly-edit and post.

I love my life.

Per Daniel (emphasis mine): Read more

Categories: Aster Data, Data warehousing, Greenplum, Hadapt, Hadoop, MapReduce, Microsoft and SQL*Server, SQL/Hadoop integration, Theory and architecture

13 Comments

April 15, 2013

Teradata SQL-H

As vendors so often do, Teradata has caused itself some naming confusion. SQL-H was introduced as a facility of Teradata Aster, to complement SQL-MR.* But while SQL-MR is in essence a set of SQL extensions, SQL-H is not. Rather, SQL-H is a transparency interface that makes Hadoop data responsive to the same code that would work on Teradata Aster …

*Speaking of confusion — Teradata Aster seems to use the spellings SQL/MR and SQL-MR interchangeably.

… except that now there’s also a SQL-H for regular Teradata systems as well. While it has the same general features and benefits as SQL-H for Teradata Aster, the details are different, since the underlying systems are.

I hope that’s clear. 🙂

Categories: Data integration and middleware, Data warehousing, Emulation, transparency, portability, Hadoop, SQL/Hadoop integration, Teradata

2 Comments

March 18, 2013

DBMS development and other subjects

The cardinal rules of DBMS development

Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.

That’s if things go extremely well.

Rule 2: You aren’t an exception to Rule 1.

In particular:

Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
Mixed workload management is harder than you’re assuming it is.
Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.

DBMS with Hadoop underpinnings …

… aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well aware of this fact. Read more

Categories: Aster Data, Cloudera, Columnar database management, Database compression, Hadapt, Hadoop, Hortonworks, IBM and DB2, MarkLogic, Netezza, NoSQL, QlikTech and QlikView, SQL/Hadoop integration, Structured documents, Sybase, Tableau Software, Teradata

36 Comments

February 25, 2013

Greenplum HAWQ

My former friends at Greenplum no longer talk to me, so in particular I wasn’t briefed on Pivotal HD and Greenplum HAWQ. Pivotal HD seems to be yet another Hadoop distribution, with the idea that you use Greenplum’s management tools. Greenplum HAWQ seems to be Greenplum tied to HDFS.

The basic idea seems to be much like what I mentioned a few days ago — the low-level file store for Greenplum can now be something else one has heard of before, namely HDFS (Hadoop Distributed File System, which is also an option for, say, NuoDB). Beyond that, two interesting quotes in a Greenplum blog post are:

When a query starts up, the data is loaded out of HDFS and into the HAWQ execution engine.

and

In addition, it has native support for HBase, supporting HBase predicate pushdown, hive[sic] connectivity, and offering a ton of intelligent features to retrieve HBase data.

The first sounds like the invisible loading that Daniel Abadi wrote about last September on Hadapt’s blog. (Edit: Actually, see Daniel’s comment below.) The second sounds like a good idea that, again, would also be a natural direction for vendors such as Hadapt.

Categories: EMC, Greenplum, Hadapt, Hadoop, HBase, SQL/Hadoop integration

14 Comments

December 13, 2012

Spark, Shark, and RDDs — technology notes

Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level:

Rather than persisting data to disk after every step, as MapReduce does, Spark instead writes to something called RDDs (Resilient Distributed Datasets), which can live in memory.
Rather than being restricted to maps and reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order. All the primitives are parallel with respect to the RDDs.
Shark is a lot like Hive, only rewritten (in significant parts) and running over Spark.
There’s an approach to launching tasks quickly — ~5 milliseconds or so — that I unfortunately didn’t grasp.

The key concept here seems to be the RDD. Any one RDD:

Is a collection of Java objects, which should have the same or similar structure.
Can be partitioned/distributed and shuffled/redistributed across the cluster.
Doesn’t have to be entirely in memory at once.

Otherwise, there’s a lot of flexibility; an RDD can be a set of tuples, a collection of XML documents, or whatever other reasonable kind of dataset you want. And I gather that:

At the moment, RDDs expire at the end of a job.
This restriction will be lifted in a future release.

Categories: Data models and architecture, Databricks, Spark and BDAS, Hadoop, MapReduce, Memory-centric data management, Open source, Parallelization, SQL/Hadoop integration

11 Comments

December 13, 2012

Introduction to Spark, Shark, BDAS and AMPLab

UC Berkeley’s AMPLab is working on a software stack that:

Is meant (among other goals) to improve upon Hadoop …
… but also to interoperate with it, and which in fact …
… uses significant parts of Hadoop.
Seems to have the overall name BDAS (Berkeley Data Analytics System).

The whole thing has $30 million in projected funding (half government, half industry) and a 6-year plan (which they’re 2 years into).

Specific projects of note in all that include:

Mesos, a cluster manager. I don’t know much about Mesos, but it seems to be in production use, most notably at Twitter supporting Storm.
Spark, a replacement for MapReduce and the associated execution stack.
Shark, a replacement for Hive.

Categories: ClearStory Data, Databricks, Spark and BDAS, Hadoop, MapReduce, Parallelization, Specific users, SQL/Hadoop integration

11 Comments

November 1, 2012

Quick notes on Impala

Edit: There is now a follow-up post on Cloudera Impala with substantially more detail.

In my world it’s possible to have a hasty 2-hour conversation, and that’s exactly what I had with Cloudera last week. We touched on hardware and general adoption, but much of the conversation was about Cloudera Impala, announced today. Like Hive, Impala turns Hadoop into a basic analytic RDBMS, with similar SQL/Hadoop integration benefits to those of Hadapt. In particular:

Impala is Hive-compatible in query language (HQL, which is a whole lot like SQL), metadata, JDBC/ODBC drivers, etc.
Unlike Hive, Impala does not work through Hadoop MapReduce.
Unlike Hadoop MapReduce and hence Hive, Impala does not persist intermediate results to disk. This is good for performance, but on extremely long-running queries it increases the risk you’ll have a node failure and have to restart the query from scratch.
Impala in its first version is missing some Hive syntax, notably in support for UDFs (User-Defined Functions).

Beyond that: Read more

Categories: Cloudera, Columnar database management, Database compression, Hadapt, Hadoop, MapReduce, Open source, SQL/Hadoop integration

6 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

SQL/Hadoop integration

Impala and Parquet

Dave DeWitt responds to Daniel Abadi

SQL-Hadoop architectures compared

Teradata SQL-H

DBMS development and other subjects

Greenplum HAWQ

Spark, Shark, and RDDs — technology notes

Introduction to Spark, Shark, BDAS and AMPLab

More on Cloudera Impala

Quick notes on Impala

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin