Teradata SQL-H, using HCatalog
When I grumbled about the conference-related rush of Hadoop announcements, one example of many was Teradata Aster’s SQL-H. Still, it’s an interesting idea, and a good hook for my first shot at writing about HCatalog. Indeed, other than the Talend integration bundled into Hortonworks’ HDP 1, Teradata SQL-H is the first real use of HCatalog I’m aware of.
The Teradata SQL-H idea is:
- Register your Hadoop data to HCatalog. I’ll confess to being unclear about the details of how that works, for example in the case of data that just doesn’t fit well into flat relational tables. Stay tuned for future posts. For now, I’ll just note that:
- HCatalog is closely based on Hive’s metadata management. If you’ve run Hive against the data, HCatalog should already know about it.
- HCatalog can handle Pig and HBase data as well.
- Write SQL DDL (Data Description Language) so that your Aster cluster knows about the data.
- Write any Teradata Aster SQL/MR against that data. Some of the execution will be done on the Hadoop cluster, but pulling data back into Aster may well be necessary.
At least in theory, Teradata SQL-H lets you use a full set of analytic tools against your Hadoop data, with little limitation except price and/or performance. Teradata thinks the performance of all this can be much better than if you just use Hadoop (35X was mentioned in one particularly favorable example), but perhaps much worse than if you just copy/extract the data to an Aster cluster in the first place.
So what might the use cases be for something like SQL-H? Offhand, I’d say:
- SQL-H use cases are probably focused in areas where copying the data to Aster in advance doesn’t make a lot of sense. So presumably …
- … the Hadoop clusters involved would hold a lot more data than you’d want to pay for storing in Teradata Aster. E.g., think of cases where Hadoop is used as a big bit bucket or archival data store.
- There could be a kind of investigative workflow. First you play around with the Hadoop data via SQL-H. Then when you think you’re onto something, you set up ETL (Extract/Transform/Load) to get the data into Aster and ratchet up the effort.
By way of contrast, the whole thing makes less sense for dashboarding kinds of uses, unless the dashboard users are very patient when they want to drill down.
Comments
10 Responses to “Teradata SQL-H, using HCatalog”
Leave a Reply
I think that days of old good MPP databases are over (at least when we talk about “big data analytics”). All attempts to marry Terradata, Aster, Greenplum etc with Hadoop look unnatural. Combination of Hive and R covers probably more than 90% of all possible use cases in analytical data processing: extract sample of data from Hadoop cluster using Hive and run R scripts on that data sample. I am not aware about single use case where processing of 100% of data (terabytes and petabytes) is a MUST requirement.
hCatalog is a “Dictionary” or “Catalog”. It’s used to store the metadata about the structure of data, not the data itself. In this way, PIG and all other implementations can map structure at runtime. Say what you want about “Unstructured” data, but the vast majority of applications bind a structure to the underling data so it can be consumed… this just makes that declaration portable across platforms. And who is using it? Ask the guys at Yahoo. Indispensable.
Hi Vlad,
I work at Teradata Aster and I appreciate your comments.
We are very customer driven. We’ve talked to many Hadoop customers before developing the SQL-H functionality.
Extracting samples and using R may work for some use cases, but the majority of enterprise Hadoop customers want a scalable way to do SQL & BI processing on their Hadoop data. Also, not everyone is willing to go to R, due to the large adoption of SQL-based tools.
I also understand that R breaks down at the Gigabyte range which is too little (let me know if you have heard anything otherwise).
Thanks,
Cesar
Hi Cesar,
Thanks for commenting!
Your “gigabyte range” figure for R breaking down sounds very odd to me. R assumes all data is in memory, which might be what you’re thinking of. But various vendors try to work around even that limitation.
Thanks Curt for the info, it makes sense. Thanks also for writing this note. Regards.
[…] DBMS integrations such as Teradata Aster’s SQL-H. […]
[…] A central part of Teradata’s strategy is that Aster and Hadoop nodes can work together via SQL-H. […]
[…] Hadoop and MapReduce with relational DBMS come from my clients at Teradata Aster (via SQL/MR and SQL-H) and Hadapt. In both cases, the story […]
[…] vendors so often do, Teradata has caused itself some naming confusion. SQL-H was introduced as a facility of Teradata Aster, to complement SQL-MR.* But while SQL-MR is in essence a set of SQL extensions, SQL-H is not. […]
[…] 6, aka the Teradata Aster Discovery Platform, includes HDFS compatibility, native MapReduce and ways of invoking Hadoop MapReduce on non-Aster nodes or clusters — but even so, you can’t run Hadoop MapReduce within Aster over Aster’s version […]