IBM’s ETL
Bearing in mind the difficulties in covering big companies and their products, I had a call with IBM about its core ETL technology (Extract/Transform/Load), and have some notes accordingly. It’s pretty reasonable to say that there are and were a Big Three of high-end ETL vendors:
- Informatica
- IBM/Ascential
- Ab Initio
However, IBM fondly thinks there are a Big Two, on the theory that Informatica Powercenter can’t scale as well as IBM and Ab Initio can, and hence gets knocked out of deals when particularly strong scalability and throughput are required.
IBM wanted to follow up about its stance on Hadoop-based ETL, which may roughly be summarized as:
- Obviously, HDFS (Hadoop Distributed File System) is an increasingly important target for ETL.
- Also, there are some workloads for which ETL and Hadoop-based analytic processing are so interwoven that ETL — or rather ELT/ETLT — should be done on Hadoop.
- For that reason alone, it makes sense to support Hadoop as an execution engine, much as other vendors do and Informatica will.
- However, IBM is in no rush to offer such support, because IBM’s own ETL engine has great parallel performance as it stands.
I.e., IBM is in effect saying “Those other guys have to rush to Hadoop so they can do parallel ETL at all, but we don’t have that problem.” Indeed, IBM says that its users today run ETL jobs with 100s of sub-jobs, which might be equivalent to 10s of MapReduce steps.
Unfortunately, time didn’t permit a detailed discussion of the wonders of IBM’s architecture and technology. But some basics include:
- There’s a coordinator process that breaks and perhaps compiles a job into a set of sub-jobs/sub-flows.
- But there’s no head node; data is piped directly from execution node to execution node as needed.
- Data tends to be distributed among nodes in line with the next join key.
- The system tries not to spill intermediate results back to disk. Examples of when that might not quite work out include:
- Sorts.
- Cases where different parts of the process finish at significantly different times.
All that makes sense by analogy to how scale-out analytic RDBMS are designed.
Relevant history and names seem to include:
- IBM Infosphere Information Server, or something like that.
- Ascential, a data integration company IBM bought some years ago.
- DataStage, Ascential’s main product name.
- Torrent, a company Ascential bought 6-7 years ago, which provided the architecture for what IBM sells in most new ETL deals.
- PX, which I gather is the name of that architecture.
Also, it may or may not be interesting to note that:
- When independent, Ascential was juggling several different ETL engines, due to acquisition or whatever.
- With various company name changes, Ascential more or less spun into and then back out of Informix.
- IBM acquired Informix and then Ascential in two separate and apparently unrelated deals.
Comments
7 Responses to “IBM’s ETL”
Leave a Reply
Some links to the history
http://it.toolbox.com/blogs/infosphere/lee-scheffler-interview-the-ghost-of-datastage-past-8819
And about Orchestrate
http://www.pr3systems.com/Operator_Combination_and_Control.pdf
Geordee,
That first historical link is outstanding. I didn’t know DataStage was based on Pick, but of course that makes sense given the company history.
Thanks!
CAM
Curt,
I love the statement “Unfortunately, time didn’t permit a detailed discussion of the wonders of IBM’s architecture and technology.”
I assume you chose the words very carefully
Max
PX stood for “Parallel Extender” originally, which was the way the product family complexity was managed, i.e., you could buy original DataStage, or DataStage with PX for much more money.
This PX meant the Torrent Orchestrate engine, (Orchestrate(TM) was the actual name of the technology) added as an alternative and mostly-compatable back-end to DataStage.
The original DataStage (not PX) technology did evolve out of Pick stuff, but I think that’s really a little too thin way to think about it. It’s a pretty rich product itself.
Later I do think PX became the informal IBM term for the scalable backend, or the product with the scalable backend, somewhat ambiguously.
Is there any information on the architecture anywhere, particularly on scaling of job processes?
[…] Monash on IBM ETL here. Share this:PrintEmailLinkedInTwitterMoreFacebookStumbleUponRedditLike this:Like Loading… […]
[…] Curt Monash on IBM ETL here. […]