August 26, 2008

Three approaches to parallelizing data transformation

Many MPP data warehousing vendors have told me their products are used for ELT (Extract/Load/Transform) instead of ETL (Extract/Transform/Load). I.e., needed data transformations are done on the MPP system, rather than on the — probably SMP — system the data comes from.* If the data transformation is being applied on a record-by-record basis, then it’s automatically fully parallelized. Even if the transforms are more complex, considerable parallel processing may still be going on.

*Or it’s some of each, at which point it’s called ETLT — I bet you can work out what that stands for.

But depending on your needs, at least two other approaches to data transformation parallelization could also be considered. Pervasive Software, which has a big data integration software business of its own, built a new ETL tool. The foundation was a middle-tier multi-core-friendly Java dataflow engine, which has been now split out as Pervasive Datarush. The product is in the early stages of being released, which may be a good excuse for the website confusingly suggesting both of:

You can have Datarush for free.
If Datarush doesn’t produce a 30X speedup for you, you can get your money back.

The third approach is my Subject Of The Week: MapReduce. When I posted a list of canonical MapReduce applications, my friends at Aster Data offered one pushback — I left out the area of data transformation. As CEO Mayank Bawa puts it:

Large-scale transformations can be parameterized as SQL/MR functions for data cleansing and standardization, unleashing the true potential for Extract-Load-Transform pipelines and making large-scale data model normalization feasible. Push down also enables rapid discovery and data pre-processing to create analytical data sets used for advanced analytics such as SAS and SPSS.

Some of our recent links about MapReduce

Categories: Aster Data, Data integration and middleware, Data warehousing, EAI, EII, ETL, ELT, ETLT, MapReduce, Parallelization, Pervasive Software

Subscribe to our complete feed!

Comments

8 Responses to “Three approaches to parallelizing data transformation”

MapReduce links | DBMS2 -- DataBase Management System Services on August 27th, 2008 5:20 am

[…] Another application of MapReduce […]
Yves de Montcheuil on August 28th, 2008 4:24 am

Curt, on this topic, I would like to point you to Talend, the first open source data integration software. Talend Open Studio is also the first solution to support both the ETL and ELT approaches natively – and of course the ETLT approach as well.
Unlike tools like Sunopsis (now Oracle Data Integrator), arguably the pionner of ELT, and engine-based tools such as Informatica or DataStage that support only ETL (ELT is only an afterthought), Talend supports both approaches natively, providing always the best performance.
More info on Talend Open Studio: http://www.talend.com/products-data-integration/talend-open-studio.php

Yves @ Talend
Luke Lonergan on August 28th, 2008 1:15 pm

Talend is good stuff – they have a very active development team on the forefront of ELT pushdowns. Two thumbs up!
Why MapReduce matters to SQL data warehousing | DBMS2 -- DataBase Management System Services on August 28th, 2008 2:45 pm

[…] Another application of MapReduce […]
Infology.Ru » Blog Archive » Три подхода к распараллеливанию процесса преобразования данных on September 29th, 2008 5:22 pm

[…] Автор: Curt Monash Дата публикации оригинала: 2008-08-26 Перевод: Олег Кузьменко Источник: Блог Курта Монаша […]
Infology.Ru » Blog Archive » Почему MapReduce так важен для хранилищ данных? on October 5th, 2008 2:59 am

[…] Another application of MapReduce […]
Pervasive DataRush | DBMS2 -- DataBase Management System Services on January 7th, 2009 9:21 pm

[…] made a few references to Pervasive DataRush in the past — like this one — but I’ve never gotten around to seriously writing it up. I’ll now try to […]
ETL 与并行执行 | Alex的个人Blog on January 13th, 2009 1:03 am

[…] 是可以很容易伸缩的. 而在另一个文章中读到Pervasive Software 也提供一个商业编程ETL API 可以很容易并行执行ETL任务, […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Three approaches to parallelizing data transformation

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin