Three approaches to parallelizing data transformation
Many MPP data warehousing vendors have told me their products are used for ELT (Extract/Load/Transform) instead of ETL (Extract/Transform/Load). I.e., needed data transformations are done on the MPP system, rather than on the — probably SMP — system the data comes from.* If the data transformation is being applied on a record-by-record basis, then it’s automatically fully parallelized. Even if the transforms are more complex, considerable parallel processing may still be going on.
*Or it’s some of each, at which point it’s called ETLT — I bet you can work out what that stands for.
But depending on your needs, at least two other approaches to data transformation parallelization could also be considered. Pervasive Software, which has a big data integration software business of its own, built a new ETL tool. The foundation was a middle-tier multi-core-friendly Java dataflow engine, which has been now split out as Pervasive Datarush. The product is in the early stages of being released, which may be a good excuse for the website confusingly suggesting both of:
- You can have Datarush for free.
- If Datarush doesn’t produce a 30X speedup for you, you can get your money back.
The third approach is my Subject Of The Week: MapReduce. When I posted a list of canonical MapReduce applications, my friends at Aster Data offered one pushback — I left out the area of data transformation. As CEO Mayank Bawa puts it:
Large-scale transformations can be parameterized as SQL/MR functions for data cleansing and standardization, unleashing the true potential for Extract-Load-Transform pipelines and making large-scale data model normalization feasible. Push down also enables rapid discovery and data pre-processing to create analytical data sets used for advanced analytics such as SAS and SPSS.
Some of our recent links about MapReduce
- The integration of MapReduce with SQL data warehousing
- Three major applications of MapReduce
- Sound bites about MapReduce
- Other links about MapReduce
Comments
8 Responses to “Three approaches to parallelizing data transformation”
Leave a Reply
[…] Another application of MapReduce […]
Curt, on this topic, I would like to point you to Talend, the first open source data integration software. Talend Open Studio is also the first solution to support both the ETL and ELT approaches natively – and of course the ETLT approach as well.
Unlike tools like Sunopsis (now Oracle Data Integrator), arguably the pionner of ELT, and engine-based tools such as Informatica or DataStage that support only ETL (ELT is only an afterthought), Talend supports both approaches natively, providing always the best performance.
More info on Talend Open Studio: http://www.talend.com/products-data-integration/talend-open-studio.php
Yves @ Talend
Talend is good stuff – they have a very active development team on the forefront of ELT pushdowns. Two thumbs up!
[…] Another application of MapReduce […]
[…] Автор: Curt Monash Дата публикации оригинала: 2008-08-26 Перевод: Олег Кузьменко Источник: Блог Курта Монаша […]
[…] Another application of MapReduce […]
[…] made a few references to Pervasive DataRush in the past — like this one — but I’ve never gotten around to seriously writing it up. I’ll now try to […]
[…] 是可以很容易伸缩的. 而在另一个文章中读到Pervasive Software 也提供一个商业编程ETL API 可以很容易并行执行ETL任务, […]