May 12, 2011

Data integration vendors and Hadoop

There have been many recent announcements about how data integration/ETL (Extract/Transform/Load) vendors are going to work with MapReduce. Most of what they say boils down to one or more of a few things:

Hadoop generally stores data in HDFS (Hadoop Distributed File System). ETL vendors want to be able to extract data from or load it into HDFS.
ETL vendors have development environments that let you specify/script/whatever ETL jobs. ETL vendors want their development tools to develop ETL processes executed via MapReduce/Hadoop.
In particular, this allows ETL vendors to exploit the parallel-processing capabilities of MapReduce.

Some additional twists include:

Pentaho announced business intelligence and ETL for Hadoop last year.
Syncsort thinks different sort algorithms should be usable with Hadoop. Consequently, it plans to contribute technology to the community to make sort pluggable into Hadoop. (However, Syncsort is keeping its own sort technology proprietary.)
Syncsort is considering replicating some Hive functionality, starting with joins, hopefully running much faster. (However, Syncsort’s basic Hadoop support is a quarter or three away, so any more advanced functionality would probably come out in 2012 or beyond.)
SnapLogic fondly thinks that its generation of MapReduce jobs is particularly intelligent.

Finally, my former clients at Pervasive, who haven’t briefed me for a while, seem to have told Doug Henschen that they have pointed DataRush at MapReduce.* However, I couldn’t find evidence of same on the Pervasive DataRush website beyond some help in using all the cores on any one Hadoop node.

*Also see that article because it names a bunch of ETL vendors doing Hadoop-related things.

Categories: Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, Parallelization, Pentaho, Pervasive Software, SnapLogic, Syncsort

Subscribe to our complete feed!

Comments

One Response to “Data integration vendors and Hadoop”

Keith Kohl on May 16th, 2011 7:42 am

Thanks for the BLOG Curt, very interesting on what you’re seeing in the market.

As way of an update, our sort “plug-in” contribution to the Hadoop community is progressing. We have received several suggestions from the community which we working to incorporate. The community also asked that we use the new MR framework that is being proposed and we are investigating this. As you note, we’ll be providing Hive-like functionality with users creating MR processes graphically with DMExpress. This should significantly simplify MR development. As we’ve discussed DMExpress biggest strengths are large file processing throughput. The DMExpress engine will execute the data processing within the Hadoop framework. The net result is much easier development in a graphical environment and extremely fast throughput.

In addition, Syncsort is working to benchmark higher volumes on a larger cluster than what we announced on May 4 to demonstrate what we strongly believe to be higher scalability (this could be worded better). We are working with comScore to do this. Stay tuned…

–Keith Kohl

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Data integration vendors and Hadoop

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin