Data integration vendors and Hadoop
There have been many recent announcements about how data integration/ETL (Extract/Transform/Load) vendors are going to work with MapReduce. Most of what they say boils down to one or more of a few things:
- Hadoop generally stores data in HDFS (Hadoop Distributed File System). ETL vendors want to be able to extract data from or load it into HDFS.
- ETL vendors have development environments that let you specify/script/whatever ETL jobs. ETL vendors want their development tools to develop ETL processes executed via MapReduce/Hadoop.
- In particular, this allows ETL vendors to exploit the parallel-processing capabilities of MapReduce.
Some additional twists include:
- Pentaho announced business intelligence and ETL for Hadoop last year.
- Syncsort thinks different sort algorithms should be usable with Hadoop. Consequently, it plans to contribute technology to the community to make sort pluggable into Hadoop. (However, Syncsort is keeping its own sort technology proprietary.)
- Syncsort is considering replicating some Hive functionality, starting with joins, hopefully running much faster. (However, Syncsort’s basic Hadoop support is a quarter or three away, so any more advanced functionality would probably come out in 2012 or beyond.)
- SnapLogic fondly thinks that its generation of MapReduce jobs is particularly intelligent.
Finally, my former clients at Pervasive, who haven’t briefed me for a while, seem to have told Doug Henschen that they have pointed DataRush at MapReduce.* However, I couldn’t find evidence of same on the Pervasive DataRush website beyond some help in using all the cores on any one Hadoop node.
*Also see that article because it names a bunch of ETL vendors doing Hadoop-related things.
Comments
One Response to “Data integration vendors and Hadoop”
Leave a Reply
Thanks for the BLOG Curt, very interesting on what you’re seeing in the market.
As way of an update, our sort “plug-in” contribution to the Hadoop community is progressing. We have received several suggestions from the community which we working to incorporate. The community also asked that we use the new MR framework that is being proposed and we are investigating this. As you note, we’ll be providing Hive-like functionality with users creating MR processes graphically with DMExpress. This should significantly simplify MR development. As we’ve discussed DMExpress biggest strengths are large file processing throughput. The DMExpress engine will execute the data processing within the Hadoop framework. The net result is much easier development in a graphical environment and extremely fast throughput.
In addition, Syncsort is working to benchmark higher volumes on a larger cluster than what we announced on May 4 to demonstrate what we strongly believe to be higher scalability (this could be worded better). We are working with comScore to do this. Stay tuned…
–Keith Kohl