Syncsort extends Hadoop MapReduce
My client Syncsort:
- Is an ETL (Extract/Transform/Load) vendor, whose flagship product DMExpress was evidently renamed to DMX.
- Has a strong history in and fondness for sort.
- Has announced a new ETL product, DMX-h ETL Edition, which uses Hadoop MapReduce to parallelize DMX by controlling a copy of DMX that resides on every data node of the Hadoop cluster.*
- Has also announced the closely-related DMX-h Sort Edition, offering acceleration for the sorts inherent in Map and Reduce steps.
- Contributed a patch to Apache Hadoop to open up Hadoop MapReduce to make all this possible.
*Perhaps we should question Syncsort’s previous claims of having strong multi-node parallelism already. 🙂
The essence of the Syncsort DMX-h ETL Edition story is:
- DMX-h inherits the various ETL-suite trappings of DMX.
- Syncsort claims DMX-h has major performance advantages vs., for example, Hive- or Pig-based alternatives.
- With a copy of DMX on every node, DMX-h can do parallel load/export.
More details can be found in a slide deck Syncsort graciously allowed me to post.
And just to be clear:
- Syncsort DMX-h ETL is not focused on getting data in and out of HDFS (Hadoop Distributed File System). Rather, it uses Hadoop to support generic ETL.
- Syncsort DMX-h ETL does not invoke Hive or Pig. Rather, it’s based on Java call-outs to DMX.
Let’s turn now to the Syncsort Hadoop patch, which:
- Was primarily designed for Hadoop 2, and was adopted into Hadoop 2.03 in January.
- Has been backported to work with Hadoop 1 and now ships with the Cloudera Hadoop distribution.
- Syncsort somewhat confusingly refers to as “pluggable sort”.
Both versions of DMX-h depend upon this patch.
The point of the Syncsort Hadoop patch is to let you interrupt Map and Reduce steps at the points where they expect to perform a sort. You may then invoke a different algorithm or program altogether. This offers two kinds of potential benefits:
- Performance, for example via:
- An alternative sort algorithm, e.g. Syncsort’s. (This is the idea behind DMX-h Sort Edition.)
- Not doing a (full) sort at all, but rather returning for example:
- Top N results
- A count
- Functionality, for example via the various ETL capabilities of DMX — which is of course the idea behind DMX-h ETL edition.
I am curious as to whether other functionality use cases will emerge.
Comments
8 Responses to “Syncsort extends Hadoop MapReduce”
Leave a Reply
Syncsort gen1 made an outstanding mainframe sort/merge package. It is also a great preprocessing tool for large extract data sets such as for DW to speed ETL. The world has changed. The goal of packages of this ilk was to optimize RAM use and sequential read/writes – which are issues that no longer exist.
ETL tools (sometimes with the exception of Ab Initio) are a confusing product in the current market. They are substantially slower than other programming (again due to fast CPUs with lots of RAM). The pitch gets reduced to:
– develop using *limited* developers. There is occasional value in non-programmer analysts touching code, but ETL programmers are generally not productive or valuable – this is mostly a management/sales pitch strategy.
– metadata with lineage. This is generally only useful in projects with under 200 developers with strong management.
Net-net – Syncsort seems to be jumping from a dead market to a hopeless position in a frozen one.
oops – should have said OVER 200 developers
Aaron,
If you’re suggesting that ETL tools can in some cases be like second-rate DBMS, you have a point.
Still, a large portion of Hadoop use is for what could be called ETL. So the story “Do what you were doing before for the non-relational parts, and we’ll help you with the relational ones” isn’t entirely crazy.
My issue with is that now that CPU is so much faster than IO and plentiful in Hadoop environments – how can there be a product that optimizes sort/merge specifically?
This seems to be a cure without a disease.
If this is a specific RDBMS-Hadoop adaptor I still don’t see the value. If it is pointed RDBMS-Hadoop, the RDBMS is likely the bottleneck, so why have a separate tool for this connection? If Hadoop-RDBMS, what is value again vs. Sqoop?
I guess there may be some optimizations, but it seems such a point product that it may be more useful licensed to RDBMS vendors.
If much of the work is already being done in Java/XML/shell commands, this really probably shouldn’t be packaged as an ETL tool, but rather as a superadaptor or an ESB.
Aaron,
If you’re saying that nothing is ever CPU-bound any more, I don’t know what you’re talking about.
I think you missed the point. The CPU needs of a sorting algorithm are proportionate to the amount of data. Better sorting in large memory systems is unlikely to reduce IO, but may reduce CPU.
Historically, sorting was constrained by both IO and CPU. CPU speed/density kept increasing faster than IO, and by ~1990, most sorting became IO bound.
In a shared system doing things like Hadoop today, I can’t remember any case where sorting commonly became CPU bound.
My point is that this is a treatment in need of a disease. If the vendor has a counter argument – I would be interested in hearing it.
I wouldn’t say there is no return on trying to better utilize the CPU on sorts etc.
There is a tremendous return on better CPU utilization from a datacentre cost perspective (power, heat). Furthermore, modern CPU architecture requires a particular bent of mind for your algorithms; and returns can be quite stunning:
MonetDB Paper: http://www-db.cs.wisc.edu/cidr/cidr2005/papers/P19.pdf
Apache Tez (a new runtime) incorporates many, many sort optimizations for MR-like applications, interactive SQL queries etc. – again, we see lots of return there.
Hope this helps.
Thanks Arun – that is a seminal paper, and my point is also made by the authors there:
“we point out that the “artificially” high bandwidths generated by MonetDB/MIL make it harder to scale the system to disk-based problems efficiently, simply because memory bandwidth tends to be much larger (and cheaper)than I/O bandwidth.” (And, BTW, sad how little Volcano has integrated with Hadoops.)
It is instructive to look at Hadoop system sorting. I am amazed at how little large scale sorting (even 10GB or more) actually happens in ones I look at. This is likely relevant to the huge effort that is going on in relational::Hadoop integration (which may be more targeted at preserving existing license streams than user value), but good sorting can obviate the need for many of those constructs.
I’ve been following Tez with interest. DAGs both for workflow and for data relationship modeling and algorithms based on that are clearly a key path for Hadoop to do unique work.