Introduction to Syncsort and DMExpress
Let’s start with some Syncsort basics.
- Syncsort was founded in 1968.
- As you might guess from its name and age, Syncsort started out selling software for IBM mainframes, used for sorting data. However, for the past 30 or so years, Syncsort’s products have gone beyond sort to also do join, aggregation, and merge. This was the basis for Syncsort’s expansion into the more general ETL (Extract/Transform/Load) business.
- As you might further guess, along the way there was a port to UNIX, development of a GUI (Graphical User Interface), and a change of ownership as Syncsort’s founder more or less cashed out.
- At this point, Syncsort sees itself primarily as a data integration/ETL company, whose main claim to fame is performance, with further claims of linear scaling and no manual tuning.*
One of Syncsort’s favorite value propositions is to contrast the cost of doing ETL in Syncsort, on commodity hardware, to the cost of doing ELT (Extract/Load/Transform) on high-end Teradata gear.
*I forget whether Syncsort actually bothered to say “almost” when making those claims, but one should of course assume the word is in there.
Syncsort general highlights include:
- Syncsort now has about 350 employees and $100 million in revenue.
- Syncsort’s company reboot occurred in April, 2008. Syncsort’s founder was largely bought out by investors, and new management started coming in.
- Syncsort says it has three main businesses:
- Data protection — this is the smallest Syncsort business, and I didn’t ask further about it.
- Mainframe sort, apparently now under the product name MFX rather than Syncsort. (However, Syncsort says that for the past 30 years its sort products have done more than sort, specifically also join, aggregation, and merge. That’s the basis for the whole move to ETL.)
- Data integration, which I think really means “open systems rather than mainframe.” This is the biggest part of the whole.
- Syncsort’s main data integration product is called DMExpress. There also are bunch of installations of a legacy UNIX product called — you guessed it! — Syncsort.
- There are about 900 DMExpress customers. Syncsort guesses that around 60% of them are using DMExpress for more than just sorting.
- Syncsort cites its main data integration competitors as being, no surprise, Ab Initio (their favorite choice to compare themselves to), Informatica (with PowerCenter), and IBM (with DataStage). Syncsort evidently also sees Talend reasonably often, Pervasive rarely, and expressor never.
The high-level technology picture for Syncsort DMExpress is:
- DMExpress is focused on big-batch loading, not low-latency streaming. In theory one could fire up DMExpress every 10-15 seconds, but Syncsort didn’t make that sound like a common use case.
- Core competencies of DMExpress seem to include sorting, aggregation, joins, merging, compression, and DBMS-style optimization.
- Syncsort asserts that 80% of the cycles in ETL are taken up with sorting and aggregation, and hence that any advantages in performance or scalability DMExpress has in those areas translate to general performance and scalability advantages in ETL.
- Syncsort DMExpress runs on dedicated boxes, with fast direct-attached storage; 15,000 RPM disks are common, and Syncsort wishes it could persuade more of its customers to use solid-state drives instead. Syncsort believes DMExpress is at its most differentiated when buffers overflow and swapping is needed.
- DMExpress compression is just gzip. Syncsort asserts that sorting makes a big difference in gzip’s compression ratio.
- DMExpress is faster on Linux than on Unix.
- In general, Syncsort makes a big deal out of using as few CPU cycles as possible. (Given the product’s history, that makes sense.) Syncsort’s core performance claim is that DMExpress handles data at close to raw I/O rates.
Syncsort DMExpress competitive claims include:
- DMExpress supposedly uses only 25% as much CPU as competitors, even Ab Initio.
- DMExpress does direct I/O from disk, with large buffers so that reads can be nicely sequential. Apparently this is called “partition parallelism,” and other ETL vendors do it too. But Syncsort claims differentiation in that this happens automagically.
- Syncsort asserts that managing general parallelism is painful in, say, Informatica PowerCenter. But again, DMExpress does that automagically.
- DMExpress starts one thread per file read in. Syncsort asserts that Ab Initio, by contrast, starts many more processes than that.
Syncsort estimates that one DMExpress customer is loading 1000 records/second/machine on 500 machines, around the clock. That would be about 2 billion records/hour, which is not implausible given who the customer is. Syncsort also told a story of an unnamed customer for whom Oracle utterly choked on joining 5 tables of 1 terabyte each. (27 days to run with clever workarounds.) DMExpress did the join in 6 hours and the whole load in 15.
By the way, I gather that Syncsort DMExpress is sometimes nicknamed “DMX”.
Syncsort became a client since the last time I posted a vendor client list.
Comments
9 Responses to “Introduction to Syncsort and DMExpress”
Leave a Reply
Curt – thanks for the post and for nicely capturing the main points from the recent conversation (disclosure for DBMS2 readers: I lead DI product management for Syncsort).
I think it is also important to point out that many of our customers use DMExpress to augment their existing PowerCenter or DataStage environments and address performance issues. We see waning performance as a byproduct of the large DI vendors competing against each other feature for feature. Because DMExpress leverages the industry standard for metadata interchange (MITI), we can easily import slow running PowerCenter or DataStage routines and run almost immediately. We also maintain lineage when exporting the mapping.
Under Syncsort’s new management, we’ve also focused on simplifying our licensing and packaging model to make it easier for customers to get more value from their DMExpress investments.
How exactly do we measure the “cost of doing ELT (Extract/Load/Transform) on high-end Teradata gear”?
Given that we must already have the Teradata server for query processing, where does the ELT cost come from?
Adding ETL software and servers into the flow into Teradata adds to the cost, surely?
I don’t doubt DMX has it’s capabilities, I just don’t think the attempt to contrast ETL v ELT cost adds to the value proposition message.
If I had a penny for every MVS Syncsort job I’d run “back in the day”…
The contention, correct or otherwise, is that Teradata machines that would otherwise have insufficient throughput work just fine if some of their duties are offloaded.
Paul Johnson has a good comment, now Syncsort claims to compete with Teradata?
Offloading a particular kind of functionality is a limited kind of competition.
We are not claiming to compete with Teradata and actually see ourselves as quite complementary to them. What we are seeing with our customers is that they have had to push processing into Teradata (or other databases – source and target warehouse) because their ETL engine couldn’t handle the throughput requirements as well as a scalable database like Teradata.
Many of these customers have made a large investment (many times more than once) in their database environments and have not realized a linear gain in ELT capacity with the investment made. They are also being asked to 1) shorten batch windows, 2) add sources & reports, and 3) provide intra-day updates to the warehouse while the end users are using it. As customers point out, there is the double whammy that once transformations are pushed to the database by the ETL engine, the often expensive ETL software simply becomes a scheduler executing the pushed down SQL. Customers are even telling us they’re writing the SQL in Teradata, copying it into query objects of major ETL tools for subsequent pushdown and scheduling. Needless to say, this is a huge waste of expensive ETL software and a huge labor cost.
We are suggesting that customers are better off putting the “T” back where it belongs and let the warehouse service the business users. We’ve seen too many instances where pushing the “T” into the database creates management, agility and metadata governance challenges since these transformations are represented as SQL. When we explain DMExpress capabilities (minimal or no tuning, extreme efficiency, throughput at I/O rates, extreme performance, etc.) customers ask us to take on the transformation requirements freeing up capacity on Teradata to fulfill its mission: service user queries and reports.
We believe that we offer a unique and efficient processing layer that reduces the cost structure and labor costs associated with managing transformations in the face of exploding data volumes. We also understand how Teradata is primarily focused on support for user queries contained in analytic and reporting applications.
[…] an ETL (Extract/Transform/Load) vendor, whose flagship product DMExpress was evidently renamed to […]
I want to know more about the life support of the product. Do you have primary support? For how long?
I’m using DMExpress DMXMMSRT 14.%-m.%-d SunOS 5.10 SPARC 64-bit. It uses two files namely:
adp1340_sort01.jcl:
//SORT01 EXEC PGM=SYNCSORT,PARM=EQUALS
//SORTIN DD DSN=IGNORED,
// DCB=(LRECL=499,RECFM=FB)
//SORTOUT DD DSN=IGNORED
//SYSIN DD *
SORT FIELDS=(1,6,A,32,3,A,46,28,A,44,2,A,35,9,A,74,16,A),FORMAT=CH
END
/*
adp1340_sort01.map
SORTIN_DSN=$DATADIR/$REGION/adp1340/gasserv.ppmf1340
SORTOUT_DSN=$DATADIR/$REGION/adp1340/gasserv.ppmf1340.sort
But the result is ASCII collating sequence, is there a possible that I couldchange it to EBCDIC collating sequence. Thanks