Data warehouse load speeds in the spotlight
Syncsort and Vertica combined to devise and run a benchmark in which a data warehouse got loaded at 5 ½ terabytes per hour, which is several times faster than the figures used in any other vendors’ similar press releases in the past. Takeaways include:
- Syncsort isn’t just a mainframe sort utility company, but also does data integration. Who knew?
- Vertica’s design to overcome the traditional slow load speed of columnar DBMS works.
The latter is unsurprising. Back in February, I wrote at length about how Vertica makes rapid columnar updates. I don’t have a lot of subsequent new detail, but it made sense then and now.
As is always the case in private benchmarks – or indeed audited public ones – the specific numbers shouldn’t be taken too seriously. Back in February, Vertica talked of sustainable load speeds of 3-5 megabytes/second. That’s around 10-18 gigabytes/hour. Multiply by 16 for the number of nodes in this latest benchmark, tack on the factor of your choice for better hardware, and you still come out more than an order of magnitude away from the 5+TB/hour figure. (Edit: Omer Trajman clarifies in the comments that this comes from the difference between 1 stream of trickle feed and 8 streams of bulk load. Bulk rates are about 3X trickle rates, per stream, and the benchmark was done at 8 streams/node.)
But all trailing zeros aside, I’m hearing more and more about data warehouse load speeds these days. E.g., data integration startup Expressor Software hopes fast loading will be one of its claims to fame. Aster Data’s architecture includes dedicated bulk load nodes. Kickfire bragged back in April of 100 GB/hour load speeds. Many other database vendors emphasize bulk and/or incremental load speed as well.
There are at least two main reasons for this emphasis on data warehouse load speed. First, increasingly many data warehouse use cases are 24/7, making batch windows problematic. Examples can include call centers, website personalization, or manufacturing, plus any kind of global enterprise or global SaaS operation. Second, there are a lot of use cases where, 24/7-ness even aside, data comes in fast. Clickstream/network event is the most obvious example. Telecom is another biggie. And all meltdowns aside, there is quite a lot of database use — including columnar databases — in the financial trading “tick store” market.
All that said — I think most commercial data warehouse DBMS will provide most users with much more load speed than they actually need. More important in most cases will be the performance overhead created by the loading, as well as the load speed and hardware utilization of the data integration middleware itself. But at some point in the technology stack, data warehouse load speed is an increasingly non-trivial subject.
Comments
17 Responses to “Data warehouse load speeds in the spotlight”
Leave a Reply
Vertica trickle loads (concurrent load and query) at 3-5 MB/s per node per load stream. The numbers in this benchmark came in on par with bulk load numbers Vertica sees in the field of 10-15 MB/s per node per load stream. In this case 16 nodes with 8 streams per node.
On Syncsort’s use in DW loading:
A major British airline (used to) use it extensively to aggregate data before loading it onto Teradata servers. That was in the days of TD Model 4 and NCR 3600 kit. It was blindingly fast.
Was this benchmark insert only or insert and update? From Vertica white papers, amazing performance is provided for INSERT statements but maybe not for UPDATE statements.
JG,
Interesting! This particular Syncsort product — DM Express – seems to have been introduced in 2005, but presumably they got the idea from somewhere.
Mark,
I’m pretty sure it’s insert, not update.
These analytic DBMS aren’t designed for great update performance.
CAM
Curt,
We used to get “great” performance bulk loading Oracle by turning off indexing and referential integrity checks. Then, of course, you had to take a few days off when you turned them back on, but we could brag about how fast we “loaded” data. So I’m wondering if Vertica’s bulk loading leaves off some necessary processing that its trickle feed includes.
-NR
A couple of additional data points (a.k.a. claims):
Greenplum includes this quotation in their analyst briefing — “In our proof of concept, we saw Greenplum reach data loading speeds of over three terabytes per hour, and we know that the database can scale even further than that.” Steven Hirsch, Chief Data Officer, NYSE Euronext
And an executive from Vhayu, which sells a storage & query engine for market tick data and has adapters for Streambase & Coral8 among CEP tools, told me in November that Vhayu won a customer that said that Vertica wasn’t handling its real-time insert rates. RT inserts != bulk loads.
Seth, I haven’t read the Vertica announcement yet, but my sense is that they were including ETL functions from Syncsort, not just screaming a neat flat file. I didn’t see similar ETL comments in the other announcements.
We also used to use sort utilities, including syncsort, to stage the files to improve bulk loading figures. So you have to look very carefully at these benchmarks.
-NR
twitter: nraden
Yep. Definitely ETL. Even mentioned that a small fraction of the records were deliberately erroneous.
ETL is the legacy world, extract a whole thing and then load. Not really scalable.
The active data warehouse is the future, when only incremental transactional changes to be loaded into data warehouse. See the DataMirror (now IBM), Shareplex or Wisdomforce that successfully do that for years
These are great questions and many are answered in the full benchmark report (http://www.vertica.com/php/pdfgateway?file=ETL-World-Record-Audit-Report.pdf). I can clarify trickle vs. bulk loads in Vertica and in this benchmark as well as what Vertica handles regarding real-time and near real time.
The Vertica engine does not internally distinguish between trickle and bulk loads. There are no indexes to turn on and off, no transaction log and no special bulk load modes. Data is available for fast querying right after it commits.
We do differentiate between trickle and bulk loads using common industry terminology. A trickle feed is where data is sent to the database at a rate that accommodates queries running concurrently and is typically continuous (24×7). A bulk load assumes no queries for some period of time and full resources are consumed by the data load.
This benchmark was designed to test bulk loads – sending data to the server as fast as possible. At the end of the load data was immediately available for fast querying. We have numerous benchmarks (http://www.vertica.com/benchmarks) and customers (http://www.vertica.com/customers) that demonstrate trickle load capabilities. In telecommunications and financial services, trickle loads are a critical feature that prompted customers to deploy Vertica.
Note that Vertica by itself is designed for near-real time (delay in seconds) not real-time or CEP analysis. We recently announced the deployment of Streambase and Vertica at BlueCrest Capital to handle both streaming data and historical data (http://www.vertica.com/company/news_and_events/StreamBase-and-Vertica-Announce-Customer-Deployment-at-BlueCrest).
[…] I did get careless when I neglected to doublecheck something I already knew. The conclusion of this post isn’t really consistent with what ParAccel told me way back in February, 2007 about how much […]
[…] month, when Vertica releases its “benchmark” of data warehouse load speeds, I didn’t realize it had previously released some actual customer-experience load rates as […]
[…] There has been a renewed buzz in the data integration vendor world around the coveted tpc-h benchmark. A discussion about the latest can be found at dbms2. […]
[…] Software plans to blow the Vertica/Syncsort “benchmark” out of the water, to wit What I know already is that our numbers will between 7 and 8 min to load […]
[…] warehouse load speeds are a contentious issue. Vertica contrived a benchmark with a 5 1/2 terabyte/hour load rate. Oracle has gotten dinged for very low load speeds, which then are hotly debated. I was told […]
[…] last year, Vertica made hoo-hah about what it called a world-record data warehouse load speed benchmark. I wrote at the time that this showed Vertica wasn’t painfully slow at loading, always a […]