Expressor pre-announces a data loading benchmark leapfrog
Expressor Software plans to blow the Vertica/Syncsort “benchmark” out of the water, to wit
What I know already is that our numbers will between 7 and 8 min to load one TB of data and will set another world record for the tpc-h benchmark.
The whole blog post has a delightful air of skepticism, e.g.:
Sometimes the mention of a join and lookup are documented but why? If the files are load ready what is there to join or lookup?
… If the files are load ready and the bulk load interface is used, what exactly is done with the DI product?
My guess… nothing.
… But what I can’t figure out is what is so complex about this test in the first place?
More from Vertica on data warehouse load speeds
Last month, when Vertica releases its “benchmark” of data warehouse load speeds, I didn’t realize it had previously released some actual customer-experience load rates as well. In a July, 2008 white paper that seems thankfully free of any registration requirements, Vertica cited four examples:
- (Comcast) Trickle loads 48MB/minute – SNMP data generated by devices in the Comcast cable network is trickle loaded on a 24×7 basis at rates as high as 135,000 rows/second. The system runs on 5 HP ProLiant DL 380 servers.
- (Verizon) Bulk loads to memory 300MB/minute – 50MB to 300MB of call detail records (1K record size—150 columns per row) are loaded every 10 minutes. Vertica runs on 6 HP ProLiant DL380 servers.
- (Level 3 Communications) Bulk loads to disk 5GB/minute - The loading and enrichment (i.e., summary table creation) of 1.5TB of call detail records formerly took 5 days in a row-oriented data warehouse database. Vertica required 5 hours to load the same data.
- (”Global investment firm”) Trickle loads 2.6GB/minute - Historic financial trade and quote (TaQ) data was bulk loaded into the database at a rate of 125GB/hour. New TaQ data was trickled into the database at rates as high as 90,000 rows per second (480b per row).
| Categories: Vertica Systems | Leave a Comment |
ParAccel’s market momentum
After my recent blog post, ParAccel is once again angry that I haven’t given it proper credit for it accomplishments. So let me try to redress the failing.
- ParAccel has disclosed the names of two customers, LatiNode and Merkle (presumably as an add-on to Merkle’s Netezza environment). And ParAccel has named two others under NDA. Four disclosed or semi-disclosed customers is actually more than DATAllegro has/had, although I presume DATAllegro’s three known customers are larger, especially in terms of database size.
- ParAccel sports a long list of partners, and has put out quite a few press releases in connection with these partnerships. While I’ve never succeeded in finding a company that took its ParAccel partnership especially seriously, I’ve only asked three or four of them, which is a small fraction of the total number of partners ParAccel has announced, so in no way can I rule out that somebody, somewhere, is actively helping ParAccel try to sell its products.
- ParAccel repeatedly says it has beaten Vertica in numerous proofs-of-concept (POCs), considerably more than the two cases in which it claims to have actually won a deal against Vertica competition.
- ParAccel has elicited favorable commentary from such astute observers as Seth Grimes and Doug Henschen.
- ParAccel has been noted for running TPC-H benchmarks in memory much more quickly than other vendors run them on disk.
Uh, that’s about all I can think of. What else am I forgetting? Surely that can’t be ParAccel’s entire litany of market success!
| Categories: Data warehousing, ParAccel, Uncategorized | Leave a Comment |
ParAccel actually uses relatively little PostgreSQL code
I often find it hard to write about ParAccel’s technology, for a variety of reasons:
- With occasional exceptions, ParAccel is reluctant to share detailed information.
- With occasional exceptions, ParAccel is reluctant to say anything for attribution.
- In ParAccel’s version of an “agile” development approach, product details keep changing, as do plans and schedules. (The gibe that ParAccel’s product plans are whatever their current sales prospect wants them to be — while of course highly exaggerated — isn’t wholly unfounded.)
- ParAccel has sold very few copies of its products, so it’s hard to get information from third parties.
ParAccel is quick, however, to send email if I post anything about them they think is incorrect.
All that said, I did get careless when I neglected to doublecheck something I already knew. Read more
| Categories: Data warehousing, ParAccel, PostgreSQL | 3 Comments |
Ordinary OLTP DBMS vs. memory-centric processing
A correspondent from China wrote in to ask about products that matched the following application scenario: Read more
| Categories: In-memory DBMS, McObject, Memory-centric data management, OLTP, Oracle TimesTen, solidDB | 4 Comments |
More grist for the column vs. row mill
Daniel Abadi and Sam Madden are at it again, following up on their blog posts of six months arguing for the general superiority of column stores over row stores (for analytic query processing). The gist is to recite a number of bases for superiority, beyond the two standard ones of less I/O and better compression, and seems to be based largely on Section 5 of a SIGMOD paper they wrote with Neil Hachem.
A big part of their argument is that if you carry the processing of columnar and/or compressed data all the way through in memory, you get lots of advantages, especially because everything’s smaller and hence fits better into Level 2 cache. There also is some kind of join algorithm enhancement, which seems to be based on noticing when the result wound up falling into a range according to some dimension, and perhaps using dictionary encoding in a way that will help induce such an outcome.
The main enemy here is row-store vendors who say, in effect, “Oh, it’s easy to shoehorn almost all the benefits of a column-store into a row-based system.” They also take a swipe — for being insufficiently purely columnar — at unnamed columnar Vertica competitors, described in terms that seemingly apply directly to ParAccel.
| Categories: Columnar database management, Data warehousing, Database compression, ParAccel, Vertica Systems | 2 Comments |
Database archiving and information preservation
Two similar companies reached out to me recently – SAND Technology and Clearpace. Their current market focus is somewhat different: Clearpace talks mainly of archiving, and sells first and foremost into the compliance market, while SAND has the most traction providing “near-line” storage for SAP databases.* But both stories boil down to pretty much the same thing: Cheap, trustworthy data storage with good-enough query capabilities. E.g., I think both companies would agree the following is a not-too-misleading first-approximation characterization of their respective products:
- Fully functional relational DBMS.
- Claims of fast query performance, but that’s not how they’re sold.
- Huge compression.
- Careful attention to time-stamping and auditability.
| Categories: Archiving and information preservation, Clearpace, Database compression, SAND Technology | 2 Comments |
Introduction to Clearpace
Clearpace is a UK-based startup in a similar market to what SAND Technology has gotten into – DBMS archiving, with a strong focus on compression and general cost-effectiveness. Clearpace launched its product NParchive a couple of quarters ago, and says it now has 25 people and $1 million or so in revenue. Clearpace NParchive technical highlights include:
Introduction to SAND Technology
SAND Technology has a confused history. For example:
- SAND has been around in some form or other since 1982, starting out as a Hitachi reseller in Canada.
- In 1992 SAND acquired a columnar DBMS product called Nucleus, which originally was integrated with hardware (in the form of a card). Notwithstanding what development chief Richard Grodin views as various advantages vs. Sybase IQ, SAND has only had limited success in that market.
- Thus, SAND introduced a second, similarly-named product, which could also be viewed as a columnar DBMS. (As best I can tell, both are called SAND/DNA.) But it’s actually focused on archiving, aka the clunkily named “near-line storage.” And it’s evidently not the same code line; e.g., the newer product isn’t bit-mapped, while the older one is.
- The near-line product was originally focused on the SAP market. Now it’s moving beyond.
- Canada-based SAND had offices in Germany and the UK before it did in the US. This leads to an oddity – SAND is less focused on the SAP aftermarket in Germany than it still is in the US.
SAND is publicly traded, so its numbers are on display. It turns out to be doing $7 million in annual revenue, and losing money.
OK. I just wanted to get all that out of the way. My main thoughts about the DBMS archiving market are in a separate post.
| Categories: Archiving and information preservation, Columnar database management, Data warehousing, SAND Technology | 3 Comments |
How to buy an analytic DBMS (overview)
I went to London for a couple of days last week, at the behest of Kognitio. Since I was in the neighborhood anyway, I visited their offices for a briefing. But the main driver for the trip was a seminar Thursday at which I was the featured speaker. As promised, the slides have been uploaded here.
The material covered on the first 13 slides should be very familiar to readers of this blog. I touched on database diversity and the disk-speed barrier, after which I zoomed through a quick survey of the data warehouse DBMS market. But then I turned to material I’ve been working on more recently – practical advice directly on the subject of how to buy an analytic DBMS.
I started by proposing a seven-part segmentation self-assessment:
| Categories: Data warehousing | 4 Comments |
The “baseball bat” test for analytic DBMS and data warehouse appliances
More and more, I’m hearing about reliability, resilience, and uptime as criteria for choosing among data warehouse appliances and analytic DBMS. Possible reasons include:
- More data warehouses are mission-critical now, with strong requirements for uptime.
- Maybe reliability is a bit of a luxury, but the products are otherwise good enough now that users can afford to be a bit pickier.
- Vendor marketing departments are blowing the whole subject out of proportion.
The truth probably lies in a combination of all these factors.
Making the most fuss on the subject is probably Aster Data, who like to talk at length both about mission-critical data warehouse applications and Aster’s approach to making them robust. But I’m also hearing from multiple vendors that proofs-of-concept now regularly include stress tests against failure, in what can be – and indeed has been – called the “baseball bat” test. Prospects are encouraged to go on a rampage, pulling out boards, disk drives, switches, power cables, and almost anything else their devious minds can come up with to cause computer carnage.
| Categories: Data warehouse appliances, Data warehousing | 4 Comments |
Kognitio and WX-2 update
I went to Bracknell Wednesday to spend time with the Kognitio team. I think I came away with a better understanding of what the technology is all about, and why certain choices have been made.
Like almost every other contender in the market,* Kognitio WX-2 queries disk-based data in the usual way. Even so, WX-2’s design is very RAM-centric. Data gets on and off disk in mind-numbingly simple ways – table scans only, round-robin partitioning only (as opposed to the more common hash), and no compression. However, once the data is in RAM, WX-2 gets to work, happily redistributing as seems optimal, with little concern about which node retrieved the data in the first place. (I must confess that I don’t yet understand why this strategy doesn’t create ridiculous network bottlenecks.) How serious is Kognitio about RAM? Well, they believe they’re in the process of selling a system that will include 40 terabytes of the stuff. Apparently, the total hardware cost will be in the $4 million range.
*Exasol is the big exception. They basically use disk as a source from which to instantiate in-memory databases.
Other technical highlights of the Kognitio WX-2 story include:
| Categories: Application areas, Data warehousing, Kognitio, Scientific research | 1 Comment |
Data warehouse load speeds in the spotlight
Syncsort and Vertica combined to devise and run a benchmark in which a data warehouse got loaded at 5 ½ terabytes per hour, which is several times faster than the figures used in any other vendors’ similar press releases in the past. Takeaways include:
- Syncsort isn’t just a mainframe sort utility company, but also does data integration. Who knew?
- Vertica’s design to overcome the traditional slow load speed of columnar DBMS works.
The latter is unsurprising. Back in February, I wrote at length about how Vertica makes rapid columnar updates. I don’t have a lot of subsequent new detail, but it made sense then and now.
Another dubious “end of computer history” argument
In a typically snarky Register article, Chris Mellor raises a caution about the use of future many-cored chips in IT. In essence, he says that today’s apps run in a relatively small number of threads each, and modifying them to run in many threads is too difficult. Hence, most of the IT use for many-cored chips will be via hypervisors that assign apps to cores as makes sense.
Mellor has a point, but he’s overstating it.
| Categories: Parallelization, Theory and architecture | 3 Comments |
The Teradata Accelerate program
An article in Intelligent Enterprise clued me in that Teradata has announced the Teradata Accelerate program. A little poking around revealed a press release in which — lo and behold — I am quoted,* to wit:
“The Teradata Accelerate program is a great idea. There’s no safer choice than Teradata technology plus Teradata consulting, bundled in a fixed-cost offering,” said Curt Monash, president of Monash Research. “The Teradata Purpose Built Platform Family members are optimized for a broad range of business intelligence and analytic uses.”
| Categories: Data warehousing, Pricing, Teradata | Leave a Comment |
High-end MySQL use
To a large extent, MySQL lives in two different alternate universes from most other DBMS. One is for low-end, simple database applications. For example, of all the DBMS I write about, MySQL is the one I actually use in my own business — because MySQL sits underneath WordPress, and WordPress is what runs my blogs. My largest database (the one for DBMS2) contains 12 megabytes of data in 11 tables, none of which has yet reached 5000 rows in size. Read more
| Categories: Google, MySQL, OLTP, Open source, Parallelization | Leave a Comment |
Interpreting the results of data warehouse proofs-of-concept (POCs)
When enterprises buy new brands of analytic DBMS, they almost always run proofs-of-concept (POCs) in the form of private benchmarks. The results are generally confidential, but that doesn’t keep a few stats from occasionally leaking out. As I noted recently, those leaks are problematic on multiple levels. For one thing, even if the results are to be taken as accurate and basically not-misleading, the way vendors describe them leaves a lot to be desired.
Here’s a concrete example to illustrate the point. One of my vendor clients sent over the stats from a recent POC, in which its data warehousing product was compared against a name-brand incumbent. 16 reports were run. The new product beat the old 16 out of 16 times. The lowest margin was a 1.8X speed-up, while the best was a whopping 335.5X.
My client helpfully took the “simple average” — i.e. the mean – of the 16 factors, and described this as an average 62X drubbing. But is that really fair?
| Categories: Data warehousing | 7 Comments |
MySQL Query Analyzer
Given how the product’s rollout has been handled, it seems necessary to comment on MySQL’s recently released MySQL Query Analyzer without actually having much information on the subject. Mark Callaghan offers a good take — he’s generally very favorable, but notes that MySQL has some limitations that Query Analyzer has trouble getting around.
| Categories: MySQL | 2 Comments |
Silly website tricks
Vertica’s marketing is usually good-to-outstanding, but they made a funny misstep this time. If you go to the Vertica home page, you’ll see seasonal art suggesting that their product is a turkey and/or that it’s terrified it’s about to get the ax.
Live by the pun, die by the pun.
| Categories: Humor, Vertica Systems | 3 Comments |
Graphjam: I can haz BI
Charts and graphs, from the folks who brought you a whole lot of cute kitten photos:
| Categories: Business intelligence, Fun stuff, Humor | Leave a Comment |
When people don’t want accurate predictions made about them
In a recent article on governmental anti-terrorism data mining efforts — and the privacy risks associated with same — The Economist wrote (emphasis mine):
Abdul Bakier, a former official in Jordan’s General Intelligence Department, says that tips to foil data-mining systems are discussed at length on some extremist online forums. Tricks such as calling phone-sex hotlines can help make a profile less suspicious. “The new generation of al-Qaeda is practising all that,” he says.
Well, duh. Terrorists and fraudsters don’t want to be detected. Algorithms that rely on positive evidence of bad intent may work anyway. But if you rely on evidence that shows people are not bad actors, that’s likely to work about as well as Bayesian spam detectors.* Read more
| Categories: Analytic technologies, Data warehousing | 1 Comment |
High-performance analytics
For the past few months, I’ve collected a lot of data points to the effect that high-performance analytics – i.e., beyond straightforward query — is becoming increasingly important. And I’ve written about some of them at length. For example:
- MapReduce – controversial or in some cases even disappointing though it may be – has a lot of use cases.
- It’s early days, but Netezza and Teradata (and others) are beefing up their geospatial analytic capabilities.
- Memory-centric analytics is in the spotlight.
Ack. I can’t decide whether “analytics” should be a singular or plural noun. Thoughts?
Another area that’s come up which I haven‘t blogged about so much is data mining in the database. Data mining accounts for a large part of data warehouse use. The traditional way to do data mining is to extract data from the database and dump it into SAS. But there are problems with this scenario, including:
| Categories: Analytic technologies, Aster Data, Data warehousing, EAI, EII, ETL, ELT, ETLT, Greenplum, MapReduce, Netezza, Oracle, Parallelization, SAS Institute, Teradata | 5 Comments |
Beyond query
I sometimes describe database management systems as “big SQL interpreters,” because that’s the core of what they do. But it’s not all they do, which is why I describe them as “electronic file clerks” too. File clerks don’t just store and fetch data; they also put a lot of work into neatening, culling, and generally managing the health of their information hoards.
Already 15 years ago, online backup was as big a competitive differentiator in the database wars as any particular SQL execution feature. Security became important in some market segments. Reliability and availability have been important from the getgo. And manageability has been crucial ever since Microsoft lapped Oracle in that regard, back when SQL Server had little else to recommend it except price.*
*Before Oracle10g, the SQL Server vs. Oracle manageability gap was big.
Now data warehousing is demanding the same kinds of infrastructure richness.*
| Categories: Data warehousing, Microsoft and SQL*Server, Oracle | 1 Comment |
The query from hell, and other stories
I write about a lot of products whose core job boils down to Make queries run fast. Without exception, their vendors tout stories of remarkable performance gains over conventional/incumbent DBMS (reported improvement is usually at least 50-fold, and commonly 100-500+). They further claim at least 2-3X better performance than their close competitors. In making these claims, vendors usually stress that their results come from live customer benchmarks. In few if any of the cases, I judge, are they lying outright. So what’s going on? Read more
| Categories: Data warehousing | Leave a Comment |
MySQL is being used in an IBM Lotus appliance
Apparently, IBM is rolling out an appliance for small businesses. MySQL is under the covers. The appliance won’t have a keyboard or monitor, so there won’t be a lot of database administration going on.
Before Solid and solidDB were acquired by IBM, one of the things Solid was proudest of was some embedded apps in which solidDB ran for years in boxes without keyboards or monitors.
I still think it’s a pity that IBM isn’t using solidDB as broadly as the technology deserves. Even so, this is a nice endorsement of MySQL for reliable zero-DBA mid-range use.
| Categories: DBMS product categories, IBM and DB2, Mid-range, MySQL, solidDB | Leave a Comment |
