Database compression
Analysis of technology that compresses data within a database management system. Related subjects include:
The eternal bogosity of performance marketing
Chris Kanaracus uncovered a case of Oracle actually pulling an ad after having been found “guilty” of false advertising. The essence seems to be that Oracle claimed 20X hardware performance vs. IBM, based on a comparison done against 6 year old hardware running an earlier version of the Oracle DBMS. My quotes in the article were:
- “Everybody’s guilty of that kind of exaggeration.”
- “Oracle tends to be even a little guiltier than others.”
- “If your new system can’t outperform somebody else’s old system by a huge factor on at least some queries, you’re doing something wrong.”
- “Use newer, better hardware; use newer, better software; have a top sales engineer do a great job of tuning it and of course you’ll see huge performance results.”
Another example of Oracle exaggeration was around the Exadata replacement of Teradata at Softbank. But the bogosity flows both ways. Netezza used to make a flat claim of 50X better performance than Oracle, while Vertica’s standard press release boilerplate long boasted
50x-1000x faster performance at 30% the cost of traditional solutions
Of course, reality is a lot more complicated. Even if you assume apples-to-apples comparisons in terms of hardware and software versions, performance comparisons can vary greatly depending upon queries, databases, or use cases. For example:
- Many queries are inherently much faster over columnar storage than over row-based.
- Different data sets respond very differently to various compression algorithms.
- Some analytic RDBMS can maintain strong performance at high levels of concurrent usage. Some can’t.
- Some queries that run very fast on one DBMS without tuning might require careful tuning in another system.
- Some DBMS scale out much better than others.
- Vendors optimize for different usage assumptions, which may or may not apply in your particular case.
And so, vendor marketing claims about across-the-board performance should be viewed with the utmost of suspicion.
Related links
Categories: Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Exadata, Netezza, Oracle, Vertica Systems | Leave a Comment |
Clustrix 4.0 and other Clustrix stuff
It feels like time to write about Clustrix, which I last covered in detail in May, 2010, and which is releasing Clustrix 4.0 today. Clustrix and Clustrix 4.0 basics include:
- Clustrix makes a short-request processing appliance.
- As you might guess from the name, Clustrix is clustered — peer-to-peer, with no head node.
- The Clustrix appliance uses flash/solid-state storage.
- Traditionally, Clustrix has run a MySQL-compatible DBMS.
- Clustrix 4.0 introduces JSON support. More on that below.
- Clustrix 4.0 introduces a bunch of administrative features, and parallel backup.
- Also in today’s announcement is a Rackspace partnership to offer Clustrix remotely, at monthly pricing.
- Clustrix has been shipping product for about 4 years.
- Clustrix has 20 customers in production, running >125 Clustrix nodes total.
- Clustrix has 60 people.
- List price for a (smallest size) Clustrix system is $150K for 3 nodes. Highest-end maintenance costs 15%.
- There’s also a $100K version meant for high availability/disaster recovery. Over half of Clustrix’s customers use off-site disaster recovery.
- Clustrix is raising a C round. Part of it has already been raised from insiders, as a kind of bridge.
The biggest Clustrix installation seems to be 20 nodes or so. Others seem to have 10+. I presume those disaster recovery customers have 6 or more nodes each. I’m not quite sure how the arithmetic on that all works; perhaps the 125ish count of nodes is a bit low.
Clustrix technical notes include: Read more
Categories: Cloud computing, Clustering, Clustrix, Database compression, Market share and customer counts, MySQL, OLTP, Pricing, Structured documents | 4 Comments |
Why I recommend avoiding Kognitio
Since my recent post about Kognitio, things have gotten worse. The company is insistently pushing the marketing message that Kognitio has always been an in-memory product, and at one point went so far as to publicly pretend that I had agreed.
I do not agree. Yes, it’s fair to say — as I did in 2008 — that Kognitio is very RAM-centric, but that’s not at all the same thing. In particular:
- I did due diligence for Warburg Pincus’ original investment in Kognitio in the 1990s (it was then called White Cross). I have no memory of an in-memory positioning, nor of discussing same with anybody.
- I checked my notes from a 2006 briefing, which included Kognitio CTO Roger Gaskell. There was no claim that Kognitio was an in-memory product.
- Indeed, as I also posted in 2008, Kognitio keeps indexes on disk. If you use indexes on disk, you’re not an in-memory product.
The truth is that Kognitio offers a disk-based DBMS that has long been worked on by a small team. I believe that the team really has put considerable effort into how Kognitio uses RAM. But there’s no basis to give Kognitio credit for being “really” in-memory vs. a variety of other analytic RDBMS alternatives. And a row-based product that doesn’t currently offer compression is at a large disadvantage versus, say, columnar products that already do.*
*Columnar systems don’t clobber row-based ones in-memory as extremely as they do in some disk-based use cases. But even in-memory it’s good not to have to move around data that isn’t relevant to your query.
Until Kognitio gets at least somewhat more honest in its marketing, I recommend avoiding Kognitio like the plague. It’s simply not a big enough company to buy from unless you have some level of trust in the management team.
Categories: Columnar database management, Database compression, In-memory DBMS, Kognitio, Memory-centric data management | 1 Comment |
Disk, flash, and RAM
Three months ago, I pointed out that it is hard to generalize about memory-centric database management, because there are so many different kinds. That said, there are some basic points that I’d like to record as background for any future discussion of the subject, focusing on differences between disk and RAM. And while I’m at it, I’ll throw in a few comments about flash memory as well.
This post would probably be better if I had actual numbers for the speeds of various kinds of silicon operations, but I’ll do what I can without them.
For most purposes, database speed is a function of a few kinds of number:
- CPU cycles consumed.
- I/O throughput.
- I/O wait time.
- Network throughput.
- Network wait time.
The amount of storage used is also important, both directly — storage hardware costs money — and because if you save storage via compression, you may get corresponding benefits in I/O. Power consumption and similar costs are usually tied to hardware efficiency; the less gear you use, the less floor space and cooling you may be able to get away with.
When databases move to RAM from spinning disk, major consequences include: Read more
Categories: Database compression, Memory-centric data management, Solid-state memory, solidDB | 6 Comments |
Approximate query results
In theory:
- A database query is a predicate.
- A DBMS matches the data it manages against the predicate and send back those records for which the predicate is true.
And so it would seem that query results always have to be exact. Even so, there are at least four different practical scenarios in which query results can reasonably be regarded as approximate, each associated with query languages that can supersede standard set-theoretic SQL.
Actually, there’s a fifth, and it’s a huge one — some fraction of your data is just plain wrong. But that’s not what this post is about.
First, some queries don’t have binary results, even in principle. Notably, text queries are answered via relevancy rankings, which fit badly into the relational model.
Second — and this can be combined with the first — you might want to generalize the query to look for partial matches. For example, Yarcdata suggested to me a scenario in which:
- You do a SPARQL query.
- You modify the query to accept results higher up in the taxonomy. (Which is likely to be possible, because where there’s SPARQL, there’s apt to be a taxonomy as well.) For example, if you really want to query on two people living in the house, you might extend the query to cover two people connected by any kind of address or building.
Similarly, if you’re looking for geographic proximity, it’s common to extend the allowed radius to fish for more results. Or one can walk up the hierarchy in a dimensional model.
Third, sometimes you just don’t have the data for any kind of precise answer at all. One adaptation I’ve mentioned before is to interpolate time series with synthetic data, and send back “precise” results based on that. In the same post I mentioned the Vertica “range join”, wherein users deliberately throw away part of their data — only storing the range it was in — and then join accordingly.
As Donald Rumsfeld might have said — and would have done well to reflect upon — you go into decision-making with the data you have, not the data you wish you had.
Finally, sometimes there’s a precise answer in principle, but for performance reasons you accept an approximate one, at least to start with. Numerous companies have told me stories around this, including:
- Infobright, whose “Rough Query” gives fast approximate results to a broad range of queries.
- Metamarkets, which does fast cardinality estimates via HyperLogLog.
- Aster Data, which was the first company to point out to me that median, decile, quintile, and so on calculations are a lot faster in a shared-nothing setting if you’re willing to settle for approximate results.
The latter two categories led me to ask vendors how customers actually make use of their exotic SQL capabilities. Answers boiled down to:
- (Always) Well, there’s a lot of custom coding.
- (Sometimes) We’re working with partner BI vendors to make direct use of the capabilities, but that’s not done yet, so it’s too early to talk about any details.
Perhaps the answers will never get much better; it’s tough to get packaged software vendors to support vendor-specific SQL, unless the vendor is Oracle. Even so, we’re seeing ever more ways in which conventional SQL DBMS are being superseded by data management and analytic alternatives.
Categories: Aster Data, Business intelligence, Data models and architecture, Data warehousing, Database compression, Infobright, Text, Vertica Systems, Yarcdata and Cray | 3 Comments |
Introduction to MemSQL
I talked with MemSQL shortly before today’s launch. MemSQL technology basics are:
- In-memory relational DBMS.
- Being released single-box only. Transparent sharding is under development for release in the fall. Basic replication is under development too.
- Subset of SQL-92.
- MySQL wire-compatible (SQL coverage issues excepted).
MemSQL’s performance claims include:
- Read performance 10% or so worse than memcached.
- Write performance 20% or so better than memcached.
- 1.2 million inserts/second on a 64-core, 1/2 TB of RAM machine.
- Similarly, 1/2 billion records loaded in under 20 minutes.
MemSQL company basics include: Read more
Categories: Database compression, In-memory DBMS, Investment research and trading, Market share and customer counts, memcached, MemSQL, OLTP, Pricing, Web analytics | 3 Comments |
Workday update
In August 2010, I wrote about Workday’s interesting technical architecture, highlights of which included:
- Lots of small Java objects in memory.
- A very simple MySQL backing store (append-only, <10 tables).
- Some modernistic approaches to application navigation.
- A faceted approach to BI.
I caught up with Workday recently, and things have naturally evolved. Most of what we talked about (by my choice) dealt with data management, business intelligence, and the overlap between the two.
It is now reasonable to say that Workday’s servers fall into at least seven tiers, although we talked mainly about five that work together as a kind of giant app/database server amalgamation. The three that do noteworthy data management can be described as:
- In-memory objects and transactions. This is similar to what Workday had before.
- Persistent MySQL. Part of this is similar to what Workday had before. In addition, Workday is now storing certain data in tables in the ordinary relational way.
- In-memory caching and indexing. This has three aspects:
- Indexes for the ordinary relational tables, organized in interesting ways.
- Indexes for Workday’s search-box navigation (as per my original Workday technical post, you can search across objects, task-names, etc.).
- Compressed copies of the Java objects, used to instantiate other servers as needed. The most obvious uses of this are:
- Recovery for the object/transaction tier.
- Launch for the elastic compute tier. (Described below.)
Two other Workday server tiers may be described as: Read more
Kognitio’s story today
I had dinner tonight with the Kognitio folks. So far as I can tell:
- Branding has been mercifully simplified. Everything is now called “Kognitio” (as opposed to, for example, “WX2”).
- Notwithstanding its long history of selling disk-based DBMS and denigrating memory-only configurations, Kognitio now says that in fact it’s always been an in-memory DBMS vendor.
- Notwithstanding its long history of selling (or attempting to sell) analytic DBMS, Kognitio wants to be viewed as an accelerator to your existing DBMS. This is apparently inspired in part by SAP HANA, notwithstanding that HANA’s direction is to evolve into a hybrid OLTP/analytic general-purpose DBMS.
- Notwithstanding its lack of analytic platform features, Kognitio wants to be viewed as selling an analytic platform.
- Notwithstanding its memory-centric focus, Kognitio doesn’t want to compress data. Kognitio’s opinion — which to my knowledge is shared by few people outside Kognitio — seems to be that the CPU cost of compression/decompression isn’t justified by the RAM savings from compression.
- Kognitio still is pushing a cloud/SaaS (Software as a Service) story. Even if you want to use Kognitio (the product) on-premises, Kognitio (the company) calls that “private cloud” and offers to let you pay annually.
Kognitio believes that this story is appealing, especially to smaller venture-capital-backed companies, and backs that up with some frieNDA pipeline figures.
Between that success claim and SAP’s HANA figures, it seems that the idea of using an in-memory DBMS to accelerate analytics has legs. This makes sense, as the BI vendors — Qlik Tech excepted — don’t seem to be accomplishing much with their proprietary in-memory alternatives. But I’m not sure that Kognitio would be my first choice to fill that role. Rather, if I wanted to buy an unsuccessful analytic RDBMS to use as an in-memory accelerator, I might consider ParAccel, which is columnar, has an associated compression story, has always had a hybrid memory-centric flavor much as Kognitio has, and is well ahead of Kognitio in the analytic platform derby. That said, I’ll confess to not having talked with or heard much about ParAccel for a while, so I don’t know if they’ve been able maintain technical momentum any more than Kognitio has.
Categories: Cloud computing, Data warehousing, Database compression, Kognitio, Memory-centric data management, ParAccel, Software as a Service (SaaS) | 2 Comments |
IBM DB2 10
Shortly before Tuesday’s launch of DB2 10, IBM’s Conor O’Mahony checked in for a relatively non-technical briefing.* More precisely, this is about DB2 for “distributed” systems, aka LUW (Linux/Unix/Windows); some of the features have already been in the mainframe version of DB2 for a while. IBM is graciously permitting me to post the associated DB2 10 announcement slide deck.
*I hope any errors in interpretation are minor.
Major aspects of DB2 10 include new or improved capabilities in the areas of:
- Compression.
- Analytic query performance.
- Data ingest.
- Multi-temperature data management.
- Workload management.
- Graph management/relationship analytics.
- Time-travel, bitemporal features, and bitemporal time-travel.
Of course, there are various other enhancements too, including to security (fine-grained access control), Oracle compatibility, and DB2 pureScale. Everything except the pureScale part is also reflected in IBM InfoSphere Warehouse, which is a near-superset of DB2.*
*Also, the data ingest part isn’t in base DB2.
Categories: Data warehousing, Database compression, IBM and DB2, RDF and graphs, Solid-state memory, Workload management | 6 Comments |
Hardware and components — lessons from Teradata
I love talking with Carson Schmidt, chief of Teradata’s hardware engineering (among other things), even if I don’t always understand the details of what he’s talking about. It had been way too long since our last chat, so I requested another one. We were joined by Keith Muller, who I presume is pictured here. Takeaways included:
- Teradata performance growth was slow in the early 2000s, but has accelerated since then; Intel gets a lot of the credit (and blame) for that.
- Carson hopes for a performance “discontinuity” with Intel Ivy Bridge.
- Teradata is not afraid to use niche special-purpose chips.
- Teradata’s views can be taken as well-informed endorsements of InfiniBand and SAS 2.0.