Theory and architecture
Analysis of design choices in databases and database management systems. Related subjects include:
- Any subcategory
- Database diversity
- Explicit support for specific data types
- (in Text Technologies) Text search
Hadoop YARN — beyond MapReduce
A lot of confusion seems to have built around the facts:
- Hadoop MapReduce is being opened up into something called MapReduce 2 (MRv2).
- Something called YARN (Yet Another Resource Negotiator) is involved.
- One purpose of the whole thing is to make MapReduce not be required for Hadoop.
- MPI (Message Passing Interface) was mentioned as a paradigmatic example of a MapReduce alternative, yet the MPI/YARN/Hadoop effort is somehow troubled.
- Cloudera shipped YARN in June, yet simultaneously warned people away from actually using it.
Here’s my best effort to make sense of all that, helped by a number of conversations with various Hadoop companies, but most importantly a chat Friday with Arun Murthy and other Hortonworks folks.
- YARN, as an aspect of Hadoop, has two major kinds of benefits:
- The ability to use programming frameworks other than MapReduce.
- Scalability, no matter what programming framework you use.
- The YARN availability story goes:
- YARN is in alpha.
- YARN is expected to be in production at year-end, give or take.
- Cloudera made the marketing decision to include YARN in its June Hadoop distribution release anyway, but advised that it was for experimentation rather than production.
- Hortonworks, in its own June release, only shipped code it advised putting into production.
- My take on the YARN/MPI story goes something like this:
- Numerous people have told me of YARN/MPI delays.
- One person suggested that Greenplum is taking the lead in YARN/MPI integration, but has gotten slow and reclusive, apparently due to some big company-itis.
- I find that credible because of the Greenplum/SAS/MPI connection.
- If I understood Arun correctly, the latency story on Hadoop MapReduce is approximately:
- Arun says that Hadoop’s reputation for taking 10s of seconds to start a Hadoop job is old news. It takes a low single-digit number of seconds.
- However, starting all that Java does take 100s of milliseconds at best — 200 milliseconds in an ideal case, 500 milliseconds more realistically, and that’s just on a single server.
- Thus, if you want human real-time interaction, Hadoop MapReduce is not and likely never will be the way to go. Getting Hadoop MapReduce latencies under a few seconds is likely to be more trouble than it’s worth — because of MapReduce, not because of Hadoop.
- In particular — instead of incurring the overhead of starting processes up, Arun thinks low-latency needs should be met in a different way, namely by serving them from already-running processes. The examples he kept mentioning were the event processing projects Storm (out of Twitter, via an acquisition) and S4 (out of Yahoo).
Categories: Cloudera, Hadoop, Hortonworks, MapReduce, Workload management, Yahoo | 7 Comments |
A data distribution idea at Vertica and Clustrix
Yesterday I wrote:
Clustrix has one cool idea I haven’t heard from anybody else, which I’m calling index distribution. The idea is that each index can be distributed differently across the cluster … i.e. on different distribution keys. Clustrix thinks that paying special attention to index distribution and movement is helpful to the performance of distributed joins.
While that’s true, I thought I’d heard something similar from Vertica; so I checked, and indeed I had. Vertica famously lets you store columns in different sort orders, in both reasonable senses:
- Different columns in a table can be sorted in different ways.
- A single column, which is stored multiple times for usual reasons of replication safety, can be sorted differently in its different copies.
It turns out those columns can also be distributed on different keys as well.
Related link
- Vertica projections explained at length (September, 2011)
Categories: Clustering, Clustrix, Columnar database management, Vertica Systems | 5 Comments |
Clustrix 4.0 and other Clustrix stuff
It feels like time to write about Clustrix, which I last covered in detail in May, 2010, and which is releasing Clustrix 4.0 today. Clustrix and Clustrix 4.0 basics include:
- Clustrix makes a short-request processing appliance.
- As you might guess from the name, Clustrix is clustered — peer-to-peer, with no head node.
- The Clustrix appliance uses flash/solid-state storage.
- Traditionally, Clustrix has run a MySQL-compatible DBMS.
- Clustrix 4.0 introduces JSON support. More on that below.
- Clustrix 4.0 introduces a bunch of administrative features, and parallel backup.
- Also in today’s announcement is a Rackspace partnership to offer Clustrix remotely, at monthly pricing.
- Clustrix has been shipping product for about 4 years.
- Clustrix has 20 customers in production, running >125 Clustrix nodes total.
- Clustrix has 60 people.
- List price for a (smallest size) Clustrix system is $150K for 3 nodes. Highest-end maintenance costs 15%.
- There’s also a $100K version meant for high availability/disaster recovery. Over half of Clustrix’s customers use off-site disaster recovery.
- Clustrix is raising a C round. Part of it has already been raised from insiders, as a kind of bridge.
The biggest Clustrix installation seems to be 20 nodes or so. Others seem to have 10+. I presume those disaster recovery customers have 6 or more nodes each. I’m not quite sure how the arithmetic on that all works; perhaps the 125ish count of nodes is a bit low.
Clustrix technical notes include: Read more
Categories: Cloud computing, Clustering, Clustrix, Database compression, Market share and customer counts, MySQL, OLTP, Pricing, Structured documents | 4 Comments |
Why I recommend avoiding Kognitio
Since my recent post about Kognitio, things have gotten worse. The company is insistently pushing the marketing message that Kognitio has always been an in-memory product, and at one point went so far as to publicly pretend that I had agreed.
I do not agree. Yes, it’s fair to say — as I did in 2008 — that Kognitio is very RAM-centric, but that’s not at all the same thing. In particular:
- I did due diligence for Warburg Pincus’ original investment in Kognitio in the 1990s (it was then called White Cross). I have no memory of an in-memory positioning, nor of discussing same with anybody.
- I checked my notes from a 2006 briefing, which included Kognitio CTO Roger Gaskell. There was no claim that Kognitio was an in-memory product.
- Indeed, as I also posted in 2008, Kognitio keeps indexes on disk. If you use indexes on disk, you’re not an in-memory product.
The truth is that Kognitio offers a disk-based DBMS that has long been worked on by a small team. I believe that the team really has put considerable effort into how Kognitio uses RAM. But there’s no basis to give Kognitio credit for being “really” in-memory vs. a variety of other analytic RDBMS alternatives. And a row-based product that doesn’t currently offer compression is at a large disadvantage versus, say, columnar products that already do.*
*Columnar systems don’t clobber row-based ones in-memory as extremely as they do in some disk-based use cases. But even in-memory it’s good not to have to move around data that isn’t relevant to your query.
Until Kognitio gets at least somewhat more honest in its marketing, I recommend avoiding Kognitio like the plague. It’s simply not a big enough company to buy from unless you have some level of trust in the management team.
Categories: Columnar database management, Database compression, In-memory DBMS, Kognitio, Memory-centric data management | 1 Comment |
Disk, flash, and RAM
Three months ago, I pointed out that it is hard to generalize about memory-centric database management, because there are so many different kinds. That said, there are some basic points that I’d like to record as background for any future discussion of the subject, focusing on differences between disk and RAM. And while I’m at it, I’ll throw in a few comments about flash memory as well.
This post would probably be better if I had actual numbers for the speeds of various kinds of silicon operations, but I’ll do what I can without them.
For most purposes, database speed is a function of a few kinds of number:
- CPU cycles consumed.
- I/O throughput.
- I/O wait time.
- Network throughput.
- Network wait time.
The amount of storage used is also important, both directly — storage hardware costs money — and because if you save storage via compression, you may get corresponding benefits in I/O. Power consumption and similar costs are usually tied to hardware efficiency; the less gear you use, the less floor space and cooling you may be able to get away with.
When databases move to RAM from spinning disk, major consequences include: Read more
Categories: Database compression, Memory-centric data management, Solid-state memory, solidDB | 6 Comments |
Approximate query results
In theory:
- A database query is a predicate.
- A DBMS matches the data it manages against the predicate and send back those records for which the predicate is true.
And so it would seem that query results always have to be exact. Even so, there are at least four different practical scenarios in which query results can reasonably be regarded as approximate, each associated with query languages that can supersede standard set-theoretic SQL.
Actually, there’s a fifth, and it’s a huge one — some fraction of your data is just plain wrong. But that’s not what this post is about.
First, some queries don’t have binary results, even in principle. Notably, text queries are answered via relevancy rankings, which fit badly into the relational model.
Second — and this can be combined with the first — you might want to generalize the query to look for partial matches. For example, Yarcdata suggested to me a scenario in which:
- You do a SPARQL query.
- You modify the query to accept results higher up in the taxonomy. (Which is likely to be possible, because where there’s SPARQL, there’s apt to be a taxonomy as well.) For example, if you really want to query on two people living in the house, you might extend the query to cover two people connected by any kind of address or building.
Similarly, if you’re looking for geographic proximity, it’s common to extend the allowed radius to fish for more results. Or one can walk up the hierarchy in a dimensional model.
Third, sometimes you just don’t have the data for any kind of precise answer at all. One adaptation I’ve mentioned before is to interpolate time series with synthetic data, and send back “precise” results based on that. In the same post I mentioned the Vertica “range join”, wherein users deliberately throw away part of their data — only storing the range it was in — and then join accordingly.
As Donald Rumsfeld might have said — and would have done well to reflect upon — you go into decision-making with the data you have, not the data you wish you had.
Finally, sometimes there’s a precise answer in principle, but for performance reasons you accept an approximate one, at least to start with. Numerous companies have told me stories around this, including:
- Infobright, whose “Rough Query” gives fast approximate results to a broad range of queries.
- Metamarkets, which does fast cardinality estimates via HyperLogLog.
- Aster Data, which was the first company to point out to me that median, decile, quintile, and so on calculations are a lot faster in a shared-nothing setting if you’re willing to settle for approximate results.
The latter two categories led me to ask vendors how customers actually make use of their exotic SQL capabilities. Answers boiled down to:
- (Always) Well, there’s a lot of custom coding.
- (Sometimes) We’re working with partner BI vendors to make direct use of the capabilities, but that’s not done yet, so it’s too early to talk about any details.
Perhaps the answers will never get much better; it’s tough to get packaged software vendors to support vendor-specific SQL, unless the vendor is Oracle. Even so, we’re seeing ever more ways in which conventional SQL DBMS are being superseded by data management and analytic alternatives.
Categories: Aster Data, Business intelligence, Data models and architecture, Data warehousing, Database compression, Infobright, Text, Vertica Systems, Yarcdata and Cray | 3 Comments |
Database diversity revisited
From time to time, I try to step back and build a little taxonomy for the variety in database technology. One effort was 4 1/2 years ago, in a pre-planned exchange with Mike Stonebraker (his side, alas, has since been taken down). A year ago I spelled out eight kinds of analytic database.
The angle I’ll take this time is to say that every sufficiently large enterprise needs to be cognizant of at least 7 kinds of database challenge. General notes on that include:
- I’m using the weasel words “database challenge” to evade questions as to what is or isn’t exactly a DBMS.
- One “challenge” can call for multiple products and technologies even within a single enterprise, let alone at different ones. For example, in this post the “eight kinds of analytic database” are reduced to just a single category.
- Even so, one product or technology may be well-suited to address a couple different kinds of challenges.
The Big Seven database challenges that almost any enterprise faces are: Read more
Introduction to Yarcdata
Cray’s strategy these days seems to be:
- Move forward with the classic supercomputer business.
- Diversify into related areas.
At the moment, the main diversifications are:
- Boxes that are like supercomputers, but at a lower price point.
- Storage.
- “(Big) data”.
The last of the three is what Cray subsidiary Yarcdata is all about. Read more
Is salesforce.com going to stick with Oracle?
Surprisingly often, I’m asked “Is salesforce.com going to stick with Oracle?” So let me refer to and expand upon my previous post about salesforce.com’s database architecture by saying:
- Today, salesforce.com uses Oracle as one of several ways to store data.
- salesforce.com’s use of Oracle isn’t very relational.
- salesforce.com is investing in HBase, after exploring other NoSQL options.
- salesforce.com surely has a very inexpensive Oracle license, reducing pressure to move any time soon. However …
- … salesforce.com’s use of Oracle has flipped from being a marketing advantage to a marketing liability.*
- It will be some years before any NoSQL option is mature enough to handle salesforce.com’s work.
- Especially through Heroku, salesforce.com is getting ever more experience with PostgreSQL.
Some day, Marc Benioff will probably say “We turned off Oracle across most of our applications a while ago, and nobody outside the company even noticed.”
*in that
- The marketing benefit “Oracle — it’s what the trustworthy big boys use” hardly matters any more.
- The marketing annoyance of Larry Ellison citing salesforce.com’s use of Oracle keeps growing.
Note: This blog post is less readable than it would be if I’d found a better workaround to WordPress’ bugs in the area of nested bullet points. I’m sorry.
Categories: NoSQL, OLTP, Oracle, salesforce.com, Software as a Service (SaaS) | 10 Comments |
Teradata SQL-H, using HCatalog
When I grumbled about the conference-related rush of Hadoop announcements, one example of many was Teradata Aster’s SQL-H. Still, it’s an interesting idea, and a good hook for my first shot at writing about HCatalog. Indeed, other than the Talend integration bundled into Hortonworks’ HDP 1, Teradata SQL-H is the first real use of HCatalog I’m aware of.
The Teradata SQL-H idea is:
- Register your Hadoop data to HCatalog. I’ll confess to being unclear about the details of how that works, for example in the case of data that just doesn’t fit well into flat relational tables. Stay tuned for future posts. For now, I’ll just note that:
- HCatalog is closely based on Hive’s metadata management. If you’ve run Hive against the data, HCatalog should already know about it.
- HCatalog can handle Pig and HBase data as well.
- Write SQL DDL (Data Description Language) so that your Aster cluster knows about the data.
- Write any Teradata Aster SQL/MR against that data. Some of the execution will be done on the Hadoop cluster, but pulling data back into Aster may well be necessary.
At least in theory, Teradata SQL-H lets you use a full set of analytic tools against your Hadoop data, with little limitation except price and/or performance. Teradata thinks the performance of all this can be much better than if you just use Hadoop (35X was mentioned in one particularly favorable example), but perhaps much worse than if you just copy/extract the data to an Aster cluster in the first place.
So what might the use cases be for something like SQL-H? Offhand, I’d say:
- SQL-H use cases are probably focused in areas where copying the data to Aster in advance doesn’t make a lot of sense. So presumably …
- … the Hadoop clusters involved would hold a lot more data than you’d want to pay for storing in Teradata Aster. E.g., think of cases where Hadoop is used as a big bit bucket or archival data store.
- There could be a kind of investigative workflow. First you play around with the Hadoop data via SQL-H. Then when you think you’re onto something, you set up ETL (Extract/Transform/Load) to get the data into Aster and ratchet up the effort.
By way of contrast, the whole thing makes less sense for dashboarding kinds of uses, unless the dashboard users are very patient when they want to drill down.