Database diversity
Discussion of choices and variety in database management system architecture. Related subjects include:
Mike Stonebraker calls for the complete destruction of the old DBMS order
Last week, Dan Weinreb tipped me off to something very cool: Mike Stonebraker and a group of MIT/Brown/Yale colleagues are calling for a complete rewrite of OLTP DBMS. And they have a plan for how to do it, called H-Store, as per a paper and an associated slide presentation.
Categories: Database diversity, In-memory DBMS, Memory-centric data management, Michael Stonebraker, OLTP, Theory and architecture, VoltDB and H-Store | 36 Comments |
Mike Stonebraker’s DBMS taxonomy
In a response to my recent five-part series on DBMS diversity, Mike Stonebraker has proposed his own taxonomy of data management technologies over on Vertica’s Database Column blog. (Edit: Some good stuff disappeared when Vertica nuked that blog.)
- OLTP DBMSs focused on fast, reliable transaction processing
- Analytic/Data Warehouse DBMSs focused on efficient load and ad-hoc query performance
- Science DBMSs — after all MatLab does not scale to disk-sized arrays
- RDF stores focused on efficiently storing semi-structured data in this format
- XML stores focused on semi-structured data in this format
- Search engines — the big players all use proprietary engines in this area
- Stream Processing Engines focused on real-time StreamSQL
- “Lean and Mean,” less-than-a-database engines focused on doing a small number of things very well (embedded databases are probably in this category)
- MapReduce and Hadoop — after all Google has enough “throw weight” to define a category
He goes on to say that each will be architected differently, except that — as he already convinced me back in July — RDF will be well-managed by specialty data warehouse DBMS. Read more
Categories: Data types, Database diversity, Michael Stonebraker, Mid-range, OLTP, RDF and graphs, Theory and architecture | 6 Comments |
Database management system choices – beyond relational
This is the fifth of a five-part series on database management system choices. For the first post in the series, please click here.
Relational database management systems have three essential elements:
- Rows and columns. Theoretically, rows and columns may be inessential to the relational model. But in reality, they are built into the design of every real-world relational product. If you don’t have rows and columns, you’re not using the product to do what it was well-designed for.
- Predicate logic. Theoretically, everything can be fitted into a predicate Procrustean bed. But if you’re looking for relevancy rankings on a text search, binary logic is a highly convoluted way to get them.
- Fixed schemas. Database theorists commonly assume that databases have fixed schemas. If this means that 90%+ of all information is null or missing, they have elegant ways of dealing with that. Even so, as computing gets ever more concerned with individuals — each with his/her/its unique “profile(s)” — fixed schemas get ever harder to maintain.
If any of these three elements is missing or inappropriate, then a traditional relational database management system may not be the best choice.
Categories: Data types, Database diversity, Theory and architecture | 3 Comments |
Database management system choices — mid-range-relational
This is the fourth of a five-part series on database management system choices. For the first post in the series, please click here.
The other threat to the high-end relational DBMS vendors aims squarely at the heart of their business. It’s the mid-range relational database management systems, which are doing an ever-larger fraction of what their high-end cousins can. That said, different products do different things well. So if you’re not blindly paying up for the security of an all-things-to-all-people high-end DBMS, there are a number of factors you might want to consider.
Categories: Database diversity, EnterpriseDB and Postgres Plus, Mid-range, MySQL, OLTP, PostgreSQL, Theory and architecture | 3 Comments |
Database management system choices – relational data warehouse
This is the third of a five-part series on database management system choices. For the first post in the series, please click here.
High-end OLTP relational database management system vendors try to offer one-stop shopping for almost all data management needs. But as I noted in my prior post, their product category is facing two major competitive threats. One comes from specialty data warehouse database management system products. I’ve covered those extensively in this blog, with key takeaways including:
- Specialty data warehouse products offer huge cost advantages versus less targeted DBMS. This applies to purchase/maintenance and administrative costs alike. And it’s true even when the general-purposed DBMS boast data warehousing features such as star indexes, bitmap indexes, or sophisticated optimizers.
- The larger the database, the bigger the difference. It’s almost inconceivable to use Oracle for a 100+ terabyte data warehouse. But if you only have 5 terabytes, Oracle is a perfectly viable – albeit annoying and costly – alternative.
- Most specialty data warehouse products have a shared-nothing architecture. Smaller parts are cheaper per unit of capacity. Hence shared nothing/grid architectures are inherently cheaper, at least in theory. In data warehousing, that theoretical possibility has long been made practical.
- Specialty data warehouse products with row-based architectures are commonly sold in appliance formats. In particular, this is true of Teradata, Netezza, DATAllegro, and Greenplum. One reason is that they’re optimized to stream data off of disk fairly sequentially, as opposed to relying on random seeks.
- Specialty data warehouse products with columnar architectures are commonly available in software-only formats. Even so, Vertica and ParAccel also boast appliance deals, with HP and Sun respectively.
- There is tremendous technical diversity and differentiation in the specialty data warehouse system market.
Let me expand on that last point. Different features may or may not be important to you, depending on whether your precise application needs include: Read more
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, Database diversity, Theory and architecture | 20 Comments |
Database management system choices – 4 categories of relational
This is the second of a five-part series on database management system choices. For the first post in the series, please click here.
For the most part, relational database management systems divide into four major classes:
- High-end OLTP (OnLine Transaction Processing) relational DBMS. Oracle is the flagship for this category, followed by DB2.
- Specialty data warehouse DBMS. Teradata is the leader here, followed by Netezza, DATAllegro, ParAccel, Vertica, Infobright, Greenplum, Kognitio, Sybase IQ, and a host of others.
- Mid-range relational database management systems. Most of the contenders here fall into one or more of three categories: Open-source-based relational DBMS (MySQL, PostgreSQL, EnterpriseDB); reseller-focused relational DBMS (Progress OpenEdge, Pervasive PSQL); or crippled “editions” of high-end systems. Microsoft SQL Server was once a clear mid-range system, but now is better classified as high-end OLTP.
- Embedded relational database management systems. The leader of this category is Sybase’s SQL Anywhere. Also significant are memory-centric products Oracle TimesTen and solidDB.
Categories: Database diversity, OLTP, Theory and architecture | 9 Comments |
Database management system choices — overview
This is the first in a 5-part series of posts on data management product choices. By pre-arrangement, Mike Stonebraker is responding on The Database Column, starting with his own taxonomy of DBMS types.
In the 1990s, most database management experts believed that a single general-purpose DBMS could meet substantially all needs. If you just kept adding in enough datatypes and data access methods (e.g., specialized indexes), your DBMS could eventually do a good job of meeting almost any requirement. And so, from the late 1990s into the beginning of this decade, it seemed that technology was supporting business trends, and the DBMS industry was inexorably consolidating. There was an oligopoly of high-end vendors, who sold increasingly similar super-sophisticated database management systems. Nothing else in database management seemed to matter.
Well, we were wrong. The big thing we overlooked is that database optimizations go down to the level of actual storage. Read more
Categories: Database diversity, Parallelization, Theory and architecture | 14 Comments |
CouchDB — lazy database design taken to excess?
I’ve run into a research/alpha/whatever project called CouchDB a couple of times now. It’s yet another “Who needs relational databases? Who needs schemas?” kind of idea. Rather, CouchDB is for taking random documents and banging them into databases, then calculating views on the fly as needed. It’s REST-friendly. Lucene and a web server are built in.
Damien Katz seems to be the driving force behind CouchDB, and his discussion of document-oriented development seems to be a good starting point. Read more
Categories: CouchDB, Data models and architecture, Database diversity, Structured documents, Theory and architecture | 4 Comments |
A passionate defense of MapReduce
Mark Chu-Carroll has weighed in with a passionate defense of MapReduce. I only see one thing he got wrong, which was to overlook the great shared-nothing parallelism of today’s data warehouse appliances and specialty data warehouse DBMS. But that doesn’t detract from his overall point, which is that MapReduce is designed to help with parallel computing in general, not database querying in particular.
He also has the best version I know of an old observation, namely:
… [relational database] people have found the most beautiful, wonderful, perfect hammer in the whole world. It’s perfectly balanced – not too heavy, not too light, and swings just right to pound in a nail just right every time. The grip is custom-made, fitted to the shape of the owners hand, so that they can use it all day without getting any blisters. It’s also beautifully decorated – encrusted with gemstones and gold filigree – but only in places that won’t detract from how well it works as a hammer. It really is the greatest hammer ever. Relational database guys love their hammer. It’s just such a wonderful tool! And when they make something with it, it really comes out great. In fact, they like it so much that they think it’s the only tool they need. If you give them a screw, they’ll just pound it in like it’s a nail. And when you point out to them that dammit, it’s a screw, not a nail, they’ll say “I know that. But you can’t expect me to use a crappy little screwdriver when I have a magnificent hammer like this!”
Categories: Data models and architecture, Database diversity, MapReduce, Parallelization | Leave a Comment |
Amazon Dynamo — when primary key access is enough
Amazon has a very decentralized technical operation. But even the individual pieces have interestingly huge scale. Thus, various different things they’re doing are of interest.
They recently presented a research paper on a high-performance transactional system called Dynamo. (Hat tip to Dare Obasanjo.) A key point is the following:
There are many services on Amazon’s platform that only need primary-key access to a data store. For many services, such as those that provide best seller lists, shopping carts, customer preferences, session management, sales rank, and product catalog, the common pattern of using a relational database would lead to inefficiencies and limit scale and availability. Dynamo provides a simple primary-key only interface to meet the requirements of these applications.
Now, I don’t think too many organizations past Amazon are going to decide that they can’t afford the overhead of an RDBMS for such OLTP-like applications. But I do think it will become increasingly common to find other reasons to eschew traditional OLTP relational architectures. Maybe you’ll want the schema flexibility of XML. Or perhaps you’ll be happy with a fixed relational schema, but will want to optimize for analytic performance.