Readings in Database Systems
Mike Stonebraker and Larry Ellison have numerous things in common. If nothing else:
- They’re both titanic figures in the database industry.
- They both gave me testimonials on the home page of my business website.
- They both have been known to use the present tense when the future tense would be more accurate. 🙂
I mention the latter because there’s a new edition of Readings in Database Systems, aka the Red Book, available online, courtesy of Mike, Joe Hellerstein and Peter Bailis. Besides the recommended-reading academic papers themselves, there are 12 survey articles by the editors, and an occasional response where, for example, editors disagree. Whether or not one chooses to tackle the papers themselves — and I in fact have not dived into them — the commentary is of great interest.
But I would not take every word as the gospel truth, especially when academics describe what they see as commercial market realities. In particular, as per my quip in the first paragraph, the data warehouse market has not yet gone to the extremes that Mike suggests,* if indeed it ever will. And while Joe is close to correct when he says that the company Essbase was acquired by Oracle, what actually happened is that Arbor Software, which made Essbase, merged with Hyperion Software, and the latter was eventually indeed bought by the giant of Redwood Shores.**
*When it comes to data warehouse market assessment, Mike seems to often be ahead of the trend.
**Let me interrupt my tweaking of very smart people to confess that my own commentary on the Oracle/Hyperion deal was not, in retrospect, especially prescient.
Mike pretty much opened the discussion with a blistering attack against hierarchical data models such as JSON or XML. To a first approximation, his views might be summarized as:
- Logical hierarchical models can be OK in certain cases. In particular, JSON could be a somewhat useful datatype in an RDBMS.
- Physical hierarchical models are horrible.
- Rather, you should implement the logical hierarchical model over a columnar RDBMS.
My responses start:
- Nested data structures are more important than Mike’s discussion seems to suggest.
- Native XML and JSON stores are apt to have an index on every field. If you squint, that index looks a lot like a column store.
- Even NoSQL stores should and I think in most cases will have some kind of SQL-like DML (Data Manipulation Language). In particular, there should be some ability to do joins, because total denormalization is not always a good choice.
In no particular order, here are some other thoughts about or inspired by the survey articles in Readings in Database Systems, 5th Edition.
- I agree that OLTP (OnLine Transaction Processing) is transitioning to main memory.
- I agree with the emphasis on “data in motion”.
- While I needle him for overstating the speed of the transition, Mike is right that columnar architectures are winning for analytics. (Or you could say they’ve won, if you recognize that mop-up from the victory will still take 1 or 2 decades.)
- The guys seem to really hate MapReduce, which is an old story for Mike, but a bit of a reversal for Joe.
- MapReduce is many things, but it’s not a data model, and it’s also not something that Hadoop 1.0 was an alternative to. Saying each of those things was sloppy writing.
- The guys characterize consistency/transaction isolation as a rather ghastly mess. That part was an eye-opener.
- Mike is a big fan of arrays. I suspect he’s right in general, although I also suspect he’s overrating SciDB. I also think he’s somewhat overrating the market penetration of cube stores, aka MOLAP.
- The point about Hadoop (in particular) and modern technologies in general showing the way to modularization of DBMS is an excellent one.
- Joe and Mike disagreed about analytics; Joe’s approach rang truer for me. My own opinion is:
- Business intelligence has been important for quite a while, and won’t stop.
- Machine learning is becoming ever more important.
- It’s still early days for the integration of the two areas, but much more will come.
- The challenge of whether anybody wants to do machine learning (or other advanced analytics) over a DBMS is sidestepped in part by the previously mentioned point about the modularization of a DBMS. Hadoop, for example, can be both an OK analytic DBMS (although not fully competitive with mature, dedicated products) and of course also an advanced analytics framework.
- Similarly, except in the short-term I’m not worried about the limitations of Spark’s persistence mechanisms. Almost every commercial distribution of Spark I can think of is part of a package that also contains a more mature data store.
- Versatile DBMS and analytic frameworks suffer strategic contention for memory, with different parts of the system wanting to use it in different ways. Raising that as a concern about the integration of analytic DBMS with advanced analytic frameworks is valid.
- I used to overrate the importance of abstract datatypes, in large part due to Mike’s influence. I got over it. He should too. 🙂 They’re useful, to the point of being a checklist item, but not a game-changer. A big part of the problem is what I mentioned in the previous point — different parts of a versatile DBMS would prefer to do different things with memory.
- I used to overrate the importance of user-defined functions in an analytic RDBMS. Mike had nothing to do with my error. 🙂 I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. Looser coupling between analytics and data management seems more flexible.
- Excellent points are made about the difficulties of “First we build the perfect schema” data warehouse projects and, similarly, MDM (Master Data Management).
- There’s an interesting discussion that helps explain why optimizer progress is so slow (both for the industry in general and for each individual product).
Related links
- I did a deep dive into MarkLogic’s indexing strategy in 2008, which informed my comment about XML/JSON stores above.
- Again with MarkLogic as the focus, in 2010 I was skeptical about document stores not offering joins. MarkLogic has since capitulated.
- I’m not current on SciDB, but I did write a bit about it in 2010.
- I’m surprised that I can’t find a post to point to about modularization of DBMS. I’ll leave this here as a placeholder until I can.
- Edit: As promised, I’ve now posted about the object-relational/abstract datatype boom of the 1990s.
Comments
9 Responses to “Readings in Database Systems”
Leave a Reply
I agree that most OLTP deployments can use an in-memory DBMS when “can” only considers the database size and RAM available on large DRAM servers and small clusters. But are new deployments choosing to do that?
Maybe Amazon has data on that from EC2 customers.
My perception from people making this claim is that it ignores whether customers:
* are willing to pay for it
* are willing to pay for the power for it
* have large DRAM servers available to handle it
It is also easier for new deployments to go in-memory. The question is whether the will be in-memory several years later when they have too much data.
Hi Mark,
There are several different issues mixed together in your skepticism, and rightly so. For starters:
1. Existing OLTP systems work over existing OLTP DBMS. Migration, to borrow a phrase from Zork, is slow and tedious.
2. For a long time, many systems have been configured so that most accesses only go to RAM. If memory serves, the figure for SAP a decade ago was 99 1/2%. (However, fraction of accesses and fraction of work in accessing may be two very different things …)
3. If OLTP data is of the most classical kind — records of transactions engaged in by humans — then the databases and their growth are limited in size by the amount of actual business activity they track. Your challenge of “Won’t they soon outgrow RAM?” applies mainly to apps with a strong machine-generated aspect and/or to ones that capture interactions more than just actions.
“Native XML and JSON stores are apt to have an index on every field. If you squint, that index looks a lot like a column store.”
http://www.vldb.org/pvldb/vol8/p1668-shukla.pdf
“Nested data structures are more important than Mike’s discussion seems to suggest.”
Permanodes provide modeling, storing, searching, sharing & syncing data in the post-PC era
https://en.wikipedia.org/wiki/Camlistore
http://camlistore.org/docs/schema/permanode
[…] my recent Stonebraker-oriented post about database theory and practice over the decades, I […]
I think that JSON stores like MongoDB / ElasticSearch are filling very important need.
Data in many cases is hierarchical by its nature. When it is arriving with large stream – there is no time to normalize it into tables. When some event arrive with a bit non-standard structure we can not start redesign of our schema. We can not throw it either…
Regarding HDFS – I also not agree that it is bad. I remember tons of critiques: It is slow, single point of failure, read only …
I know very little real life problems with it. There is not much things with the same level of robustness. It solves the problem after all.
MapReduce and its generalization as FlumeJava (for Googlers) and Spark (for the rest of us) are giving us capability to build our own execution plans.
We can also call them directed acylic execution graphs of computations. I see proliferation of hints in the traditional databases as a proof that capability to write own execution plan as serious need…
OLTP tend to evolve over time to hybrid “analytic” systems with both singleton read write characteristics (e.g., order form entry) and more advanced things like lookup (e.g., view order history.) Singleton is completely solved for human interface – RDBMS with disk, RDBMS with SSD, KVP all work. The IO stress *always* comes from analytic parts, and there are loads of solutions. OLTP is now down to cost and support models and analytics integration.
There are strong use cases for M/R on the other side of spectrum from RDBMS – roll your own. If your application is big enough it will break the commercial options, and Hadoop solves a fundamental problem: schedule big batch work, and let the programmer drive. That is why it is the first option for someone betting a business on a complex new app (and why pros go nuts looking at those systems.)
The arrays/nested data structures/hierarchical data feeds and store mentioned and schema on read solve for a couple of other issues. Some data fits poorly into sets – time series (a common use for arrays) and inconsistent hierarchies (triple stores is a common use.) This fits David’s example – it’s really use of Hadoop *as a building block* for a customized use-case specific data management system.
This is very similar to OODBMS back when db as a toolset rather than a product (and you are responsible for locking or consistency) – except now we have scale and cost and data as leverage.
Expect the world to divide into the SQL on Hadoop on-the-box uses, and the exceptional big data as a differenciator exceptions.
Agree that BI is important, and will continue to be (durr); that machine learning is becoming more important; that abstract data type usefulness is over-stated; that UDF usefulness is over-stated.
The difficulties around “First we build the perfect schema” data warehouse projects cannot be over-stated.
This is the subject of a blog article that I’ve been writing in my head for a while. May this is the reminder I needed!