Michael Stonebraker
Discussion of the views of database management pioneer Mike Stonebraker. Related subjects include:
- Vertica
- StreamBase
- H-Store
- (in Software Memories) Stonebraker’s accomplishments
Flash, other solid-state memory, and disk
If there’s one subject on which the New England Database Summit changed or at least clarified my thinking,* it’s future storage technologies. Here’s what I now think:
- Solid-state memory will soon be the right storage technology for a large fraction of databases, OLTP and analytic alike. I’m not sure whether the initial cutoff in database size is best thought of as terabytes or 10s of terabytes, but it’s in that range. And it will increase over time, for the usual cheaper-parts reasons.
- That doesn’t necessarily mean flash. PCM (Phase-Change Memory) is coming down the pike, with perhaps 100X the durability of flash, in terms of the total number of writes it can tolerate. On the other hand, PCM has issues in the face of heat. More futuristically, IBM is also high on magnetic racetrack memory. IBM likes the term storage-class memory to cover all this — which I find regrettable, since the acronym SCM is way overloaded already. 🙂
- Putting a disk controller in front of solid-state memory is really wasteful. It wreaks havoc on I/O rates.
- Generic PCIe interfaces don’t suffice either, in many analytic use cases. Their I/O is better, but still not good enough. (Doing better yet is where Petascan – the stealth-mode company I keep teasing about – comes in.)
- Disk will long be useful for very large databases. Kryder’s Law, about disk capacity, has at least as high an annual improvement as Moore’s Law shows for chip capacity, the disk rotation speed bottleneck notwithstanding. Disk will long be much cheaper than silicon for data storage. And cheaper silicon in sensors will lead to ever more machine-generated data that fills up a lot of disks.
- Disk will long be useful for archiving. Disk is the new tape.
*When the first three people to the question microphone include both Mike Stonebraker and Dave DeWitt, your thinking tends to clarify in a hurry.
Related links
- A slide deck by Mohan of IBM similar to the one he presented at the NEDB Summit about storage-class memories.
- A much more detailed IBM presentation on storage-class memories.
- Oracle’s and Teradata’s beliefs about the importance of solid-state memory.
Other posts based on my January, 2010 New England Database Summit keynote address
- Data-based snooping — a huge threat to liberty that we’re all helping make worse
- Interesting trends in database and analytic technology
- Open issues in database and analytic technology
Categories: Data warehousing, Michael Stonebraker, Presentations, Solid-state memory, Storage, Theory and architecture | 3 Comments |
New England Database Summit (January 28, 2010)
New England Database Day has now, in its third year, become a “Summit.” It’s a nice event, providing an opportunity for academics and business folks to mingle. The organizers are basically the local branch of the Mike Stonebraker research tree, with this year’s programming head being Daniel Abadi. It will be on Thursday, January 28, 2010, once again in the Stata Center at MIT. It would be reasonable to park in the venerable 4/5 Cambridge Center parking lot, especially if you’d like to eat at Legal Seafood afterwards.
So far there are two confirmed speakers — Raghu Ramakrishnan of Yahoo and me. My talk title will be something like “Database and analytic technology: The state of the union”, with all wordplay intended.
There’s more information at the official New England Database Summit website. There’s also a post with similar information on Daniel Abadi’s DBMS Musings blog.
Edit after the event:
Posts based on my January, 2010 New England Database Summit keynote address
- Data-based snooping — a huge threat to liberty that we’re all helping make worse
- Flash, other solid-state memory, and disk
- Interesting trends in database and analytic technology
- Open issues in database and analytic technology
Categories: Analytic technologies, Data warehousing, Michael Stonebraker, Presentations, Theory and architecture | 4 Comments |
Three big myths about MapReduce
Once again, I find myself writing and talking a lot about MapReduce. But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:
- MapReduce is something very new
- MapReduce involves strict adherence to the Map-Reduce programming paradigm
- MapReduce is a single technology
Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Google, Greenplum, Hadoop, Log analysis, MapReduce, Michael Stonebraker, Parallelization, Web analytics | 11 Comments |
Introduction to the XLDB and SciDB projects
Before I write anything else about the overlapping efforts known as XLDB and SciDB, I probably should explain and disambiguate what they are as best I can. XLDB was organized and still is run by guys who want to solve a scientific problem in eXtremely Large DataBase Management, most especially Jacek Becla of SLAC (the organization previously known as Stanford Linear Accelerator Center). Becla’s original motivation was that he needs a DBMS to manage what will be 55 petabytes of raw image data and 100 petabytes of astronomical data total for LSST (Large Synoptic Survey Telescope). Read more
Categories: Data models and architecture, Database diversity, eBay, Michael Stonebraker, Open source, Petabyte-scale data management, Scientific research, Theory and architecture | 2 Comments |
NoSQL?
Eric Lai emailed today to ask what I thought about the NoSQL folks, and especially whether I thought their ideas were useful for enterprises in general, as opposed to just Web 2.0 companies. That was the first I heard of NoSQL, which seems to be a community discussing SQL alternatives popular among the cloud/big-web-company set, such as BigTable, Hadoop, Cassandra and so on. My short answers are:
- In most cases, no.
- Most of these technologies are designed for simple, high-volume OLTP (OnLine Transaction Processing.) Most large enterprises have an established way of doing OLTP, probably via relational database management systems. Why change?
- MapReduce is an exception, in that it’s designed for analytics. MapReduce may be useful for enterprises. But where it is, it probably should be integrated into an analytic DBMS.
- There’s one big countervailing factor to all these generalities — schema flexibility.
As for the longer form, let me start by noting that there are two main kinds of reason for not liking SQL. Read more
There always seems to be a fire drill around MapReduce news
Last August I flew out to see my new clients at Greenplum. They told me they planned to roll out MapReduce in a few weeks, and asked for my help in publicizing it. From their offices I went to dinner with non-clients Aster Data, who told me they’d gotten wind of a Greenplum MapReduce announcement and planned to come out ahead of it. A couple of hours later, Aster signed up as a client. In something of a pickle — but not one of my own making — I knocked heads, and persuaded both vendors to announce MapReduce at the same time, namely the following Monday. Lots of publicity ensued for both vendors, and everybody was reasonably satisfied. Read more
Categories: About this blog, Analytic technologies, Aster Data, Greenplum, MapReduce, Michael Stonebraker, Vertica Systems | 1 Comment |
Stonebraker, DeWitt, et al. compare MapReduce to DBMS
Along with five other coauthors — the lead author seems to be Andy Pavlo — famous MapReduce non-fans Mike Stonebraker and David DeWitt have posted a SIGMOD 2009 paper called “A Comparison of Approaches to Large-Scale Data Analysis.” The heart of the paper is benchmarks of Hadoop, Vertica, and “DBMS-X” on identical clusters of 100 low-end nodes., across a series of tests including (if I understood correctly):
- A couple of different flavors of a Grep task originally proposed in a Google MapReduce paper.
- A database query on simulated clickstream data
- A join on the same clickstream data.
- Two aggregations on the clickstream data.
Categories: Analytic technologies, Hadoop, MapReduce, Michael Stonebraker, Parallelization, Vertica Systems | 6 Comments |
New England Database Day this Friday January 30
Dan Weinreb, to whose opinions I usually give great weight, spoke very favorably of last year’s New England Database Day conference. Well, this year’s is taking place on Friday. It’s at MIT and it’s free, with easy registration. A list of papers is here.
It’s pretty obvious who’s running the show. Sam Madden’s name is given as a contact; elsewhere it’s referred to as being organized by Madden and Mike Stonebraker. Of the six identified papers, 2-3 look like the subjects or people could be taken straight from Vertica’s Database Column blog. But that hardly means the event will be one long Vertica commercial. For example, the other papers include one from Netezza and one on Flash memory data access methods.
I really doubt I’ll make to Cambridge in time for the 9:00 am opening remarks ;), but I’ll try to swing by later on.
Categories: Michael Stonebraker, Theory and architecture, Vertica Systems | 3 Comments |
Mike Stonebraker’s counterarguments to MapReduce’s popularity
In response to recent posting I’ve done about MapReduce, Mike Stonebraker just got on the phone to give me his views. His core claim, more or less, is that anything you can do in MapReduce you could already do in a parallel database that complies with SQL-92 and/or has PostgreSQL underpinnnings. In particular, Mike says: Read more
Categories: Data warehousing, MapReduce, Michael Stonebraker, PostgreSQL | 5 Comments |
Microsoft is buying DATAllegro
I’ve long argued that:
- Oracle and Microsoft are doomed in the data warehouse market unless they acquire MPP/shared-nothing data warehouse DBMS and/or data warehouse appliances.
- DATAllegro is the ideal acquisition for either of them.
Microsoft has now validated my claim by agreeing to buy DATAllegro. As you probably know, we’ve been covering DATAllegro extensively, as per the links listed below.
Basic deal highlights include: Read more