DBMS development and other subjects
The cardinal rules of DBMS development
Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.
That’s if things go extremely well.
Rule 2: You aren’t an exception to Rule 1.
In particular:
- Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
- Mixed workload management is harder than you’re assuming it is.
- Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.
DBMS with Hadoop underpinnings …
… aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well aware of this fact.
But note that the HadoopDB prototype — on which Hadapt was based — was completed and the paper presented in 2009.
MarkLogic …
… has been around long enough to make a good DBMS. It used to make a solid XML DBMS. Now SQL and JSON are also in the mix. The SQL part is a reversal of MarkLogic’s long-time stance. The JSON part gets MarkLogic out of the usually-losing side of the XML/JSON debate.
As for MarkLogic’s Enterprise NoSQL messaging — it basically equates “NoSQL” to “short-request dynamic-schema“, and in 2013 I have little quarrel with that definition.
RDBMS-oriented Hadoop file formats are confusing
I’ve recently tried asking both Cloudera and Hortonworks about the “columnar” file formats beneath their respective better-Hive efforts, each time getting the response “Let me set you up with a call with the right person.” Cloudera also emailed over a link to Parquet, evidently the latest such project.
Specific areas about which I’m confused (and the same questions apply to any of these projects, as they seem similarly-intended) include but are not limited to:
- Is it truly columnar (doesn’t seem so, based on the verbiage), or more PAX-like, or something else entirely?
- What’s the nested data structure story? (It seems there is one.)
- What’s the compression story?
Come to think of it, the name “Parquet” suggests that either:
- Rows and columns are mixed together.
- Somebody has the good taste to be a Celtics fan.
Whither analytic platforms?
I’ve been a big advocate of analytic platform technology, but interest hasn’t increased as much as I expected. Teradata Aster seems to be doing well, but not so extremely well that IBM Netezza, Sybase IQ, et al. feel the need to be aggressive in their responses. Vendors have, for the most part, put decent capabilities in place; but the energy I’d looked for isn’t there.
I think that problems include:
- Analytic platforms are marketed too purely as a development play. Selling six-to-seven figure application development deals is hard.
- But selling analytic performance — the other main benefit — is harder than it used to be. Good enough is often good enough. In particular …
- … a lot of analytic work is being conceded, rightly or wrongly, to Hadoop.
- More generally, selling advanced analytic tools is commonly a tough, niche-oriented business.
Also, some of the investigative analytics energy has been absorbed by business intelligence tools, specifically ones with “discovery” interfaces — Tableau, QlikView, and so on.
Related links
- One database to rule them all? (February, 2013)
- NewSQL thoughts (January, 2013)
- Bottleneck Whack-A-Mole (August, 2009)
Comments
36 Responses to “DBMS development and other subjects”
Leave a Reply
Glad to see someone else agree with my maxim that databases are the one part of the IT landscape where old is potentially a good thing: http://robtweed.wordpress.com/2013/03/15/the-uncertainty-principle/
Well, let’s not take it to extremes!
Could you explain that flippant response?
We’re talking, after all, about a database technology that underpins a product you’ve discussed before in positive terms:
http://www.dbms2.com/category/products-and-vendors/intersystems-cache-ensemble/
I think that, despite what in many cases is their immaturity, using DBMS invented within the past 5-15 years is generally a better idea than using Oracle et al., unless there’s some legacy reason to use the older stuff. My post carries more force the broader the aspirations are for a new system; the more narrow your goals, the sooner you can achieve them.
And to your specific point, I’m not convinced that resurrecting Mumps is a good idea.
At the end of the day, the key criteria for databases come down to pretty basic things such as:
– performance
– scalability
– maintainability
– robustness
– quality of available technical support
– flexibility of data model
– ease of access across as wide a range of languages as possible, including the very latest and most popular
– ability to work efficiently in modern web and mobile architectures
– availability as an Open Source version
Everything else is just irrelevant fad, fashion and/or subjective personal opinion.
“I’m not convinced that resurrecting Mumps is a good idea.”
Sadly, a common sentiment, usually for all the wrong reasons. See:
http://robtweed.wordpress.com/2013/01/22/can-a-phoenix-rise-from-the-ashes-of-mumps/
That’s a lot of paragraphs to say “Nobody else has solved this problem in decades, but I have, and I’ll tell you how some other time.”
OK, I accept Rule 1 but I’d raise a couple of points:
* Surely Hadoop already is a DBMS (what else could it possibly be?)
* Hadoop has already had 7+ years & 10s of $m of development.
SQL on Hadoop clearly needs time to develop. However, the current offering is actually good enough for many cases where scale is an issue. It’s limited and brutally inefficient but it get’s the job done at arbitrary scale. Plus scaling is relatively painless. No current analytic DB can honestly claim to offer that right now, in my experience.
Hadoop in 2013 promises ‘if our SQL supports it then it can get as big as you want’. There is also an implied idea that no vendor will be able to hold you over a barrel in the future, either for an upgrade or for scaling. It’s the principle of ‘no more blank checks.’
‘That’s a lot of paragraphs to say “Nobody else has solved this problem in decades, but I have, and I’ll tell you how some other time.”’
Well I guess I’d assumed you’d continue reading, having set the scene:
http://robtweed.wordpress.com/2013/01/23/can-a-phoenix-arise-from-the-ashes-of-mumps-part-2/
http://robtweed.wordpress.com/2013/01/24/a-phoenix-rises/
Apologies if it challenges your attention span!
Curt – (if you don’t mind my interrupting this “Pick vs. MUMPS: Choosing the DBMS for the future of your PDP8” fest)
Can you expand on your thought about choosing a DBMS invented in the last 5-15 years. It’s …err… controversial. Except for a few companies such as Google, I can’t think of anyone using a dbms invented in the past 5 years at all in strategic prod. Over 90% of the time I’ve tried a dbms that was < 10YO, I've hit fundamental capability problems (Vertica being the notable exception.)
I like tech. I like new. But most of the new stuff is niche, incomplete in capability, inadequate in API, not adequately manageable, and likely only relevant to very unique circumstances. Can you enumerate newsqls that that you'd recommend putting business critical stuff on? My take is that most need more time (and perhaps more money than they have in pocket) to gel.
Aaron,
To restrict my answer to stuff I’ve posted about here:
1. Workday and salesforce.com in essence wrote their own DBMS.
2. Substantially every major web company tracks interaction data in some combination of MySQL and NoSQL. I don’t know what fraction of online order-taking is done in Oracle/SQL Server/DB2/etc., but I’d guess it is far under 100%.
3. I wouldn’t hesitate to endorse relying on Vertica or Netezza, for a considerable range of workloads each. In specific contexts, I say favorable things about Infobright, Greenplum, ParAccel and others as well.
4. In the publishing/publishing-like area, MarkLogic evidently rocks.
I do agree that if you’re looking for a GENERAL-PURPOSE DBMS, you’re likely to be happiest with Oracle or one of its traditional alternatives. But increasingly, I dispute that’s what you should be looking for.
Joe,
Hadoop is getting pretty decent at what it was originally designed for, and reasonable variants of same. Beyond that, we’ll see.
YARN and now Tez were sort of designed with RDBMS-like mixed-workload management issues in mind, but I wouldn’t forecast success with the same confidence I would if equally smart people actually worked at a focused RDBMS vendor. Building a successful cost-based optimizer is always a lot harder than people think it is — but that point is also very workload-dependent, as many optimizations are so straightforward a heuristic optimizer or a lame CBO are just fine.
Hadoop’s design is antithetical to multi-temperature data management; that will take a long time to fix. Nobody from the Hadoop community has ever indicated to me he’s working on or even thinking hard about it. Indeed, the first part of the fix must be to get people stop screaming in their heads “Noooo! Affinity is always a terrible idea!!!”
Etc., etc.
Looks like we’re mostly agreeing, but my take is that the envelope to use without flinching pushes to >10 years from start. Your examples:
– Workday (I know nothing about it)
– salesforce.com > 10YO
– MySQL and other traditional rdbms > 10YO
– NoSQL (love many of them; don’t see much transactional stuff in ones that look like dbms, but lots in distributed caches/KVs, and loads of batch stuff in db-ish processing
– Vertica – around 10YO including C-store (though it looked relatively solid after ~7)
– Netezza – >10YO
– Infobright – (I don’t know much about this, but didn’t it come from UofWarsaw (Wrablewski[sp?] > 10 years ago)
– Greenplum pg>30YO (even the GP parallelism is >10YO)
– ParAccel (never seen it)
– MarkLogic >10YO
so, of the ones I’ve seen in the semi-wild, most really took a while to grow to be non-scary decisions. I have nothing against specialized products, but just need them to work well….
Things that are wildly popular don’t look much like rdbms in terms of semantics. A lot are caches (possibly with eventual consistancy.) A lot are filesystems. Many are specialized indices (SOLR, Neo4J.)
The tradeoffs are the kickers here. Traditional rdbms has some issues: scale, price, sometimes specialized optimization, often distribution semantics, relationship semantics, processing of semistructured data.
Of these, the newsql winners need to attack one of the problems in a way that makes it a clear winner. Looking at the history is instructive – for example, Red Brick looked like it had a niche for a while.
An important point here is that nothing is scaling to very large very efficiently at this point. Another is that a lot of the no-/new-sql use is more bypassing the traditional rdbms price structure than trying for new features or efficiency.
Aaron,
I guess it depends on what you mean by a “non-scary decision”. In plenty of use cases, selecting Oracle gives a higher chance of project failure than trying Vertica, even Vertica 4. Ditto for many of the other projects you mention.
There are certainly good use cases for columnar RDBMS. One not mentioned is Hana columnar – the only thing that may be under 5YO I’ve seen in corporate strategic nonweb environments.
Parquet seems like a Pax format and referring to such implementations as “columnar store” might be confusing as noted by Daniel Abadi (http://bit.ly/149QONj). When multiple columns are stored together in the same data block or file, it may more accurately be referred to as “row group” or “column family” storage.
One key reason why “true” column store is typically not implemented in Hadoop, is that HDFS’ default block placement policy does not guarantee co-location (i.e. blocks are placed somewhat randomly over the cluster). As a result, data of columns that belong to the same row might be placed across different datanodes, and reconstructing such rows will lead to tremendous network overhead.
HDFS default placement policy can be replaced with one that forces all columns (or column family) belonging to a specific row to be placed and replicated to the same datanodes. While this will alleviate the network overhead, it might lead to other challenges such as high (and inefficient) write I/Os, unbalanced cluster, and reduced availability.
Actually, HANA’s predecessor shipped in 2005 or so: http://www.dbms2.com/2006/09/20/saps-bi-accelerator/
@Eli,
Parquet design allows for true columnar storage to be implemented if/when Hadoop allows that to happen. Columns do not have to be in the same file, though currently they are due to HDFS specifics.
Dmitriy,
Thanks for the clarification. My comments about true columnar store in Parquet should be read in the context of using it with HDFS.
The point i’m trying to make is that implementing a true columnar storage in shared-nothing architecture requires control over block placement which is not available in the default HDFS implementation.
There are several ways to overcome this challenge:
1. Change HDFS policy to force data locality, so all columns for a specific range of rows will be stored on the same storage node
2. Use default HDFS policy (no forced locality) but implement a shared-everything architecture over HDFS. The remote query nodes will be able to assemble the [partial] rows as needed from the various data nodes.
“1. Workday and salesforce.com in essence wrote their own DBMS.”
I worked for SF few years back and I can tell you that their data reside on Oracle.
Sure. But the layer between the apps and Oracle is like a DBMS itself.
[…] I posted in March about the great expense and difficulty of building a good DBMS, I was thinking especially of SQL-on-Hadoop. Categories: Cloudera, Columnar database management, […]
[…] Developing any kind of DBMS is very hard. […]
[…] The cardinal rules of DBMS development (March, 2013) Categories: Actian and Ingres, Aster Data, Business intelligence, Cognos, Data warehouse appliances, Data warehousing, EMC, Exadata, Greenplum, Hadoop, IBM and DB2, Market share and customer counts, Microsoft and SQL*Server, Mid-range, MySQL, Netezza, NewSQL, NoSQL, OLTP, Oracle, Predictive modeling and advanced analytics, Pricing, Progress, Apama, and DataDirect, QlikTech and QlikView, SAP AG, Software as a Service (SaaS), Sybase, Teradata Subscribe to our complete feed! […]
[…] previously encapsulated the first point in the cardinal rules of DBMS development: Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of […]
[…] cardinal rules of DBMS development (March, 2013) Categories: Cloudera, Data warehousing, Facebook, Hadoop, SQL/Hadoop integration, […]
[…] The cardinal rules of DBMS development (March, 2013) […]
[…] The difficulty of DBMS development, including Hadoop-based ones (March, 2013) Categories: Aster Data, Business intelligence, Columnar database management, Data integration and middleware, Data models and architecture, Data warehousing, Database diversity, Greenplum, Hadoop, Memory-centric data management, OLTP, Schema on need, SQL/Hadoop integration, Teradata, Vertica Systems Subscribe to our complete feed! […]
[…] Maturing a new data management product is always difficult, costly and slow. […]
[…] maturity is a huge issue for all the above, and will remain one for […]
[…] Consistency of performance can be an important aspect of product maturity. […]
[…] problem, in a nutshell, is that there’s a huge difference between making database technology sort of work and making it work really well, and datatype extensions typically get stuck in that muddled middle. One problem, as I mentioned […]
[…] A tease and occasionally useful as a SQL engine for its own sake, but really not very good, pending years to mature. […]
[…] Vendors generally recognize that maturing a data store is an important, many-years-long process. […]
[…] learned this one the hard way. In the words of independent tech analyst Curt Monash: “Developing a good database management system […]