March 18, 2013

DBMS development and other subjects

The cardinal rules of DBMS development

Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.

That’s if things go extremely well.

Rule 2: You aren’t an exception to Rule 1.

In particular:

Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
Mixed workload management is harder than you’re assuming it is.
Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.

DBMS with Hadoop underpinnings …

… aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well aware of this fact.

But note that the HadoopDB prototype — on which Hadapt was based — was completed and the paper presented in 2009.

MarkLogic …

… has been around long enough to make a good DBMS. It used to make a solid XML DBMS. Now SQL and JSON are also in the mix. The SQL part is a reversal of MarkLogic’s long-time stance. The JSON part gets MarkLogic out of the usually-losing side of the XML/JSON debate.

As for MarkLogic’s Enterprise NoSQL messaging — it basically equates “NoSQL” to “short-request dynamic-schema“, and in 2013 I have little quarrel with that definition.

RDBMS-oriented Hadoop file formats are confusing

I’ve recently tried asking both Cloudera and Hortonworks about the “columnar” file formats beneath their respective better-Hive efforts, each time getting the response “Let me set you up with a call with the right person.” Cloudera also emailed over a link to Parquet, evidently the latest such project.

Specific areas about which I’m confused (and the same questions apply to any of these projects, as they seem similarly-intended) include but are not limited to:

Is it truly columnar (doesn’t seem so, based on the verbiage), or more PAX-like, or something else entirely?
What’s the nested data structure story? (It seems there is one.)
What’s the compression story?

Come to think of it, the name “Parquet” suggests that either:

Rows and columns are mixed together.
Somebody has the good taste to be a Celtics fan.

Whither analytic platforms?

I’ve been a big advocate of analytic platform technology, but interest hasn’t increased as much as I expected. Teradata Aster seems to be doing well, but not so extremely well that IBM Netezza, Sybase IQ, et al. feel the need to be aggressive in their responses. Vendors have, for the most part, put decent capabilities in place; but the energy I’d looked for isn’t there.

I think that problems include:

Analytic platforms are marketed too purely as a development play. Selling six-to-seven figure application development deals is hard.
But selling analytic performance — the other main benefit — is harder than it used to be. Good enough is often good enough. In particular …
… a lot of analytic work is being conceded, rightly or wrongly, to Hadoop.
More generally, selling advanced analytic tools is commonly a tough, niche-oriented business.

Also, some of the investigative analytics energy has been absorbed by business intelligence tools, specifically ones with “discovery” interfaces — Tableau, QlikView, and so on.

Related links

One database to rule them all? (February, 2013)
NewSQL thoughts (January, 2013)
Bottleneck Whack-A-Mole (August, 2009)

Categories: Aster Data, Cloudera, Columnar database management, Database compression, Hadapt, Hadoop, Hortonworks, IBM and DB2, MarkLogic, Netezza, NoSQL, QlikTech and QlikView, SQL/Hadoop integration, Structured documents, Sybase, Tableau Software, Teradata

Subscribe to our complete feed!

Comments

36 Responses to “DBMS development and other subjects”

robtweed on March 18th, 2013 2:59 am

Glad to see someone else agree with my maxim that databases are the one part of the IT landscape where old is potentially a good thing: http://robtweed.wordpress.com/2013/03/15/the-uncertainty-principle/
Curt Monash on March 18th, 2013 3:14 am

Well, let’s not take it to extremes!
robtweed on March 18th, 2013 4:26 am

Could you explain that flippant response?
robtweed on March 18th, 2013 4:40 am

We’re talking, after all, about a database technology that underpins a product you’ve discussed before in positive terms:

http://www.dbms2.com/category/products-and-vendors/intersystems-cache-ensemble/
Curt Monash on March 18th, 2013 4:41 am

I think that, despite what in many cases is their immaturity, using DBMS invented within the past 5-15 years is generally a better idea than using Oracle et al., unless there’s some legacy reason to use the older stuff. My post carries more force the broader the aspirations are for a new system; the more narrow your goals, the sooner you can achieve them.

And to your specific point, I’m not convinced that resurrecting Mumps is a good idea.
robtweed on March 18th, 2013 5:00 am

At the end of the day, the key criteria for databases come down to pretty basic things such as:

– performance
– scalability
– maintainability
– robustness
– quality of available technical support
– flexibility of data model
– ease of access across as wide a range of languages as possible, including the very latest and most popular
– ability to work efficiently in modern web and mobile architectures
– availability as an Open Source version

Everything else is just irrelevant fad, fashion and/or subjective personal opinion.
robtweed on March 18th, 2013 7:04 am

“I’m not convinced that resurrecting Mumps is a good idea.”

Sadly, a common sentiment, usually for all the wrong reasons. See:

http://robtweed.wordpress.com/2013/01/22/can-a-phoenix-rise-from-the-ashes-of-mumps/
Curt Monash on March 18th, 2013 7:42 am

That’s a lot of paragraphs to say “Nobody else has solved this problem in decades, but I have, and I’ll tell you how some other time.”
Joe Harris on March 18th, 2013 7:57 am

OK, I accept Rule 1 but I’d raise a couple of points:
* Surely Hadoop already is a DBMS (what else could it possibly be?)
* Hadoop has already had 7+ years & 10s of $m of development.

SQL on Hadoop clearly needs time to develop. However, the current offering is actually good enough for many cases where scale is an issue. It’s limited and brutally inefficient but it get’s the job done at arbitrary scale. Plus scaling is relatively painless. No current analytic DB can honestly claim to offer that right now, in my experience.

Hadoop in 2013 promises ‘if our SQL supports it then it can get as big as you want’. There is also an implied idea that no vendor will be able to hold you over a barrel in the future, either for an upgrade or for scaling. It’s the principle of ‘no more blank checks.’
robtweed on March 18th, 2013 8:18 am

‘That’s a lot of paragraphs to say “Nobody else has solved this problem in decades, but I have, and I’ll tell you how some other time.”’

Well I guess I’d assumed you’d continue reading, having set the scene:

http://robtweed.wordpress.com/2013/01/23/can-a-phoenix-arise-from-the-ashes-of-mumps-part-2/

http://robtweed.wordpress.com/2013/01/24/a-phoenix-rises/

Apologies if it challenges your attention span!
aaron on March 18th, 2013 3:28 pm

Curt – (if you don’t mind my interrupting this “Pick vs. MUMPS: Choosing the DBMS for the future of your PDP8” fest)

Can you expand on your thought about choosing a DBMS invented in the last 5-15 years. It’s …err… controversial. Except for a few companies such as Google, I can’t think of anyone using a dbms invented in the past 5 years at all in strategic prod. Over 90% of the time I’ve tried a dbms that was < 10YO, I've hit fundamental capability problems (Vertica being the notable exception.)

I like tech. I like new. But most of the new stuff is niche, incomplete in capability, inadequate in API, not adequately manageable, and likely only relevant to very unique circumstances. Can you enumerate newsqls that that you'd recommend putting business critical stuff on? My take is that most need more time (and perhaps more money than they have in pocket) to gel.
Curt Monash on March 18th, 2013 8:06 pm

Aaron,

To restrict my answer to stuff I’ve posted about here:

1. Workday and salesforce.com in essence wrote their own DBMS.

2. Substantially every major web company tracks interaction data in some combination of MySQL and NoSQL. I don’t know what fraction of online order-taking is done in Oracle/SQL Server/DB2/etc., but I’d guess it is far under 100%.

3. I wouldn’t hesitate to endorse relying on Vertica or Netezza, for a considerable range of workloads each. In specific contexts, I say favorable things about Infobright, Greenplum, ParAccel and others as well.

4. In the publishing/publishing-like area, MarkLogic evidently rocks.

I do agree that if you’re looking for a GENERAL-PURPOSE DBMS, you’re likely to be happiest with Oracle or one of its traditional alternatives. But increasingly, I dispute that’s what you should be looking for.
Curt Monash on March 18th, 2013 8:25 pm

Joe,

Hadoop is getting pretty decent at what it was originally designed for, and reasonable variants of same. Beyond that, we’ll see.

YARN and now Tez were sort of designed with RDBMS-like mixed-workload management issues in mind, but I wouldn’t forecast success with the same confidence I would if equally smart people actually worked at a focused RDBMS vendor. Building a successful cost-based optimizer is always a lot harder than people think it is — but that point is also very workload-dependent, as many optimizations are so straightforward a heuristic optimizer or a lame CBO are just fine.

Hadoop’s design is antithetical to multi-temperature data management; that will take a long time to fix. Nobody from the Hadoop community has ever indicated to me he’s working on or even thinking hard about it. Indeed, the first part of the fix must be to get people stop screaming in their heads “Noooo! Affinity is always a terrible idea!!!”

Etc., etc.
aaron on March 18th, 2013 10:24 pm

Looks like we’re mostly agreeing, but my take is that the envelope to use without flinching pushes to >10 years from start. Your examples:

– Workday (I know nothing about it)
– salesforce.com > 10YO
– MySQL and other traditional rdbms > 10YO
– NoSQL (love many of them; don’t see much transactional stuff in ones that look like dbms, but lots in distributed caches/KVs, and loads of batch stuff in db-ish processing
– Vertica – around 10YO including C-store (though it looked relatively solid after ~7)
– Netezza – >10YO
– Infobright – (I don’t know much about this, but didn’t it come from UofWarsaw (Wrablewski[sp?] > 10 years ago)
– Greenplum pg>30YO (even the GP parallelism is >10YO)
– ParAccel (never seen it)
– MarkLogic >10YO

so, of the ones I’ve seen in the semi-wild, most really took a while to grow to be non-scary decisions. I have nothing against specialized products, but just need them to work well….

Things that are wildly popular don’t look much like rdbms in terms of semantics. A lot are caches (possibly with eventual consistancy.) A lot are filesystems. Many are specialized indices (SOLR, Neo4J.)

The tradeoffs are the kickers here. Traditional rdbms has some issues: scale, price, sometimes specialized optimization, often distribution semantics, relationship semantics, processing of semistructured data.

Of these, the newsql winners need to attack one of the problems in a way that makes it a clear winner. Looking at the history is instructive – for example, Red Brick looked like it had a niche for a while.

An important point here is that nothing is scaling to very large very efficiently at this point. Another is that a lot of the no-/new-sql use is more bypassing the traditional rdbms price structure than trying for new features or efficiency.
Curt Monash on March 18th, 2013 11:14 pm

Aaron,

I guess it depends on what you mean by a “non-scary decision”. In plenty of use cases, selecting Oracle gives a higher chance of project failure than trying Vertica, even Vertica 4. Ditto for many of the other projects you mention.
aaron on March 19th, 2013 7:35 am

There are certainly good use cases for columnar RDBMS. One not mentioned is Hana columnar – the only thing that may be under 5YO I’ve seen in corporate strategic nonweb environments.
Eli Singer on March 19th, 2013 12:25 pm

Parquet seems like a Pax format and referring to such implementations as “columnar store” might be confusing as noted by Daniel Abadi (http://bit.ly/149QONj). When multiple columns are stored together in the same data block or file, it may more accurately be referred to as “row group” or “column family” storage.

One key reason why “true” column store is typically not implemented in Hadoop, is that HDFS’ default block placement policy does not guarantee co-location (i.e. blocks are placed somewhat randomly over the cluster). As a result, data of columns that belong to the same row might be placed across different datanodes, and reconstructing such rows will lead to tremendous network overhead.

HDFS default placement policy can be replaced with one that forces all columns (or column family) belonging to a specific row to be placed and replicated to the same datanodes. While this will alleviate the network overhead, it might lead to other challenges such as high (and inefficient) write I/Os, unbalanced cluster, and reduced availability.
Curt Monash on March 19th, 2013 8:17 pm

Actually, HANA’s predecessor shipped in 2005 or so: http://www.dbms2.com/2006/09/20/saps-bi-accelerator/
Dmitriy on March 25th, 2013 3:49 pm

@Eli,
Parquet design allows for true columnar storage to be implemented if/when Hadoop allows that to happen. Columns do not have to be in the same file, though currently they are due to HDFS specifics.
Eli Singer on March 29th, 2013 12:02 am

Dmitriy,

Thanks for the clarification. My comments about true columnar store in Parquet should be read in the context of using it with HDFS.

The point i’m trying to make is that implementing a true columnar storage in shared-nothing architecture requires control over block placement which is not available in the default HDFS implementation.

There are several ways to overcome this challenge:

1. Change HDFS policy to force data locality, so all columns for a specific range of rows will be stored on the same storage node

2. Use default HDFS policy (no forced locality) but implement a shared-everything architecture over HDFS. The remote query nodes will be able to assemble the [partial] rows as needed from the various data nodes.
Aaron K on April 29th, 2013 5:40 pm

“1. Workday and salesforce.com in essence wrote their own DBMS.”

I worked for SF few years back and I can tell you that their data reside on Oracle.
Curt Monash on April 29th, 2013 6:34 pm

Sure. But the layer between the apps and Oracle is like a DBMS itself.
Hortonworks, Hadoop, Stinger and Hive | DBMS 2 : DataBase Management System Services on August 24th, 2013 6:55 am

[…] I posted in March about the great expense and difficulty of building a good DBMS, I was thinking especially of SQL-on-Hadoop. Categories: Cloudera, Columnar database management, […]
Layering of database technology & DBMS with multiple DMLs | DBMS 2 : DataBase Management System Services on September 24th, 2013 4:41 am

[…] Developing any kind of DBMS is very hard. […]
RDBMS and their bundle-mates | DBMS 2 : DataBase Management System Services on November 10th, 2013 2:23 pm

[…] The cardinal rules of DBMS development (March, 2013) Categories: Actian and Ingres, Aster Data, Business intelligence, Cognos, Data warehouse appliances, Data warehousing, EMC, Exadata, Greenplum, Hadoop, IBM and DB2, Market share and customer counts, Microsoft and SQL*Server, Mid-range, MySQL, Netezza, NewSQL, NoSQL, OLTP, Oracle, Predictive modeling and advanced analytics, Pricing, Progress, Apama, and DataDirect, QlikTech and QlikView, SAP AG, Software as a Service (SaaS), Sybase, Teradata Subscribe to our complete feed! […]
Necessary complexity | DBMS 2 : DataBase Management System Services on April 19th, 2014 4:17 am

[…] previously encapsulated the first point in the cardinal rules of DBMS development: Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of […]
Cloudera, Impala, data warehousing and Hive | DBMS 2 : DataBase Management System Services on May 17th, 2014 12:38 am

[…] cardinal rules of DBMS development (March, 2013) Categories: Cloudera, Data warehousing, Facebook, Hadoop, SQL/Hadoop integration, […]
Optimism, pessimism and fatalism — fault-tolerance, Part 1 | DBMS 2 : DataBase Management System Services on June 8th, 2014 12:55 pm

[…] The cardinal rules of DBMS development (March, 2013) […]
Using multiple data stores | DBMS 2 : DataBase Management System Services on June 18th, 2014 12:03 pm

[…] The difficulty of DBMS development, including Hadoop-based ones (March, 2013) Categories: Aster Data, Business intelligence, Columnar database management, Data integration and middleware, Data models and architecture, Data warehousing, Database diversity, Greenplum, Hadoop, Memory-centric data management, OLTP, Schema on need, SQL/Hadoop integration, Teradata, Vertica Systems Subscribe to our complete feed! […]
An idealized log management and analysis system — from whom? | DBMS 2 : DataBase Management System Services on September 7th, 2014 8:39 am

[…] Maturing a new data management product is always difficult, costly and slow. […]
Where the innovation is | DBMS 2 : DataBase Management System Services on January 19th, 2015 3:28 am

[…] maturity is a huge issue for all the above, and will remain one for […]
Sources of differentiation | DBMS 2 : DataBase Management System Services on October 26th, 2015 3:35 pm

[…] Consistency of performance can be an important aspect of product maturity. […]
Abstract datatypes and extensible RDBMS | Software Memories on December 12th, 2015 6:55 am

[…] problem, in a nutshell, is that there’s a huge difference between making database technology sort of work and making it work really well, and datatype extensions typically get stuck in that muddled middle. One problem, as I mentioned […]
Notes on Spark and Databricks — generalities | DBMS 2 : DataBase Management System Services on July 31st, 2016 10:29 am

[…] A tease and occasionally useful as a SQL engine for its own sake, but really not very good, pending years to mature. […]
Some stuff that's always on my mind | DBMS 2 : DataBase Management System Services on May 20th, 2018 3:27 pm

[…] Vendors generally recognize that maturing a data store is an important, many-years-long process. […]
5 Lessons in Distributed Databases | DataStax on September 25th, 2018 11:04 am

[…] learned this one the hard way. In the words of independent tech analyst Curt Monash: “Developing a good database management system […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

DBMS development and other subjects

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin