Data models and architecture

Discussion of issues in data modeling, and whether databases should be consolidated or loosely coupled. Related subjects include:

September 24, 2013

JSON in DB2

There’s a growing trend for DBMS to beef up their support for multiple data manipulation languages (DMLs) or APIs — and there’s a special boom in JSON support, MongoDB-compatible or otherwise. So I talked earlier tonight with IBM’s Bobbie Cochrane about how JSON is managed in DB2.

For starters, let’s note that there are at least four strategies IBM could have used.

Store JSON in a BLOB (Binary Large OBject) or similar existing datatype. That’s what IBM actually chose.
Store JSON in a custom datatype, using the datatype extensibility features DB2 has had since the 1990s. IBM is not doing this, and doesn’t see a need to at this time.
Use DB2 pureXML, along with some kind of JSON/XML translator. DB2 managed JSON this way in the past, via UDFs (User-Defined Functions), but that implementation is superseded by the new BLOB-based approach, which offers better performance in ingest and query alike.
Shred — to use a term from XML days — JSON into a bunch of relational columns. IBM experimented with this approach, but ultimately rejected it. In dismissing shredding, Bobbie also disdained any immediate support for schema-on-need.

IBM’s technology choices are of course influenced by its use case focus. It’s reasonable to divide MongoDB use cases into two large buckets:

Hardcore internet and/or machine-generated data, for example from a website.
Enterprise data aggregation, for example a “360-degree customer view.”

IBM’s DB2 JSON features are targeted at the latter bucket. Also, I suspect that IBM is generally looking for a way to please users who enjoy working on and with their MongoDB skills. Read more

Categories: Data models and architecture, IBM and DB2, MongoDB, NoSQL, pureXML, Structured documents

2 Comments

September 21, 2013

Schema-on-need

Two years ago I wrote about how Zynga managed analytic data:

Data is divided into two parts. One part has a pretty ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like eBay‘s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into long character strings.) … Zynga adds data into the real schema when it’s clear it will be needed for a while.

What was then the province of a few huge web companies is now poised to be a broader trend. Specifically:

Relational DBMS are adding or enhancing their support for complex datatypes, to accommodate various kinds of machine-generated data.
- MongoDB-compatible JSON is the flavor of the day on the short-request side, but alternatives include other JSON, XML, other key-value, or text strings.
- It is often possible to index on individual attributes inside the complex datatype.
The individual attributes inside the complex datatypes amount to virtual columns, which can play similar roles in SQL statements as physical columns do.
Over time, the DBA may choose to materialize virtual columns as additional physical columns, to boost query performance.

That migration from virtual to physical columns is what I’m calling “schema-on-need”. Thus, schema-on-need is what you invoke when schema-on-read no longer gets the job done. 😉

Categories: Data models and architecture, Data warehousing, MongoDB, PostgreSQL, Schema on need, Structured documents

10 Comments

September 8, 2013

Layering of database technology & DBMS with multiple DMLs

Two subjects in one post, because they were too hard to separate from each other

Any sufficiently complex software is developed in modules and subsystems. DBMS are no exception; the core trinity of parser, optimizer/planner, and execution engine merely starts the discussion. But increasingly, database technology is layered in a more fundamental way as well, to the extent that different parts of what would seem to be an integrated DBMS can sometimes be developed by separate vendors.

Major examples of this trend — where by “major” I mean “spanning a lot of different vendors or projects” — include:

The object/relational, aka universal, extensibility features developed in the 1990s for Oracle, DB2, Informix, Illustra, and Postgres. The most successful extensions probably have been:
- Geospatial indexing via ESRI.
- Full-text indexing, notwithstanding questionable features and performance.
MySQL storage engines.
MPP (Massively Parallel Processing) analytic RDBMS relying on single-node PostgreSQL, Ingres, and/or Microsoft SQL Server — e.g. Greenplum (especially early on), Aster (ditto), DATAllegro, DATAllegro’s offspring Microsoft PDW (Parallel Data Warehouse), or Hadapt.
Splits in which a DBMS has serious processing both in a “database” layer and in a predicate-pushdown “storage” layer — most famously Oracle Exadata, but also MarkLogic, InfiniDB, and others.
SQL-on-HDFS — Hive, Impala, Stinger, Shark and so on (including Hadapt).

Other examples on my mind include:

Data manipulation APIs being added to key-value stores such as Couchbase and Aerospike.
TokuMX, the Tokutek/MongoDB hybrid I just blogged about.
NuoDB’s willing reliance on third-party key-value stores (or HDFS in the role of one).
FoundationDB’s strategy, and specifically its acquisition of Akiban.

And there are several others I hope to blog about soon, e.g. current-day PostgreSQL.

In an overlapping trend, DBMS increasingly have multiple data manipulation APIs. Examples include: Read more

Categories: Aerospike, Akiban, Aster Data, Cache, Calpont, Cloudera, Data models and architecture, Database diversity, Databricks, Spark and BDAS, DATAllegro, Derived data, Greenplum, Hadapt, Hadoop, JPMorgan Chase, NoSQL, NuoDB, Parallelization, Solid-state memory, SQL/Hadoop integration, Structured documents, Text

7 Comments

July 20, 2013

The refactoring of everything

I’ll start with three observations:

Computer systems can’t be entirely tightly coupled — nothing would ever get developed or tested.
Computer systems can’t be entirely loosely coupled — nothing would ever get optimized, in performance and functionality alike.
In an ongoing trend, there is and will be dramatic refactoring as to which connections wind up being loose or tight.

As written, that’s probably pretty obvious. Even so, it’s easy to forget just how pervasive the refactoring is and is likely to be. Let’s survey some examples first, and then speculate about consequences. Read more

Categories: Business intelligence, Cloud computing, Clustering, Data models and architecture, Exadata, IBM and DB2, In-memory DBMS, Memory-centric data management, Microsoft and SQL*Server, NewSQL, NoSQL, Oracle, SAP AG, Software as a Service (SaaS), Telecommunications, Teradata, Workday

5 Comments

June 23, 2013

Impala and Parquet

I visited Cloudera Friday for, among other things, a chat about Impala with Marcel Kornacker and colleagues. Highlights included:

Impala is meant to someday be a competitive MPP (Massively Parallel Processing) analytic RDBMS.
At the moment, it is not one. For example, Impala lacks any meaningful form of workload management or query optimization.
While Impala will run against any HDFS (Hadoop Distributed File System) file format, claims of strong performance assume that the data is in Parquet …
- … which is the replacement for the short-lived Trevni …
- … and which for most practical purposes is true columnar.
Impala is also meant to be more than an RDBMS; Parquet and presumably in the future Impala can accommodate nested data structures.
Just as Impala runs against most or all HDFS file formats, Parquet files can be used by most Hadoop execution engines, and of course by Pig and Hive.
The Impala roadmap includes workload management, query optimization, data skipping, user-defined functions, hash distribution, two turtledoves, and a partridge in a pear tree.

Data gets into Parquet via batch jobs only — one reason it’s important that Impala run against multiple file formats — but background format conversion is another roadmap item. A single table can be split across multiple formats — e.g., the freshest data could be in HBase, with the rest is in Parquet.

Categories: Cloudera, Columnar database management, Data models and architecture, Data warehousing, Database compression, Hadoop, HBase, MapReduce, Market share and customer counts, Specific users, SQL/Hadoop integration, Workload management

11 Comments

April 14, 2013

Introduction to Deep Information Sciences and DeepDB

I talked Friday with Deep Information Sciences, makers of DeepDB. Much like TokuDB — albeit with different technical strategies — DeepDB is a single-server DBMS in the form of a MySQL engine, whose technology is concentrated around writing indexes quickly. That said:

DeepDB’s indexes can help you with analytic queries; hence, DeepDB is marketed as supporting OLTP (OnLine Transaction Processing) and analytics in the same system.
DeepDB is marketed as “designed for big data and the cloud”, with reference to “Volume, Velocity, and Variety”. What I could discern in support of that is mainly:
- DeepDB has been tested at up to 3 terabytes at customer sites and up to 1 billion rows internally.
- Like most other NewSQL and NoSQL DBMS, DeepDB is append-only, and hence could be said to “stream” data to disk.
- DeepDB’s indexes could at some point in the future be made to work well with non-tabular data.*
- The Deep guys have plans and designs for scale-out — transparent sharding and so on.

*For reasons that do not seem closely related to product reality, DeepDB is marketed as if it supports “unstructured” data today.

Other NewSQL DBMS seem “designed for big data and the cloud” to at least the same extent DeepDB is. However, if we’re interpreting “big data” to include multi-structured data support — well, only half or so of the NewSQL products and companies I know of share Deep’s interest in branching out. In particular:

Akiban definitely does. (Note: Stay tuned for some next-steps company news about Akiban.)
Tokutek has planted a small stake there too.
Key-value-store-backed NuoDB and GenieDB probably leans that way. (And SanDisk evidently shut down Schooner’s RDBMS while keeping its key-value store.)
VoltDB, Clustrix, ScaleDB and MemSQL seem more strictly tabular, except insofar as text search is a requirement for everybody. (Edit: Oops; I forgot about Clustrix’s approach to JSON support.)

Edit: MySQL has some sort of an optional NoSQL interface, and hence so presumably do MySQL-compatible TokuDB, GenieDB, Clustrix, and MemSQL.

Also, some of those products do not today have the transparent scale-out that Deep plans to offer in the future.

Categories: Akiban, Cloud computing, Clustrix, Columnar database management, Data models and architecture, Database compression, GenieDB, Market share and customer counts, Memory-centric data management, MySQL, NewSQL, NoSQL, NuoDB, OLTP, Oracle, ScaleDB, Schooner Information Technology, Tokutek and TokuDB, Transparent sharding, VoltDB and H-Store

Some notes on new-era data management, March 31, 2013

Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.

Performance confusion

Discussions of DBMS performance are always odd, for starters because:

Workloads and use cases vary greatly.
In particular, benchmarks such as the YCSB or TPC-H aren’t very helpful.

But in NoSQL/NewSQL short-request processing performance claims seem particularly confused. Reasons include but are not limited to:

It’s common for databases or at least working sets to be entirely in RAM — but it’s not always required.
Consistency and durability models vary. What’s more, in some systems — e.g. MongoDB — there’s considerable flexibility as to which model you use.
In particular, there’s an increasingly common choice in which data is written synchronously to RAM on 2 or more servers, then asynchronously to disk on each of them. Performance in these cases can be quite different from when all writes need to be committed to disk. Of course, you need sufficient disk I/O to keep up, so SSDs (Solid-State Drives) can come in handy.
Many workloads are inherently single node (replication aside). Others are not.

MongoDB and 10gen

I caught up with Ron Avnur at 10gen. Technical highlights included: Read more

Categories: Benchmarks and POCs, Cassandra, Clustering, Couchbase, Data models and architecture, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, HBase, In-memory DBMS, Investment research and trading, Market share and customer counts, MarkLogic, Memory-centric data management, MongoDB, NewSQL, NoSQL, Tokutek and TokuDB

8 Comments

February 21, 2013

One database to rule them all?

Perhaps the single toughest question in all database technology is: Which different purposes can a single data store serve well? — or to phrase it more technically — Which different usage patterns can a single data store support efficiently? Ted Codd was on multiple sides of that issue, first suggesting that relational DBMS could do everything and then averring they could not. Mike Stonebraker too has been on multiple sides, first introducing universal DBMS attempts with Postgres and Illustra/Informix, then more recently suggesting the world needs 9 or so kinds of database technology. As for me — well, I agreed with Mike both times. 🙂

Since this is MUCH too big a subject for a single blog post, what I’ll do in this one is simply race through some background material. To a first approximation, this whole discussion is mainly about data layouts — but only if we interpret that concept broadly enough to comprise:

Every level of storage (disk, RAM, etc.).
Indexes, aggregates and raw data alike.

To date, nobody has ever discovered a data layout that is efficient for all usage patterns. As a general rule, simpler data layouts are often faster to write, while fancier ones can boost query performance. Specific tradeoffs include, but hardly are limited to: Read more

Categories: Columnar database management, Data models and architecture, Data warehousing, Database compression, Database diversity, GenieDB, GIS and geospatial, Hadoop, IBM and DB2, MarkLogic, Michael Stonebraker, Microsoft and SQL*Server, NewSQL, NoSQL, Oracle, PostgreSQL, SAP AG, Solid-state memory, Storage, Structured documents, Text, Theory and architecture, Tokutek and TokuDB

20 Comments

December 13, 2012

Spark, Shark, and RDDs — technology notes

Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level:

Rather than persisting data to disk after every step, as MapReduce does, Spark instead writes to something called RDDs (Resilient Distributed Datasets), which can live in memory.
Rather than being restricted to maps and reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order. All the primitives are parallel with respect to the RDDs.
Shark is a lot like Hive, only rewritten (in significant parts) and running over Spark.
There’s an approach to launching tasks quickly — ~5 milliseconds or so — that I unfortunately didn’t grasp.

The key concept here seems to be the RDD. Any one RDD:

Is a collection of Java objects, which should have the same or similar structure.
Can be partitioned/distributed and shuffled/redistributed across the cluster.
Doesn’t have to be entirely in memory at once.

Otherwise, there’s a lot of flexibility; an RDD can be a set of tuples, a collection of XML documents, or whatever other reasonable kind of dataset you want. And I gather that:

At the moment, RDDs expire at the end of a job.
This restriction will be lifted in a future release.

Categories: Data models and architecture, Databricks, Spark and BDAS, Hadoop, MapReduce, Memory-centric data management, Open source, Parallelization, SQL/Hadoop integration

11 Comments

November 1, 2012

Data models and architecture

JSON in DB2

Schema-on-need

Layering of database technology & DBMS with multiple DMLs

The refactoring of everything

Impala and Parquet

Introduction to Deep Information Sciences and DeepDB

Some notes on new-era data management, March 31, 2013

One database to rule them all?

Spark, Shark, and RDDs — technology notes

More on Cloudera Impala

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin