Theory and architecture

Analysis of design choices in databases and database management systems. Related subjects include:

September 21, 2013

Schema-on-need

Two years ago I wrote about how Zynga managed analytic data:

Data is divided into two parts. One part has a pretty ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like eBay‘s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into long character strings.) … Zynga adds data into the real schema when it’s clear it will be needed for a while.

What was then the province of a few huge web companies is now poised to be a broader trend. Specifically:

That migration from virtual to physical columns is what I’m calling “schema-on-need”. Thus, schema-on-need is what you invoke when schema-on-read no longer gets the job done. 😉

Read more

September 8, 2013

Layering of database technology & DBMS with multiple DMLs

Two subjects in one post, because they were too hard to separate from each other

Any sufficiently complex software is developed in modules and subsystems. DBMS are no exception; the core trinity of parser, optimizer/planner, and execution engine merely starts the discussion. But increasingly, database technology is layered in a more fundamental way as well, to the extent that different parts of what would seem to be an integrated DBMS can sometimes be developed by separate vendors.

Major examples of this trend — where by “major” I mean “spanning a lot of different vendors or projects” — include:

Other examples on my mind include:

And there are several others I hope to blog about soon, e.g. current-day PostgreSQL.

In an overlapping trend, DBMS increasingly have multiple data manipulation APIs. Examples include:  Read more

September 3, 2013

The Hemisphere program

Another surveillance slide deck has emerged, as reported by the New York Times and other media outlets. This one is for the Hemisphere program, which apparently:

Other notes include:

I’ve never gotten a single consistent figure, but typical CDR size seems to be in the 100s of bytes range. So I conjecture that Project Hemisphere spawned one of the first petabyte-scale databases ever.

Hemisphere Project unknowns start:  Read more

August 31, 2013

Tokutek’s interesting indexing strategy

The general Tokutek strategy has always been:

But the details of “writes indexes efficiently” have been hard to nail down. For example, my post about Tokutek indexing last January, while not really mistaken, is drastically incomplete.

Adding further confusion is that Tokutek now has two product lines:

TokuMX further adds language support for transactions and a rewrite of MongoDB’s replication code.

So let’s try again. I had a couple of conversations with Martin Farach-Colton, who:

The core ideas of Tokutek’s architecture start: Read more

August 24, 2013

Hortonworks business notes

Hortonworks did a business-oriented round of outreach, talking with at least Derrick Harris and me. Notes  from my call — for which Rob Bearden didn’t bother showing up — include, in no particular order:

In Hortonworks’ view, Hadoop adopters typically start with a specific use case around a new type of data, such as clickstream, sensor, server log, geolocation, or social.  Read more

August 17, 2013

Aerospike 3

My clients at Aerospike are coming out with their Version 3, and as several of my clients do, have encouraged me to front-run what otherwise would be the Monday embargo.

I encourage such behavior with arguments including:

Aerospike 2’s value proposition, let us recall, was:

… performance, consistent performance, and uninterrupted operations …

  • Aerospike’s consistent performance claims are along the lines of sub-millisecond latency, with 99.9% of responses being within 5 milliseconds, and even a node outage only borking performance for some 10s of milliseconds.
  • Uninterrupted operation is a core Aerospike design goal, and the company says that to date, no Aerospike production cluster has ever gone down.

The major support for such claims is Aerospike’s success in selling to the digital advertising market, which is probably second only to high-frequency trading in its low-latency demands. For example, Aerospike’s CMO Monica Pal sent along a link to what apparently is:

Read more

August 12, 2013

Things I keep needing to say

Some subjects just keep coming up. And so I keep saying things like:

Most generalizations about “Big Data” are false. “Big Data” is a horrific catch-all term, with many different meanings.

Most generalizations about Hadoop are false. Reasons include:

Hadoop won’t soon replace relational data warehouses, if indeed it ever does. SQL-on-Hadoop is still very immature. And you can’t replace data warehouses unless you have the power of SQL.

Note: SQL isn’t the only way to provide “the power of SQL”, but alternative approaches are just as immature.

Most generalizations about NoSQL are false. Different NoSQL products are … different. It’s not even accurate to say that all NoSQL systems lack SQL interfaces. (For example, SQL-on-Hadoop often includes SQL-on-HBase.)

Read more

August 6, 2013

Hortonworks, Hadoop, Stinger and Hive

I chatted yesterday with the Hortonworks gang. The main subject was Hortonworks’ approach to SQL-on-Hadoop — commonly called Stinger —  but at my request we cycled through a bunch of other topics as well. Company-specific notes include:

Our deployment and use case discussions were a little confused, because a key part of Hortonworks’ strategy is to support and encourage the idea of combining use cases and workloads on a single cluster. But I did hear:

*By the way — Teradata seems serious about pushing the UDA as a core message.

Ecosystem notes, in Hortonworks’ perception, included:

I also asked specifically about OpenStack. Hortonworks is a member of the OpenStack project, contributes nontrivially to Swift and other subprojects, and sees Rackspace as an important partner. But despite all that, I think strong Hadoop/OpenStack integration is something for the indefinite future.

Hortonworks’ views about Hadoop 2.0 start from the premise that its goal is to support running a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 include:

Frankly, I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of Hadoop 2.0 is what was least ready. HDFS (Hadoop Distributed File System) performance, NameNode failover and so on were well worth having, and it’s more than a year between Cloudera starting supporting them and when Hortonworks is offering Hadoop 2.0.

Hortonworks’ approach to doing SQL-on-Hadoop can be summarized simply as “Make Hive into as good an analytic RDBMS as possible, all in open source”. Key elements include:  Read more

August 4, 2013

Data model churn

Perhaps we should remind ourselves of the many ways data models can be caused to churn. Here are some examples that are top-of-mind for me. They do overlap a lot — and the whole discussion overlaps with my post about schema complexity last January, and more generally with what I’ve written about dynamic schemas for the past several years..

Just to confuse things further — some of these examples show the importance of RDBMS, while others highlight the relational model’s limitations.

The old standbys

Product and service changes. Simple changes to your product line many not require any changes to the databases recording their production and sale. More complex product changes, however, probably will.

A big help in MCI’s rise in the 1980s was its new Friends and Family service offering. AT&T couldn’t respond quickly, because it couldn’t get the programming done, where by “programming” I mainly mean database integration and design. If all that was before your time, this link seems like a fairly contemporaneous case study.

Organizational changes. A common source of hassle, especially around databases that support business intelligence or planning/budgeting, is organizational change. Kalido’s whole business was based on accommodating that, last I checked, as were a lot of BI consultants’. Read more

July 31, 2013

“Disruption” in the software industry

I lampoon the word “disruptive” for being badly overused. On the other hand, I often refer to the concept myself. Perhaps I should clarify. 🙂

You probably know that the modern concept of disruption comes from Clayton Christensen, specifically in The Innovator’s Dilemma and its sequel, The Innovator’s Solution. The basic ideas are:

In response (this is the Innovator’s Solution part):

But not all cleverness is “disruption”.

Here are some of the examples that make me think of the whole subject. Read more

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.