Theory and architecture

Analysis of design choices in databases and database management systems. Related subjects include:

Any subcategory
Database diversity
Explicit support for specific data types
(in Text Technologies) Text search

January 15, 2013

Tokutek update

Alternate title: TokuDB updates 🙂

Now that I’ve addressed some new NewSQL entrants, namely NuoDB and GenieDB, it’s time to circle back to some more established ones. First up are my clients at Tokutek, about whom I recently wrote:

Tokutek turns a performance argument into a functionality one. In particular, Tokutek claims that TokuDB does a much better job than alternatives of making it practical for you to update indexes at OLTP speeds. Hence, it claims to do a much better job than alternatives of making it practical for you to write and execute queries that only make sense when indexes (or other analytic performance boosts) are in place.

That’s all been true since I first wrote about Tokutek and TokuDB in 2009. However, TokuDB’s technical details have changed. In particular, Tokutek has deemphasized the ideas that:

Vaguely justified the “fractal” metaphor, namely …
… the stuff in that post about having one block each sized for each power of 2, …
… which seem to be a form of what is more ordinarily called “cache-oblivious” technology.

Rather, Tokutek’s new focus for getting the same benefits is to provide a separate buffer for each node of a b-tree. In essence, Tokutek is taking the usual “big blocks are better” story and extending it to indexes. TokuDB also uses block-level compression. Notes on that include: Read more

Categories: Akiban, Database compression, Market share and customer counts, NewSQL, Tokutek and TokuDB

7 Comments

January 12, 2013

Introduction to NuoDB

NuoDB has an interesting NewSQL story. NuoDB’s core design goals seem to be:

SQL.
Transactions.
Very flexible topology, including:
- Local replicas.
- Remote replicas.
- Easy deployment and management.

Categories: Cache, Cloud computing, Clustering, Database compression, NewSQL, NuoDB

5 Comments

January 5, 2013

NewSQL thoughts

I plan to write about several NewSQL vendors soon, but first here’s an overview post. Like “NoSQL”, the term “NewSQL” has an identifiable, recent coiner — Matt Aslett in 2011 — yet a somewhat fluid meaning. Wikipedia suggests that NewSQL comprises three things:

OLTP- (OnLine Transaction Processing)/short-request-oriented SQL DBMS that are newer than MySQL.
Innovative MySQL engines.
Transparent sharding systems that can be used with, for example, MySQL.

I think that’s a pretty good working definition, and will likely remain one unless or until:

SQL-oriented and NoSQL-oriented systems blur indistinguishably.
MySQL (or PostgreSQL) laps the field with innovative features.

To date, NewSQL adoption has been limited.

NewSQL vendors I’ve written about in the past include Akiban, Tokutek, CodeFutures (dbShards), Clustrix, Schooner (Membrain), VoltDB, ScaleBase, and ScaleDB, with GenieDB and NuoDB coming soon.
But I’m dubious whether, even taken together, all those vendors have as many customers or production references as any of 10gen, Couchbase, DataStax, or Cloudant.*

That said, the problem may lie more on the supply side than in demand. Developing a competitive SQL DBMS turns out to be harder than developing something in the NoSQL state of the art.

Categories: Akiban, Cloudant, Clustering, Clustrix, Couchbase, DataStax, dbShards and CodeFutures, Market share and customer counts, MySQL, NewSQL, NoSQL, OLTP, Oracle, Parallelization, PostgreSQL, ScaleBase, ScaleDB, Schooner Information Technology, Tokutek and TokuDB, Transparent sharding, VoltDB and H-Store

19 Comments

January 5, 2013

Data(base) virtualization — a terminological mess

Data/database virtualization seems to be a hot subject right now, and vendors of a broad variety of different technologies are all claiming to be in the space. A terminological mess has ensued, as Monash’s First and Third Laws of Commercial Semantics are borne out in spades.

If something is like “virtualization”, then it should resemble hypervisors such as VMware. To me:

The core feature of a hypervisor is that it allows many somethings to run and coexist where ordinarily only one something would come into play. Here the “many somethings” are virtual machines and what’s going on inside them, and the “one something” is the ordinary operating system/hardware computing stack.
A core feature of original VMware was that the “many somethings” could be quite different — for example, the operating environments of numerous different hardware systems you wanted to decommission, or of new systems that you didn’t want to buy quite yet.
Important features of hypervisors include:
- The ability to have multiple virtual machines run side by side at once, safely.
- Flexible and powerful workload management if the virtual machines do contend for resources.
- Easy management.
- The negative feature of having sufficiently low overhead.

Anything that claims to be “like virtualization” should be viewed in that light. Read more

Categories: Clustering, Data integration and middleware, ScaleDB, Theory and architecture, Transparent sharding

5 Comments

December 13, 2012

Spark, Shark, and RDDs — technology notes

Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level:

Rather than persisting data to disk after every step, as MapReduce does, Spark instead writes to something called RDDs (Resilient Distributed Datasets), which can live in memory.
Rather than being restricted to maps and reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order. All the primitives are parallel with respect to the RDDs.
Shark is a lot like Hive, only rewritten (in significant parts) and running over Spark.
There’s an approach to launching tasks quickly — ~5 milliseconds or so — that I unfortunately didn’t grasp.

The key concept here seems to be the RDD. Any one RDD:

Is a collection of Java objects, which should have the same or similar structure.
Can be partitioned/distributed and shuffled/redistributed across the cluster.
Doesn’t have to be entirely in memory at once.

Otherwise, there’s a lot of flexibility; an RDD can be a set of tuples, a collection of XML documents, or whatever other reasonable kind of dataset you want. And I gather that:

At the moment, RDDs expire at the end of a job.
This restriction will be lifted in a future release.

Categories: Data models and architecture, Databricks, Spark and BDAS, Hadoop, MapReduce, Memory-centric data management, Open source, Parallelization, SQL/Hadoop integration

11 Comments

December 13, 2012

Introduction to Spark, Shark, BDAS and AMPLab

UC Berkeley’s AMPLab is working on a software stack that:

Is meant (among other goals) to improve upon Hadoop …
… but also to interoperate with it, and which in fact …
… uses significant parts of Hadoop.
Seems to have the overall name BDAS (Berkeley Data Analytics System).

The whole thing has $30 million in projected funding (half government, half industry) and a 6-year plan (which they’re 2 years into).

Specific projects of note in all that include:

Mesos, a cluster manager. I don’t know much about Mesos, but it seems to be in production use, most notably at Twitter supporting Storm.
Spark, a replacement for MapReduce and the associated execution stack.
Shark, a replacement for Hive.

Categories: ClearStory Data, Databricks, Spark and BDAS, Hadoop, MapReduce, Parallelization, Specific users, SQL/Hadoop integration

11 Comments

December 12, 2012

Some trends that will continue in 2013

I’m usually annoyed by lists of year-end predictions. Still, a reporter asked me for some, and I found one kind I was comfortable making.

Trends that I think will continue in 2013 include:

Growing attention to machine-generated data. Human-generated data grows at the rate business activity does, plus 0-25%. Machine-generated data grows at the rate of Moore’s Law, also plus 0-25%, which is a much higher total. In particular, the use of remote machine-generated data is becoming increasingly real.

Hadoop adoption. Everybody has the big bit bucket use case, largely because of machine-generated data. Even today’s technology is plenty good enough for that purpose, and hence justifies initial Hadoop adoption. Development of further Hadoop technology, which I post about frequently, is rapid. And so the Hadoop trend is very real.

Application SaaS. The on-premises application software industry has hopeless problems with product complexity and rigidity. Any suite new enough to cut the Gordian Knot is or will be SaaS (Software as a Service).

Newer BI interfaces. Advanced visualization — e.g. Tableau or QlikView — and mobile BI are both hot. So, more speculatively, are “social” BI (Business Intelligence) interfaces.

Price discounts. If you buy software at 50% of list price, you’re probably doing it wrong. Even 25% can be too high.

MySQL alternatives. NoSQL and NewSQL products often are developed as MySQL alternatives. Oracle has actually done a good job on MySQL technology, but now its business practices are scaring companies away from MySQL commitments, and newer short-request SQL DBMS are ready for use.

Categories: Business intelligence, Hadoop, MySQL, NewSQL, NoSQL, Open source, Oracle, Pricing, Software as a Service (SaaS), Surveillance and privacy

3 Comments

December 2, 2012

Are column stores really better at compression?

A consensus has evolved that:

Columnar compression (i.e., value-based compression) compresses better than block-level compression (i.e., compression of bit strings).
Columnar compression can be done pretty well in row stores.

Still somewhat controversial is the claim that:

Columnar compression can be done even better in column stores than in row-based systems.

A strong plausibility argument for the latter point is that new in-memory analytic data stores tend to be columnar — think HANA or Platfora; compression is commonly cited as a big reason for the choice. (Another reason is that I/O bandwidth matters even when the I/O is from RAM, and there are further reasons yet.)

One group that made the in-memory columnar choice is the Spark/Shark guys at UC Berkeley’s AMP Lab. So when I talked with them Thursday (more on that another time, but it sounds like cool stuff), I took some time to ask why columnar stores are better at compression. In essence, they gave two reasons — simplicity, and speed of decompression.

In each case, the main supporting argument seemed to be that finding the values in a column is easier when they’re all together in a column store. Read more

Categories: Columnar database management, Database compression, Databricks, Spark and BDAS, In-memory DBMS, Netezza

10 Comments

November 29, 2012

Notes on Microsoft SQL Server

I’ve been known to gripe that covering big companies such as Microsoft is hard. Still, Doug Leland of Microsoft’s SQL Server team checked in for phone calls in August and again today, and I think I got enough to be worth writing about, albeit at a survey level only,

Subjects I’ll mention include:

Hadoop
Parallel Data Warehouse
PolyBase
Columnar data management
In-memory data management (Hekaton)

One topic I can’t yet comment about is MOLAP/ROLAP, which is a pity; if anybody can refute my claim that ROLAP trumps MOLAP, it’s either Microsoft or Oracle.

Microsoft’s slides mentioned Yahoo refining a 6 petabyte Hadoop cluster into a 24 terabyte SQL Server “cube”, which was surprising in light of Yahoo’s history as an Oracle reference.

Categories: Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Hadoop, Hortonworks, In-memory DBMS, MapReduce, Market share and customer counts, Microsoft and SQL*Server, Oracle, Yahoo

10 Comments

November 19, 2012

Couchbase 2.0

My clients at Couchbase checked in.

After multiple delays, Couchbase 2.0 is well into beta, with general availability being delayed by the holiday season as much as anything else.
Couchbase (the company) now has >350 subscription customers, almost all for Couchbase (the product) — which is to say for what was known as Membase, which is basically a persistent version of Memcached.
There also are many users of open source Couchbase, most famously LinkedIn.
Orbitz is a much-mentioned flagship paying Couchbase customer.
Couchbase customers mainly seem to be replacing a caching layer, Memcached or otherwise.
Couchbase headcount is just under 100.

The big changes in Couchbase 2.0 versus the previous (1.8.x) version are:

JSON storage, including secondary indexes.
Multi-data-center replication.
A back-end change from SQLite to a heavily forked version of CouchDB, called Couchstore.

Couchbase 2.0 is upwards-compatible with prior versions of Couchbase (and hence with Memcached), but not with CouchDB.

Technology notes on Couchbase 2.0 include: Read more

Categories: Basho and Riak, Cache, Cassandra, Clustering, Couchbase, MapReduce, Market share and customer counts, MongoDB, NoSQL, Open source, Structured documents

5 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Theory and architecture

Tokutek update

Introduction to NuoDB

NewSQL thoughts

Data(base) virtualization — a terminological mess

Spark, Shark, and RDDs — technology notes

Introduction to Spark, Shark, BDAS and AMPLab

Some trends that will continue in 2013

Are column stores really better at compression?

Notes on Microsoft SQL Server

Couchbase 2.0

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin