Memory-centric data management

Analysis of technologies that manage data entirely or primarily in random-access memory (RAM). Related subjects include:

Oracle TimesTen
solidDB
QlikTech
SAP‘s BI Accelerator
Exasol
Solid-state memory as a replacement for disk

September 23, 2013

Thoughts on in-memory columnar add-ons

Oracle announced its in-memory columnar option Sunday. As usual, I wasn’t briefed; still, I have some observations. For starters:

Oracle, IBM (Edit: See the rebuttal comment below), and Microsoft are all doing something similar …
… because it makes sense.
The basic idea is to take the technology that manages indexes — which are basically columns+pointers — and massage it into an actual column store. However …
… the devil is in the details. See, for example, my May post on IBM’s version, called BLU, outlining all the engineering IBM did around that feature.
Notwithstanding certain merits of this approach, I don’t believe in complete alternatives to analytic RDBMS. The rise of analytic DBMS oriented toward multi-structured data just strengthens that point.

I’d also add that Larry Ellison’s pitch “build columns to avoid all that index messiness” sounds like 80% bunk. The physical overhead should be at least as bad, and the main saving in administrative overhead should be that, in effect, you’re indexing ALL columns rather than picking and choosing.

Anyhow, this technology should be viewed as applying to traditional business transaction data, much more than to — for example — web interaction logs, or other machine-generated data. My thoughts around that distinction start:

I argued back in 2011 that traditional databases will wind up in RAM, basically because …
… Moore’s Law will make it ever cheaper to store them there.
Still, cheaper != cheap, so this is a technology only to use with your most valuable data — i.e., that transactional stuff.
These are very tabular technologies, without much in the way of multi-structured data support.

Categories: Columnar database management, Data warehousing, IBM and DB2, Memory-centric data management, Microsoft and SQL*Server, OLTP, Oracle, SAP AG, Workday

6 Comments

September 8, 2013

Layering of database technology & DBMS with multiple DMLs

Two subjects in one post, because they were too hard to separate from each other

Any sufficiently complex software is developed in modules and subsystems. DBMS are no exception; the core trinity of parser, optimizer/planner, and execution engine merely starts the discussion. But increasingly, database technology is layered in a more fundamental way as well, to the extent that different parts of what would seem to be an integrated DBMS can sometimes be developed by separate vendors.

Major examples of this trend — where by “major” I mean “spanning a lot of different vendors or projects” — include:

The object/relational, aka universal, extensibility features developed in the 1990s for Oracle, DB2, Informix, Illustra, and Postgres. The most successful extensions probably have been:
- Geospatial indexing via ESRI.
- Full-text indexing, notwithstanding questionable features and performance.
MySQL storage engines.
MPP (Massively Parallel Processing) analytic RDBMS relying on single-node PostgreSQL, Ingres, and/or Microsoft SQL Server — e.g. Greenplum (especially early on), Aster (ditto), DATAllegro, DATAllegro’s offspring Microsoft PDW (Parallel Data Warehouse), or Hadapt.
Splits in which a DBMS has serious processing both in a “database” layer and in a predicate-pushdown “storage” layer — most famously Oracle Exadata, but also MarkLogic, InfiniDB, and others.
SQL-on-HDFS — Hive, Impala, Stinger, Shark and so on (including Hadapt).

Other examples on my mind include:

Data manipulation APIs being added to key-value stores such as Couchbase and Aerospike.
TokuMX, the Tokutek/MongoDB hybrid I just blogged about.
NuoDB’s willing reliance on third-party key-value stores (or HDFS in the role of one).
FoundationDB’s strategy, and specifically its acquisition of Akiban.

And there are several others I hope to blog about soon, e.g. current-day PostgreSQL.

In an overlapping trend, DBMS increasingly have multiple data manipulation APIs. Examples include: Read more

Categories: Aerospike, Akiban, Aster Data, Cache, Calpont, Cloudera, Data models and architecture, Database diversity, Databricks, Spark and BDAS, DATAllegro, Derived data, Greenplum, Hadapt, Hadoop, JPMorgan Chase, NoSQL, NuoDB, Parallelization, Solid-state memory, SQL/Hadoop integration, Structured documents, Text

7 Comments

August 25, 2013

Cloudera Hadoop strategy and usage notes

When we scheduled a call to talk about Sentry, Cloudera’s Charles Zedlewski and I found time to discuss other stuff as well. One interesting part of our discussion was around the processing “frameworks” Cloudera sees as most important.

The four biggies are:
- MapReduce. Duh.
- SQL, specifically Impala. This is as opposed to the uneasy Hive/MapReduce layering.
- Search.
- “Math” , which seems to mainly be through partnerships with SAS and Revolution Analytics. I don’t know a lot about how these work, but I presume they bypass MapReduce, in which case I could imagine them greatly outperforming Mahout.
Stream processing (Storm) is next in line.
Graph — e.g. Giraph — rises to at least the proof-of-concept level. Again, the hope would be that this well outperforms graph-on-MapReduce.
Charles is also seeing at least POC interest in Spark.
But MPI (Message Passing Interface) on Hadoop isn’t going anywhere fast, except to the extent it’s baked into SAS or other “math” frameworks. Generic MPI use cases evidently turn out to be a bad fit for Hadoop, due to factors such as:
- Low data volumes.
- Latencies in various parts of the system

HBase was artificially omitted from this “frameworks” discussion because Cloudera sees it as a little bit more of a “storage” system than a processing one.

Another good subject was offloading work to Hadoop, in a couple different senses of “offload”: Read more

Categories: Cloudera, Databricks, Spark and BDAS, Endeca, Hadoop, HP and Neoview, MapReduce, Predictive modeling and advanced analytics, RDF and graphs, Revolution Analytics, SAS Institute, Streaming and complex event processing (CEP), Teradata

22 Comments

August 17, 2013

Aerospike 3

My clients at Aerospike are coming out with their Version 3, and as several of my clients do, have encouraged me to front-run what otherwise would be the Monday embargo.

I encourage such behavior with arguments including:

“Nobody else is going to write in such technical detail anyway, so they won’t mind.”
“I’ve done this before. Other writers haven’t complained.”
“In fact, some other writers like having me go first, so that they can learn from and/or point to what I say.”
“Hey, I don’t ask for much in the way of exclusives, but I’d be pleased if you threw me this bone.”

Aerospike 2’s value proposition, let us recall, was:

… performance, consistent performance, and uninterrupted operations …

Aerospike’s consistent performance claims are along the lines of sub-millisecond latency, with 99.9% of responses being within 5 milliseconds, and even a node outage only borking performance for some 10s of milliseconds.

Uninterrupted operation is a core Aerospike design goal, and the company says that to date, no Aerospike production cluster has ever gone down.

The major support for such claims is Aerospike’s success in selling to the digital advertising market, which is probably second only to high-frequency trading in its low-latency demands. For example, Aerospike’s CMO Monica Pal sent along a link to what apparently is:

a video by a customer named Brightroll …
… who enjoy SLAs (Service Level Agreements) such as those cited above (they actually mentioned five 9s)* …
… at peak loads of 10-12 million requests/minute.

Categories: Aerospike, Market share and customer counts, Memory-centric data management, NoSQL, Pricing, Web analytics

3 Comments

August 12, 2013

Things I keep needing to say

Some subjects just keep coming up. And so I keep saying things like:

Most generalizations about “Big Data” are false. “Big Data” is a horrific catch-all term, with many different meanings.

Most generalizations about Hadoop are false. Reasons include:

Hadoop is a collection of disparate things, most particularly data storage and application execution systems.
The transition from Hadoop 1 to Hadoop 2 will be drastic.
For key aspects of Hadoop — especially file format and execution engine — there are or will be widely varied options.

Hadoop won’t soon replace relational data warehouses, if indeed it ever does. SQL-on-Hadoop is still very immature. And you can’t replace data warehouses unless you have the power of SQL.

Note: SQL isn’t the only way to provide “the power of SQL”, but alternative approaches are just as immature.

Most generalizations about NoSQL are false. Different NoSQL products are … different. It’s not even accurate to say that all NoSQL systems lack SQL interfaces. (For example, SQL-on-Hadoop often includes SQL-on-HBase.)

Categories: Actian and Ingres, Amazon and its cloud, Benchmarks and POCs, Business intelligence, Cloud computing, Columnar database management, Data warehouse appliances, Data warehousing, Hadoop, HBase, In-memory DBMS, Infobright, Market share and customer counts, NoSQL, OLTP, ParAccel, Pricing, SAP AG, Sybase, Vertica Systems

10 Comments

July 20, 2013

The refactoring of everything

I’ll start with three observations:

Computer systems can’t be entirely tightly coupled — nothing would ever get developed or tested.
Computer systems can’t be entirely loosely coupled — nothing would ever get optimized, in performance and functionality alike.
In an ongoing trend, there is and will be dramatic refactoring as to which connections wind up being loose or tight.

As written, that’s probably pretty obvious. Even so, it’s easy to forget just how pervasive the refactoring is and is likely to be. Let’s survey some examples first, and then speculate about consequences. Read more

Categories: Business intelligence, Cloud computing, Clustering, Data models and architecture, Exadata, IBM and DB2, In-memory DBMS, Memory-centric data management, Microsoft and SQL*Server, NewSQL, NoSQL, Oracle, SAP AG, Software as a Service (SaaS), Telecommunications, Teradata, Workday

5 Comments

June 16, 2013

Webinar Wednesday, June 26, 1 pm EST — Real-Time Analytics

I’m doing a webinar Wednesday, June 26, at 1 pm EST/10 am PST called:

Real-Time Analytics in the Real World

The sponsor is MemSQL, one of my numerous clients to have recently adopted some version of a “real-time analytics” positioning. The webinar sign-up form has an abstract that I reviewed and approved … albeit before I started actually outlining the talk. 😉

Our plan is:

I’ll review the multiple technologies and use cases that various companies call “real-time analytics”. I’m not planning for this part to be at all MemSQL-focused.*
MemSQL will review some specific use cases they feel their product — memory-centric scale-out RDBMS — has proven it supports.

*MemSQL is debuting pretty high in my rankings of content sponsors who are cool with vendor neutrality. I sent them a draft of my slides mentioning other tech vendors and not them, and they didn’t blink.

In other news, I’ll be in California over the next week. Mainly I’ll be visiting clients — and 2 non-clients and some family — 10:00 am through dinner, but I did set aside time to stop by GigaOm Structure on Wednesday. I have sniffles/cough/other stuff even before I go. So please don’t expect a lot of posts until I’ve returned, rested up a bit, and also prepared my webinar deck.

Categories: Analytic technologies, In-memory DBMS, MemSQL, NewSQL, Parallelization

1 Comment

April 14, 2013

Introduction to Deep Information Sciences and DeepDB

I talked Friday with Deep Information Sciences, makers of DeepDB. Much like TokuDB — albeit with different technical strategies — DeepDB is a single-server DBMS in the form of a MySQL engine, whose technology is concentrated around writing indexes quickly. That said:

DeepDB’s indexes can help you with analytic queries; hence, DeepDB is marketed as supporting OLTP (OnLine Transaction Processing) and analytics in the same system.
DeepDB is marketed as “designed for big data and the cloud”, with reference to “Volume, Velocity, and Variety”. What I could discern in support of that is mainly:
- DeepDB has been tested at up to 3 terabytes at customer sites and up to 1 billion rows internally.
- Like most other NewSQL and NoSQL DBMS, DeepDB is append-only, and hence could be said to “stream” data to disk.
- DeepDB’s indexes could at some point in the future be made to work well with non-tabular data.*
- The Deep guys have plans and designs for scale-out — transparent sharding and so on.

*For reasons that do not seem closely related to product reality, DeepDB is marketed as if it supports “unstructured” data today.

Other NewSQL DBMS seem “designed for big data and the cloud” to at least the same extent DeepDB is. However, if we’re interpreting “big data” to include multi-structured data support — well, only half or so of the NewSQL products and companies I know of share Deep’s interest in branching out. In particular:

Akiban definitely does. (Note: Stay tuned for some next-steps company news about Akiban.)
Tokutek has planted a small stake there too.
Key-value-store-backed NuoDB and GenieDB probably leans that way. (And SanDisk evidently shut down Schooner’s RDBMS while keeping its key-value store.)
VoltDB, Clustrix, ScaleDB and MemSQL seem more strictly tabular, except insofar as text search is a requirement for everybody. (Edit: Oops; I forgot about Clustrix’s approach to JSON support.)

Edit: MySQL has some sort of an optional NoSQL interface, and hence so presumably do MySQL-compatible TokuDB, GenieDB, Clustrix, and MemSQL.

Also, some of those products do not today have the transparent scale-out that Deep plans to offer in the future.

Categories: Akiban, Cloud computing, Clustrix, Columnar database management, Data models and architecture, Database compression, GenieDB, Market share and customer counts, Memory-centric data management, MySQL, NewSQL, NoSQL, NuoDB, OLTP, Oracle, ScaleDB, Schooner Information Technology, Tokutek and TokuDB, Transparent sharding, VoltDB and H-Store

Some notes on new-era data management, March 31, 2013

Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.

Performance confusion

Discussions of DBMS performance are always odd, for starters because:

Workloads and use cases vary greatly.
In particular, benchmarks such as the YCSB or TPC-H aren’t very helpful.

But in NoSQL/NewSQL short-request processing performance claims seem particularly confused. Reasons include but are not limited to:

It’s common for databases or at least working sets to be entirely in RAM — but it’s not always required.
Consistency and durability models vary. What’s more, in some systems — e.g. MongoDB — there’s considerable flexibility as to which model you use.
In particular, there’s an increasingly common choice in which data is written synchronously to RAM on 2 or more servers, then asynchronously to disk on each of them. Performance in these cases can be quite different from when all writes need to be committed to disk. Of course, you need sufficient disk I/O to keep up, so SSDs (Solid-State Drives) can come in handy.
Many workloads are inherently single node (replication aside). Others are not.

MongoDB and 10gen

I caught up with Ron Avnur at 10gen. Technical highlights included: Read more

Categories: Benchmarks and POCs, Cassandra, Clustering, Couchbase, Data models and architecture, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, HBase, In-memory DBMS, Investment research and trading, Market share and customer counts, MarkLogic, Memory-centric data management, MongoDB, NewSQL, NoSQL, Tokutek and TokuDB

8 Comments

March 26, 2013

Platfora at the time of first GA

Well-resourced Silicon Valley start-ups typically announce their existence multiple times. Company formation, angel funding, Series A funding, Series B funding, company launch, product beta, and product general availability may not be 7 different “news events”, but they’re apt to be at least 3-4. Platfora, no exception to this rule, is hitting general availability today, and in connection with that I learned a bit more about what they are up to.

In simplest terms, Platfora offers exploratory business intelligence against Hadoop-based data. As per last weekend’s post about exploratory BI, a key requirement is speed; and so far as I can tell, any technological innovation Platfora offers relates to the need for speed. Specifically, I drilled into Platfora’s performance architecture on the query processing side (and associated data movement); Platfora also brags of rendering 100s of 1000s of “marks” quickly in HTML5 visualizations, but I haven’t a clue as to whether that’s much of an accomplishment in itself.

Platfora’s marketing suggests it obviates the need for a data warehouse at all; for most enterprises, of course, that is a great exaggeration. But another dubious aspect of Platfora marketing actually serves to understate the product’s merits — Platfora claims to have an “in-memory” product, when what’s really the case is that Platfora’s memory-centric technology uses both RAM and disk to manage larger data marts than could reasonably be fit into RAM alone. Expanding on what I wrote about Platfora when it de-stealthed: Read more

Categories: Business intelligence, Columnar database management, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Market share and customer counts, Memory-centric data management, Platfora, Workload management

13 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in