Theory and architecture

Analysis of design choices in databases and database management systems. Related subjects include:

Any subcategory
Database diversity
Explicit support for specific data types
(in Text Technologies) Text search

November 1, 2012

Notes and comments — October 31, 2012

Time for another catch-all post. First and saddest — one of the earliest great commenters on this blog, and a beloved figure in the Boston-area database community, was Dan Weinreb, whom I had known since some Symbolics briefings in the early 1980s. He passed away recently, much much much too young. Looking back for a couple of examples — even if you’ve never heard of him before, I see that Dan ‘s 2009 comment on Tokutek is still interesting today, and so is a post on his own blog disagreeing with some of my choices in terminology.

Otherwise, in no particular order:

1. Chris Bird is learning MongoDB. As is common for Chris, his comments are both amusing and enlightening.

2. When I relayed Cloudera’s comments on Hadoop adoption, I left out a couple of categories. One Cloudera called “mobile”; when I probed, that was about HBase, with an example being messaging apps.

The other was “phone home” — i.e., the ingest of machine-generated data from a lot of different devices. This is something that’s obviously been coming for several years — but I’m increasingly getting the sense that it’s actually arrived.

Categories: Cloudera, Data integration and middleware, Hadoop, HBase, Informatica, Metamarkets and Druid, MongoDB, NoSQL, Open source, Telecommunications

2 Comments

October 24, 2012

Quick notes on Impala

Edit: There is now a follow-up post on Cloudera Impala with substantially more detail.

In my world it’s possible to have a hasty 2-hour conversation, and that’s exactly what I had with Cloudera last week. We touched on hardware and general adoption, but much of the conversation was about Cloudera Impala, announced today. Like Hive, Impala turns Hadoop into a basic analytic RDBMS, with similar SQL/Hadoop integration benefits to those of Hadapt. In particular:

Impala is Hive-compatible in query language (HQL, which is a whole lot like SQL), metadata, JDBC/ODBC drivers, etc.
Unlike Hive, Impala does not work through Hadoop MapReduce.
Unlike Hadoop MapReduce and hence Hive, Impala does not persist intermediate results to disk. This is good for performance, but on extremely long-running queries it increases the risk you’ll have a node failure and have to restart the query from scratch.
Impala in its first version is missing some Hive syntax, notably in support for UDFs (User-Defined Functions).

Beyond that: Read more

Categories: Cloudera, Columnar database management, Database compression, Hadapt, Hadoop, MapReduce, Open source, SQL/Hadoop integration

6 Comments

October 23, 2012

Introduction to Platfora

When I wrote last week that I have at least 5 clients claiming they’re uniquely positioned to support BI over Hadoop (most of whom partner with a 6th client, Tableau) the non-partnering exception I had in mind was Platfora, Ben Werther’s oh-so-stealthy startup that is finally de-stealthing today. Platfora combines:

An interesting approach to analytic data management.
Business intelligence tools integrated with that.

The whole thing sounds like a perhaps more general and certainly non-SaaS version of what Metamarkets has been offering for a while.

The Platfora technical story starts: Read more

Categories: Business intelligence, Columnar database management, Data models and architecture, Data warehousing, Database compression, Hadoop, Memory-centric data management, Platfora

6 Comments

October 17, 2012

Hadoop/RDBMS integration: Aster SQL-H and Hadapt

Two of the more interesting approaches for integrating Hadoop and MapReduce with relational DBMS come from my clients at Teradata Aster (via SQL/MR and SQL-H) and Hadapt. In both cases, the story starts:

You can dump any kind of data you want into Hadoop’s file system.
You can have data in a scale-out RDBMS to get good performance on analytic SQL.
You can access all the data (not just the relationally stored part) via SQL.
You can do MapReduce on all the data (not just the Hadoop-stored part).
To varying degrees, Hadapt and Aster each offer three kinds of advantage over Hadoop-with-Hive:
- SQL performance is (much) better.
- SQL functionality is better.
- At least some of your employees — the “business analysts” — can invoke MapReduce processes through SQL, if somebody else (e.g. your techies or the vendor’s) coded them up in the first place.

Of course, there are plenty of differences. Those start: Read more

Categories: Aster Data, Hadapt, Hadoop, Pricing, SQL/Hadoop integration, Teradata

5 Comments

October 17, 2012

The Teradata Aster Big Analytics Aster/Hadoop appliance

My clients at Teradata are introducing a mix-em/match-em Aster/Hadoop box, officially called the Teradata Aster Big Analytics Appliance. Basics include:

You can fill a rack with nodes either for the Aster DBMS or for Hadoop (Hortonworks flavor), or you can combine them in the same box.
If you combine them, they share management software (adapted from mainstream Teradata’s) and Infiniband.
An Aster node has 16 2.6-gigahertz cores and 24 900GB disk drives.
A Hadoop node has 12 2.0-gigahertz cores and 12 3TB drives.
A central part of Teradata’s strategy is that Aster and Hadoop nodes can work together via SQL-H.
The Teradata Aster Big Analytics Appliance is based on a family of Dell servers that fit more compactly into racks than do Teradata’s traditional products.
The Teradata Aster Big Analytics Appliance replaces a previous interim Teradata Aster appliance that used similar hardware to that in other Teradata systems.

My views on the Teradata Aster Big Analytics Appliance start: Read more

Categories: Aster Data, Hadapt, Pricing, SQL/Hadoop integration, Teradata

3 Comments

October 16, 2012

Hadapt Version 2

My clients at Hadapt are coming out with a Version 2 to be available in Q1 2013, and perhaps slipstreaming some of the features before then. At that point, it will be reasonable to regard Hadapt as offering:

A very tight integration between an RDBMS-based analytic platform and Hadoop …
… that is decidedly immature as an analytic RDBMS …
… but which strongly improves the SQL capabilities of Hadoop (vs., say, the alternative of using Hive).

Solr is in the mix as well.

Hadapt+Hadoop is positioned much more as “better than Hadoop” than “a better scale-out RDBMS”– and rightly so, due to its limitations when viewed strictly from an analytic RDBMS standpoint. I.e., Hadapt is meant for enterprises that want to do several of:

Dump multi-structured data into Hadoop.
Refine or just move some of it into an RDBMS.
Bring in data from other RDBMS.
Process of all the above via Hadoop MapReduce.
Process of all the above via SQL.
Use full-text indexes on the data.

Hadapt has 6 or so production customers, a dozen or so more coming online soon, 35 or so employees (mainly in Cambridge or Poland), reasonable amounts of venture capital, and the involvement of a variety of industry luminaries. Hadapt’s biggest installation seems to have 10s of terabytes of relational data and 100s of TBs of multi-structured; Hadapt is very confident in its ability to scale an order of magnitude beyond that with the Version 2 product, and reasonably confident it could go even further.

At the highest level, Hadapt works like this: Read more

Categories: Analytic technologies, Cloudera, Columnar database management, Data models and architecture, Data warehousing, Hadapt, Hadoop, MapR, MapReduce, Market share and customer counts, SQL/Hadoop integration, Text

4 Comments

October 12, 2012

(Relational) database (management system) — three analytic glossary draft entries

These are three closely-related draft entries for the DBMS2 analytic glossary. Please comment with any ideas you have for their improvement!

1. Database management system (DBMS)

In our definition, a database management system (DBMS) is:

Software that manages the reading and writing of data …
… through an application programming interface (API) …
… that depends solely upon the values of the data and similar logical information.

Commonly, that API takes the form of a data manipulation language (DML) such as SQL or MDX, but our definition allows for APIs as simple as those of key-value stores.

There are two major alternatives to our definition:

The above could be a definition of “data management software”, with the term “DBMS” reserved for systems with a true DML.
Many vendors and industry observers abbreviate “database management system” or “data management software” as “database”.

Two important distinctions among categories of DBMS and the processing they’re optimized for are:

Relational vs. non-relational
Short-request vs. analytic vs. general-purpose

2. Database

The term database has two common meanings in IT: Read more

Categories: Analytic glossary, Data models and architecture

3 Comments

October 1, 2012

Notes on the Oracle OpenWorld Sunday keynote

I’m not at Oracle OpenWorld, but as usual that won’t keep me from commenting. My bottom line on the first night’s announcements is:

At many large enterprises, Oracle has a lock on much of their IT efforts. (But not necessarily in the internet or investigative analytics areas.) Tonight’s announcements serve to strengthen that.
Tonight’s announcements do little to help Oracle in other market segments.

In particular:

1. At the highest level, my view of Oracle’s strategy is the same as it’s been for several years:

Clayton Christensen’s The Innovator’s Solution teaches us that Oracle should focus on selling a thick stack of technology to its highest-end customers, and that’s exactly what Oracle does focus on.

2. Tonight’s news is closely in line with what Oracle’s Juan Loaiza told me three years ago, especially:

Oracle thinks flash memory is the most important hardware technology of the decade, one that could lead to Oracle being “bumped off” if they don’t get it right.

Juan believes the “bulk” of Oracle’s business will move over to Exadata-like technology over the next 5-10 years. Numbers-wise, this seems to be based more on Exadata being a platform for consolidating an enterprise’s many Oracle databases than it is on Exadata running a few Especially Big Honking Database management tasks.

3. Oracle is confusing people with its comments on multi-tenancy. I suspect:

What Oracle is talking about when it says “multi-tenancy” is more like consolidation than true multi-tenancy.
Probably there are a couple of true multi-tenancy features as well.

4. SaaS (Software as a Service) vendors don’t want to use Oracle, because they don’t want to pay for it.* This limits the potential impact of Oracle’s true multi-tenancy features. Even so: Read more

Categories: Business intelligence, Cloud computing, Columnar database management, Data warehouse appliances, Data warehousing, Exadata, Memory-centric data management, Oracle, Software as a Service (SaaS), Solid-state memory, Storage

9 Comments

September 27, 2012

Hoping for true columnar storage in Oracle12c

I was asked to clarify one of my July comments on Oracle12c,

I wonder whether Oracle will finally introduce a true columnar storage option, a year behind Teradata. That would be the obvious enhancement on the data warehousing side, if they can pull it off. If they can’t, it’s a damning commentary on the core Oracle codebase.

by somebody smart who however seemed to have half-forgotten my post comparing (hybrid) columnar compression to (hybrid) columnar storage.

In simplest terms:

Columnar storage and columnar compression are two different things. The main connections are:
- Columnar storage can make columnar compression more effective.
- In different ways, both technologies reduce I/O.
EMC Greenplum, Teradata Aster, and Teradata Classic are all originally row-based systems that have gone hybrid columnar.
Vertica is an originally column-based system that has gone hybrid columnar.

Categories: Aster Data, Columnar database management, Data warehousing, Database compression, Greenplum, Oracle, Teradata, Vertica Systems

4 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in