Data types

Analysis of data management technology optimized for specific datatypes, such as text, geospatial, object, RDF, or XML. Related subjects include:

Any subcategory
Database diversity

February 23, 2014

Confusion about metadata

A couple of points that arise frequently in conversation, but that I don’t seem to have made clearly online.

“Metadata” is generally defined as “data about data”. That’s basically correct, but it’s easy to forget how many different kinds of metadata there are. My list of metadata kinds starts with:

Data about data structure. This is the classical sense of the term. But please note:
- In a relational database, structural metadata is rather separate from the data itself.
- In a document database, each document might carry structure information with it.
Other inputs to core data management functions. Two major examples are:
- Column statistics that inform RDBMS optimizers.
- Value ranges that inform partition pruning or, more generally, data skipping.
Inputs to ancillary data management functions — for example, security privileges.
Support for human decisions about data — for example, information about authorship or lineage.

What’s worse, the past year’s most famous example of “metadata”, telephone call metadata, is misnamed. This so-called metadata, much loved by the NSA (National Security Agency), is just data, e.g. in the format of a CDR (Call Detail Record). Calling it metadata implies that it describes other data — the actual contents of the phone calls — that the NSA strenuously asserts don’t actually exist.

And finally, the first bullet point above has a counter-intuitive consequence — all common terminology notwithstanding, relational data is less structured than document data. Reasons include:

Relational databases usually just hold strings — or maybe numbers — with structural information being held elsewhere.
Some document databases store structural metadata right with the document data itself.
Some document databases store data in the form of (name, value) pairs. In some cases additional structure is imposed by naming conventions.
Actual text documents carry the structure imposed by grammar and syntax.

Related links

A lengthy survey of metadata kinds, biased to Hadoop (August, 2012)
Metadata as derived data (May, 2011)
Dataset management (May, 2013)
Structured/unstructured … multi-structured/poly-structured (May, 2011)

Categories: Data models and architecture, Hadoop, Structured documents, Surveillance and privacy, Telecommunications

5 Comments

February 2, 2014

Spark and Databricks

I’ve heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some sources of confusion.

Spark is very new. All Spark adoption is recent.
Databricks was founded to commercialize Spark. It is very much in stealth mode …
… except insofar as Databricks folks are going out and trying to drum up Spark adoption. 🙂
Ion Stoica is running Databricks, but you couldn’t tell that from his UC Berkeley bio page. Edit: After I posted this, Ion’s bio was quickly updated. 🙂
Spark creator and Databricks CTO Matei Zaharia is an MIT professor, but actually went on leave there before he ever showed up.
Cloudera is perhaps Spark’s most visible supporter. But Cloudera’s views of Spark’s role in the world is different from the Spark team’s.

The “What is Spark?” question may soon be just as difficult as the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is Spark?” goes something like this:

Spark is a distributed execution engine for analytic processes …
… which works well with Hadoop.
Spark is distinguished by a flexible in-memory data model …
… and farms out persistence to HDFS (Hadoop Distributed File System) or other existing data stores.
Intended analytic use cases for Spark include:
- SQL data manipulation.
- ETL-like data manipulation.
- Streaming-like data manipulation.
- Machine learning.
- Graph analytics.

Categories: Cloudera, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Market share and customer counts, Memory-centric data management, Predictive modeling and advanced analytics, RDF and graphs, Streaming and complex event processing (CEP)

16 Comments

January 9, 2014

The games of Watson

IBM excels at game technology, most famously in Deep Blue (chess) and Watson (Jeopardy!). But except at the chip level — PowerPC — IBM hasn’t accomplished much at game/real world crossover. And so I suspect the Watson hype is far overblown.

I believe that for two main reasons. First, whenever IBM talks about big initiatives like Watson, it winds up bundling a bunch of dissimilar things together and claiming they’re a seamless whole. Second, some core Watson claims are eerily similar to artificial intelligence (AI) over-hype three or more decades past. For example, the leukemia treatment advisor that is being hopefully built in Watson now sounds a lot like MYCIN from the early 1970s, and the idea of collecting a lot of tidbits of information sounds a lot like the Cyc project. And by the way:

MYCIN led to E-MYCIN, which led to the company Teknowledge, which raised a lot of money* but now has almost faded from memory.
Cyc is connected to the computer science community’s standard unit of bogosity.

Categories: Health care, IBM and DB2, Scientific research, Text

3 Comments

November 8, 2013

Comments on the 2013 Gartner Magic Quadrant for Operational Database Management Systems

The 2013 Gartner Magic Quadrant for Operational Database Management Systems is out. “Operational” seems to be Gartner’s term for what I call short-request, in each case the point being that OLTP (OnLine Transaction Processing) is a dubious term when systems omit strict consistency, and when even strictly consistent systems may lack full transactional semantics. As is usually the case with Gartner Magic Quadrants:

I admire the raw research.
The opinions contained are generally reasonable (especially since Merv Adrian joined the Gartner team).
Some of the details are questionable.
There’s generally an excessive focus on Gartner’s perception of vendors’ business skills, and on vendors’ willingness to parrot all the buzzphrases Gartner wants to hear.
The trends Gartner highlights are similar to those I see, although our emphasis may be different, and they may leave some important ones out. (Big omission — support for lightweight analytics integrated into operational applications, one of the more genuine forms of real-time analytics.)

Anyhow: Read more

Categories: Actian and Ingres, Aerospike, Basho and Riak, Cassandra, Clustrix, Couchbase, DataStax, EnterpriseDB and Postgres Plus, Hadoop, HBase, IBM and DB2, In-memory DBMS, MarkLogic, McObject, Memory-centric data management, Microsoft and SQL*Server, Mid-range, NewSQL, NuoDB, OLTP, Oracle, Pricing, Progress, Apama, and DataDirect, SAP AG, Structured documents, Sybase, VoltDB and H-Store

16 Comments

October 24, 2013

JSON in Teradata

I coined the term schema-on-need last month. More precisely, I coined it while being briefed on JSON-in-Teradata, which was announced earlier this week, and is slated for availability in the first half of 2014.

The basic JSON-in-Teradata story is as you expect:

A JSON document is stuck into a relational field. ~~(Oddly, Teradata wasn’t yet sure whether the field would be a BLOB or VARCHAR or something else.)~~ Edit: See Dan Graham’s comment below.
Fields within the JSON document can be indexed on.
Those fields can be referenced in SQL statements much as regular Teradata columns can.
~~You have to retrieve the whole document.~~ Edit: See Dan Graham’s comment below.
To avert the performance pain of retrieving the whole document, you can of course copy any particular field into a column of its own. (That’s the schema-on-need part of the story.)

JSON virtual columns are referenced a little differently than ordinary physical columns are. Thus, if you materialize a virtual column, you have to change your SQL. If you’re doing business intelligence through a semantic layer, or otherwise have some kind of declarative translation, that’s probably not a big drawback. If you’re coding analytic procedures directly, it still may not be a big drawback — hopefully you won’t reference the virtual column too many times in code before you decide to materialize it instead.

My Bobby McFerrin* imitation notwithstanding, Hadapt illustrates a schema-on-need approach that is slicker than Teradata’s in two ways. First, Hadapt has full SQL transparency between virtual and physical columns. Second, Hadapt handles not just JSON, but anything represented by key-value pairs. Still, like XML before it but more concisely, JSON is a pretty versatile data interchange format. So JSON-in-Teradata would seem to be useful as it stands.

*The singer in the classic 1988 music video Don’t Worry Be Happy. The other two performers, of course, were Elton John and Robin Williams.

Categories: Data models and architecture, Data warehousing, Hadapt, Schema on need, Structured documents, Teradata

3 Comments

October 10, 2013

Aster 6, graph analytics, and BSP

Teradata Aster 6 has been preannounced (beta in Q4, general release in Q1 2014). The general architectural idea is:

There are multiple data stores, the first two of which are:
- The classic Aster relational data store.
- A file system that emulates HDFS (Hadoop Distributed File System).
There are multiple processing “engines”, where an engine is what occupies and controls a processing thread. These start with:
- Generic analytic SQL, as Aster has had all along.
- SQL-MR, the MapReduce Aster has also had all along.
- SQL-Graph aka SQL-GR, a graph analytics system.
The Aster parser and optimizer accept glorified SQL, and work across all the engines combined.

There’s much more, of course, but those are the essential pieces.

Just to be clear: Teradata Aster 6, aka the Teradata Aster Discovery Platform, includes HDFS compatibility, native MapReduce and ways of invoking Hadoop MapReduce on non-Aster nodes or clusters — but even so, you can’t run Hadoop MapReduce within Aster over Aster’s version of HDFS.

The most dramatic immediate additions are in the graph analytics area.* The new SQL-Graph is supported by something called BSP (Bulk Synchronous Parallel). I’ll start by observing (and some of this is confusing):

BSP was thought of a long time ago, as a general-purpose computing model, but recently has come to the fore specifically for graph analytics. (Think Pregel and Giraph, along with Teradata Aster.)
BSP has a kind of execution-graph metaphor, which is different from the graph data it helps analyze.
BSP is described as being a combination hardware/software technology, but Teradata Aster and everybody else I know of implements it in software only.
Aster long ago talked of adding a graph data store, but has given up that plan; rather, it wants you to do graph analytics on data stored in tables (or accessed through views) in the usual way.

Use cases suggested are a lot of marketing, plus anti-fraud.

*Pay no attention to Aster’s previous claims to do a good job on graph — and not only via nPath — in SQL-MR.

So far as I can infer from examples I’ve seen, the semantics of Teradata Aster SQL-Graph start:

Ordinary SQL except in the FROM clause.
Functions/operators that are the arguments for FROM; of course, they output tables. You can write these yourself, or use Teradata Aster’s prebuilt ones.

Within those functions, the core idea is: Read more

Categories: Application areas, Aster Data, Business intelligence, Data models and architecture, Data warehousing, Hadoop, Parallelization, Predictive modeling and advanced analytics, RDF and graphs, Teradata

4 Comments

October 10, 2013

Libraries in Teradata Aster

I recently wrote (emphasis added):

My clients at Teradata Aster probably see things differently, but I don’t think their library of pre-built analytic packages has been a big success. The same goes for other analytic platform vendors who have done similar (generally lesser) things. I believe that this is because such limited libraries don’t do enough of what users want.

The bolded part has been, shall we say, confirmed. As Randy Lea tells it, Teradata Aster sales qualification includes the determination that at least one SQL-MR operator — be relevant to the use case. (“Operator” seems to be the word now, rather than “function”.) Randy agreed that some users prefer hand-coding, but believes a large majority would like to push work to data analysts/business analysts who might have strong SQL skills, but be less adept at general mathematical programming.

This phrasing will all be less accurate after the release of Aster 6, which extends Aster’s capabilities beyond the trinity of SQL, the SQL-MR library, and Aster-supported hand-coding.

Randy also said:

A typical Teradata Aster production customer uses 8-12 of the prebuilt functions (but now they seem to be called operators).
nPath is used in almost every Aster account. (And by now nPath has morphed into a family of about 5 different things.)
The Aster collaborative filtering operator is used in almost every account.
Ditto a/the text operator.
Several business intelligence vendors are partnering for direct access to selected Teradata Aster operators — mentioned were Tableau, TIBCO Spotfire, and Alteryx.
I don’t know whether this is on the strength of a specific operator or not, but Aster is used to help with predictive parts failure applications in multiple industries.

And Randy seemed to agree when I put words in his mouth to the effect that the prebuilt operators save users months of development time.

Meanwhile, Teradata Aster has started a whole new library for relationship analytics.

Categories: Application areas, Aster Data, Data warehousing, Predictive modeling and advanced analytics, Teradata, Text

1 Comment

September 24, 2013

JSON in DB2

There’s a growing trend for DBMS to beef up their support for multiple data manipulation languages (DMLs) or APIs — and there’s a special boom in JSON support, MongoDB-compatible or otherwise. So I talked earlier tonight with IBM’s Bobbie Cochrane about how JSON is managed in DB2.

For starters, let’s note that there are at least four strategies IBM could have used.

Store JSON in a BLOB (Binary Large OBject) or similar existing datatype. That’s what IBM actually chose.
Store JSON in a custom datatype, using the datatype extensibility features DB2 has had since the 1990s. IBM is not doing this, and doesn’t see a need to at this time.
Use DB2 pureXML, along with some kind of JSON/XML translator. DB2 managed JSON this way in the past, via UDFs (User-Defined Functions), but that implementation is superseded by the new BLOB-based approach, which offers better performance in ingest and query alike.
Shred — to use a term from XML days — JSON into a bunch of relational columns. IBM experimented with this approach, but ultimately rejected it. In dismissing shredding, Bobbie also disdained any immediate support for schema-on-need.

IBM’s technology choices are of course influenced by its use case focus. It’s reasonable to divide MongoDB use cases into two large buckets:

Hardcore internet and/or machine-generated data, for example from a website.
Enterprise data aggregation, for example a “360-degree customer view.”

IBM’s DB2 JSON features are targeted at the latter bucket. Also, I suspect that IBM is generally looking for a way to please users who enjoy working on and with their MongoDB skills. Read more

Categories: Data models and architecture, IBM and DB2, MongoDB, NoSQL, pureXML, Structured documents

2 Comments

September 21, 2013

Schema-on-need

Two years ago I wrote about how Zynga managed analytic data:

Data is divided into two parts. One part has a pretty ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like eBay‘s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into long character strings.) … Zynga adds data into the real schema when it’s clear it will be needed for a while.

What was then the province of a few huge web companies is now poised to be a broader trend. Specifically:

Relational DBMS are adding or enhancing their support for complex datatypes, to accommodate various kinds of machine-generated data.
- MongoDB-compatible JSON is the flavor of the day on the short-request side, but alternatives include other JSON, XML, other key-value, or text strings.
- It is often possible to index on individual attributes inside the complex datatype.
The individual attributes inside the complex datatypes amount to virtual columns, which can play similar roles in SQL statements as physical columns do.
Over time, the DBA may choose to materialize virtual columns as additional physical columns, to boost query performance.

That migration from virtual to physical columns is what I’m calling “schema-on-need”. Thus, schema-on-need is what you invoke when schema-on-read no longer gets the job done. 😉

Categories: Data models and architecture, Data warehousing, MongoDB, PostgreSQL, Schema on need, Structured documents

10 Comments

September 8, 2013

Layering of database technology & DBMS with multiple DMLs

Two subjects in one post, because they were too hard to separate from each other

Any sufficiently complex software is developed in modules and subsystems. DBMS are no exception; the core trinity of parser, optimizer/planner, and execution engine merely starts the discussion. But increasingly, database technology is layered in a more fundamental way as well, to the extent that different parts of what would seem to be an integrated DBMS can sometimes be developed by separate vendors.

Major examples of this trend — where by “major” I mean “spanning a lot of different vendors or projects” — include:

The object/relational, aka universal, extensibility features developed in the 1990s for Oracle, DB2, Informix, Illustra, and Postgres. The most successful extensions probably have been:
- Geospatial indexing via ESRI.
- Full-text indexing, notwithstanding questionable features and performance.
MySQL storage engines.
MPP (Massively Parallel Processing) analytic RDBMS relying on single-node PostgreSQL, Ingres, and/or Microsoft SQL Server — e.g. Greenplum (especially early on), Aster (ditto), DATAllegro, DATAllegro’s offspring Microsoft PDW (Parallel Data Warehouse), or Hadapt.
Splits in which a DBMS has serious processing both in a “database” layer and in a predicate-pushdown “storage” layer — most famously Oracle Exadata, but also MarkLogic, InfiniDB, and others.
SQL-on-HDFS — Hive, Impala, Stinger, Shark and so on (including Hadapt).

Other examples on my mind include:

Data manipulation APIs being added to key-value stores such as Couchbase and Aerospike.
TokuMX, the Tokutek/MongoDB hybrid I just blogged about.
NuoDB’s willing reliance on third-party key-value stores (or HDFS in the role of one).
FoundationDB’s strategy, and specifically its acquisition of Akiban.

And there are several others I hope to blog about soon, e.g. current-day PostgreSQL.

In an overlapping trend, DBMS increasingly have multiple data manipulation APIs. Examples include: Read more

Categories: Aerospike, Akiban, Aster Data, Cache, Calpont, Cloudera, Data models and architecture, Database diversity, Databricks, Spark and BDAS, DATAllegro, Derived data, Greenplum, Hadapt, Hadoop, JPMorgan Chase, NoSQL, NuoDB, Parallelization, Solid-state memory, SQL/Hadoop integration, Structured documents, Text

7 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in