Data types

Analysis of data management technology optimized for specific datatypes, such as text, geospatial, object, RDF, or XML. Related subjects include:

Any subcategory
Database diversity

May 22, 2010

Notes on SciDB and scientific data management

I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That’s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about issues in scientific data management. Here’s some of what has transpired since then.

The main new activity I know of has been in the open source SciDB project. Read more

Categories: Analytic technologies, Data warehousing, eBay, GIS and geospatial, Microsoft and SQL*Server, SciDB, Scientific research, Web analytics

5 Comments

April 8, 2010

Information found in public-facing social networks

Here are some examples illustrating two recent themes of mine, namely:

Easily-available information reveals all sorts of things about us.
Graph-based analysis is on the rise.

Pete Warden scraped all of Facebook’s social graph (at least for the United States), and put up a really interesting-looking visualization of same. Facebook’s lawyer’s came down on him, and he quickly agreed to destroy the data he’d scraped, but also published ideas on how other people could duplicate his work.

Warden has since given an interview in which he outlines some of the things researchers hoped to do with this data: Read more

Categories: Analytic technologies, Facebook, RDF and graphs, Surveillance and privacy

1 Comment

April 5, 2010

Notes on the evolution of OLTP database management systems

The past few years have seen a spate of startups in the analytic DBMS business. Netezza, Vertica, Greenplum, Aster Data and others are all reasonably prosperous, alongside older specialty product vendors Teradata and Sybase (the Sybase IQ part). OLTP (OnLine Transaction Processing) and general purpose DBMS startups, however, have not yet done as well, with such success as there has been (MySQL, Intersystems Cache’, solidDB’s exit, etc.) generally accruing to products that originated in the 20th Century.

Nonetheless, OLTP/general-purpose data management startup activity has recently picked up, targeting what I see as some very real opportunities and needs. So as a jumping-off point for further writing, I thought it might be interesting to collect a few observations about the market in one place. These include:

Big-brand OLTP/general-purpose DBMS have more “stickiness” than analytic DBMS.
By number, most of an enterprise’s OLTP/general-purpose databases are low-volume and low-value.
Most interesting new OLTP/general-purpose data management products are either MySQL-based or NoSQL.
It’s not yet clear whether MySQL will prevail over MySQL forks, or vice-versa, or whether they will co-exist.
The era of silicon-centric relational DBMS is coming.
The emphasis on scale-out and reducing the cost of joins spans the NoSQL and SQL-based worlds.
Users’ instance on “free” could be a major problem for OLTP DBMS innovation.

I shall explain. Read more

Categories: Akiban, Analytic technologies, Business intelligence, Data warehousing, EnterpriseDB and Postgres Plus, Exadata, Market share and customer counts, Memory-centric data management, Mid-range, MySQL, NoSQL, OLTP, Open source, Oracle, PostgreSQL, RDF and graphs, Solid-state memory, VoltDB and H-Store, Web analytics

8 Comments

April 3, 2010

Akiban highlights

Akiban responded quickly to my complaints about its communication style, and I chatted for a couple of hours with senior Akiban techies Ori Herrnstadt, Peter Beaman and Jack Orenstein. It’s still early days for Akiban product development, so some details haven’t been determined yet, and others I just haven’t yet pinned down. Still, I know a lot more than I did a day ago. Highlights of my talk with Akiban included: Read more

Categories: Akiban, MySQL, Object, OLTP, Software as a Service (SaaS)

4 Comments

March 14, 2010

Toward a NoSQL taxonomy

I talked Friday with Dwight Merriman, founder of 10gen (the MongoDB company). He more or less convinced me of his definition of NoSQL systems, which in my adaptation goes:

NoSQL = HVSP (High Volume Simple Processing) without joins or explicit transactions

Within that realm, Dwight offered a two-part taxonomy of NoSQL systems, according to their data model and replication/sharding strategy. I’d be happier, however, with at least three parts to the taxonomy:

How data looks logically on a single node
How data is stored physically on a single node
How data is distributed, replicated, and reconciled across multiple nodes, and whether applications have to be aware of how the data is partitioned among nodes/shards. Read more

Categories: Cassandra, Data models and architecture, NoSQL, Parallelization, RDF and graphs, Structured documents, Theory and architecture

13 Comments

March 12, 2010

Some NoSQL links

I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I’m poking around a bit reading stuff on the subjects. Here are some links I found. Read more

Categories: Amazon and its cloud, Cassandra, Continuent, Google, MySQL, NoSQL, Open source, RDF and graphs, Tokutek and TokuDB

5 Comments

February 22, 2010

Aster Data nCluster 4.5

Like Vertica, Netezza, and Teradata, Aster is using this week to pre-announce a forthcoming product release, Aster Data nCluster 4.5. Aster is really hanging its identity on “Big Data Analytics” or some variant of that concept, and so the two major named parts of Aster nCluster 4.5 are:

Aster Data Analytic Foundation, a set of analytic packages prebuilt in Aster’s SQL-MapReduce
Aster Data Developer Express, an Eclipse-based IDE (Integrated Development Environment) for developing and testing applications built on Aster nCluster, Aster SQL-MapReduce, and Aster Data Analytic Foundation

And in other Aster news:

Along with the development GUI in Aster nCluster 4.5, there is also a new administrative GUI.
Aster has certified that nCluster works with Fusion I/O boards, because at least one retail industry prospect cares. However, that in no way means that arm’s-length Fusion I/O certification is Aster’s ultimate solid-state memory strategy.
I had the wrong impression about how far Aster/SAS integration has gotten. So far, it’s just at the connector level.

Aster Data Developer Express evidently does some cool stuff, like providing some sort of parallelism testing right on your desktop. It also generates lots of stub code, saving humans from the tedium of doing that. Useful, obviously.

But mainly, I want to write about the analytic packages. Read more

Categories: Aster Data, Data warehousing, Investment research and trading, Predictive modeling and advanced analytics, RDF and graphs, SAS Institute, Teradata

9 Comments

February 1, 2010

Open issues in database and analytic technology

The last part of my New England Database Summit talk was on open issues in database and analytic technology. This was closely intertwined with the previous section, and also relied on a lot that I’ve posted here. So I’ll just put up a few notes on that part, with lots of linkage to prior discussion of the same points. Read more

Categories: Analytic technologies, Business intelligence, Cloud computing, Data warehousing, Presentations, RDF and graphs, Software as a Service (SaaS), Solid-state memory, Theory and architecture

4 Comments

January 15, 2010

Intersystems Cache’ highlights

I talked with Robert Nagle of Intersystems last week, and it went better than at least one other Intersystems briefing I’ve had. Intersystems’ main product is Cache’, an object-oriented DBMS introduced in 1997 (before that Intersystems was focused on the fourth-generation programming language M, renamed from MUMPS). Unlike most other OODBMS, Cache’ is used for a lot of stuff one would think an RDBMS would be used for, across all sorts of industries. That said, there’s a distinct health-care focus to Intersystems, in that:

MUMPS, the original Intersystems technology, was focused on health care.
The reasons Intersystems went object-oriented have a lot to do with the structure of health-care records.
Intersystems’ biggest and most visible ISVs are in the health-care area.
Intersystems is actually beginning to sell an electronic health records system called TrakCare around the world (but not in the US, where it has lots of large competitive VARs).

Note: Intersystems Cache’ is sold mainly through VARs (Value-Added Resellers), aka ISVs/OEMs. I.e., it’s sold by people who write applications on top of it.

So far as I understand – and this is still pretty vague and apt to be partially erroneous – the Intersystems Cache’ technical story goes something like this: Read more