Theory and architecture

Analysis of design choices in databases and database management systems. Related subjects include:

Any subcategory
Database diversity
Explicit support for specific data types
(in Text Technologies) Text search

June 5, 2011

Hadoop confusion from Forrester Research

Jim Kobielus started a recent post

Most Hadoop-related inquiries from Forrester customers come to me. These have moved well beyond the “what exactly is Hadoop?” phase to the stage where the dominant query is “which vendors offer robust Hadoop solutions?”

What I tell Forrester customers is that, yes, Hadoop is real, but that it’s still quite immature.

So far, so good. But I disagree with almost everything Jim wrote after that.

Jim’s thesis seems to be that Hadoop will only be mature when a significant fraction of analytic DBMS vendors have own-branded versions of Hadoop alongside their DBMS, possibly via acquisition. Based on this, he calls for a formal, presumably vendor-driven Hadoop standardization effort, evidently for the whole Hadoop stack. He also says that

Hadoop is the nucleus of the next-generation cloud EDW, but that promise is still 3-5 years from fruition

where by “cloud” I presume Jim means first and foremost “private cloud.”

I don’t think any of that matches Hadoop’s actual strengths and weaknesses, whether now or in the 3-7 year future. My reasoning starts:

Hadoop is well on its way to being a surviving data-storage-plus-processing system — like an analytic DBMS or DBMS-imitating data integration tool …
… but Hadoop is best-suited for somewhat different use cases than those technologies are, and the gap won’t close as long as the others remain a moving target.
I don’t think MapReduce is going to fail altogether; it’s too well-suited for too many use cases.
Hadoop (as opposed to general MapReduce) has too much momentum to fizzle, perhaps unless it is supplanted by one or more embrace-and-extend MapReduce-plus systems that do a lot more than it does.
The way for Hadoop to avoid being a MapReduce afterthought is to evolve sufficiently quickly itself; ponderous standardization efforts are quite beside the point.

As for the rest of Jim’s claim — I see three main candidates for the “nucleus of the next-generation enterprise data warehouse,” each with better claims than Hadoop:

Relational DBMS, much like today. (E.g., Teradata, DB2, Exadata or their successors.) This is the case in which robustness of the central data store matters most.
Grand cosmic data integration tools. (The descendants of Informatica PowerCenter, et al.) This is the case in which the logic of data relationships can safely be separated from physical storage.
Nothing. (The architecture could have several strong members, none of which is truly the “nucleus.”) This is the case in which new ways keep being invented to extract high value from data, outrunning what grandly centralized solutions can adapt to. I think this is the most likely case of all.

Categories: Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, Theory and architecture

9 Comments

June 1, 2011

The essence of an application

Once upon a time, information technology was strictly about — well, information. And by “information” what was meant was “data”.* An application boiled down to a database design, plus a straightforward user interface, in whatever the best UI technology of the day happened to be. Things rarely worked quite as smoothly as the design-database/press-button/generate-UI propaganda would have one believe, but database design was clearly at the center of application invention.

*Not coincidentally, two of the oldest names for “IT” were data processing and management information systems.

Eventually, there came to be three views of the essence of IT:

Data — i.e., the traditional view, still exemplified by IBM and Oracle.
People empowerment — i.e., Microsoft-style emphasis on UI friendliness and efficiency.
Operational workflow — i.e., SAP-style emphasis on actual business processes.

Graphical user interfaces were a major enabling technology for that evolution. Equally important, relational databases made some difficult problems easy(ier), freeing application designers to pursue more advanced functionality.

Based on further technical evolution, specifically in analytic and consumer technologies, I think we should now take that list up to five. The new members I propose are:

Investigative analytics.
Emotional response.

1 Comment

May 30, 2011

Another category of derived data

Six months ago, I argued the importance of derived analytic data, saying

… there’s no escaping the importance of derived/augmented/enhanced/cooked/adjusted data for analytic data processing. The five areas I have in mind are, loosely speaking:

Aggregates, when they are maintained, generally for reasons of performance or response time.

Calculated scores, commonly based on data mining/predictive analytics.

Text analytics.

The kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for.

Adjusted data, especially in scientific contexts.

Probably there are yet more examples that I am at the moment overlooking.

Well, I did overlook at least one category. 🙂

A surprisingly important kind of derived data is metadata, especially for large, poly-structured data sets. For example, CERN has vastly quantities of experiment sensor data, stored as files; just the metadata alone fills over 10 terabytes in an Oracle database. MarkLogic is big on storing derived metadata, both on the publishing/media and intelligence sides of the business.

Categories: Data models and architecture, Derived data, Hadoop, MarkLogic

2 Comments

May 29, 2011

When it’s still best to use a relational DBMS

There are plenty of viable alternatives to relational database management systems. For short-request processing, both document stores and fully object-oriented DBMS can make sense. Text search engines have an important role to play. E. F. “Ted” Codd himself once suggested that relational DBMS weren’t best for analytics.* Analysis of machine-generated log data doesn’t always have a naturally relational aspect. And I could go on with more examples yet.

*Actually, he didn’t admit that what he was advocating was a different kind of DBMS, namely a MOLAP one — but he was. And he was wrong anyway about the necessity for MOLAP. But let’s overlook those details. 🙂

Nonetheless, relational DBMS dominate the market. As I see it, the reasons for relational dominance cluster into four areas (which of course overlap):

Data re-use. Ted Codd’s famed original paper referred to shared data banks for a reason.
The benefits of normalization, which include:
- You only have to do programming work of writing something once …
- … and you don’t have to do the programming work of keeping multiple versions of the information consistent.
- You only have to do processing work of writing something once.
- You only have to buy storage to hold each fact once.
Separation of concerns.
- Different people can worry about programming and “database stuff.”
- Indeed, even performance optimization can sometimes be separated from programming (i.e., when all you have to do to get speed is implement the correct indexes).
Maturity and momentum, as reflected in the availability of:
- People.
- A broad variety of mature relational DBMS.
- Vast amounts of packaged software that “talks” SQL.

Generally speaking, I find the reasons for sticking with relational technology compelling in cases such as: Read more

Categories: Analytic technologies, Data models and architecture, Database diversity, MOLAP, NoSQL, Object, Theory and architecture

21 Comments

May 23, 2011

Traditional databases will eventually wind up in RAM

In January, 2010, I posited that it might be helpful to view data as being divided into three categories:

Human/Tabular data –i.e., human-generated data that fits well into relational tables or arrays.
Human/Nontabular data — i.e., all other data generated by humans.
Machine-Generated data.

I won’t now stand by every nuance in that post, which may differ slightly from those in my more recent posts about machine-generated data and poly-structured databases. But one general idea is hard to dispute:

Traditional database data — records of human transactional activity, referred to as “Human/Tabular data above” — will not grow as fast as Moore’s Law makes computer chips cheaper.

And that point has a straightforward corollary, namely:

It will become ever more affordable to put traditional database data entirely into RAM. Read more

Categories: Analytic technologies, Cache, In-memory DBMS, memcached, Memory-centric data management, OLTP, Oracle, Oracle TimesTen, SAP AG, solidDB, Storage, Theory and architecture, VoltDB and H-Store

28 Comments

May 18, 2011

Starcounter high-speed memory-centric object-oriented DBMS, coming soon

Since posting recently about Starcounter, I’ve had the chance to actually talk with the company (twice). Hence I know more than before. 🙂 Starcounter:

Has been around as a company since 2006.
Has developed memory-centric object-oriented DBMS technology that has been OEMed by a few application software companies (especially in bricks-and-mortar retailing and in online advertising).
Is planning to actually launch an OODBMS product sometime this summer.
Has 14 employees (most or all of whom are in Sweden, which is also where I think Starcounter’s current customers are centered).
Is planning to shift emphasis soon to the US market.

Starcounter’s value propositions are programming ease (no object/relational impedance mismatch) and performance. Starcounter believes its DBMS has 100X the performance of conventional DBMS at short-request transaction processing, and 10X the performance of other memory-centric and/or object-oriented DBMS (e.g. Oracle TimesTen, or Versant). That said, Starcounter has not yet tested VoltDB. Starcounter does not claim performance much beyond that of disk-based DBMS on analytic tasks such as aggregations.

The key technical aspect to Starcounter is integration between the DBMS and the virtual machine, so that the same copy of the data is accessed by both the DBMS and the application program, without any movement or transformation being needed. (Starcounter isn’t aware of any other object-oriented DBMS that work this way.) Transient and persistent data are handled in the same way, seamlessly.

Other Starcounter technical highlights include: Read more

Categories: Data models and architecture, In-memory DBMS, Memory-centric data management, Object, OLTP, Starcounter, Theory and architecture

3 Comments

May 17, 2011

Terminology: poly-structured data, databases, and DBMS

My recent argument that the common terms “unstructured data” and “semi-structured data” are misnomers, and that a word like “multi-” or “poly-structured”* would be better, seems to have been well-received. But which is it — “multi-” or “poly-“?

*Everybody seems to like “poly-structured” better when it has a hyphen in it — including me. 🙂

The big difference between the two is that “multi-” just means there are multiple structures, while “poly-” further means that the structures are subject to change. Upon reflection, I think the “subject to change” part is essential, so poly-structured it is.

The definitions I’m proposing are:

A database is poly-structured to the extent that its structure is apt to be changed in the ordinary course of query, update, or programming.
Data is poly-structured to the extent that it is best represented in a poly-structured database.
A DBMS is poly-structured to the extent that it is oriented to managing poly-structured databases.

Read more

Categories: Object, Structured documents, Text, Theory and architecture

23 Comments

May 15, 2011

What to do about “unstructured data”

We hear much these days about unstructured or semi-structured (as opposed to) structured data. Those are misnomers, however, for at least two reasons. First, it’s not really the data that people think is un-, semi-, or fully structured; it’s databases.* Relational databases are highly structured, but the data within them is unstructured — just lists of numbers or character strings, whose only significance derives from the structure that the database imposes.

*Here I’m using the term “database” literally, rather than as a concise synonym for “database management system”. But see below.

Second, a more accurate distinction is not whether a database has one structure or none — it’s whether a database has one structure or many. The easiest way to see this is for databases that have clearly-defined schemas. A relational database has one schema (even if it is just the union of various unrelated sub-schemas); an XML database, however, can have as many schemas as it contains documents.

One small terminological problem is easily handled, namely that people don’t talk about true databases very often, at least when they’re discussing generalities; rather, they talk about data and DBMS.* So let’s talk of DBMS being “structured” singly or multiply or whatever, just as the databases they’re designed to manage are.

*And they refer to the DBMS as “databases,” because they don’t have much other use for the word.

All that said — I think that single vs. multiple database structures isn’t a bright-line binary distinction; rather, it’s a spectrum. For example: Read more

Categories: Cassandra, Couchbase, Data models and architecture, HBase, IBM and DB2, MarkLogic, MongoDB, NoSQL, Splunk, Theory and architecture

19 Comments

May 4, 2011

IBM InfoSphere Warehouse pricing, packaging, compression and more

IBM InfoSphere Warehouse 9.7.3 has been announced, and is planned for general availability late this month. IBM InfoSphere Warehouse is, in essence, DB2-plus, where the “plus” comprises:

DPF (Data Partitioning Feature) — i.e., the ability to do shared-nothing scale-out.
Unimportant add-ons — e.g., a mere 5 seats of the Cognos BI tool.

The main news in this release of InfoSphere Warehouse is probably pricing. While IBM has long had a funky server-power-based pricing scheme, it is now adding per-terabyte pricing, with a twist: IBM InfoSphere Warehouse now can be bought per terabyte of compressed user data. Specifically:

IBM InfoSphere Warehouse 9.7.3 Enterprise Edition can be bought for production for $70K or so per terabyte of compressed user data.
IBM InfoSphere Warehouse 9.7.3 Departmental Edition can be bought for production for $35K or so per terabyte of compressed user data.
Development/test seats of IBM InfoSphere Warehouse cost about $2K per user.
High availability/disaster recovery instances are priced as if they were managing 1 TB each — unless, of course, you have an active-active configuration, in which case they’re priced according to their full amount of data.

Per-terabyte pricing is generally a good way to think about analytic DBMS costs, for at least two reasons: Read more

Categories: Data warehousing, Database compression, IBM and DB2, Pricing

1 Comment

May 3, 2011

Oracle and IBM workload management

When last night’s Oracle/Exadata post got too long — and before I knew Oracle would request a different section be cut — I set aside my comments on Oracle’s workload management story to post separately. Elements of Oracle’s workload management story include:

Oracle’s workload management product is called Oracle Database Resource Manager.
Oracle Database Resource Manager has long managed CPU. For Exadata, Oracle added in management of I/O. Management of RAM is coming.
Another aspect of Oracle workload management is “instance caging.” If you’re running multiple instances of Oracle on the same box – e.g. one with 128 cores and thus 256 threads – instance caging can keep an instance confined to a specific number of threads.
Policies can let some classes of user get access to more threads in Oracle Parallel Query than others do.*
Oracle offers a QoS (Quality of Service) layer, at least on Exadata, that tries to use Oracle’s workload management capabilities to enforce SLAs (Service Level Agreements). For example, if you want a certain query to always be answered in no more than 0.3 seconds, it tries to make that happen. However, this technology is new in the current Oracle release, and will be enhanced going forward.

*Recall that “degrees of parallelism” in Oracle Parallel Query can now be set automagically.

One reason I split out this discussion of workload management is that I also talked with IBM’s Tim Vincent yesterday, who added some insight to what I already wrote last August about DB2/InfoSphere Warehouse workload management. Specifically:

DB2/InfoSphere Warehouse workload management has multiple ways to manage use of CPU resources.
DB2/InfoSphere Warehouse workload management doesn’t directly manage consumption of I/O or RAM resources. However, it can influence usage of I/O or RAM by:
- Limiting the number or rows read or returned.
- Adjusting priorities as to which queries get to prefetch the most records.
DB2/InfoSphere Warehouse workload management doesn’t allow you to directly set an SLA mandating query response time. However, if query response times exceed a target SLA, DB2/InfoSphere Warehouse workload management can cause a statistics dump that might help you tune your way out of the problem.

Categories: Data warehousing, IBM and DB2, Oracle, Workload management

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in