May 17, 2011

Terminology: poly-structured data, databases, and DBMS

My recent argument that the common terms “unstructured data” and “semi-structured data” are misnomers, and that a word like “multi-” or “poly-structured”* would be better, seems to have been well-received. But which is it — “multi-” or “poly-“?

*Everybody seems to like “poly-structured” better when it has a hyphen in it — including me. 🙂

The big difference between the two is that “multi-” just means there are multiple structures, while “poly-” further means that the structures are subject to change. Upon reflection, I think the “subject to change” part is essential, so poly-structured it is.

The definitions I’m proposing are:

A database is poly-structured to the extent that its structure is apt to be changed in the ordinary course of query, update, or programming.
Data is poly-structured to the extent that it is best represented in a poly-structured database.
A DBMS is poly-structured to the extent that it is oriented to managing poly-structured databases.

Please note:

There are many different degrees of being poly-structured; that’s why I used the phrase “to the extent that”, instead of a simple “if”.
And as always, no technology categorization is ever precise.

Examples of poly-structure include:

XML or JSON documents/objects describe themselves. Add a new one to a database with a different structure than the others and — presto! — you have changed the overall structure. Thus:
- XML and JSON data is apt to be poly-structured.
- XML and JSON databases are apt to be poly-structured.
- MarkLogic, MongoDB, et al. are poly-structured DBMS.
A text document is inherently poly-structured. Some queries might look at it as a bag of words; others might group the words via stemming and synonyms; others might actually exploit the document’s grammatical structure. Text search engines are poly-structured because they support all those kinds of queries.
A single log file can be somewhat poly-structured, in that different views of it might extract different kinds of name-value pair, or different temporal relationships.
A database that seamlessly includes a variety of log files, each with its own structure(s), is quite poly-structured.
A classic relational database is not very poly-structured, because DDL (Data Description Language) isn’t really in “the ordinary course” of programming or update.
However, views add a bit of poly-structure to relational databases that is not present in, say, IMS databases.
An object-oriented DBMS is highly poly-structured, as is Workday’s internal data store.

So what do you think? Do these definitions work?

Categories: Object, Structured documents, Text, Theory and architecture

Subscribe to our complete feed!

Comments

23 Responses to “Terminology: poly-structured data, databases, and DBMS”

Morgan Goeller on May 17th, 2011 9:11 am

I would also consider technologies like data federation, virtualization, and abstraction in your definition.

Technologies like Composite Software, IBM Infosphere, and others can really change the dynamics of your data ecosphere.
Curt Monash on May 17th, 2011 9:36 am

Morgan,

Federation and abstraction certainly fit into what I called “DBMS2” in the middle of the last decade. But I’m not sure that they tie closely into the definitions in this particular post.
unholyguy on May 17th, 2011 9:37 am

I like this definition overall

I think there needs to be more emphasis in there somewhere around change over time. XML, JSON, etc can maintain multiple different versions of the schema simultaneously and still be accessed in the same way. So it is not always a “change it” scenario
Curt Monash on May 17th, 2011 4:41 pm

unholyguy,

I’m not sure what distinction you’re drawing.

If a database has a bunch of different structures, the structures aren’t entirely physical; there’s something “virtual” about them. So you can have multiple structures at once, different ones of which are invoked/emphasized in different operations.

What I’ve struggled a bit to capture is how this is deeper than the flexibility of the traditional relational model.
Dave Duggal on May 18th, 2011 6:31 am

‘Subject to change’ is a much more powerful concept than just ‘many’.

How about ‘dynamic-structure’, as the structure is virtual and configurable at run-time (which is what we do)? Alternatively, ‘polymorphic-structure’. “Poly” by itself doesn’t really connote ‘change’.
Curt Monash on May 18th, 2011 6:34 am

Dave,

I find the term “polymorphic” a bit pretentious in most of its uses. And it’s easy to oversell flexibility anyhow. When things are flexible, often any one user will do one thing with it and not change all that much going forward (thus using only a small fraction of its power).
Dave Duggal on May 18th, 2011 6:58 am

Understood, but technically polymorphic is a better fit to what you are describing, it’s also a term fairly well understood by software development / computer science community writ large.

I support your decomposition effort so I’m not trying to knit pick.

In our case, dynamic structure is a function of the system method, it’s canonical so the flexibility is inherent – every interaction benefits from late-binding for situational awareness.

Ultimately, data serves the business. Static structures constrain variance, preclude context and are the enemy of agility.
Curt Monash on May 18th, 2011 7:54 am

I think the word “polymorphic” has become rather — as it were — overloaded. Hence my reluctance to pile yet more duties on it.

Also, shorter terms are better than longer ones.
Dave Duggal on May 18th, 2011 9:43 am

Makes sense. The thinking is more important than the term, and the term works. Great series of posts.
Duncan Irving on May 19th, 2011 1:04 am

Curt – I like the general thrust of your thoughts and it is an interesting debate. The concept of a dynamically changing framework in data structures or in the DBMS is appropriate and there is scope for more rigour in the naming convention.
My 2p’th:
Poly- and Multi- have the same meaning being Greek and Latin prefixes for many.
Polymorphic means “many forms” and as Dave points out, is well used in the programming community but doesn’t carry the sense of time-dependence or time-variance.
To express the concept of change, or indeed the potential for change, through time, might I suggest “mutable”? Mutable objects and mutable constructs such as arrays are already well-understood by programmers and there is no ambiguity around the nomenclature either.
Curt Monash on May 19th, 2011 1:16 am

Duncan,

I like the idea of “mutably structured” or something like that. My only objections, really, are:

1. I’m looking to save syllables. (Phonemes, really, but to a first approximation it’s reasonable to say “syllables”.)
2. “Mutably structured” sounds like it should be in a movie starring Sigourney Weaver.
Traditional databases will eventually wind up in RAM | DBMS 2 : DataBase Management System Services on May 24th, 2011 1:41 am

[…] post, which may differ slightly from those in my more recent posts about machine-generated data and poly-structured databases. But one general idea is hard to […]
The Data Blog: Aster Data Blog » Blog Archive » Multi-structured Data: Platform Capabilities Required for Big Data Analytics on June 13th, 2011 9:57 am

[…] been a topic of discussion lately with IDC Research (where we first saw the term referenced) and other industry analysts. It is also the upcoming topic of a webcast we’re doing with the IDC on June […]
Hadapt update | DBMS 2 : DataBase Management System Services on July 6th, 2011 6:48 pm

[…] Hadapt use cases are centered around keeping machine-generated or other poly-structured data in Hadoop, and extracting, enhancing, or otherwise deriving some of it to live in the […]
Charlie Reitzel on September 14th, 2011 9:12 am

Strong vs. Loose Types – Data Quality

In programming, a similar distinction is drawn between strongly typed and loosely typed languages. As most programmers know, C++ is the textbook “strongly typed” language. Smalltalk is the the archtypal “loosely typed” language, where methods can be attached to individual objects, dynamically at runtime. Look, Ma, no classes!

Perl is another loosely typed language, albeit with a decent Class system. Java is generally thought of as having strong types, but with reflection and various byte code manipulation techniques, you can actually approach smalltalk like flexibility.

Anyway, I think the analogy to database schemas is clear enough. How does the DBMS control the schema? You can see a continuum, with Codd & Date on the on end and a bag ‘o tags on the other.

In most cases, the DBMS is not the limiting factor. Rather, as others have alluded, the logic that consumes the data will impose constraints on what “fields” must be present, the domain the values, etc. In analytics, I think they call this “data quality”.
Michael Brackett on September 14th, 2011 2:46 pm

The discussion is great, because ‘unstructured’ is the wrong term. It’s part of the lexical challenge in data resoruce management, and is promoted by people pumping the words without really knowing what they are saying.

However, all the discussion is very physically oriented toward databases and database procession. How about stepping outside the database environment and looking at an organizations total data resource? Then look at the possible ways that data are structured.

I’ve done just that and prefer to use the terms highly-structured to replace ‘semi-structured’ and super-structured’ to replace unstructured. The terms seem to work quite well.

The terms poly- and multi- imply that something could have many different structures or exist in many different forms. That’s not quite true. The point is that the structure is more intricate, more detailed, more inter-twined, etc., than typical tabular data that can be accessed by SQL.
Another category of derived data : DBMS 2 : DataBase Management System Services on April 24th, 2012 12:00 am

[…] surprisingly important kind of derived data is metadata, especially for large, poly-structured data sets. For example, CERN has vastly quantities of experiment sensor data, stored as files; just […]
Glassbeam instantiates a lot of trends | DBMS 2 : DataBase Management System Services on October 30th, 2013 10:37 am

[…] has an analytic technology stack focused on poly-structured machine-generated […]
Hadoop/RDBMS integration: Aster SQL-H and Hadapt | DBMS 2 : DataBase Management System Services on January 31st, 2014 9:07 am

[…] right there you have an argument for flexible investigative or iterative analytics, over multi-structured (and relational) data. And if you think about how to combine information from all those data […]
Hadapt Version 2 | DBMS 2 : DataBase Management System Services on January 31st, 2014 9:07 am

[…] multi-structured data into […]
Snowflake Computing | DBMS 2 : DataBase Management System Services on December 12th, 2014 2:42 am

[…] addition to its purely relational functionality, Snowflake accepts poly-structured data. Notes on that […]
Which analytic technology problems are important to solve for whom? | DBMS 2 : DataBase Management System Services on April 9th, 2015 7:52 am

[…] data to regulators can have dire consequences. That said, when we hear about poor governance of poly-structured data, I question whether that data is being used in the applications where strong governance is actually […]
Data messes | DBMS 2 : DataBase Management System Services on August 7th, 2015 11:02 am

[…] and enhancement are likely to be much bigger deals in the analytic sphere, in part because of poly-structured internet data. Many Hadoop and now Spark use cases address exactly those […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Terminology: poly-structured data, databases, and DBMS

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin