Terminology: poly-structured data, databases, and DBMS
My recent argument that the common terms “unstructured data” and “semi-structured data” are misnomers, and that a word like “multi-” or “poly-structured”* would be better, seems to have been well-received. But which is it — “multi-” or “poly-“?
*Everybody seems to like “poly-structured” better when it has a hyphen in it — including me. 🙂
The big difference between the two is that “multi-” just means there are multiple structures, while “poly-” further means that the structures are subject to change. Upon reflection, I think the “subject to change” part is essential, so poly-structured it is.
The definitions I’m proposing are:
- A database is poly-structured to the extent that its structure is apt to be changed in the ordinary course of query, update, or programming.
- Data is poly-structured to the extent that it is best represented in a poly-structured database.
- A DBMS is poly-structured to the extent that it is oriented to managing poly-structured databases.
- There are many different degrees of being poly-structured; that’s why I used the phrase “to the extent that”, instead of a simple “if”.
- And as always, no technology categorization is ever precise.
Examples of poly-structure include:
- XML or JSON documents/objects describe themselves. Add a new one to a database with a different structure than the others and — presto! — you have changed the overall structure. Thus:
- XML and JSON data is apt to be poly-structured.
- XML and JSON databases are apt to be poly-structured.
- MarkLogic, MongoDB, et al. are poly-structured DBMS.
- A text document is inherently poly-structured. Some queries might look at it as a bag of words; others might group the words via stemming and synonyms; others might actually exploit the document’s grammatical structure. Text search engines are poly-structured because they support all those kinds of queries.
- A single log file can be somewhat poly-structured, in that different views of it might extract different kinds of name-value pair, or different temporal relationships.
- A database that seamlessly includes a variety of log files, each with its own structure(s), is quite poly-structured.
- A classic relational database is not very poly-structured, because DDL (Data Description Language) isn’t really in “the ordinary course” of programming or update.
- However, views add a bit of poly-structure to relational databases that is not present in, say, IMS databases.
- An object-oriented DBMS is highly poly-structured, as is Workday’s internal data store.
So what do you think? Do these definitions work?
Comments
23 Responses to “Terminology: poly-structured data, databases, and DBMS”
Leave a Reply
I would also consider technologies like data federation, virtualization, and abstraction in your definition.
Technologies like Composite Software, IBM Infosphere, and others can really change the dynamics of your data ecosphere.
Morgan,
Federation and abstraction certainly fit into what I called “DBMS2” in the middle of the last decade. But I’m not sure that they tie closely into the definitions in this particular post.
I like this definition overall
I think there needs to be more emphasis in there somewhere around change over time. XML, JSON, etc can maintain multiple different versions of the schema simultaneously and still be accessed in the same way. So it is not always a “change it” scenario
unholyguy,
I’m not sure what distinction you’re drawing.
If a database has a bunch of different structures, the structures aren’t entirely physical; there’s something “virtual” about them. So you can have multiple structures at once, different ones of which are invoked/emphasized in different operations.
What I’ve struggled a bit to capture is how this is deeper than the flexibility of the traditional relational model.
‘Subject to change’ is a much more powerful concept than just ‘many’.
How about ‘dynamic-structure’, as the structure is virtual and configurable at run-time (which is what we do)? Alternatively, ‘polymorphic-structure’. “Poly” by itself doesn’t really connote ‘change’.
Dave,
I find the term “polymorphic” a bit pretentious in most of its uses. And it’s easy to oversell flexibility anyhow. When things are flexible, often any one user will do one thing with it and not change all that much going forward (thus using only a small fraction of its power).
Understood, but technically polymorphic is a better fit to what you are describing, it’s also a term fairly well understood by software development / computer science community writ large.
I support your decomposition effort so I’m not trying to knit pick.
In our case, dynamic structure is a function of the system method, it’s canonical so the flexibility is inherent – every interaction benefits from late-binding for situational awareness.
Ultimately, data serves the business. Static structures constrain variance, preclude context and are the enemy of agility.
I think the word “polymorphic” has become rather — as it were — overloaded. Hence my reluctance to pile yet more duties on it.
Also, shorter terms are better than longer ones.
Makes sense. The thinking is more important than the term, and the term works. Great series of posts.
Curt – I like the general thrust of your thoughts and it is an interesting debate. The concept of a dynamically changing framework in data structures or in the DBMS is appropriate and there is scope for more rigour in the naming convention.
My 2p’th:
Poly- and Multi- have the same meaning being Greek and Latin prefixes for many.
Polymorphic means “many forms” and as Dave points out, is well used in the programming community but doesn’t carry the sense of time-dependence or time-variance.
To express the concept of change, or indeed the potential for change, through time, might I suggest “mutable”? Mutable objects and mutable constructs such as arrays are already well-understood by programmers and there is no ambiguity around the nomenclature either.
Duncan,
I like the idea of “mutably structured” or something like that. My only objections, really, are:
1. I’m looking to save syllables. (Phonemes, really, but to a first approximation it’s reasonable to say “syllables”.)
2. “Mutably structured” sounds like it should be in a movie starring Sigourney Weaver.
[…] post, which may differ slightly from those in my more recent posts about machine-generated data and poly-structured databases. But one general idea is hard to […]
[…] been a topic of discussion lately with IDC Research (where we first saw the term referenced) and other industry analysts. It is also the upcoming topic of a webcast we’re doing with the IDC on June […]
[…] Hadapt use cases are centered around keeping machine-generated or other poly-structured data in Hadoop, and extracting, enhancing, or otherwise deriving some of it to live in the […]
Strong vs. Loose Types – Data Quality
In programming, a similar distinction is drawn between strongly typed and loosely typed languages. As most programmers know, C++ is the textbook “strongly typed” language. Smalltalk is the the archtypal “loosely typed” language, where methods can be attached to individual objects, dynamically at runtime. Look, Ma, no classes!
Perl is another loosely typed language, albeit with a decent Class system. Java is generally thought of as having strong types, but with reflection and various byte code manipulation techniques, you can actually approach smalltalk like flexibility.
Anyway, I think the analogy to database schemas is clear enough. How does the DBMS control the schema? You can see a continuum, with Codd & Date on the on end and a bag ‘o tags on the other.
In most cases, the DBMS is not the limiting factor. Rather, as others have alluded, the logic that consumes the data will impose constraints on what “fields” must be present, the domain the values, etc. In analytics, I think they call this “data quality”.
The discussion is great, because ‘unstructured’ is the wrong term. It’s part of the lexical challenge in data resoruce management, and is promoted by people pumping the words without really knowing what they are saying.
However, all the discussion is very physically oriented toward databases and database procession. How about stepping outside the database environment and looking at an organizations total data resource? Then look at the possible ways that data are structured.
I’ve done just that and prefer to use the terms highly-structured to replace ‘semi-structured’ and super-structured’ to replace unstructured. The terms seem to work quite well.
The terms poly- and multi- imply that something could have many different structures or exist in many different forms. That’s not quite true. The point is that the structure is more intricate, more detailed, more inter-twined, etc., than typical tabular data that can be accessed by SQL.
[…] surprisingly important kind of derived data is metadata, especially for large, poly-structured data sets. For example, CERN has vastly quantities of experiment sensor data, stored as files; just […]
[…] has an analytic technology stack focused on poly-structured machine-generated […]
[…] right there you have an argument for flexible investigative or iterative analytics, over multi-structured (and relational) data. And if you think about how to combine information from all those data […]
[…] multi-structured data into […]
[…] addition to its purely relational functionality, Snowflake accepts poly-structured data. Notes on that […]
[…] data to regulators can have dire consequences. That said, when we hear about poor governance of poly-structured data, I question whether that data is being used in the applications where strong governance is actually […]
[…] and enhancement are likely to be much bigger deals in the analytic sphere, in part because of poly-structured internet data. Many Hadoop and now Spark use cases address exactly those […]