Confusion about metadata
A couple of points that arise frequently in conversation, but that I don’t seem to have made clearly online.
“Metadata” is generally defined as “data about data”. That’s basically correct, but it’s easy to forget how many different kinds of metadata there are. My list of metadata kinds starts with:
- Data about data structure. This is the classical sense of the term. But please note:
- In a relational database, structural metadata is rather separate from the data itself.
- In a document database, each document might carry structure information with it.
- Other inputs to core data management functions. Two major examples are:
- Column statistics that inform RDBMS optimizers.
- Value ranges that inform partition pruning or, more generally, data skipping.
- Inputs to ancillary data management functions — for example, security privileges.
- Support for human decisions about data — for example, information about authorship or lineage.
What’s worse, the past year’s most famous example of “metadata”, telephone call metadata, is misnamed. This so-called metadata, much loved by the NSA (National Security Agency), is just data, e.g. in the format of a CDR (Call Detail Record). Calling it metadata implies that it describes other data — the actual contents of the phone calls — that the NSA strenuously asserts don’t actually exist.
And finally, the first bullet point above has a counter-intuitive consequence — all common terminology notwithstanding, relational data is less structured than document data. Reasons include:
- Relational databases usually just hold strings — or maybe numbers — with structural information being held elsewhere.
- Some document databases store structural metadata right with the document data itself.
- Some document databases store data in the form of (name, value) pairs. In some cases additional structure is imposed by naming conventions.
- Actual text documents carry the structure imposed by grammar and syntax.
Related links
- A lengthy survey of metadata kinds, biased to Hadoop (August, 2012)
- Metadata as derived data (May, 2011)
- Dataset management (May, 2013)
- Structured/unstructured … multi-structured/poly-structured (May, 2011)
Comments
5 Responses to “Confusion about metadata”
Leave a Reply
Thanks Curt for this simple and straightforward explanation. Glad to see I wasn’t the only one cringing when hearing about the NSA’s “metadata”. Even Jack Bauer’s CTU, with their often stretched tech concepts, weren’t making this kind of mistakes…
I would label a Call Detail Record event data and not metadata. I think they call it metadata to try and downplay what they are doing. In Information Management circles you throw around the term metadata if you want everyone in the room to go to sleep. If they referred to it as Personal Private Information instead of metadata there would have been a few more headlines.
[…] article Confusion about Metadata speaks about some additional aspects of metadata management that getting more relevant these days. […]
[…] article Confusion about Metadata speaks about some additional aspects of metadata management that getting more relevant these days. […]
[…] February, 2014 post on various metadata-related confusions notes some egregious governmental spin. Categories: Health care, Predictive modeling and […]