Another category of derived data
Six months ago, I argued the importance of derived analytic data, saying
… there’s no escaping the importance of derived/augmented/enhanced/cooked/adjusted data for analytic data processing. The five areas I have in mind are, loosely speaking:
- Aggregates, when they are maintained, generally for reasons of performance or response time.
- Calculated scores, commonly based on data mining/predictive analytics.
- Text analytics.
- The kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for.
- Adjusted data, especially in scientific contexts.
Probably there are yet more examples that I am at the moment overlooking.
Well, I did overlook at least one category. 🙂
A surprisingly important kind of derived data is metadata, especially for large, poly-structured data sets. For example, CERN has vastly quantities of experiment sensor data, stored as files; just the metadata alone fills over 10 terabytes in an Oracle database. MarkLogic is big on storing derived metadata, both on the publishing/media and intelligence sides of the business.
Actually, what made me think of writing this post was a few conversations at MarkLogic’s April user conference. For example, MarkLogic likes to break lunch up into subject-specific tables, hosted either by a partner company, or by one of the analysts who is attending anyway. So they asked me to hold a table about having Hadoop and MarkLogic work together. When I showed up, I discovered that most of the users at the table worked for a single organization; what’s more, they were skeptical about the table’s discussion subject, and wanted to be see if I could persuade them otherwise. I gently pointed out that I hadn’t actually picked the subject, and asked them what their use cases might be like. Those turned out to be classified …
… but have no fear! Your hero thought quickly, and soon was holding forth about various ways one might combine the two technologies for various intelligence tasks. The one that finally struck a chord was — you guessed it! — metadata management. It seems they had colleagues with a lot of machine-generated data maintained in Hadoop and, upon reflection, they thought MarkLogic might be a good way to manage the metadata for same.
So should metadata management be handled relationally? Looking at my first three tests for when going relational is a slam-dunk choice:
- I don’t think the application suites exploiting derived metadata are complex enough to support a strong pro-relational bias.
- I don’t think the benefits of normalization are intense enough to mandate relational storage. (Also, since provenance matters, some of the traditional benefits of normalization are obviated — you may actually want out-of-date information in some cases.)
- There certainly are some cases where you can set up a fixed schema, have one row of metadata per object, and be happy. In those cases, a relational database likely suffices, and is probably the right choice, but …
… I’m not sure how numerous the cases are where a simple, fixed database design isn’t a good fit. Thoughts?
Comments
2 Responses to “Another category of derived data”
Leave a Reply
I’m a bit confused about the whole premise here. Some data can be derived and some is entirely additional. For example, if you store a digital photograph in your photo library folder, metadata like “how large is the photo” can be computed from the photo. But when I annotate it to say “This is Alice and Bob standing in front of the hotel we stayed at”, that can’t be derived. Usually when I hear the word “metadata”, the latter is what comes to my mind, although maybe that’s just me.
[…] That all fits well with my thoughts about the importance of derived data. […]