Data that is derived, augmented, enhanced, adjusted, or cooked
On this food-oriented weekend, I could easily go on long metaphorical flights about the distinction between “raw” and “cooked” data. I’ll spare you that part — reluctantly, given my fondness for fresh fruit, sushi, and steak tartare — but there’s no escaping the importance of derived/augmented/enhanced/cooked/adjusted data for analytic data processing. The five areas I have in mind are, loosely speaking:
- Aggregates, when they are maintained, generally for reasons of performance or response time.
- Calculated scores, commonly based on data mining/predictive analytics.
- Text analytics.
- The kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for.
- Adjusted data, especially in scientific contexts.
Probably there are yet more examples that I am at the moment overlooking. But even these should suffice to establish my point, as might even just the broad list of synonyms for the concept of “derived data” I’ve used above. Namely, one of the first questions one should ask in considering an analytic data management strategy is:
Do we have to plan for data other than what we will be storing in raw form?
Any derived data could, in principle, be re-derived each time it is needed, except in those cases where issues of security, data ownership, or whatever prevent access to the underlying raw data entirely. Thus, the reason to store derived data is usually just a matter of physical processing, as reflected in performance, price/performance, response time, and the like. This might suggest that the decision whether or not to explicitly store derived data depends on the performance characteristics of your analytic DBMS and the related technology stack. In practice, however, that often turns out not to be the case.
Choice of technology stack does indeed have a major effect on the first category I mentioned: Aggregates. Whether or not you want to maintain a physical representation of a sum, average, roll-up or whatever has a lot to do with which particular DBMS or in-memory analytic tool you are using. In Oracle, especially pre-Exadata, you’re apt to have a lot of materialized views. In Netezza, not so much. If you’re using a MOLAP (Multidimensional OnLine Analytic Processing) tool such as Essbase, you’re probably going crazy with pre-calculated roll-ups. And if you’re using Skytide, you may not be keeping the unaggregated raw data at all.
Something similar could be said about the simpler forms of data mining scoring; if you’re just doing a weighted sum, precalculation is a nice-to-have, not a must-have, depending on the speed and power of your DBMS. But that’s about as far as it goes. For more complex kinds of predictive analytic models, real-time scoring could be prohibitively slow. Ditto for social graph analysis, and the same goes for the other examples as well.
Text analytics requires a lot of processing per document. You need to tokenize (i.e., identify the boundaries of) the words, sentences, and paragraphs; identify the words’ meaning; map out the grammar; resolve references such as pronouns; and often do more besides (e.g. sentiment analysis). There are a double-digit number of steps to all that, many of them expensive. No way are you going to redo the whole process each time you do a query. Not coincidentally, MarkLogic — a huge fraction of whose business to date is for text-oriented uses — thinks heavily in terms of the enhancement and augmentation of data.
If you look through a list of actual Hadoop or other MapReduce use cases, you’ll see that a lot of them boil down to “crunch data in a big batch job to get it ready for further processing.” Most famously this gets done to weblogs, documents, images, or other nontabular data, but it can also happen to time series or traditional relational tables as well. See, for example, the use cases in the last two Aster Data slide decks I posted. Generally, those are not processes that you want to try to run real time.
Scientists have a massive need to adjust or “cook” data, a point that emerged into the public consciousness in connection with Climategate. The LSST project expects to store 4.5 petabytes of derived data per year, for a decade. Types of scientific data cooking include:
- Log processing not unlike that done in various commercial sectors.
- Assigning data to different kinds or densities of coordinate grids — “regridding” — often through a process of interpolation/approximation/estimation.
- Adjusting/normalizing data for all kinds of effects (e.g. weather cycles).
Examples where data adjustment would be needed can be found all over physical and social science and engineering. In some cases you might be able to get by with recalculating all that on the fly, but in many instances storing derived data is the only realistic option.
Similar issues arise in marketing applications, even beyond the kind of straightforward predictive analytics-based scoring and psychographic/clustering results one might expect. For example, suppose you enter bogus information into some kind of online registration form, claiming to be a 90-year-old woman when in fact you’re a 32-year-old male. If you have 400 Facebook friends, the vast majority of whom are in your age range, look at web sites about cars, poker, and video games, and have a propensity to click on ads featuring scantily-clad females, increasingly many analytic systems are smart enough to treat you as somebody other than your grandmother. But those too are complex analyses, run in advance, with the results stored in the database to fuel sub-second ad serving response times.
Related links
- The #1 or #2 kind of data mart in the Teradata user base is derived data based on a central data store, especially at eBay. Actually, those figures probably cover both derived data and pure extracts — but based on how the conversation went reported in that link, I’d guess that in most cases true derived data is involved.
- There’s a whole other area of derived OLTP data, such as bank balances or inventory levels. It’s important too.
Comments
12 Responses to “Data that is derived, augmented, enhanced, adjusted, or cooked”
Leave a Reply
One small point: Mike Stonebraker recently gave a talk about the DBMS being developed to store the data from the LSST (amongst other things) called SciDB. One requirement is that the “cooking” can be done again later. OF course, one would rather not pay for this again, but there are various reasons you might want to, and the capability must be retained. E.g. suppose you find out that the new-comet-finding algorithm that you had run over the data had a bug or has been improved? The intention is for the cooked data to be tagged with “provenance” metadata that would say “we got this by taking dataset X and running software Y of version Z over it.” None of this contradicts anything you said; I’m just adding it.
Dan,
Quite right. As previously noted in, e.g., my posts about SciDB :), scientists sometimes frame the issue not as “We’ve always kept raw data; let’s keep derived data too” but more as “We’ve always kept cooked data, but we really need to keep the original raw stuff as well.”
This is a cousin of the trend in commercial data warehousing to keep all detailed data, when in the past all that was kept might have been summaries.
We’ve been dealing with reports in this way for many years.
Although much of this cleansed, aggregated data is accessible somewhere else in raw format, which, with some work can be recreated, there is a surprising amount that is exceedingly complex to reproduce after the fact.
This is due to highly complex, point in time business rules in LOB systems. It mostly occurs in financial analysis and auditing.
In addition, there are often regulatory requirements for old aggregated data.
Here’s a link to a couple of real-world examples – don’t mean it to be a pitch, but that’s the real-world experience I have of these issues:
http://www.datawatch.com/case_studies/Monarch/banking/Columbia_Bank_data.pdf
http://www.datawatch.com/case_studies/Monarch/banking/MissionBank_User_Story.pdf
Gareth
[…] by definition, derived data is based on lower level raw data, it could be derived again as needed. DBMS2 takes us through some thinking about how better to handle derived […]
[…] months ago, I argued the importance of derived analytic data, saying … there’s no escaping the importance of derived/augmented/enhanced/cooked/adjusted […]
spending through the internet characteristics a favor turn|kind office|benefit} to get decreased take away from since stores a trim value cut on those same. Companionless url comes with a brochure at emilio pucci align|clothing|habiliments|habits|clothes} supplied by several materials and / or hue styles. making a election on a emilio pucci suite mannequin plus one prior to your distinct existence is also easier than you think in shopping on the internet,
[…] scenarios where you incrementally derive and enhance data, it’s natural to want to keep everything in the same place. (That also helps with […]
[…] suppose it’s in documents or videos or something? Well, then there’s a needed step of data enhancement; even when that’s done, further data integration issues are likely to be […]
[…] far as I can tell, predictive modeling is used to notice aberrant data (raw or derived). This is quickly used to define a subset of data to drill down to (e.g., certain kinds of […]
[…] stored. Data munging/wrangling/cleaning/preparation is often a big deal. Complicated forms of derived data can be important […]
[…] stored. Data munging/wrangling/cleaning/preparation is often a big deal. Complicated forms of derived data can be important […]
guns, gun, online gun store, handguns, dealers, rifles, ammo, gun dealers, gun dealers online, gun stores online,