Three quick notes about derived data
I had one of “those” trips last week:
- 20 meetings, a number of them very multi-hour.
- A broken laptop.
- Flights that arrived 10:30ish Sunday night and left 7:00 Saturday morning.
So please pardon me if things are a bit disjointed …
I’ve argued for a while that:
- All human-generated data should be retained.
- The more important kinds of machine-generated data should be retained as well.
- Raw data isn’t enough; it’s really important to store derived data as well.
Here are a few notes on the derived data trend.
He doesn’t generally use the term, but a big proponent these days of the derived data story is Hortonworks founder/CTO Eric Baldeschwieler, aka Eric 14. Eric likes to position Hadoop as a “data refinery”, where — among other things — you transform data and do “iterative analytics” on it. And he’s getting buy-in; for example, that formulation was prominent in the joint Teradata/Hortonworks vision announcement.
The KXEN guys don’t use the term “derived data” much either, but they tend to see the idea as central to predictive modeling even so. The argument in essence is that traditional predictive modeling consists of three steps:
- Think hard about exactly which variables you want to model on.
- Do transformations on those variables so that they fit into your favored statistical algorithm (commonly linear regression, although KXEN favors nonlinear choices).
- Press a button to run the algorithm.
#3 is the most automated part, and #1 is what KXEN thinks its technology makes unnecessary. Hence #2, they suggest, is often the bulk of the modeling effort — except now they want to automate that away too.
And then there are my new clients at MarketShare, a predictive modeling consulting company focused on marketing use cases, which also has a tech layer (accelerated via the acquisition of JovianDATA). It turns out that a typical MarketShare model is fed by a low double-digit number of other models, each of which is doing some kind of data transformation. The final step is typically a linear regression, yielding coefficients of the sort that marketers recognize and (think they) understand. Earlier steps are typically transformations on individual variables. I didn’t see many examples, but the transformations clearly go beyond the traditional rescaling — log, log (x/(1-x)), binning, whatever — to involve multiplication by what could be construed as other variables. I.e., there seemed to be a polynomial flavor to the whole thing.
Comments
8 Responses to “Three quick notes about derived data”
Leave a Reply
While it’s true that analysts like to transform the raw data in various ways to improve model fit, this practice creates a deployment problem, as the transformations used to create the predictive model must be reproduced exactly in the production environment.
This is non-trivial for customers with thousands of deployed models. One of our customers has a modeling operation that routinely creates more than 5,000 standard transformations for each modeling project. For obvious reasons, it’s impractical to generate and persist all of these variables in the production warehouse.
PMML 4.0 is helpful in this regard as it supports certain standard transformations, which can be produced on the fly when scoring. The problem will go away when we reach that state of nirvana where PMML supports every possible transformation and every analytic software vendor supports PMML export. In the case of one well-known vendor, that will be never.
An alternative is to use analytic methods that don’t require transformation, because they work well with a range of data types and distributions. In other words, instead of transforming the data to fit the requirements of the model, use a method that works well with the data “as is”. Analysts often reject this approach because (a) the learned in STAT 101 that linear regression is a great tool; (b) they never learned how to use anything else; (c) they imagine that the business cares about the very small differences in model “lift” they can obtain from messing around with the data; (d) model deployment is somebody else’s problem.
Thomas,
Thanks for chiming in!
Actually, it’s NOT obvious to me that adding 5000 columns to a data warehouse is intolerable.
More to the point — a large fraction of those transformations really will be easy and cheap to do on the fly. The exceptions to that rule are the ones that one really wants to persist. 🙂
It’s obvious to someone with a large traditional warehouse who is concerned about the impact of adding 5,000 derived fields to 300 million rows.
1.5 trillion fields with numeric values? Maybe I’m being a bit too blase’, but why is that a huge amount?
Curt,
Either blase or facetious.
Regards
TD
Thomas,
What’s a few terabytes among friends?
[…] You develop an intermediate analytic result, and using it as input to the next round of analysis. This is roughly equivalent to saying that iterative analytics refers to a multi-step analytic process involving a lot of derived data. […]
[…] blog post stresses a need for “Big Data applications”. Coincidentally, I was advising MarketShare this week on some messaging, and was reminded that they (too) self-describe as an application […]