Integrating statistical analysis into business intelligence
Business intelligence tools have been around for two decades.* In that time, many people have had the idea of integrating statistical analysis into classical BI. Yet I can’t think of a single example that was more than a small, niche success.
*Or four decades, if you count predecessor technologies.
The first challenge, I think, lies in the paradigm. Three choices that come to mind are:
- “I think I may have discovered a new trend or pattern. Tell me, Mr. BI — is the pattern real?”
- “I think I see a trend. Help me, Mr. BI, to launch a statistical analysis.”
- “If you want to do some statistical analysis, here’s a link to fire it up.”
But the first of those approaches requires too much intelligence from the software, while the third requires too much numeracy from the users. So only the second option has a reasonable chance to work, and even that one may be hard to pull off unless vendors focus on one vertical market at a time.
The challenges in full automation start:
- The software may not be able to reliably deduce:
- Exactly which hypothesis to test …
- … against exactly which data set.
- The software may not be able to reliably adjust for differences over time in areas such as:
- Your marketing choices and product cycles.
- Your competitors’ marketing choices and product cycles.
- Seasonality.
- Exogenous economic factors.
Perhaps someday those problems will be solved in full generality … but not soon.
On the other hand, just dumping statistical software onto everybody’s desk won’t work either, even if the software is something like KXEN. Some people are numerate enough to make good use of such capabilities, but many are not.
What that leaves us with is semi-automation. The template I envision is:
- The software uses heuristics to come up with one or a few guesses for the objective variable or function. A human BI user makes the final decision.
- The software invokes a pre-identified choice of data set from which to pursue any particular objective. A human data architect or statistician had previously set this up in the first place.
- The software invokes pre-identified time adjustments for any particular analysis, previously set up by a human data architect or statistician, subject to final approval by a human BI user.
- The software uses heuristics based on the actual BI query to guess at any relevant parameters identifying the specific data subset to consider. A human BI user makes the final decision.
At least that much human intervention will long be necessary.
Application areas where it might be easy to guess an objective function include:
- Many kinds of CRM. You probably want to know about sales levels, response rates, or something like that.
- Quality. You probably want to know about something like a defect rate.
- Accounting. In some contexts, it’s clear that you want to know about the incidence of something unwelcome, like bad debts or product returns.
Even so, there are many cases where humans will need to at least tweak the objective function.
- Do you want to measure a simple count of new customers, or do you want to weight them by expected lifetime value?
- Which intermediate anomaly is most crucial to your defect tracking?
- How far in arrears does a debt have to be to be “bad”?
Choosing the data subset is also tricky. A good first approximation might be a query result set the human user recently looked at. But which one? Surely she’s bounced around and done some drilling down, so at least you need to give her a UI for rewinding a bit. She might also want to specify somewhat different dimensions — or ranges of dimensional values — for the predictive analysis than were used for the query that set the inquiry off.
And finally — how vertical should this functionality be? My first inclination is to say “Very”. Consider again the application scenarios I mentioned above. If we know that the fundamental issue is likely to be “Campaign response” or “Sales vs. comparables” or “Defect rate” or “Late payments”, in each case we can envision what the heuristics and user interfaces might be like. But otherwise, where do we even start?
Related link
- My definition of investigative analytics, wherein I said it’s all about discovering unknown patterns, as opposed to monitoring for known ones.
Comments
13 Responses to “Integrating statistical analysis into business intelligence”
Leave a Reply
Interesting post Curt, I think you also need to consider the experimental design as part of the overall statistical analysis.
It is often the correct choice of experimental design that enables the ‘learn’ in ‘test & learn’…
Mark,
I was focusing on adoption even more mass than what you’re probably thinking of.
Otherwise, that’s an excellent point!
Curt one place this makes sense is in Financial Services – lots of stats and modeling. Just look at options – there’s a set model of how to estimate the correct price based on the underlying etc.
A number of the vertical software products will calculate this kind of stuff. Of course where they fall down is one the basic reporting believe it or not. They can calculate the Value at Risk for security but get a reasonable report that includes this not a lot of fun (of course they roll their own reporting language)
While the stat analysis in vertical solution is easier for all reasons you’ve mentioned, the trouble is that the vertical app is limited to its own data. An analysis is only valuable if it tells you something you don’t know already and the most interesting “discoveries” happen when you mix different datasets. For example, combining CRM data with Accounting and Census.
Second “trap” for laymen is mixing positive correlation and cause-effect. For example, married women earn less than divorced (my intern discovered this today). Is one affect another? Is there a third factor? If we want to give the power to regular users, the software should be able to identify or help to identify the root cause.
@curt: agreed with the conclusion that the functionality should be vertical. The model becomes valuable to a business user when it is customized for that particular business and vertical. The deep, granular model is used by the BI user to answer very pointed business questions. Each question might require further transformation of the model. For instance, the same marketing mix model could be deployed in an application to execute what-if scenarios like profit maximization and budget reduction. It could also be used to correctly attribute revenues for historical marketing spend. In our experience, the key automation challenge is not about generating a single, all knowing model. Instead, the automation challenge is to build a modeling environment that applies business specific heuristics to build, validate and deploy a highly granular and customized model.
Practicing analysts can tell you that domain experience is essential to success — a top-notch SEO analyst can’t walk into the actuarial department and have instant credibility. The analytic methods used aren’t the same, and domain knowledge enables analysts to distinguish useful insights from bullhockey.
Since the analytic methods differ from niche to niche, so does the analytic software. Geneticists like to use ASREML and actuaries like to use Emblem. Hey, they both do GLM, so we can save money if they share software, right? Wrong. The differences are small, but they mean a lot to the users, and good quants have more credibility with the business than anyone in IT.
It makes more sense to add BI capabilities to statistical tools than the other way around. BI users rarely care who makes the software as long as it looks pretty. Analytics users, on the other hand, care a lot about how the math is done inside, and whether or not the vendor has street cred — because they know t’s a lot harder to build a statistical algorithm correctly than it is to present a pretty chart of sales by region.
But why try to mosh the software together at all? The output of analytic software is data — predictions, patterns, relationships and trends — which you can push into anything you like for reporting and visualization.
TD
Thomas,
Some enterprises have full-fledged departments for statistical analysis or predictive modeling, well-enough staffed to meet all such needs across all areas of the business. For them, I think your comment makes tons of sense.
But many enterprises can’t afford that luxury. For them, I think the kinds of approaches I’m talking about could be a whole lot better than nothing.
An admirable goal. I think the place to start is with the recognition that the solution will be relative to the functions and ability of the user. By way of example – the check engine light in the car is a great of example of exception notification for the average “user”. The diagnostics performed at the service station that advises the replacement of a part is somewhat more sophisticated “intelligence” function, but doesn’t include any real investigative analytics. The results of the service station analysis if provided in bulk to the warranty analyst is fodder for lots of correlation studies, and it’s here that “canned” statistics provide some understanding of the impact to the business. The engineer who gets all the data related to failures including environmental and usage data is probably the best candidate for a bi overview/statistical drill down/statistical modelling with AI overtones capabilities to highlight areas for research and to suggest alternatives to consider in designing the next generation of the part.
In most cases the inclusion of more sophisticated analysis will have to be tailored because the capability to do analysis with today’s computing power is unbounded and the time to get a job done in a day is bounded. Tailoring the solution will require professionals who are experts in analytics and experts in the subject matter.
So – most BI users are best served by the check engine light. However, there is a market under served for the users who are capable of understanding and acting on statistical analysis and who have the requisite subject matter expertise. We are getting there 🙂
Hi Curt,
A joint project between EMC/Greenplum, UCBerkeley and other universities, called MADlib, embeds statistical and numerical methods in Postgres and Greenplum. This allows in-database statistical explorations to be done at small and large scale. A project worth tracking, IMHO.
Nitin
Hi Nitin,
I’ve written a great deal about statistica/DBMS integration here in the past. Look around! I’ve written somewhat less about MADlib, however, because I so hated the original MAD skills paper, because I haven’t written that much about Greenplum recently, and because MADlib was still something of a joke back when I still was talking with Greenplum.
[…] recent post on broadening the usefulness of statistics presupposed two things about the statistical sophistication of business intelligence tool […]
[…] an agility standpoint, the integration of predictive modeling into business intelligence would seem like pure goodness. Unfortunately, the most natural ways to do such integration would […]
[…] A couple years ago I wrote skeptically about integrating predictive modeling and business intelligence. I’m less skeptical […]