Predictive modeling and advanced analytics

Discussion of technologies and vendors in the overlapping areas of predictive analytics, predictive modeling, data mining, machine learning, Monte Carlo analysis, and other “advanced” analytics.

May 17, 2012

Thoughts on “data science”

Teradata is paying me to join a panel on “data science” in downtown Boston, Tuesday May 22, at 3:00 pm. A planning phone call led me to jot down a few notes on the subject, which I’m herewith adapting into a blog post.

For starters, I have some concerns about the concepts of data science and data scientist. Too often, the term “data scientist” is used to suggest that one person needs to have strong skills both in analytics and in data management. But in reality, splitting those roles makes perfect sense. Further:

The leader in raising these issues is probably Neil Raden.

But there’s one respect in which I think the term “data science” is highly appropriate. In conventional science, gathering data is just as much of an accomplishment as analyzing it. Indeed, most Nobel Prizes are given for experimental results. Similarly, if you’re doing data science, you should be thinking hard about how to corral ever more useful data. Techniques include but are not limited to:

May 7, 2012

Relationship analytics application notes

This post is part of a series on managing and analyzing graph data. Posts to date include:

In my recent post on graph data models, I cited various application categories for relationship analytics. For most applications, it’s hard to get a lot of details. Reasons include:

Even so, it’s fairly safe to say:

Read more

April 24, 2012

Three quick notes about derived data

I had one of “those” trips last week:

So please pardon me if things are a bit disjointed …

I’ve argued for a while that:

Here are a few notes on the derived data trend. Read more

February 27, 2012

Translucent modeling, and the future of internet marketing

There’s a growing consensus that consumers require limits on the predictive modeling that is done about them. That’s a theme of the Obama Administration’s recent work on consumer data privacy; it’s central to other countries’ data retention regulations; and it’s specifically borne out by the recent Target-pursues-pregnant-women example. Whatever happens legally, I believe this also calls for a technical response, namely:

Consumers should be shown key factual and psychographic aspects of how they are modeled, and be given the chance to insist that marketers disregard any or all of those aspects.

I further believe that the resulting technology should be extended so that

information holders can collaborate by exchanging estimates for such key factors, rather than exchanging the underlying data itself.

To some extent this happens today, for example with attribution/de-anonymization or with credit scores; but I think it should be taken to another level of granularity.

My name for all this is translucent modeling, rather than “transparent”, the idea being that key points must be visible, but the finer details can be safely obscured.

Examples of dialog I think marketers should have with consumers include: Read more

February 27, 2012

The latest privacy example — pregnant potential Target shoppers

Charles Duhigg of the New York Times wrote a very interesting article, based on a forthcoming book of his, on two related subjects:

The predictive modeling part is that Target determined:

and then built a marketing strategy around early indicators of a woman’s pregnancy. Read more

February 26, 2012

SAP HANA today

SAP HANA has gotten much attention, mainly for its potential. I finally got briefed on HANA a few weeks ago. While we didn’t have time for all that much detail, it still might be interesting to talk about where SAP HANA stands today.

The HANA section of SAP’s website is a confusing and sometimes inaccurate mess. But an IBM whitepaper on SAP HANA gives some helpful background.

SAP HANA is positioned as an “appliance”. So far as I can tell, that really means it’s a software product for which there are a variety of emphatically-recommended hardware configurations — Intel-only, from what right now are eight usual-suspect hardware partners. Anyhow, the core of SAP HANA is an in-memory DBMS. Particulars include:

SAP says that the row-store part is based both on P*Time, an acquisition from Korea some time ago, and also on SAP’s own MaxDB. The IBM white paper mentions only the MaxDB aspect. (Edit: Actually, see the comment thread below.) Based on a variety of clues, I conjecture that this was an aspect of SAP HANA development that did not go entirely smoothly.

Other SAP HANA components include:  Read more

February 8, 2012

Comments on SAS

A reporter interviewed me via IM about how CIOs should view SAS Institute and its products. Naturally, I have edited my comments (lightly) into a blog post. They turned out to be clustered into three groups, as follows:

February 6, 2012

Sumo Logic and UIs for text-oriented data

I talked with the Sumo Logic folks for an hour Thursday. Highlights included:

What interests me about Sumo Logic is that automated classification story. I thought I heard Sumo Logic say: Read more

January 25, 2012

Departmental analytics — best practices

I believe IT departments should support and encourage departmental analytics efforts, where “support” and “encourage” are not synonyms for “control”, “dominate”, “overwhelm”, or even “tame”. A big part of that is:
Let, and indeed help, departments have the data they want, when they want it, served with blazing performance.

Three things that absolutely should NOT be obstacles to these ends are:

Read more

January 18, 2012

KXEN clarifies its story

I frequently badger my clients to tell their story in the form of a company blog, where they can say what needs saying without being restricted by the rules of other formats. KXEN actually listened, and put up a pair of CTO posts that make the company story a lot clearer.

Excerpts from the first post include (with minor edits for formatting, including added emphasis):

Back in 1995, Vladimir Vapnik … changed the machine learning game with his new ‘Statistical Learning Theory’: he provided the machine learning guys with a mathematical framework that allowed them finally to understand, at the core, why some techniques were working and some others were not. All of a sudden, a new realm of algorithms could be written that would use mathematical equations instead of engineering data science tricks (don’t get me wrong here: I am an engineer at heart and I know the value of “tricks,” but tricks cannot overcome the drawbacks of a bad mathematical framework). Here was a foundation for automated data mining techniques that would perform as well as the best data scientists deploying these tricks. Luck is not enough though; it was because we knew a lot about statistics and machine learning that we were able to decipher the nuggets of gold in Vladimir’s theory.

Read more

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.