Trends in predictive modeling
I talked with Teradata about a bunch of stuff yesterday, including this week’s announcements in in-database predictive modeling. The specific news was about partnerships with Fuzzy Logix and Revolution Analytics. But what I found more interesting was the surrounding discussion. In a nutshell:
- Teradata is finally seeing substantial interest in in-database modeling, rather than just in-database scoring (which has been important for years) and in-database data preparation (which is a lot like ELT — Extract/Load/transform).
- Teradata is seeing substantial interest in R.
- It seems as if similar groups of customers are interested in both parts of that, such as:
- Usual-suspect consumer marketing sectors (telecom, credit card, retail).*
- Semiconductor manufacturing.**
- Parallelized SAS modeling on Teradata seems to be limited by the small number of algorithms that are parallelized. (SAS scoring, I presume, is a different matter.)
This is the strongest statement of perceived demand for in-database modeling I’ve heard. (Compare Point #3 of my July predictive modeling post.) And fits with what I’ve been hearing about R.
*That’s very similar to the list of sectors for SAS HPA.
**To support their extremely high focus on product quality, semiconductor manufacturers have been using state-of-the-art analytic tools for at least 30 years.
In-database modeling is a performance feature, and performance can have several kinds of benefit, which may be summarized as “cheaper”, “better”, and “previously impractical”. My impression is that in-database modeling is pretty far toward the “previously impractical” end of the spectrum; enterprises don’t adopt a new way of predictive modeling until they want to create models that the old way can’t get done.
Basically, I think that models are increasingly:
- Richer and more diverse than before. (see for example Point #5 of my July predictive modeling post.)
- Developed in a more experimental and quickly-iterative way than before.
I think the first point pretty much implies the second, but the converse isn’t as clear; one can tweak old-style models in quick-turnaround fashion even more easily than one can develop the more complex newer styles.
And finally: I’m not hearing that modeling — even when it’s parallel and in-database fast — is commonly done on a complete many-terabyte dataset. It’s not a question I always remember to ask; for example, I didn’t bring it up with Teradata. But when I do, I rarely hear of models being trained on more than a few terabytes of data each.
Comments
One Response to “Trends in predictive modeling”
Leave a Reply
There are many benefits to modeling in-database, not the least of which is the elimination of data movement. Even if one builds models on “only” a few terabytes, the effort to move that data adds serious latency to the analytic cycle time. Moreover, there are very few server-based analytic products that can work with terabyte-sized data sets, so analysts working outside of a database typically work with sets of 100GB or less.
SAS HPA has failed to gain acceptance, but not because it has a limited number of algorithms. The reasons this product has failed are (1) SAS has priced the product out of the market; (2) the product architecture restricts deployment choices; and (3) the product’s software engineering makes high demands on infrastructure leading to exploding TCO. HPA runs “in the appliance” and not “in the database”, so it requires specially constructed platforms that bulk up the memory and reduce storage. In the case of Teradata, SAS HPA runs only on the dedicated 720 appliance, and not on other members of the Teradata family.
SAS’ more recent attempts to deploy HPA into Hadoop in a “run beside on the node” approach have also predictably run into roadblocks, since HPA cannot run on the typical commodity node servers that Hadoop customers use.
SAS scoring is also limited to those algorithms supported by the SAS Scoring Accelerator or the SAS Analytics Accelerator, a subset of SAS analytic capabilities. Since both of these products have limitations of their own (and also require substantial additional licensing fees), and since SAS/STAT does not export PMML, many SAS customers simply rebuild the scoring code in SQL, C, Java, R or Python.
Revolution R Enterprise, on the other hand, will run on any instance of Teradata 14.10, in any Teradata appliance. And any R code developed anywhere, on any platform, will run in Teradata in Revolution R.