Predictive modeling and advanced analytics
Discussion of technologies and vendors in the overlapping areas of predictive analytics, predictive modeling, data mining, machine learning, Monte Carlo analysis, and other “advanced” analytics.
Further clarifying in-database MPP SAS
My recent post about SAS’ MPP/in-database efforts was based on a discussion in a shared ride to the airport, and was correspondingly rough. SAS’ Shannon Heath was kind enough to write in with clarifications, and to allow me to post same. Read more
Categories: Aster Data, Netezza, Parallelization, Predictive modeling and advanced analytics, SAS Institute | 4 Comments |
Notes and cautions about new analytic technology
As previously noted, I headlined Aster’s Big Data Summit in Washington, DC last Thursday. More than others, that talk did reuse material I’d presented before. I promised the audience that when I got back I’d put up a blog post linking to supporting material for the talk.
Part of the time, I talked about things I’ve written about before. For example: Read more
Categories: Aster Data, Business intelligence, Data warehousing, Predictive modeling and advanced analytics, Presentations | 3 Comments |
Clarifying the state of MPP in-database SAS
I routinely am briefed way in advance of products’ introductions. For that reason and others, it can be hard for me to keep straight what’s been officially announced, introduced for test, introduced for general availability, vaguely planned for the indefinite future, and so on. Perhaps nothing has confused me more in that regard than the SAS Institute’s multi-year effort to get SAS integrated into various MPP DBMS, specifically Teradata, Netezza Twinfin(i), and Aster Data nCluster.
However, I chatted briefly Thursday with Michelle Wilkie, who is the SAS product manager overseeing all this (and also some other stuff, like SAS running on grids without being integrated into a DBMS). As best I understood, the story is: Read more
Categories: Aster Data, Data warehouse appliances, MapReduce, Netezza, Parallelization, Predictive modeling and advanced analytics, SAS Institute, Specific users, Teradata | 11 Comments |
Revolution Analytics seems very confused
Revolution Analytics is a relaunch of a company previously known as REvolution Computing, built around the open source R language. Last week they sent around email claiming they were a new company (false), and asking for briefings in connection with an embargo this morning. I talked to Revolution Analytics yesterday, and they told me the embargo had been moved to Thursday.* However, Revolution apparently neglected to tell the press the same thing, and there’s an article out today — quoting me, because I’d given quotes in line with the original embargo, before I’d had the briefing myself. And what’s all this botched timing about? Mainly, it seems to be for a “statement of direction” about software Revolution Analytics hasn’t actually developed yet.
*More precisely, they spoke as if the embargo had been Thursday all along.
Categories: Investment research and trading, Parallelization, Predictive modeling and advanced analytics, Revolution Analytics, SAS Institute | 13 Comments |
TwinFin(i) – Netezza’s version of a parallel analytic platform
Much like Aster Data did in Aster 4.0 and now Aster 4.5, Netezza is announcing a general parallel big data analytic platform strategy. It is called Netezza TwinFin(i), it is a chargeable option for the Netezza TwinFin appliance, and many announced details are on the vague side, with Netezza promising more clarity at or before its Enzee Universe conference in June. At a high level, the Aster and Netezza approaches compare/contrast as follows: Read more
Categories: Aster Data, Data warehouse appliances, Data warehousing, Hadoop, MapReduce, Netezza, Predictive modeling and advanced analytics, SAS Institute, Teradata | 10 Comments |
Aster Data nCluster 4.5
Like Vertica, Netezza, and Teradata, Aster is using this week to pre-announce a forthcoming product release, Aster Data nCluster 4.5. Aster is really hanging its identity on “Big Data Analytics” or some variant of that concept, and so the two major named parts of Aster nCluster 4.5 are:
- Aster Data Analytic Foundation, a set of analytic packages prebuilt in Aster’s SQL-MapReduce
- Aster Data Developer Express, an Eclipse-based IDE (Integrated Development Environment) for developing and testing applications built on Aster nCluster, Aster SQL-MapReduce, and Aster Data Analytic Foundation
And in other Aster news:
- Along with the development GUI in Aster nCluster 4.5, there is also a new administrative GUI.
- Aster has certified that nCluster works with Fusion I/O boards, because at least one retail industry prospect cares. However, that in no way means that arm’s-length Fusion I/O certification is Aster’s ultimate solid-state memory strategy.
- I had the wrong impression about how far Aster/SAS integration has gotten. So far, it’s just at the connector level.
Aster Data Developer Express evidently does some cool stuff, like providing some sort of parallelism testing right on your desktop. It also generates lots of stub code, saving humans from the tedium of doing that. Useful, obviously.
But mainly, I want to write about the analytic packages. Read more
Categories: Aster Data, Data warehousing, Investment research and trading, Predictive modeling and advanced analytics, RDF and graphs, SAS Institute, Teradata | 9 Comments |
How 30+ enterprises are using Hadoop
MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera’s files. Facts and metrics ranged widely, of course:
- Some are in heavy production with Hadoop, and closely engaged with Cloudera. Others are active Hadoop users but are very secretive. Yet others signed up for initial Hadoop training last week.
- Some have Hadoop clusters in the thousands of nodes. Many have Hadoop clusters in the 50-100 node range. Others are just prototyping Hadoop use. And one seems to be “OEMing” a small Hadoop cluster in each piece of equipment sold.
- Many export data from Hadoop to a relational DBMS; many others just leave it in HDFS (Hadoop Distributed File System), e.g. with Hive as the query language, or in exactly one case Jaql.
- Some are household names, in web businesses or otherwise. Others seem to be pretty obscure.
- Industries include financial services, telecom (Asia only, and quite new), bioinformatics (and other research), intelligence, and lots of web and/or advertising/media.
- Application areas mentioned — and these overlap in some cases — include:
- Log and/or clickstream analysis of various kinds
- Marketing analytics
- Machine learning and/or sophisticated data mining
- Image processing
- Processing of XML messages
- Web crawling and/or text processing
- General archiving, including of relational/tabular data, e.g. for compliance
SAS on Netezza and other Netezza extensibility
I chatted with SAS CTO Keith Collins yesterday about the new SAS/Netezza in-database parallel data mining scoring offering. My impression is that this is very similar to SAS’ current Teradata support, notwithstanding SAS’ and Teradata’s apparent original intention of offering in-database modeling by now as well.
I gather this is a big performance-enhancing deal, just as it is for SPSS or Oracle’s own data mining over Oracle. However, I must confess to not yet understanding why. That is, I don’t know what’s so complicated about data mining scoring algorithms that makes hand-coding them in SQL particularly forbidding. My naive view of data mining is that you do a big regression to get a bunch of weights, and the resulting scoring algorithm is a linear combination of a few dozen variables. Evidently, that’s not quite right.
Anyhow, it turns out that SAS held off on this work until it could be done for TwinFin. That’s largely because TwinFin lets partners write code on Intel CPUs, while previously they had to write in C for Netezza’s FPGAs. I got a similar sense from at least one other Netezza partner as well.
Categories: Data warehouse appliances, Data warehousing, Netezza, Predictive modeling and advanced analytics, SAS Institute | 5 Comments |
MapReduce user eHarmony chose Netezza over Aster or Greenplum
Depending on which IDG reporter you believe, eHarmony has either 4 TB of data or more than 12 TB, stored in Oracle but now analyzed on Netezza. Interestingly, eHarmony is a Hadoop/MapReduce shop, but chose Netezza over Aster Data or Greenplum even so. Price was apparently an important aspect of the purchase decision. Netezza also seems to have had a very smooth POC. Read more
Categories: Application areas, Aster Data, Benchmarks and POCs, Data warehousing, Greenplum, MapReduce, Netezza, Oracle, Predictive modeling and advanced analytics, Pricing | 5 Comments |
Aster Data nPath
Edit: Unfortunately, this post and its sequel rely on Aster Data posts that Aster’s buyer Teradata no longer makes easily available.
At the same time as it rolled out its cloud story, Aster Data told of nPath, a MapReduce-based feature in nCluster. As best I understand it, the core idea of nPath is that it preprocesses sequential data via MapReduce so that you can then do ordinary SQL on it. (Steve Wooledge’s blog post about nPath outlines why that might be needed. Point 1 in Mayank Bawa’s August, 2008 post is much more concise. 😉 ) Now, that might seem to contradict the syntax, which is all about MapReduce being invoked via SQL — still, it’s what’s really going on.
That leads to two obvious questions: What is nPath used (or useful) for? and How is the preprocessing done anyway? Read more