High-performance analytics
For the past few months, I’ve collected a lot of data points to the effect that high-performance analytics – i.e., beyond straightforward query — is becoming increasingly important. And I’ve written about some of them at length. For example:
- MapReduce – controversial or in some cases even disappointing though it may be – has a lot of use cases.
- It’s early days, but Netezza and Teradata (and others) are beefing up their geospatial analytic capabilities.
- Memory-centric analytics is in the spotlight.
Ack. I can’t decide whether “analytics” should be a singular or plural noun. Thoughts?
Another area that’s come up which I haven‘t blogged about so much is data mining in the database. Data mining accounts for a large part of data warehouse use. The traditional way to do data mining is to extract data from the database and dump it into SAS. But there are problems with this scenario, including:
- There’s a lot of data to move.
- Therefore it’s tempting to only sample the database rather than analyze the whole thing, which could have at least a slight negative effect on model accuracy.
- The result of the process is often some kind of scoring algorithm, and you may want to execute that real-time rather than in batch mode.
Various interesting fixes have been tried.
- SAS and Teradata are partnering quite closely to run SAS on Teradata boxes.
- Database management system vendors are building at least the data scoring part right into the DBMS. SAS rival SPSS – which relies more on just-in-time SQL and less on batch extracts anyway – reports that hooking into Oracle’s native scoring produces massive performance gains. (To put that another way – I finally got independent confirmation of what Oracle’s Charlie Berger has been telling me for years.)
- Data preparation can be handled by the general ELT/ETLT (Extract/(Transform)/Load/Transform – i.e., in-database data transformation) strategies of the data warehouse DBMS vendors.
- Oracle (more than most competitors, although SAS/Teradata are headed that way too) actually does all stages of data mining right in the database.
Vendors who are putting considerable marketing emphasis on parallel analytics include:
- Greenplum and Aster Data (especially MapReduce)
- Oracle (the data mining story and more)
- Teradata (the SAS deal, the geospatial effort, and more)
- Netezza (especially in connection with the Netezza Developer Network)
I’m sure others would say they belong on the list as well. It’s an important area of competitive differentiation.
Comments
6 Responses to “High-performance analytics”
Leave a Reply
[…] the only way in which data warehousing issues go “beyond query”; another important subject is high-performance analytics. Share: These icons link to social bookmarking sites where readers can share and discover new web […]
You should mention leveraging SQL analytic functions on other SQL capabilities. If one can code complex “in database” SQL it will often blow the pajamas off the time to transfer/crunch an equivalent SAS->data dump-> crunch data approach. IF one can code equivalents to FOR/NEXT loops (e.g. via row_number() with logic), IF/THEN constructs (via CASE/WHEN) and procedural flow (via nested in line views) there are many set based approaches where one can take on problems previously in the SAS/SPSS/”R” domain. -D
Dave,
Exactly!
Would you care to elaborate further? 🙂
Best,
CAM
CAM,
Sadly I cannot elaborate much since most of our SQL based techniques are IP and can’t be shared in a public forum. I can say that many signal detection, scoring, interpolation, and fuzzy matching techniques can be coded with creative SQL.
-D
[…] Kognitio, and Greenplum each have run on configurations with over 100 processors or cores.* Other analytic processing – data mining, geospatial analysis, etc. — benefits from massive parallelization as well. […]
[…] SQL 2003 and further features in integrated analytics. […]