SAS on Netezza and other Netezza extensibility
I chatted with SAS CTO Keith Collins yesterday about the new SAS/Netezza in-database parallel data mining scoring offering. My impression is that this is very similar to SAS’ current Teradata support, notwithstanding SAS’ and Teradata’s apparent original intention of offering in-database modeling by now as well.
I gather this is a big performance-enhancing deal, just as it is for SPSS or Oracle’s own data mining over Oracle. However, I must confess to not yet understanding why. That is, I don’t know what’s so complicated about data mining scoring algorithms that makes hand-coding them in SQL particularly forbidding. My naive view of data mining is that you do a big regression to get a bunch of weights, and the resulting scoring algorithm is a linear combination of a few dozen variables. Evidently, that’s not quite right.
Anyhow, it turns out that SAS held off on this work until it could be done for TwinFin. That’s largely because TwinFin lets partners write code on Intel CPUs, while previously they had to write in C for Netezza’s FPGAs. I got a similar sense from at least one other Netezza partner as well.
Comments
5 Responses to “SAS on Netezza and other Netezza extensibility”
Leave a Reply
Well, SAS has a pretty darn good GLM algorithm – able to do regressions of unlimited size in limited memory and parallelized as well. Probably better than most, certainly better than SPSS’ algorithm, last I looked.
Also SAS has pretty much every known model selection method built over their algorithm and whatever magic they do, their model selection is usually much faster than other software.
So I can see why many people would like to use SAS over some other package here.
I think you need to ask Netezza flat-out whether their UDFs execute in the FPGA or not. They certainly have left people with that impression but I do not believe it is correct.
My understanding is the Netezza UDFs execute in the operating system on SPUs, running on the CPU in the SPU. Not in the FPGA.
The NPS 10000 and earlier boxes used a proprietary OS based on the Nucleus Plus real-time OS. These systems also ran PowerPC CPUs. To my understanding, the UDF development environment executed on the Linux/Intel Host node and cross-compiled for Nucleus/PowerPC.
TwinFin now uses Intel Xeon and runs Linux, allowing UDF development on Red Hat Linux/Intel Xeon on the Host node and deployment on the SPU Linux variant/Intel Xeon. No more cross-compiler, Linux on both sides.
I suspect there may be some very specialized users programming the FPGA such as 3-letter government agencies. But I do not believe ordinary developers are programming the FPGA, instead they are writing UDF code running on the OS.
It would be nice if you could set the record straight here.
Disclaimer: The views expressed in this comment are my own and do not necessarily reflect the views of Teradata. The views and opinions expressed by others on this comment thread are theirs, not mine.
Hey John, just curious about the pros/cons of running the UDf in the FPGA vs the OS?
Netezza UDX code executes on the CPU. I don’t ever recall Netezza implying that they execute on the FPGAs. The FPGA is used for operations such as decompression, simple filtering, dropping columns not used by current SQL statement, transactional data visibility, etc. The publicly available information is pretty clear on the subject.
Everything else such as UDX, case statements, joins, complex filtering, etc are done on the CPU
[…] old generation of products but not its latest one, even though SAS CTO Keith Collins told me exactly the opposite would be […]