SAS in its own cloud
The Register has a fairly detailed article about SAS expanding its cloud/SaaS offerings. I disagree with one part, namely:
SAS may not have a choice but to build its own cloud. Given the sensitive nature of the data its customers analyze, moving that data out to a public cloud such as the Amazon EC2 and S3 combo is just not going to happen.
And even if rugged security could make customers comfortable with that idea, moving large data sets into clouds (as Sun Microsystems discovered with the Sun Grid) is problematic. Even if you can parallelize the uploads of large data sets, it takes time.
But if you run the applications locally in the SAS cloud, then doing further analysis on that data is no big deal. It’s all on the same SAN anyway, locked down locally just as you would do in your own data center.
I fail to see why SAS’s campus would be better than leading hosting companies’ data centers for either of data privacy/security or data upload speed. Rather, I think major reasons for SAS building its own data center for cloud computing probably focus on:
- Choice of hardware. SAS works hard with hardware engineers to optimize its software for specific platforms. Also, last I looked, it was still pretty SMP-oriented, SAS’s deal with Teradata notwithstanding, but that speaks more against the Amazon cloud than it does against some of the more classic SaaS hosts.
- Why not? (Part 1) Yes, bigger SaaS vendors than SAS have chosen to outsource their hosting, notably Salesforce.com. Still, SAS’s effort seems big enough to get reasonable economies of scale.
- Why not? (Part 2) To the extent SAS finds hosting difficult — well, even that’s a benefit. It informs the development operation what it needs to do to make the software more manageable. True, Oracle, SAP et al. don’t seem persuaded by similar reasoning — but SAS has always marched to the beat of its own drum.
- Return to its roots. Unless I’m terribly mistaken, SAS started out in the 1970s as a time-sharing vendor just as Information Builders did. (And unlike IBI, SAS has never gotten away from focusing o recurring revenue.)
Comments
15 Responses to “SAS in its own cloud”
Leave a Reply
Curt,
Why would the ‘smp-oriented’ nature of SAS be a problem on Amazon? EC2 provides servers that appear to be SMP servers from a client’s perspective — http://aws.amazon.com/ec2/#instance.
Mark,
I don’t know exactly what SAS is or isn’t best optimized for. And http://www.sas.com/partners/directory/sun/ZencosSPDS-X4500.pdf would seem, if anything, to speak against what I was suggesting.
But anyway — whether SAS needs to control its own hardware and whether SAS wants to control its own hardware aren’t exactly the same question. 😉
CAM
What SAS really needs to perform well is I/O to support the constant sorting it does. Lots and lots of I/O. The quants I have supported in general don’t make heavy demands of CPU or memory, what they really need is solid state disk.
You have to code sas specifically to take advantage of more then one cpu I think.
My knowledge is limited though, especially when it comes to the SAS application suite.
Just speculating, but I wonder if part of this SaaS play might be some kind of proprietary map/reduce framework for SAS? Would be kind of a natural fit in some ways.
SAS has to be careful, or versions of R integrated into cheap MPP data warehouse DBMS — with or without MapReduce — will be in increasingly big threat.
@Guy, SAS has had built-in multi-threading for 5 years now.
One of the applications that SAS has offered in SaaS mode for years is SAS Drug Development. I believe that 21 CFR Part 11 regulation for these pharma users is part of the self-hosting decision.
I can think of one or two other technical factors that may be considerations, but SAS also just has a company culture of going their own way.
Yes multithreading is built-in in the sense that you don’t have to pay extra for it. However to take advantage of it, I believe you have to rewrite the code to use the special multithreaded calls. It’s not an abstracted, configurable parameter at the global level, like it would be in a relational database.
http://www2.sas.com/proceedings/forum2007/036-2007.pdf
You don’t have to code for it. Multi-threading is built-in in the sense that it just happens for you. Starting with SAS9, certain SAS procedures have been modified to take advantage of multi-threading (unless options are used to suppress this). The current list of multi-threaded procedures may be found at:
http://support.sas.com/rnd/scalability/procs/index.html
However, it is true that this happens at the level of an individual processing step, not an entire SAS program.
I think that R has farther to go than many people realize in order to go to really catch up to SAS.
For example, you can hook SAS up to a terabyte of data and run some analysis. It might be slow and disk intensive, but it’ll work. Try a huge data set with R and it’ll fail; you’ll either need some custom programming or it’ll just be impossible, depending on the analysis.
For that and several other reasons, SAS seems safe for now. But I agree that doesn’t mean they should consider themselves safe in the long run.
Hans,
That depends the specific implementation of R, doesn’t it? 🙂
CAM
Well, not exactly. The built in SAS analytics have always been coded to keep as little as possible in memory. Input data is read one line at a time and temporary data is stored by default in a disk file. This is why SAS is by default so disk intensive, although there are many possible configurations.
R (all versions to my knowledge) only deals with data in memory. If you want to handle more data than you have memory, you will have to chunk it through a piece at a time. There are libraries to help, but still it can be hard to get a handle on. And if your analytics library wants all of the data at once – well, you’re out of luck and it needs to be rewritten to handle chunks.
As of fairly recently, S-PLUS has a feature called Big Data that replaces the memory cache with a disk cache. So programs operate thinking they’re dealing with memory, but really it reads from the disk. There is work to implement this in R, but I don’t know of a widely used and stable version yet. So for now, the state is this: most R libraries don’t handle chunking data, so most libraries can only handle as much data as you have memory. R users have been known to upgrade to 64bit because of this. In the future, there are many possible solutions including allowing R working “memory” to be cached on disk, or rewriting libraries to handle data differently.
What about cost implications here. SAS charges an arm and a leg for their software. My question to you folks is whether or not you think their charges might start to decrease with SAS being available like other software within this cloud computing environment.
I wanted to point out the difference between SAS-as-a-service and SAS deployed in a computing cloud.
SAS-as-a-service makes a lot of sense. With SAS, you load in data and get out reports, summaries,etc. It’s not like a database where you’re constantly reading the data you load in. Internet latency is a killer for DBMS access, but not for getting reports from SAS. So it makes perfect sense for SAS do SaaS. This will come with a completely different pricing structure.
Licensing SAS for the Amazon cloud – well that is just a SAS license. I don’t see why it would be cheaper just because it’s in a computer that runs in a “cloud”.
Hans,
My point was more that R on a sufficiently parallel system might be able to process more data …
Best,
CAM
Well, the answer really depends on the usage. The real answer is a little technical.
But many statisticians love SAS because so often they don’t have to worry at all about things like memory use, where as in R you always do.
@Hans it’s not uncommon to have SAS filesystem “datamarts” holding persistent data to support the quants. The overhead between SAS and the RDBM is still high, caching data locally and persistently makes sense in many situations.
@Curt Even a very distributed environment, like Hadoop, makes it very clear you should not expect to fit everything into memory.
The thing that makes the SAS map/reduce concept kind of reasonable to me, is that SAS is natively much more in a map/reduce file processing kind of mode then in a distributed RDBM mode. I could also imagine using something like Hadoop Streams to take existing sas and run it distributing, provided you were clever with your hashing (and could find a way around the ridiculous license cost).
Trying to get SAS to run natively inside an RDBMS on the other hand is an order of magnitude different problem.
SAS is very far from SQL.
So far SAS/Teradata seems to be mostly trying to rever engineer SAS procedures as C compiled objects inside Teredata, one at a time.