Commercial software for academic use
As Jacek Becla explained:
- Academic scientists like their software to be open source, for reasons that include both free-like-speech and free-like-beer.
- What’s more, they like their software to be dead-simple to administer and use, since they often lack the dedicated human resources for anything else.
Even so, I think that academic researchers, in the natural and social sciences alike, commonly overlook the wealth of commercial software that could help them in their efforts.
I further think that the commercial software industry could do a better job of exposing its work to academics, where by “expose” I mean:
- Give your stuff to academics for free.
- Call their attention to your free offering.
Reasons to do so include:
- Public benefit. Scientific research is important.
- Training future customers. There’s huge academic/commercial crossover, especially as students join the for-profit workforce.
The biggest issue is probably large-scale database management. There’s a feeling, permeating for example parts of the XLDB conference and the associated SciDB project, that data stores suitable for holding large amounts of data are either:
- Hadoop or
- Forbiddingly expensive.
I think that’s overstated. In particular:
- You can put >10 terabytes of machine-generated data (or any other kind) into Infobright and have it well taken care of; Infobright is open source.
- You can put >1 petabyte into [name redacted],* among others; [name redacted]* should be out soon with a generously free offering for academic users. Edit: That would be Vertica.
- Conventional relational queries, graph analysis, statistical analysis preparation and more can all be much faster in a good analytic DBMS than in alternative kinds of data stores.
- Integration between SQL and other analytic languages is ever improving, as analytic DBMS evolve into “analytic platforms“.
*My permission to use the name was yanked after this post was largely drafted. I’m sufficiently pleased with the forthcoming offering itself that I can’t get upset about the procedural confusion.
With a couple of exceptions, the statistics/predictive analytics situation seems more reasonable. Industry leaders such as SAS Institute and SPSS (now an IBM company) have engaged in varying degrees of academic outreach. R is in the process of crossing over from academia to business.
Unfortunately, I know next to nothing about Stata or, elsewhere in the technical languages area, Mathworks/Matlab. (Who knew that Mathworks was a $600 million company, local to my geographical area?)
One statistical tool that should perhaps be more present in academia is KXEN. KXEN seems to have some nice differentiation in not making you understand in advance which of your variables are most important. Econometricians and others with large numbers of independent variables might wish to take note.
If you think the true situation is nonlinear, and you’re trying to approximate it with linear models, you almost always have a large number of variables to consider. True, monomials in independent variables aren’t actually independent, but it might be interesting to pretend that they are and see if any insights fall out that could help in more rigorous analysis.
I’d further argue that, as part of neglecting commercial analytic DBMS, the scientific community in particular neglects the potential of integrated analytic platforms. Admittedly, the early leaders in that area — Aster Data, perhaps followed by Netezza (now an IBM company) — aren’t exactly priced in an academic-friendly way. But Vertica, EMC Greenplum, et al. are playing catch-up with analogous technology, and they’re more likely to offer appealing academic pricing.
There’s also the investigative analytics side of business intelligence, especially in the area of visualization/discovery. While Spotfire (now a TIBCO company) got much of its start in research-oriented areas, the otherwise more visible — no pun intended — QlikTech and Tableau don’t seem to have done much in academia. Datameer and yet-younger Hadoop-oriented business intelligence startups don’t seem to be doing much on the academic front either, more’s the pity.
Frankly, I think that most scientific analytic technology needs are also found in the business world.* That convergence will only get closer as businesses focus more on machine-generated data. Commercial software companies should pay more attention to scientists, and scientists should gaze out more often from their ramshackle, budget-constrained ivory towers.
*The converse isn’t as true. Businesses have issues not well reflected in science, derived (for example) from the complexity of their transactional schemas, or from office-politics considerations around “one version of the truth”.
Edit: Some links that seem relevant to this year’s XLDB program
- Zynga and LinkedIn
- Objectivity Infinite Graph
- eBay as of last year’s XLDB (the most expensive blog post I ever wrote, in light of Greenplum’s subsequent response)
Comments
7 Responses to “Commercial software for academic use”
Leave a Reply
Hi Kurt, there are some notable exceptions to the rule that commercial DBMS do not support scientific projects. Microsoft seems to have been very generous in providing DBMS technology to universities. For example, the Pan-STARRS PS1 project (http://pan-starrs.ifa.hawaii.edu/public/home.html) uses MS SQL Server, unless they have changed recently. Pan-STARRS incidentally gives new meaning to the phrase “spatial query.”
Curt,
You touched on many important points, and you did it very well!
One comment I’d make is that many scientific projects do not fall under the umbrella of academic use from the perspective of commercial software licenses. I suspect you did mean both academic and scientific use here.
I will also point out that sometimes it is not just the cost of commercial *software* that is the barrier. Commercial software often comes in an appliance, and that is problematic for multi-decade experiments, that are (pretty much always) required to reproduce (all published) results.
The good news is that many larger scientific projects are willing to try… PanSTARRS was just mentioned, SDSS is a good example, GAIA chose Intersystem’s Cache… I think it is a battle we can win!
Jacek,
I’m not sure I understand the appliance problem. A SQL query will have the same results from DBMS to DBMS, wrapped in hardware or otherwise, unless you use vendor-specific extensions. What’s more, relatively few of these extensions are by way of approximation, especially non-reproducible extension. Yes, there are some time series interpolations, but they’re deterministic. Yes, there are some fast approximate medians/deciles/whatever, but you can also do slower precise ones.
Price perhaps aside, I’m not understanding the reason not to use Vertica or Aster nCluster or Infobright or whatever, if they seem well-suited to the job.
Curt,
The appliance problem is related to being locked into specialized (and yes, typically expensive) hardware. (1) In scientific environments with multi-lab projects and multi-project labs hardware is often shared between projects, or repurposed, and specialized hardware makes it much harder. (2) Reproducing results generated 10 years ago is often an issue: it is way easier to virtualize the environment if you don’t have to deal with specialized software that can only run on some specialized but no-longer-supported hardware. (3) A lot of data is correlated and crossing the boundaries between different appliance boxes can be non-trivial in some cases. (4) Debugging is another issue.
Jacek,
Got it. Anyhow, there are lots of commercial analytic DBMS products that aren’t tied to hardware. Indeed, pretty much the only ones that are are Netezza (strictly), Teradata (for all practical purposes unless you have a pretty small database), and Oracle (ditto, if you’re viewing Oracle as an analytic rather than OLTP option).
FYI, Greenplum’s been giving away their engine to research for a long while now. Since it’s almost completely Postgres-compliant in its client tools and UDF APIs etc, it’s easy for academics to ramp up to it.
Chris Re’s Hazy project at Wisconsin is starting to use it, I believe — they do in-database statistical machine learning. Very impressive work, BTW: http://research.cs.wisc.edu/hazy/. Also our former student Daisy Wang, now on the faculty at Florida has used it to do in-database statistical ML for text analysis and entity extraction: http://www.cise.ufl.edu/~daisyw/
We’re planning to harvest research efforts like these via open-source in MADlib: http://madlib.net . Currently only Postgres and Greenplum ports supported, but I’m eager to get community energy around both new algorithmics and ports to other DBMSs. There’s at least a couple folks on the MADlib mailing lists talking about a Vertica port.
Joe,
Greenplum single-node http://www.dbms2.com/2009/10/19/greenplum-free-single-node-edition/ disappointed me when it turned out to be a dud. Glad to hear the real thing is being given away for free!