Notes on SciDB and scientific data management
I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That’s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about issues in scientific data management. Here’s some of what has transpired since then.
The main new activity I know of has been in the open source SciDB project.
- A company called Zetics has been started to commercialize SciDB. As of now, the entire staff seems to be CEO Marilyn Matz, techie Paul Brown, and part of Mike Stonebraker. Marilyn says Zetics has some venture capital, but even under NDA didn’t tell me who it was from. Zetics does not have its own web site.
- Marilyn tells me there are 20-25 contributors to SciDB, led by Paul Brown and Mike Stonebraker. Brown is full-time. Persistent Systems has been donating the efforts of a few of its employees. Some LSST folks have been doing SciDB work backed by grant money. Most or all of the rest seem to be purer volunteers. Some Russians have been particularly active.
- Release 0.5 of SciDB is expected in June. Release 1.0 is expected in September. This is a rewrite; prior demo code has been scrapped. Perhaps not coincidentally, it’s also a small slip from prior project plans.
- The array data model is an example of what’s being implemented first. (Duh — you can’t have a DBMS without a data model.) Support for uncertainty is an example of what’s been deferred until later.
- As has been clear since XLDB3 last August, one major target market for SciDB is genomic research.
- It’s obvious that the oil and gas industry, with all its geospatial data, should be interested in SciDB. But there’s not much activity in that regard; outreach is evidently needed. If you can think of somebody in that sector (or anywhere else) who should be alerted to SciDB, please ping them.
- Interest from web analytics users in SciDB seems to have receded a bit from the days when eBay almost funded the project.
In other scientific data management news,
- Microsoft put out a book called The Fourth Paradigm on scientific database management. The whole thing can be downloaded, very officially, as a giant PDF. I think it’s worth skimming. I don’t think it’s worth actually reading. (I did read it.)
- XLDB4 will be at Stanford October 5-7. Unlike prior XLDBs, it will have an open (i.e., no invitation required) part.
Finally, you surely are aware of the whole “Climategate” mess, in which major climate researchers’ email was hacked and many unkind conclusions were drawn. Well, one of the most technical parts of the disclosure was in a long series of Read Me files, in which an unfortunate programmer lamented about the difficulty of reconstructing published results from files at hand. These turned out to illustrate a classic problem that SciDB or alternatives are meant to solve:
- Raw data was impossible to use without various adjustments to regularize it (the word “regridding” comes up a lot, for example). Massaging was needed before analytics could be done on it.
- The raw data was thrown out or lost, and could not be reconstructed (why they couldn’t have asked the suppliers of the data to give it to them again was unclear in this case, since it wasn’t original experimental data).
- It was thus impossible to massage the data in any new or improved way.
Comments
5 Responses to “Notes on SciDB and scientific data management”
Leave a Reply
Why has interest from “web analytics users” receded recently? Could this be due to the increased interest in Hadoop/Cassandra and similar products?
Michael,
SciDB is for analytics; Cassandra is for OLTP, hold the “T”, which I called HVSP in http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/.
Hadoop is a closer competitor, as are RDBMS, MapReduce-enabled or otherwise.
What is driving the move to hadoop and other non-relational platforms is the cost and culture of RDBMS implementations.
The culture problem is related to data management systems forcing data to be transformed into a private and internal form, and all the process that fronts it. Dimensional Modeling is an example. Let’s stop physicalizing dimensional design because that’s what RDBMS products support.
On the cost front, generating data declines at roughly the inverse of moore’s law, not counting non-native per transaction data growth (I’m collecting more and more data about every event).
On the analytics side of this problem – there are many more scans of the full dataset to get a single metric, so this function cost grows non-linearly in relation to the data size.
So – Data costs are declining at the same rate of hardware. Data Analytics costs are RISING per unit of data. Put quite simply, at the upper end of the data size spectrum – data owners cannot afford to buy data management software.
I need a database with advanced statical functions or a statistics program working transparently on a very large database. (ideally a distributed).
What software do you suggest?
Spark (with some proper underlying file system) could be the solution in the future but it only lets you do basic things. You can’t fit mixed effect models or bayesian models. Scidb has the same problem, you can only use functions implemented on it, and there are little. You can also design your own algorithms but it’s gonna be quite difficult.
R or similar programs let you import data from a database but you can’t perform large operations properly, you can only get summaries or do it by chunks.
Hi Juan,
As per http://www.dbms2.com/2016/08/28/are-analytic-rdbms-and-data-warehouse-appliances-obsolete/, I’m not sure that I have a good answer for you.