Jacek Becla on issues in scientific data management
Just as Martin Kersten did, Jacek Becla emailed a response to my post on issues in scientific data management. With his permission, I’ve lightly edited his email too, and am posting it below, with some interspersed comments of my own.
Curt,
It is a very nicely written article!
I’ll quickly comment on three issues:
1) Open source
In addition to the reasons you described, science loves open source DBMSes because they are popular, easy to use and maintain. Larger projects tend to be highly distributed, with tens or even hundreds collaborating sites. Some of these sites are as small as a professor (often working part-time on a project) plus a student or two. They will gladly deal with MySQL or Postgres, but not with any fancier DBMS (even if they are covered by a project-wide license). Similarly, data centers, which typically simultaneously support many different experiments, don’t want to deal with a zoo of DBMSes. Sure, we could change format (e.g., export to MySQL) for the smaller sites, but as we all know, this is often non-trivial due to vendor-specific optimizations, different flavors of SQL, stored procedures, etc.
Also, scientists often like to recompile their entire software stack on some exotic platforms, or to take advantage of various optimizations, latest compilers, etc. Past experience of some projects (e.g. BaBar, which at some point had 5+ million lines of complex C++ code tightly coupled with Objectivity/DB) was pretty unpleasant in that area.
Jacek isn’t the only person to mention the BaBar/Objectivity example as a bad experience leading to a “Never be trapped again by a vendor” desire for open source. But I think tight coupling with an object-oriented DBMS isn’t really the same thing as putting code on top of an engine that executes SQL or some other concise data manipulation language (DML).
2) HadoopDB
Scientific data tends to be correlated and exhibits adjacency property; e.g., one of the most commonly executed query in astronomy is a near neighbor search. To the best of my knowledge MapReduce works best for uncorrelated data sets. While it is possible HadoopDB could built some powerful spatial indices that would simplify that, I doubt this will be high enough on their to-do list. Another option would be to “de-correlate” data by building overlapping partitions, but I believe that would require non-trivial modifications to the HadoopDB internals.
I don’t immediately see why solving this problem in HadoopDB would be harder than in an MPP DBMS. But it does speak against some of the advantages I was proposing for the HadoopDB alternative.
Also, based on recent discussions with the HadoopDB team they are planning to continue building HadoopDB as a research project, which means it is unclear (at least to me) how soon the product will be production-ready and how well it will be maintained/supported in the long term.
Right. But it’s a little beside the point to complain that one particular open source MPP framework doesn’t have project momentum behind it when alternative open source MPP frameworks have even less going on.
That said, SciDB does have some ongoing effort, probably more than HadoopDB currently does.
Comments
4 Responses to “Jacek Becla on issues in scientific data management”
Leave a Reply
[…] Jacek Becla’s response Categories: Analytic technologies, Data integration and middleware, Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook and Cassandra, Hadoop, Open source, SciDB, Scientific research, Specific users Subscribe to our complete feed! […]
I don’t understand Becla’s point 2. Hadoop relies on locality of reference to parallelize jobs, i.e. each node processes a highly correlated set of data, which would seem to be exactly the data a “near neighbor” search provides.
Thomas:
The issue arises for data near the edges: objects near the edge of a partition can have their neighbors in an adjacent partition. In practice that means data from any given partition must be correlated with data from all its adjacent partitions (assuming search distance is smaller than partition size, otherwise things get more complicated). That requires a distributed join – something hadoop is not good at.
[…] Leaders of the XLDB effort seem convinced that only open source DBMS can meet their needs. […]