Martin Kersten on issues in scientific data management
Martin Kersten emailed a response to my post on issues in scientific data management. With his permission, I’ve lightly edited it, and am posting it below.
Dear Curt,
Thanks for the very nice story and perception on the XLDB meeting. It is a balanced view.
More philosophically I would add a few points:
1) A data management system architecture is a large collection of compromises amongst a number of competing parameters.
data management (hardware, data structure, algorithms, optimizers, languages) –> value-for-application
Given the cost to develop/maintain a dbms, we see only a few parameter constellations in the current product offerings. And the scientists have a hard time to explore the uncharted land, because of effort required and uncertain benefits. (The same holds for researchers in R&D labs of vendors.)
2) The research community needs a focus to move ahead. The array-dbms is such a focus, because it identifies an omission in the type structure being managed at all levels of a system. Articulation of this in the community will help to steer effort.
3) The recent ‘hype’ for going to a HadoopDB like approach should be positioned carefully. It is so far a single point experiment for a limited query domain space, carefully carved out to avoid all the issues that plague a distributed dbms. Within this space the techniques come from a different operating system functionality. [Not sure what he means by this.] It does not change the DBMS itself and as such it is a repetition of middleware solutions to handle a cluster of independent MySQL instances.
This paper might be worth having a look at http://ic2.epfl.ch/labos/publications/freenix2004.pdf
To generalize it to a complete solution e.g. calls for massive replication, to avoid that you have to ship data around during query execution. This is paid for with more expensive updates. Feasible in certain domains. [I’d frame that point just as saying Hadoop-based solutions are unlikely to do as well at reducing data shipping as the better MPP DBMS.]
Comments
3 Responses to “Martin Kersten on issues in scientific data management”
Leave a Reply
[…] Martin Kersten’s response Categories: Analytic technologies, Data integration and middleware, Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook and Cassandra, Hadoop, Open source, SciDB, Scientific research, Specific users Subscribe to our complete feed! […]
I think, unless I’m misinterpreting, that Martin is making the same point I was in my last blog on MR, namely that it addresses very specific domains (namely search, aka Google/Yahoo) primarily and that, being “outside” the engine, it is more of an OS component in its current form (file management/connectors in and out of the engine, Vertica-style) rather than an integral piece of the DBMS engine. I haven’t read the referenced paper (yet) but this one here is fairly eye-opening IMHO: http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf
Thanks for reposting this!
J.
[…] as Martin Kersten did, Jacek Becla emailed a response to my post on issues in scientific data management. With his […]