October 3, 2009
Issues in scientific data management
In the opinion of the leaders of the XLDB and SciDB efforts, key requirements for scientific data management include:
- A data model based on multidimensional arrays, not sets of tuples
- A storage model based on versions and not update in place
- Built-in support for provenance (lineage), workflows, and uncertainty
- Scalability to 100s of petabytes and 1,000s of nodes with high degrees of tolerance to failures
- Support for “external” data objects so that data sets can be queried and manipulated without ever having to be loaded into the database
- Open source in order to foster a community of contributors and to insure that data is never “locked up” — a critical requirement for scientists
However:
- I think that’s a dream/wish list. A lot of good could be done without meeting each of those six requirements in full.
- I think at least some of the XLDB/SciDB leaders realize this.
- In my opinion, a highly useful subset of the dream/wish list is achievable in the reasonably-intermediate future, in either of two ways:
- Through a Hadoop-centric open source effort, especially since HadoopDB opens up the possibility of letting DBMS creators offload MPP scaling challenges to somebody else.
- From commercial MPP software-only (as opposed to appliance) DBMS vendors. I think they can develop the needed technology. I also think it could be in their business interest to make licensing arrangements of the sort that the scientific and research communities would need.
- Talking about “scientific” big data is unhelpfully vague. Let’s just focus on multi-dimensional measurement- or model-centric data, from disciplines such as seismology (under the Earth’s surface), climatology (over the surface), and astronomy (outer space). That would also include disciplines whose three-spatial-dimensions-plus-time data comes from inside a laboratory or other man-made environment, such as high-energy physics, fluid dynamics, and so on.
- One place in all that where there should be a commercial-company market is in oil/gas extraction. And by the way, the energy industry is increasing its uptake of data warehousing technology faster these days than any other sector I can think of, except perhaps for …
- … web companies that do log file analysis. Facebook’s log data has arrays-within-arrays reminiscent of the scientists’. eBay has been a major backer of XLDB/SciDB. It’s far from fully known yet just how much overlap there is between log-file-analyzers’ data management needs and those of big-data scientists. But there clearly are at least some commonalities.
- I don’t get the impression that scientists focused on modeling — e.g. climate-predictors — have been big participants in XLDB. That’s a pity for at least two reasons. First, modeling is at the heart of some of the most important global issues scientists address (e.g., climate change). Second, it might be an area of particularly rich overlap with commercial data management needs.
Now let’s step back and consider approximately what is meant by the requirements listed above.
- The requirements for an array structure are evidently pretty deep. You can glean some of the reasons from the scientific database use cases posted on the SciDB website. In particular:
- Coordinate data naturally fits into arrays.
- Coordinate data also naturally fits into geospatial ranges and the like.
- The “grid” for the array can be imprecise — or calculated via transformation — for a whole lot of different reasons.
- Different measurements may be available for different points in the array. (I think this may be the essence of the array-valued-arrays requirement.)
- Some reasons scientists want versioning and support for data provenance are pretty obvious — you never want to lose the record of what the instrument readings said, or ever were believed to say. But it goes further. Data is “cooked” — i.e., transformed/reduced — and stored in huge volumes. So you’d like to later on be able to go back to the raw data and re-cook it.
- The workflow requirement seems to stem in many cases from data movement needs, that in turn in some cases stem from political issues. I haven’t yet understood why workflow would actually need to be baked into a scientific DBMS.
- By the time the database management systems we’re talking about could conceivably be ready, the need will be at least in the 10s of petabytes. 100s of petabytes is a reasonable design goal.
- Not that I’ve run any numbers on the matter, but it seems plausible that query fault-tolerance will be needed, at least in some cases.
- In many sciences (astronomy seems to be an exception), the default choice is to keep data in files rather than a DBMS. For example, CERN has a 10 terabyte or so Oracle database holding just the metadata for a vastly larger collection of data files. Even if the pendulum swings toward greater use of DBMS, the ongoing need for external file access is pretty obvious.
- I suspect that the insistence on open source is part legitimate, part knee-jerk excessive.
- “Free” is the best possible price, of course.
- Beyond cash cost, scientists want data access to be free of licensing encumbrance. There are two main reasons. First, people might want to manage subsets or copies of data remotely from its central repository, for a variety of reasons. Not all of those reasons are easy to overcome, so any closed-source licensing would have to be very comprehensive (e.g., global or at least continent-wide “site” licensing).
- Second, they want assurance that data will always be accessible, even if licenses expire. That seems a little overwrought. Yes, moving data from one multi-petabyte repository to another could be a bit slow. But it’s not an eventuality to panic about.
- As for actual community development — scientists sure have a variety of exotic data management needs. But I’m not sure how much talent or resource there is among scientists to do true DBMS development (as opposed to, say, refining some UDFs). Yes, one XLDB attendee was both an astronomer and a PostgreSQL Major Contributor, but he seemed like an exception. On the other hand, it’s not entirely implausible that, in the right framework, some people with database talent could be recruited to donate some time to the general advancement of science.
- I don’t know much about management of uncertain data, and will duck that subject for now.
Related links
Categories: Analytic technologies, Data integration and middleware, Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook, GIS and geospatial, Hadoop, Open source, SciDB, Scientific research, Specific users, Web analytics
Subscribe to our complete feed!
Comments
7 Responses to “Issues in scientific data management”
Leave a Reply
[…] as Martin Kersten did, Jacek Becla emailed a response to my post on issues in scientific data management. With his permission, I’ve lightly edited his email too, and am posting it below, with some […]
[…] been posting recently about some issues in scientific data management. One topic I haven’t addressed yet is policies around data sharing. […]
[…] for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about issues in scientific data management. Here’s some of what has transpired since […]
[…] arrays that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific […]
[…] may change some day, with the growing importance of machine-generated data, and of big-data science in particular. But I think it’s a fair assessment at the present, and for at least the next […]
CAM here: Leaving this up with the URL deleted because it’s such a WEIRD comment spam. Everything below the line is the original comment.
————————————————————
my name is clive mcnally i inervated the i phone and i know howto write the periodic table from hydrogen through to hydrogen = h2 he2/4 be4/8 c6/12 o8/16 ne10/20 mg12/24 si14/28 s16/32 ar18/36 ca20/40 ti22/44 cr24/48 fe26/52 ni28/56 zn30/60 ge32/64 se34/68 kr36/72 sr38/76 zr40/80 mo42/84 ru44/88 pd46/92 cd48/96 sn50/100 te52/104 xe54/108 ba56/112 ce58/116 nd60/120 sm62/124 gd64/128 dy66/132 er68/136 yb70/140 hf72/144 w74/148 os76/152 pt78/156 hg80/160 pb82/164 pd84/168 rn86/172 ra88/176th90/180 u92/184 pu94/188 cm96/192 cr98/196 pm100/200 nd102/204 rf104/208 sg106/212 hs108/216 _110/220 _112/224_114/228 _+_115/230 _113/226 _111/222 mt109/218 bh107/214db105/210lr103/206md101/202es99/198bk97/194am95/190np93/186 pa91/182 ac89/178 fr87/174 at85/170 bi83/166 ti81/162 au 79/158 ir77/154 re75/150 ta73/146 lu71/142 tm69/138ho67/134 tb65/130 ev63/126 pm61/122 pr59/118 la57/114 os55/110 i53/106 sb51/102 in49/98 ag47/94 rh45/90 tc43/86nb41/82 y39/78 rb37/74 br35/70 as33/66 ga31/62 cu29/58 co27/54 mn25/50 v23/46 sc21/42 k19/38 cl17/34 p15/30al13/26 na11/22 f9/18 n7/14 b5/10 li3/6 h20
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[+1-1]
{0} 12345 [+-] 67890 {0}
[-1+1]
252
6 6 6
33 33 33
9 9 9
clive mcnally
http://www.airlinereporter.com
Issues in scientific data management | DBMS 2 : DataBase Management System Services