Native XML storage, Part 1 (technology)
IBM’s “Viper” version of DB2 is in open beta test, whatever that means, and Microsoft’s SQL Server 2005, nee Yukon, is in general release. Both have native XML capabilities surpassing Oracle’s – which is interesting in its own right, because it’s rare for either of those vendors to pull ahead of Oracle in an OLTP feature, and almost unprecedented for both to do so at once.
So let’s talk about native XML support, what it is, and who might or should care about it. (Well, the apps part is actually in a separate Part 2 post.) Most of this is based on research that’s several months old, but except for a scarcity of actual user interviews, that shouldn’t matter much.
There are two main non-native ways to put XML into a SQL database such as Oracle – shredding and LOBs (BLOBs or CLOBs – i.e., Binary or Character Large OBjects). Both can perform poorly, for different reasons. Shredding takes XML documents and distributes them among a bunch of tables. So one update in XML can become many updates when shredded, and one lookup in XML can become a complex join from shredded storage. LOB storage obviates those problems, but creates another – even when you’re only looking for part of a document, you have to retrieve and handle the whole thing, and the same goes for updates.
So native storage can be a good thing when you can afford neither the performance hit of shredding, nor of LOB storage, nor of any available hybrid. It also could be good if getting good performance from non-native storage, while possible, would create undue burdens on application development, or if there’s some other reason one or both of the shredding and LOB approaches isn’t viable.
One nice feature is that native-XML storage has almost no downside, at least if you get it from the high-end DBMS vendors. IBM, Oracle, and Microsoft have all worked out ways to have integrated query parsing and query optimization, while letting storage be more or less separate. More precisely, Oracle actually still sticks everything into one data store (hence the lack of native XML support), but allows near-infinite flexibility in how it is accessed. Microsoft has already had separate servers for tabular data, text, and MOLAP, although like Sybase, it doesn’t have general datatype extensibility that it can expose to customers, or exploit itself to provide a great variety of datatypes. IBM has had Oracle-like extensibility all along, although it hasn’t been quite as aggressive at exploiting it; now it’s introduced a separate-server option for XML. Both Microsoft and IBM claim that their administrative tools are slick enough that the DBA has little extra work from their offerings than would be present in a true single-server solution.
So how does the storage actually work? The basic idea is exactly what you’d think. Data is stored in name-value pairs, with pointers connecting parents to children. The secret sauce (and here I have less detail than I’d like) is the extra information that’s stored, either at the nodes directly, or in an overarching index. Obviously, there’s a tradeoff between update and retrieval speed. And equally obviously, I need to learn more of the particulars.
And on that somewhat lame note, let me point you at Part 2 of this post, which discusses whether and how this stuff will actually be used. (Preview: It will, big time – I think.)
Comments
9 Responses to “Native XML storage, Part 1 (technology)”
Leave a Reply
[…] The introduction and technical-implementation part of this discussion was in Part 1. […]
[…] Oracle lags behind both IBM and Microsoft on native XML. I can’t recall Oracle trailing both those vendors at once on anything as major before. […]
Hi Curt
You can find some more information including links to some published
“academic” papers on my weblog at http://sqljunkies.com/weblog/mrys.
Best regards
Michael
That would be information on Microsoft’s XML capabilities, Michael. Right?
Actually, the biggest area of information where I feel I’m lacking is not technical, but deployment. Who has used this stuff, for what?
Thanks,
CAM
Correct, it mostly is on SQL Server’s XML capabilities, although I point
to some other articles that describe Oracle or DB2 as well.
Regarding deployment, I see quite a bit of customers using it (several of
the internal and external early deployments used XML for one or the other
reason). I need to find some time to write something up.
For now, MySpace.com is looking into using it, several companies that need to
process vertical industry schemas (SportML etc) are using it etc..
Best regards
Michael
[…] Microsoft took slight exception to my claim that they lack fully general DBMS extensibility. The claim is actually correct, but perhaps it could lead to confusion. And anyhow there’s a distinction here worth drawing, namely: […]
[…] my surprise. When last I looked, the situation wasn’t much different from what it was back in November, 2005. Unless I’ve missed something (and please tell me if I have!), here’s what’s […]
[…] be forced into a multi-server strategy for MOLAP. And if it does that, maybe it will finally offer native XML and well-performing text search as […]
[…] been implying that the short list for native XML database engine vendors should be Mark Logic, IBM, and maybe Microsoft, on the theory that Progress […]