Overview of IBM DB2 pureXML
On August 29, I had a great call with IBM about DB2 pureXML (most of the IBM side of the talking was done by Conor O’Mahony and Qi Jin). I’m finally getting around to writing it up now. (The world of tabular data warehousing has kept me just a wee bit busy …)
As I write it, I see there are a considerable number of holes, but that’s the way it seems to go when researching XML storage. I’m also writing up a September call from which I finally figured out (I think) the essence of how MarkLogic Server works – but only after five months of trying. It turns out that MarkLogic works rather differently from DB2 pureXML. Not coincidentally, IBM and Mark Logic focus on rather different use cases for native XML storage.
What I understand so far about the basic DB2 pureXML architecture goes like this:
- DB2 pureXML stores XML in “true hierarchical format.” Based on all the discussion of indexing, I’m guessing that the way it does this is somewhat similar to that in MarkLogic.
- Unlike MarkLogic, DB2 pureXML gives you the choice of what tags to index on.
- In a big difference from Marklogic, text search on DB2 pureXML involves two separate indexes – XML and text (the latter being of the usual inverted-list variety). You can text-search both contents and tags, with the usual CONTAINS semantics.
- PureXML has a data store separate from the rest of DB2’s, notwithstanding IBM’s references to XML “columns.” DB2’s general datatype extensibility framework is not used; I don’t wholly understand why.
- I neglected to ask how well DB2 backup, management tools, and so on extend to DB2 pureXML.
- You can talk to DB2 pureXML via two data manipulation languages: SQL/XML, and XQuery. Both are compiled down to the same run-time instructions. IBM said there’s an abstraction layer sitting over both the relational store and the XML store that allows for this I don’t totally understand what that means, since presumably the SQL/XML starts out by being sent to DB2’s parser.
A big part of IBM’s XML business strategy is to support various (typically vertical market) XML standards. IBM has implemented support for these standards and made it freely downloadable. What does “support” mean? It surely starts with a DTD (Document Type Definition), and apparently also includes mappings to generic web services interfaces. It turns out that there are a lot of them, so I’m listing some in a separate post.
More generally, it seems that the sales and uses for IBM pureXML are concentrated in two main (overlapping) cases:
- When XML was going to be used anyway. One big example of this is the case of the standards-based industry data interchanges. Another example is when pureXML, albeit disk-based, acts as a kind of quasi-cache or mini-MDM hub (Master Data Management) for WebSphere-based enterprise application integration (EAI). IBM reports that DB2 pureXML has been sold as an intermediate EAI data store at least once each in banking, retailing, health care, and insurance.
- When schema flexibility is of great importance.
Experience teaches me that schema flexibility is a subject that can attract considerable flames, in the general vein of “Omigod! The relational model is perfect because it’s mathematically proven to ensure referential integrity!!” So I’ll split out the main discussion of that into yet another separate post, and keep going.
IBM actually breaks out the pureXML use cases into four main groups:
- Transactional. This comprises the transactional logging of information that just happens to be XML, such as in financial services.
- Forms-oriented. This comprises, for example, the tax authority use case.
- Service bus acceleration. That’s a fancy phrase to cover both the standards-based interchanges and the other EAI-related uses.
- Event-driven data warehousing. This one was kind of blurry to me. What I think it means is that if you have transactional data in XML, and you want to use it in near-real-time business intelligence, DB2 pureXML can help you with that.
#1, 3, and 4 seem to fit into my “When XML was going to be used anyway” category. Part of “Schema flexibility” matches #2; I’m not clear on where in IBM’s four buckets the rest of schema flexibility goes.
Finally, I asked directly in what areas there were significant numbers of DB2 pureXML customers. IBM offered two examples. One was financial services in general — especially in North America, notwithstanding the importance of the UNIFI standard in Europe. The other was health care data interchange outside the United States — especially in China, where regional and national centers are being established to more closely oversee local hospitals.
Related links
- IBM kindly gave me permission to make available the slide presentation from our August 29 briefing. The last page has a large number of links to further IBM pureXML resources.
- Conor O’Mahony has a good blog.
- As noted above, I am putting up separate posts on standards-based data interchange and schema flexibility.
Comments
7 Responses to “Overview of IBM DB2 pureXML”
Leave a Reply
[…] the most important or successful IBM pureXML-supported standards, in terms of downloads and other evidence of customer interest, […]
[…] O’Mahony, marketing manager for IBM’s DB2 pureXML talks a lot about one of my favorite hobbyhorses — schema flexibility — as a reason to […]
Hi Curt,
Answers to your questions above…
The reason that IBM did not use DB2’s general datatype extensibility framework is because we decided not to simply “extend” our existing infrastructure to support XML like many vendors did. Instead, we spent 5 years developing XML-specific infrastructure from the ground up.
IBM essentially has traditional relational infrastructure and XML infrastructure in the physical layer. It then has a unified runtime execution layer above this which, for the most part, “hides” the physical storage considerations. This unified execution runtime layer provides the data manipulation and retrieval languages. (I say “for the most part” because there are certain physical storage settings that you want to be able to configure). All the management tooling, including backup/restore, utilities, and high availability are supported for XML data.
DB2’s parser handles both SQL (and its extensions for XML) as well as native XQuery, and translates them into a common set of instructions. Of course, this is different to some relational vendors who translate XQuery into SQL.
Conor, I’m still not getting it. (And I suspect this is a question for one of your highly techie colleagues.)
When one uses an object-relational/extensible DBMS’s extensibility, one is indeed still banging data into rows somewhere. That’s why Oracle and DB2 put text into BLOBs/CLOBs, for example.
But one can index however one likes, no? (Again, consider the text example.)
I guess what I’m missing is this — where does DB2’s datatype extensibility fail? What unacceptable overhead does it impose in the case of XML?
I’ve gotten the message that I used to overrate datatype extensibility like Oracle’s, DB2’s, and Informix/Illustra’s. What I haven’t figured out yet, however, is WHY I was wrong.
Thanks,
CAM
“I’ve gotten the message that I used to overrate datatype extensibility like Oracle’s, DB2’s, and Informix/Illustra’s. What I haven’t figured out yet, however, is WHY I was wrong.”
LOL, maybe you just weren’t wrong…?
[…] been implying that the short list for native XML database engine vendors should be MarkLogic, IBM, and maybe Microsoft, on the theory that Progress and Intersystems tried the market and pulled […]
[…] Notwithstanding what I wrote above — and to my surprise when I learned it — IBM did not rely on its general extensibility framework for XML support. […]