The Mark Logic story in XML database management
Mark Logic* has an interesting, complex story. They sell a technology stack based on an XML DBMS with text search designed in from the get go. They usually want to be known as a “content” technology provider rather than a DBMS vendor, but not quite always.
*Note: Product name = MarkLogic, company name = Mark Logic.
I’ve agreed to do a white paper and webcast for Mark Logic (sponsored, of course). But before I start serious work on those, I want to blog based on what I know. As always, feedback is warmly encouraged.
Some of the big differences between MarkLogic and other DBMS are:
-
MarkLogic’s primary DML/DDL (Data Manipulation/Description Language) is XQuery. Indeed, Mark Logic is in many ways the chief standard-bearer for pure XQuery, as opposed to SQL/XQuery hybrids.
-
MarkLogic’s XML processing is much faster than many alternatives. A client told me last year that – in an application that had nothing to do with MarkLogic’s traditional strength of text search – MarkLogic’s performance beat IBM DB2/Viper’s by “an order of magnitude.” And I think they were using the phrase correctly (i.e., 10X or so).
-
MarkLogic indexes all kinds of entities and facts, automagically, without any schema-prebuilding. (Nor, I gather, do they depend on individual documents carrying proper DTDs.) So there actually isn’t a lot of DDL. (Mark Logic claims in one test MarkLogic had more or less 0 DDL, vs. 20,000 lines in DB2/Viper.) What MarkLogic indexes includes, as Mark Logic puts it:
- Every word
- Every piece of structure
- Every parent-child relationship
- Every value.
-
As opposed to most extended-relational DBMS, MarkLogic indexes all kinds of information in a single, tightly integrated index. Mark Logic claims this is part of the reason for MarkLogic’s good performance, and asserts that competitors’ lack of full integration often causes overhead and/or gets in the way of optimal query plans. (For example, Mark Logic claims that Microsoft SQL Server’s optimizer is so FUBARed that it always does the text part of a search first.) Interestingly, Intersystems’ object-oriented Cache’ does pretty much the same thing.
-
MarkLogic is proud of its text search extensions to XQuery. I’ve neglected to ask how that relates to the XQuery standards process. (For example, text search wasn’t integrated into the SQL standard until SQL3.)
Other architectural highlights include:
-
MarkLogic uses timestamps and appends for updates, rather than updates-in-place, much like Netezza or Illustra. Cleanup is done in the background. As long as your volume of changes (as opposed to inserts or reads) is sufficiently low, this can be more efficient than traditional approaches. Timestamping also makes it easy to write certain application functionality in publishing (“go live” times for content is a current use) and compliance (a possible future).
-
MarkLogic is ACID-compliant. Thus, you can read data as soon as it’s inserted, without a separate re-indexing step. Other native XML systems may not have that property (e.g., Mark Logic asserts DB2 Viper doesn’t.)
-
Mark Logic claims MarkLogic has relatively efficient (optional) range indexes. (This was in response to a question; details are secret.) Inverted-list DBMS like ADABAS and Model 204 have been doing decently efficient range queries for 30 years, so this claim is both credible and not terribly important.
Related links
- A companion post over on Text Technologies takes a text search view of MarkLogic.
- One of the leading sites on text analytics and general enterprise software marketing, Dave Kellogg‘s Mark Logic CEO Blog.
- Edit: An October, 2008 post takes a deeper dive into the MarkLogic architecture.
Comments
3 Responses to “The Mark Logic story in XML database management”
Leave a Reply
[…] two posts this morning on Mark Logic and it’s MarkLogic product family. The main one, over on DBMS2, outlines the technical architecture — focusing on MarkLogic as an XML database management […]
[Disclaimer: I am an IBM employee.]
There are some IBM references in this post that I’d like to comment on:
– “XML processing” is a broad topic. A specific deployment, although a perfectly valid data point, is not necessarily representative of all XML processing tasks. MarkLogic has a lot of success with “content” use cases, but what about “data” use cases? For instance, I heard of a situation where MarkLogic struggled with insert/update/delete performance in an environment with an extreme number of small XML documents. I also heard of issues with MarkLogic returning very large result sets. My intention is not to pick on MarkLogic here. They are a good vendor with good technology. I’d just like to point out that there are situations where every vendor struggles, and to not put too much weight on those situations unless you know that they closely match your situation.
– IBM continues to be, to the best of my knowledge, the only vendor to disclose XML performance results with *all* details of the workload, hardware, and configuration that was being used (SourceForge: Transaction Processing over XML). IBM also has several customer case studies that substantiate excellent levels of performance, and numerous customers who have spoken about their positive experiences with DB2 pureXML at conferences worldwide.
– Regarding indexing, there is a cost to the MarkLogic approach. As you know, MarkLogic indexes all element values and all attribute values in all paths. And they maintain this index synchronously. This, of course, incurs a cost during insert, update, and delete operations. It is possible that this partly explains the performance issues I mentioned above. In my opinion, the DB2 approach of choosing exactly what you want to index is often better. (Note that to optimize what is indexed does require that you know the queries you will process.)
Also, keep in mind that hybrid relational/XML data servers offer several advantages over XML-only data servers, including:
– Many organizations have existing systems based on relational data and SQL (and ODBC or JDBC). In such situations, a complete rip-and-replace to use an XML-only vendor can be quite traumatic. Our experience is that a hybrid relational/XML store has eased the path to native XML adoption for many such organizations.
– We have also found that many application scenarios are not XML-only, but involve XML and relational data. But, of course, it is natural that our experiences match the areas in which we possibly have stronger synergies.
– Some XML-only vendors lack some of the more advanced DBMS capabilities, like point-in-time-restore capabilities. Also, the established relational/XML vendors do have a wide ecosystem of database tools vendors who work with their products.
ступень в другой мир.