Data types
Analysis of data management technology optimized for specific datatypes, such as text, geospatial, object, RDF, or XML. Related subjects include:
- Any subcategory
- Database diversity
Truviso and EnterpriseDB blend event processing with ordinary database management
Truviso and EnterpriseDB announced today that there’s a Truviso “blade” for Postgres Plus. By email, EnterpriseDB Bob Zurek endorsed my tentative summary of what this means technically, namely:
There’s data being managed transactionally by EnterpriseDB.
Truviso’s DML has all along included ways to talk to a persistent Postgres data store.
If, in addition, one wants to do stream processing things on the same data, that’s now possible, using Truviso’s usual DML.
The Mark Logic story in XML database management
Mark Logic* has an interesting, complex story. They sell a technology stack based on an XML DBMS with text search designed in from the get go. They usually want to be known as a “content” technology provider rather than a DBMS vendor, but not quite always.
*Note: Product name = MarkLogic, company name = Mark Logic.
I’ve agreed to do a white paper and webcast for Mark Logic (sponsored, of course). But before I start serious work on those, I want to blog based on what I know. As always, feedback is warmly encouraged.
Some of the big differences between MarkLogic and other DBMS are:
-
MarkLogic’s primary DML/DDL (Data Manipulation/Description Language) is XQuery. Indeed, Mark Logic is in many ways the chief standard-bearer for pure XQuery, as opposed to SQL/XQuery hybrids.
-
MarkLogic’s XML processing is much faster than many alternatives. A client told me last year that – in an application that had nothing to do with MarkLogic’s traditional strength of text search – MarkLogic’s performance beat IBM DB2/Viper’s by “an order of magnitude.” And I think they were using the phrase correctly (i.e., 10X or so).
-
MarkLogic indexes all kinds of entities and facts, automagically, without any schema-prebuilding. (Nor, I gather, do they depend on individual documents carrying proper DTDs.) So there actually isn’t a lot of DDL. (Mark Logic claims in one test MarkLogic had more or less 0 DDL, vs. 20,000 lines in DB2/Viper.) What MarkLogic indexes includes, as Mark Logic puts it:
- Every word
- Every piece of structure
- Every parent-child relationship
- Every value.
-
As opposed to most extended-relational DBMS, MarkLogic indexes all kinds of information in a single, tightly integrated index. Mark Logic claims this is part of the reason for MarkLogic’s good performance, and asserts that competitors’ lack of full integration often causes overhead and/or gets in the way of optimal query plans. (For example, Mark Logic claims that Microsoft SQL Server’s optimizer is so FUBARed that it always does the text part of a search first.) Interestingly, Intersystems’ object-oriented Cache’ does pretty much the same thing.
-
MarkLogic is proud of its text search extensions to XQuery. I’ve neglected to ask how that relates to the XQuery standards process. (For example, text search wasn’t integrated into the SQL standard until SQL3.)
Other architectural highlights include: Read more
Categories: Data types, IBM and DB2, MarkLogic, Structured documents | 3 Comments |
XML versus sparse columns in variable schemas
Simon Sabin makes an interesting point: If you can have 30,000 columns in a table without sparsity management blowing up, you can handle entities with lots of different kinds of attributes. (And in SQL Server you can now do just that.) The example he uses is products — different products can have different sets of possible colors, different kinds of sizes, and so on. An example I’ve used in the past is marketing information — different prospects can reveal different kinds of information, which may have been gathered via non-comparable marketing programs.
I’ve suggested this kind of variability as a reason to actually go XML — you’re constantly adding not just new information, but new kinds of information, so your fixed schema is never up to date. But I haven’t detected many actual application designers who agree with me …
Categories: MySQL, Structured documents, Theory and architecture | 3 Comments |
Microsoft SQL Server Data Services
As usual, Microsoft forgot to brief me, but Mary Jo Foley reports on Microsoft SQL Server Data Services. A look at the official site clarifies that this database-in-a-cloud offering uses “Microsoft SQL Server as a data storage node.” However, there seems to be a software layer on top of SQL Server providing scale-out and appropriate management.
In addition to the more-than-SQL-Server layer, there seems to be a less-than-SQL-Server aspect as well. In a particular, Microsoft SQL Server Data Services boasts “Support for simple types: string, numeric, datetime, boolean.” XML is the “primary wire format,” and hints dropped about the schema philosophy sound XMLish too.
Interestingly, Foley reports that Microsoft plans to offer an on-premises version of Microsoft SQL Server Data Services as well.
Mike Stonebraker’s DBMS taxonomy
In a response to my recent five-part series on DBMS diversity, Mike Stonebraker has proposed his own taxonomy of data management technologies over on Vertica’s Database Column blog. (Edit: Some good stuff disappeared when Vertica nuked that blog.)
- OLTP DBMSs focused on fast, reliable transaction processing
- Analytic/Data Warehouse DBMSs focused on efficient load and ad-hoc query performance
- Science DBMSs — after all MatLab does not scale to disk-sized arrays
- RDF stores focused on efficiently storing semi-structured data in this format
- XML stores focused on semi-structured data in this format
- Search engines — the big players all use proprietary engines in this area
- Stream Processing Engines focused on real-time StreamSQL
- “Lean and Mean,” less-than-a-database engines focused on doing a small number of things very well (embedded databases are probably in this category)
- MapReduce and Hadoop — after all Google has enough “throw weight” to define a category
He goes on to say that each will be architected differently, except that — as he already convinced me back in July — RDF will be well-managed by specialty data warehouse DBMS. Read more
Categories: Data types, Database diversity, Michael Stonebraker, Mid-range, OLTP, RDF and graphs, Theory and architecture | 6 Comments |
Database management system choices – beyond relational
This is the fifth of a five-part series on database management system choices. For the first post in the series, please click here.
Relational database management systems have three essential elements:
- Rows and columns. Theoretically, rows and columns may be inessential to the relational model. But in reality, they are built into the design of every real-world relational product. If you don’t have rows and columns, you’re not using the product to do what it was well-designed for.
- Predicate logic. Theoretically, everything can be fitted into a predicate Procrustean bed. But if you’re looking for relevancy rankings on a text search, binary logic is a highly convoluted way to get them.
- Fixed schemas. Database theorists commonly assume that databases have fixed schemas. If this means that 90%+ of all information is null or missing, they have elegant ways of dealing with that. Even so, as computing gets ever more concerned with individuals — each with his/her/its unique “profile(s)” — fixed schemas get ever harder to maintain.
If any of these three elements is missing or inappropriate, then a traditional relational database management system may not be the best choice.
Categories: Data types, Database diversity, Theory and architecture | 3 Comments |
Dan Weinreb on ObjectStore
Dan Weinreb was one of the key techies at Object Design, the company that made the object-oriented database management system ObjectStore. (Object Design later merger into Excelon, which was eventually sold to Progress, which has deemphasized but still supports ObjectStore.) Recently he wrote a pair of long and fascinating articles* about Object Design, ObjectStore, and OODBMS, the first of which makes the case that “object-oriented database management systems succeeded.”
Read more
CouchDB — lazy database design taken to excess?
I’ve run into a research/alpha/whatever project called CouchDB a couple of times now. It’s yet another “Who needs relational databases? Who needs schemas?” kind of idea. Rather, CouchDB is for taking random documents and banging them into databases, then calculating views on the fly as needed. It’s REST-friendly. Lucene and a web server are built in.
Damien Katz seems to be the driving force behind CouchDB, and his discussion of document-oriented development seems to be a good starting point. Read more
Categories: CouchDB, Data models and architecture, Database diversity, Structured documents, Theory and architecture | 4 Comments |
5 kinds of data structure and 16 kinds of data access method
My recent post about datatype extensibility zoomed over at least one head, as per the comment thread. Since then I’ve googled, and come to suspect that part of what I was assuming as common knowledge may not be so common after all. So I’m going to back up and explain a bit about data access methods, as well as the sub-topic of data structures. If you take nothing else away from this post, I hope it will at least remind of you of the sheer variety of ways data can be stored on disk or in RAM.
First, let’s define the concept of data access method in three steps:
Categories: Data types | 3 Comments |
Who is actually using native XML?
Question of the day #2
Who is actually using native XML?
Mark Logic is having a fine time using its native XML engine for custom publishing. One outfit I know of is using a native XML for something like web analytics, but is driving me crazy by never coming through on permission to divulge details. There’s a bit of native XML use out there supporting the insurance industry’s ACORD standard.
And after that I quickly run out of examples of native XML use. Read more
Categories: Data types, IBM and DB2, MarkLogic, Microsoft and SQL*Server, Oracle, Structured documents | 3 Comments |