Relational DBMS versus text data
There seems to be tremendous confusion about “search,” “meaning,” “semantics,” the suitability of relational DBMS to manage text data, and similar subjects. Here are some observations that may help sort some of that out.
1. Relational database theorists like to talk about the “meaning” or “semantics” of data as being in the database (specifically its metadata, and more specifically its constraints). This is at best a very limited use of the words “meaning” or “semantics,” and has little to do with understanding the meaning of plain English (or other language) phrases, sentences, paragraphs, etc. that may be stored in the database. Hugh Darwen is right and his fellow relational theorists are confused.
2. The standard way to manage text is via a full-text index, designed like this: For hundreds of thousands of words, the index maintains a list of which documents the word appears in, and at what positions in the document it appears. This is a columnar, memory-centric approach, that doesn’t work well with the architecture of mainstream relational products. Oracle pulled off a decent single-server integration nonetheless, although performance concerns linger to this day. Others, like Sybase, which attempted a Verity integration, couldn’t make it work reasonably at all. Microsoft, which started from the Sybase architecture, didn’t even try, or if they tried it wasn’t for long; Microsoft’s text search strategy has been multi-server more or less from the getgo.
3. Notwithstanding point #2, Oracle, IBM, Microsoft, and others have SQL DBMS extended to handle text via the SQL3 (or SQL/MM ) standard. (Truth be told, I get the names and sequencing of the SQL standard versions mixed up.) From this standpoint, the full text of a document is in a single column, and one can write WHERE clauses on that column using a rich set of text search operators.
But while such SQL statements formally fit into the relational predicate logic model, the fit is pretty awkward. Text search functions aren’t two-valued binary yes/no types of things; rather, they give scores, e.g. with 101 possible values (the integers from 0 – 100). Compounding them into a two-valued function typically throws away information, especially since that compounding isn’t well understood (which is why it’s so hard to usefully federate text searches across different corpuses).
4. Something even trickier is going on. Text search can be carried out against many different kinds of things. One increasingly useful target is the tables of a relational database. Where a standard SQL query might have trouble finding all the references in a whole database to a particular customer organization or product line or whatever, a text search can do a better job. This kind of use is becoming increasingly frequent. And while it works OK against relational products, it doesn’t fit into the formal relational model at all (at least not without a tremendous amount of contortion).
5. Relational DBMS typically manage the data they index. Text search systems often don’t. But that difference is almost a small one compared with some of the others mentioned above, especially since it’s a checkmark item for leading RDBMS to have some sort of formal federation capability.
Comments
13 Responses to “Relational DBMS versus text data”
Leave a Reply
[…] One of the biggest is in the area of text. They fail to see how text data management is fundamentally different from tabular data management. Here’s a little article explaining why text doesn’t fit well into the relational model. • • • […]
I agree with this entirely — wrote an article on the related topic of searching text and how DBMSs just tend to be the wrong model, http://www.searchtools.com/info/database-search.html. I will link to this as well, it as you have more of the database architecture details.
[…] Curt Monash has some observations about search, meaning, semantics, the suitability of relational DBMS to manage text data and similar subjects: […]
[…] EDIT: Point #4 of my post on the mismatch between relational databases and text search is pretty relevant here. • • • […]
[…] Text search is a huge business on the web, and a separate big business in enterprises. And text doesn’t fit well into the relational paradigm at […]
Text Search and the Relational Model…
Since text search is one of my top areas of expertise, I hope I can explain to you why the relational model is perfectly capable of capturing the structure of text. I’ll start at the very bottom, explaining what text search really is….
[…] Po pierwsze, SQL/MM Full-Text na razie jest tylko w ułamkowej części implementowany przez niektóre silniki bazodanowe. Po drugie, relacyjne bazy danych mają swoje ograniczenia. Widoczne są one także na polu wyszukiwania “ze zrozumieniem” – bo w sumie do tego dążymy jak się chwilę zastanowić. Pisze o tym np. Curt Monash w artykule Relational DBMS versus text data. […]
[…] In December, 2005 I expounded on the mismatch between text data and the relational model. […]
[…] are probably fewer people now than there were six years ago who need to be told that text and relational database management are very different things. Other misconceptions, however, appear to be on the rise. Specific points that are commonly […]
[…] First, some queries don’t have binary results, even in principle. Notably, text queries are answered via relevancy rankings, which fit badly into the relational model. […]
“Text search functions aren’t two-valued binary yes/no types of things”
Actually, they are. It is just that the volume of results one gets from such operators applied to a text corpus is so large as to be useless to a human being, the usual consumer. Instead the results are sorted by relevance, an expensive total ordering which was computed at index build time and embodied in document id order. The human sees only the most relevant portion of an exact result set.
Arguably, it is IR which has it right (every relationship that actually occurs is automatically indexed) and the RDBMS world which got it wrong (between tables, it is unclear which relationships actually exist and indexes are created by hand).
Scott,
Text search is, in principle, a function from a set of documents to a set of possible relevancy scores, which set has cardinality a lot higher than 2.
You are correct that an early stage in evaluation is apt to be a binary filter, but that doesn’t contradict my point.
Good answers in return of this question with real arguments and telling all concerning that.