The 4 main approaches to datatype extensibility
Based on a variety of conversations – including some of the flames about my recent confession that mid-range DBMS aren’t suitable for everything — it seems as if a quick primer may be in order on the subject of datatype support. So here goes.
“Database management” usually deals with numeric or alphabetical data – i.e., the kind of stuff that goes nicely into tables. It commonly has a natural one-dimensional sort order, which is very useful for sort/merge joins, b-tree indexes, and the like. This kind of tabular data is what relational database management systems were invented for.
But ever more, there are important datatypes beyond character strings, numbers and dates. Leaving out generic BLOBs and CLOBs (Binary/Character Large OBjects), the big four surely are:
- Text. Text search is a huge business on the web, and a separate big business in enterprises. And text doesn’t fit well into the relational paradigm at all.
- Geospatial. Information about locations on the earth’s surface is essentially two-dimensional. Some geospatial apps use three dimensions.
- Object. There are two main reasons for using object datatypes. First, the data can have complex internal structures. Second, it can comprise a variety of simpler types. Object structures are well-suited for engineering and medical applications.
- XML. A great deal of XML is, at its heart, either relational/tabular data or text documents. Still, there are a variety of applications for which the most natural datatype truly is XML.
Numerous other datatypes are important as well, with the top runners-up probably being images, sound, video, time series (even though they’re numeric, they benefit from special handling).
Four major ways have evolved to manage data of non-tabular datatype, either on their own or within an essentially relational data management environment.
- Utterly standalone servers. There are lots of search engines, geospatial engines, object-oriented database management systems, and so on. Some may have ODBC/JDBC SQL interfaces, to handle metadata (which is commonly tabular in nature) if nothing else. But even so, there’s little relational about them.
- True RDBMS extensibility. In the 1990s, awkwardly named object-relational database management systems were introduced, boasting the awkwardly named feature abstract datatypes. Oracle, DB2, Informix, and PostgreSQL are now of this kind. They let one write data access methods for data that’s right in the basic relational table structure, and get at it through extensions to SQL.
- Tightly coupled servers. A close relative of RDBMS extension via new access methods is to create new servers for new datatypes, well-integrated with your RDBMS. Your parser and optimizer are in charge of federating them; the user just writes extended-SQL statements.
- User-defined functions. User-defined functions are like datatype extensions, but vastly easier to write, in that they don’t have any special access methods. When their performance is good enough, UDFs are often the best way to handle extended-datatype needs.
So how does this all play out in real-world examples? It’s all over the place.
- Enterprise text search is divided among three modes – integrated into the RDBMS (Oracle and IBM), tightly-couple server (Microsoft, pre-FAST acquisition), and standalone (Autonomy, FAST pre-acquisition, Google, and most other vendors).
- Geospatial datatypes are embedded into extensible DBMS – generally via technology from ESRI – for OLTP uses. But for data warehousing, where you don’t need pinpoint record retrieval, UDFs are generally believed to suffice. (E.g. Teradata, Netezza.)
- Intersystems seems to stand alone in getting nontrivial revenue from the standalone OODBMS market.
- The XML situation is really confused: Oracle has been late getting its native XML strategy together; the tightly-coupled DB2 Viper engine has been a performance disappointment; Microsoft’s integrated native XML isn’t heard from much either; and text/XML integrated engine Marklogic is getting some non-text business almost by default. In addition, every serious relational vendor has a capability to “shred” XML into relational tables, and can of course also just bulk-handle XML via BLOBs/CLOBs.
Comments
10 Responses to “The 4 main approaches to datatype extensibility”
Leave a Reply
“including some of the flames about my recent confession that mid-range DBMS aren’t suitable for everything”
You confuse “being contradicted” with “being flamed”. Instead of being so touchy, you could just show some responsibility and correct the factual mistakes in the post you are referring to.
Why, thank you for showing up! This post was in part for you, because you’d expressed ignorance on the matter of datatypes.
Otherwise — the words “slander” and “disgusting” are indeed flames. And you’re way out of line for claiming that your statements of opinion are factual at all, let alone that they point out somebody else’s “factual mistakes.”
Unless you settle down and start thinking clearly, I have no interest in attempting substantive discussion with you.
CAM
“This post was in part for you, because you’d expressed ignorance on the matter of datatypes.”
Then thank you so much for enlightening me and the rest of us ignoramuses! I’m so glad you told us all about those fancy datatypes like images, I really should notify Frank Mash at fotolog immediately he must look out for another database that can handle them a.s.ap.
“the words “slander” and “disgusting” are indeed flames”
What can you expect? You started a smoke screen, it’s just natural that there must be a fire somewhere.
“And you’re way out of line for claiming that your statements of opinion are factual at all, let alone that they point out somebody else’s “factual mistakes.””
I and others have explained exactly and repeatedly where the factual mistakes are. You have been avoiding even an acknowledgement that you are aware the refutations exist, even though you have been given the pointers (and members of posterity can browse the past week of planetmysql.org archives to check for themselves). I’m really starting to believe you are doing this on purpose.
“Unless you settle down and start thinking clearly, I have no interest in attempting substantive discussion with you.”
Oh no…please don’t threaten me…does “thinking clearly” mean I can’t contradict you anymore? If so, that’s really going to be such a tough choice…oh the wretched dilemma!…am I really going to cut myself off of this world-class opportunity to have Curt attempt a substantive discussion with mere me?
[…] di articoli apparsi sulla rete ne voglio evidenziare due di DBMS2, il primo riguardante i 4 approcci principali all’estensibilità dei tipi di dati, il secondo invece elenca 14 motivi per cui non dovrebbe essere utilizzato un DBMS di fascia […]
Roland,
I’ll respond to the one on-topic thing you’ve said in this comment thread. I would guess strongly that the photos at Fotolog.com aren’t indexed directly. E.g., there’s no search that says “Show me all the photos with 3 or more faces.” (I’d be fascinated if this is wrong, by the way.) Rather, they most likely are indexed on the metadata, such as descriptions. (Nothing wrong with that; image indexing is still pretty primitive. But it illustrates my point.) I further note that there’s no native search of the descriptive text; they just use Google. (Nothing wrong with that either; I do it too.)
Thus, Fotolog does not appear to be evidence that MySQL is being used to store anything but alphanumeric data and BLOBs/CLOBs.
As for the off-topic stuff — we agree on one thing. There are plenty of other posts and comment threads, here and elsewhere, where the various opinions are noted.
CAM
Curt,
Great post and very informative about the big picture! Having been studying and working on XML databases for more than 5 years, I’ve seen good and bad usages of XML. In my personally opinion, the best thing about XML is that it doesn’t require a schema or it could define a very flexible schema, yet you can define relational views on top of it to “map” it to some schema. This saves a lot of headaches from schema evolutionand data/schema integration.
Also regarding to data types or data models, I thik graph data model and graph querylanguages may play a significant role in the future database research and development. There are many applications that can be naturally modeled as a database problem ongraphs (examples include social networks, scientific data, and RDF). Queries (e.g., route queries) on graphs are usually not expressible in pure SQL. Using vendor-dependent extensions or UDFs usually results in unacceptable performance on large data sets. So I guess graphs should be the next data type that major DBMS vendors should support in the near future.
Thanks,
Ning
Ning
Ning,
Cogito had an interesting approach to graphs, which I wrote about here. Its assets have now been bought by another company; stay tuned.
RDF is the other approach to graphs, of course. If you look at the RDF section here, you may find some thoughts congenial with yours. 🙂
CAM
Curt,
Thanks for the pointers. I’ve read your Cogito white paper and their web site. Although there is not much detail about the GQL, storage and query processing, it seems a more interesting approach comparing to some of the relational approaches developed for graph databases — one simple reason is because SQL is not expressive enough to specify complex graph queries. Also clustering nodes based on their proximity is also more promising than set-oriented storage. Graph partitioning techniques seems to be one of the keys to the performance.
Some technical questions about the MyFamily.com example mentioned in your white paper. Are there any details about the experiments? i.e., is the database RAM-based/disk-based? Are there any indexes? is the data set really a graph or a tree (seems tree is enough for the problem)?
I looking forward to your update.
Thanks,
Ning
Ning,
The answer to most of your questions is “I don’t know.”
That said, Cogito definitely was supporting graphs and not just trees. Whether they optimized clustering for general graphs or just for trees is of course a whole other matter, since the former is a much harder problem.
As I write this, either my calendar is wrong or the CEO of the successor company is late giving me a phone call. As I said — stay tuned.;)
EDIT: Oops. I was supposed to call him, not vice-versa …
CAM
[…] recent post about datatype extensibility zoomed over at least one head, as per the comment thread. Since then I’ve googled, and come […]