XML versus sparse columns in variable schemas
Simon Sabin makes an interesting point: If you can have 30,000 columns in a table without sparsity management blowing up, you can handle entities with lots of different kinds of attributes. (And in SQL Server you can now do just that.) The example he uses is products — different products can have different sets of possible colors, different kinds of sizes, and so on. An example I’ve used in the past is marketing information — different prospects can reveal different kinds of information, which may have been gathered via non-comparable marketing programs.
I’ve suggested this kind of variability as a reason to actually go XML — you’re constantly adding not just new information, but new kinds of information, so your fixed schema is never up to date. But I haven’t detected many actual application designers who agree with me …
Comments
3 Responses to “XML versus sparse columns in variable schemas”
Leave a Reply
Exactly. This is one of the sweet spots of XML databases: schema evolution and schema chaos.
With relational views defined on top of XML data, application designers should not feel lost. I guess the major reasons that prevent them adopting XML by now are performance and operational completeness, which should be mature eventually.
To make this mode of operation useful, there would be one column that contains a value that says what type this row is, and there would be metadata somewhere that says which columns are relevant to this row. (For XML, your metadata would say even more, namely where each column corresponds to which element or attribute in an XML schema.) This is basically a cheesy way to do polymorphism, in RDBMS’s which don’t have this kind of polymorphism natively defined.
Simon claims you have strongly-typed data, but what in the DBMS stops you from inserting a row representing socks but setting the cup size?
And what if you have special sports socks, which have an additional attribute saying what sport they’re aimed at? This column should be empty for regular socks. That’s a simple example of inheritance. You can represent the rows just fine but you need the metadata somewhere, and I don’t see how it would be in the DBMS, so the DBMS’s ability to do integrity checking is limited.
@Ning: Besides performance and features, the other barrier is cost. ATM, the native XML database of choice is MarkLogic and they know it. So, they charge you an arm and a leg for it. The list price is 6 figures IIRC.
@Daniel: Simon’s example that includes typed metadata is a case of what some call EAV/CV (http://en.wikipedia.org/wiki/Entity-Attribute-Value_model). The metadata to create this type system is in the RDBMS as well, and your application essentially becomes the referential integrity engine. Fun times!