Data(base) virtualization — a terminological mess
Data/database virtualization seems to be a hot subject right now, and vendors of a broad variety of different technologies are all claiming to be in the space. A terminological mess has ensued, as Monash’s First and Third Laws of Commercial Semantics are borne out in spades.
If something is like “virtualization”, then it should resemble hypervisors such as VMware. To me:
- The core feature of a hypervisor is that it allows many somethings to run and coexist where ordinarily only one something would come into play. Here the “many somethings” are virtual machines and what’s going on inside them, and the “one something” is the ordinary operating system/hardware computing stack.
- A core feature of original VMware was that the “many somethings” could be quite different — for example, the operating environments of numerous different hardware systems you wanted to decommission, or of new systems that you didn’t want to buy quite yet.
- Important features of hypervisors include:
- The ability to have multiple virtual machines run side by side at once, safely.
- Flexible and powerful workload management if the virtual machines do contend for resources.
- Easy management.
- The negative feature of having sufficiently low overhead.
Anything that claims to be “like virtualization” should be viewed in that light. I.e., it isn’t real virtualization unless it has the ex uno plures* feature.
*”Out of one, many”. It turns out that e unum pluribus just means the same as e pluribus unum, namely “Out of many, one”; word order isn’t as important in Latin as in English.
Most commonly, “data/database virtualization” is used to denote some kind of transparent data federation.
- Forrester Research, in a recent Forrester Wave, conflates that with “Information as a Service”.
- Informatica’s data virtualization marketing page gives one vendor’s view as to which capabilities could be involved.
- Logical data warehouse would seem to be a related concept.
I think “virtualization” is a bad name for this, because there isn’t much ex uno plures going on. But at least it’s a name that’s in widespread use.
More solid is the sense of “database virtualization” used by Delphix. Their core idea is to take all your different database copies for product, test, development, archiving and so on, and to the extent possible turn them into one real database, plus a bunch of diffs. Cost savings are obvious if that works. The ex uno plures feature is present.
Recently, I’ve noticed that transparent sharding is being referred to as database virtualization, especially by ParElastic. Transparent sharding is a great feature, but I don’t think calling it “database virtualization” makes much sense.
I noted back in October that the essence of multitenancy is a special-case version of ex uno plures. If somebody offered that and wanted to call it “virtualization”, I might not argue too much.
Weirdest of all is ScaleDB’s use of the term. ScaleDB seems to be claiming that:
- Any interesting database topology should be called “database virtualization”.
- The highest and best form of database virtualization is a clustered, shared-everything DBMS approach such as Oracle RAC.
Neither logic nor language support ScaleDB’s side.
Comments
5 Responses to “Data(base) virtualization — a terminological mess”
Leave a Reply
Happy New Year Curt.
Good job diving right into another cloudy topic in the DBMS space. I started researching database virtualization technology a little last year, also with the VMWare model as a backdrop, but found it didn’t quite fit, as you mention. I have not resumed my research yet (need to work through the OS/App model vs the DBMS/Data model for a proper solution?), but would welcome another of your classification writeups on vendors in this space if you’re looking for additional topics :^)
Regards,
Al D.
The broad definitions I’ve used for a pair of distinct but related concepts is:
– Something is virtual if it appears to be there but is not;
– Something is transparent if it appears *not* to be there although it *is*.
A virtual machine appears to be a machine, but it’s just a piece of something else. A VPN appears to be a private network made of real, private wires and routers and such, but that’s not really there – it’s a software construct. The Delphix example you give fits right in: There appear to be separate databases for product, test, and so on – real disks holding private copies of data an indexes and all the rest – but, again, these are really just software.
You can, of course, tie yourself in knots deciding when it’s virtual and when it’s “just software”. That network the VPN virtualized was much more than just the wires and routers – without the software construct of an “IP network” on top of it, it wouldn’t have been worth much. But in practice the lines are quite clear – and if they aren’t, “virtual” is really the wrong word.
— Jerry
[…] I disapprove, data virtualization seems to be the term that will win for describing data […]
Thanks for lucid discussion of these terms. The wiki pages on “data virtualization” and “database virtualization” have left much to be desired.
VMware set the precedence with virtualization making many out of one and in a similar way Delphix makes many virtual databases out of one set of database files.
I like Jerry’s comment that one out of many is “transparency.” If something doesn’t appear to be there but is there, then it’s transparent such as aggregating multiple database into one apparent datasource. The multiple sources (say aggregating encapsulating an Oracle source and a SQL Server source) that look like one hide the individual players. Thus as you say, “transparent data federation” would be a great term for that technology to use.
– Kyle Hailey
[…] platform (the virtualization idea of “ex uno plures”, that is, out of one, many). Curt Monash likes Delphix’s idea of database virtualization, but i still […]