The questionable benefits of terabyte-scale data warehouse virtualization
Vertica is virtualizing via VMware, and has suggested a few operational benefits to doing so that might or might not offset VMware’s computational overhead. But on the whole,it seems virtualization’s major benefits don’t apply to the large-database MPP data warehousing.
A couple of years ago, I outlined four criteria for when to virtualize.
- Just to be safe, don’t virtualize apps that are already I/O-bound or otherwise running flat-out. That rules out many data warehouse scenarios right there. Vertica insists it’s an exception, however, because compression helps it avoid being I/O-bound, and offers a quote from a customer who shares that optimism.
- Big enterprises have lots of production servers that are old, unreliable, and/or idle most of the time. Virtualize those. But modern analytic DBMS tend to run on modern hardware and OS. What’s more, they’re MPP, occupying multiple servers, because one isn’t enough. That’s very different from the fractional-server-consumption scenario that drives so much of virtualization’s success.
- If a server’s use is particularly spiky, it may be a great candidate for virtualization. This one might indeed apply to a few analytic use cases of the load-nightly, query-rarely ilk.
- Most development servers can and should be virtualized. That’s a better reason to virtualize BI or data mining than it is to virtualize data warehousing itself.
So basically none of these criteria speak well of virtualizing data warehousing, except in a few cases where you’re not entirely sure why you’re storing all that data in the first place.
There’s yet another problem with the whole idea of data warehouse virtualization. Suppose you turn the virtualization dial, and increase or decrease the number of nodes dedicated to your MPP analytic DBMS. What happens to the data?? In most systems, the answer is “It gets redistributed among the disks.” That takes at least tens of minutes, and then perhaps only so little if you believe everything you hear from vendor marketers. Hours or days might also be realistic.
Bottom line: If you do your data warehousing on a shared-everything OLTP DBMS, virtualizing it might make sense. But the benefits of virtualizing MPP/shared-nothing analytic DBMS seem questionable, except in certain specialized use cases.
Comments
2 Responses to “The questionable benefits of terabyte-scale data warehouse virtualization”
Leave a Reply
What you say is definitely valid. I would like to present a different view.
The flexibility of virtual machines is very important. the fact you can develop on a virtual machine, and then just copy the virtual machine to the production server, duplicate it as many times as you need. Add machines without changing database configuration – this makes everything move faster.
In a few months – we will have 6 core CPUs. So paying about 10% CPU overhead is not critical.
I do not know if disks cannot be stressed by ESX. EMC claims that an ESX server is capable of 63,000 IOs per second.
It reminds me of the C++ vs Java issues. You pay in CPU cycles for modern language. But the flexibility makes it worthwhile, and sometimes enables you to work smarter and build a faster solution.
Sure – If you build the world’s largest warehouse, VMware will probably not be a good choice. But most of us are building medium solutions, and flexibility is worth losing 10% CPU.
We have seen virtualization work well (or is required) in a couple use-cases:
[1] Internal testing: Aster Data has used virtualization to test/QA our DBMS on > 200 virtualized nodes. (note we have production customer with ~100 nodes)
[2] Cloud computing: ShareThis runs a 3+ TB Aster nCluster system on Amazon Web Services, which is virtualized on their Amazon machine images (AMI’s). You can listen to ShareThis talk about their ability to scale up/down with no downtime because of the way we handle data movement/admin complete online: https://asterdata.webex.com/asterdata/lsr.php?AT=pb&rID=29679512&rKey=98F898EA5CCA0BE8
note ShareThis also handles full and incremental backups online)