February 23, 2009

The questionable benefits of terabyte-scale data warehouse virtualization

Vertica is virtualizing via VMware, and has suggested a few operational benefits to doing so that might or might not offset VMware’s computational overhead. But on the whole,it seems virtualization’s major benefits don’t apply to the large-database MPP data warehousing.

A couple of years ago, I outlined four criteria for when to virtualize.

So basically none of these criteria speak well of virtualizing data warehousing, except in a few cases where you’re not entirely sure why you’re storing all that data in the first place.

There’s yet another problem with the whole idea of data warehouse virtualization. Suppose you turn the virtualization dial, and increase or decrease the number of nodes dedicated to your MPP analytic DBMS. What happens to the data?? In most systems, the answer is “It gets redistributed among the disks.” That takes at least tens of minutes, and then perhaps only so little if you believe everything you hear from vendor marketers. Hours or days might also be realistic.

Bottom line: If you do your data warehousing on a shared-everything OLTP DBMS, virtualizing it might make sense. But the benefits of virtualizing MPP/shared-nothing analytic DBMS seem questionable, except in certain specialized use cases.

Comments

2 Responses to “The questionable benefits of terabyte-scale data warehouse virtualization”

  1. tzahi jakubovitz on February 23rd, 2009 4:43 pm

    What you say is definitely valid. I would like to present a different view.
    The flexibility of virtual machines is very important. the fact you can develop on a virtual machine, and then just copy the virtual machine to the production server, duplicate it as many times as you need. Add machines without changing database configuration – this makes everything move faster.
    In a few months – we will have 6 core CPUs. So paying about 10% CPU overhead is not critical.
    I do not know if disks cannot be stressed by ESX. EMC claims that an ESX server is capable of 63,000 IOs per second.
    It reminds me of the C++ vs Java issues. You pay in CPU cycles for modern language. But the flexibility makes it worthwhile, and sometimes enables you to work smarter and build a faster solution.
    Sure – If you build the world’s largest warehouse, VMware will probably not be a good choice. But most of us are building medium solutions, and flexibility is worth losing 10% CPU.

  2. Steve Wooledge on February 23rd, 2009 4:44 pm

    We have seen virtualization work well (or is required) in a couple use-cases:

    [1] Internal testing: Aster Data has used virtualization to test/QA our DBMS on > 200 virtualized nodes. (note we have production customer with ~100 nodes)

    [2] Cloud computing: ShareThis runs a 3+ TB Aster nCluster system on Amazon Web Services, which is virtualized on their Amazon machine images (AMI’s). You can listen to ShareThis talk about their ability to scale up/down with no downtime because of the way we handle data movement/admin complete online: https://asterdata.webex.com/asterdata/lsr.php?AT=pb&rID=29679512&rKey=98F898EA5CCA0BE8
    note ShareThis also handles full and incremental backups online)

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.