This is a great reading. I’ve put comments as well from my own experience in the matter.
1. First and most important, look at the big picture for why you are implementing virtualization. Most managers look solely at VMware, XenServer, Hyper-V or any other virtual server product for ROI (return on investment). Bad way to make IT decisions! Look at the big picture. How will virtualization affect everything and everyone it makes contact with? For example: How will your storage be affected when you start sharing it with hungry VMs, will the I/O hold up? Or, how will your system administrator handle the new responsibilities? How will you handle the users when they start complaining that everything is slow because you didn’t consider I/O and new responsibilities?
2. Once you have a big picture view of what you want to do with virtualization then consider that you’re still probably missing a few things that you will learn along the way. Just look at these challenges as growing pains, and unavoidable. Virtualization dynamically changes as the environments grow and upgrade. First there’s the experimental ESX or free ESXi, Hyper-V and XenServer host that gets you started. Then when the experimental host(s) get filled up there’s the small farm of host servers that get landed when you actually start purchasing new hardware and the full infrastructure licenses. Beware! of the “wow we can virtualize everything” period that happens from 50 – 200 virtual servers. At this point everything seems to work fine because you haven’t saturated your SAN I/O, or host memory and CPUs. But then there’s that point that happens at VM number 201 (201 is a relative number, it could be more or less depending on a number of factors) where panic is unavoidable if you haven’t prepared properly. That’s why you need to read the rest of this post.
3. Now that consideration 1 and 2 are out of the way, which are mainly to make IT managers think, I’ll get to the good stuff. Have a backup strategy in the beginning that is made for backing up the VM images. Don’t rely on your legacy backup software for physical servers. Yes NetBackup or whatever can still do agent backups of files of a VM, this is a no brainer. However, how are you going to do a full system restore? Unless it’s just data, 5 hours after your backup administrator begins the system restore he is still going to be trying to solve this riddle. You want a good solution that makes an image backup. Solutions: VCB, vRanger Pro, Veeam Backup, Avamar. These are all specific backup tools for virtualization. Avamar can work on any type of virtual environment including Sun containers.
4. Know your storage limits. Capacity is just one part of the storage requirement. The other part is I/O or OIPS (Input/Output Per Second). VMs have different I/O needs. One hungry database or SharePoint VM on a LUN that shares it’s disk parity with multiple LUNs can cause performance problem across all the LUNs in the disk parity group. The best way I have found to avoid this is to design your storage with the biggest I/O pool available. I/O begins at the disk and 15K disks have roughly 200 IOPS where 10K disks have 150 IOPS (SATA have 30 – 50 IOPS). Do the math (Deinoscloud added: check here and here for the formulas), which is better? After capacity and I/O is considered, then there is the pathing, which needs to be manually configure to split I/O down multiple paths to the SAN/NAS cache. I’ve seen million dollar equipment brought to its knees because this stuff was overlooked. It’s usually not the equipment (HP, NetApp, EMC) that is causing the problem, its configuration. Whether you plan to use FC, NFS or iSCSI, this is important for your storage administrator to consider. Otherwise, you will be playing VM storage Tetris and I guarantee you will lose.
5. This is in conjunction with 4, VM template configuration. If you’re planning to have a huge pool of I/O then you will never know your template configuration is poor. VM configuration is important and is easy to overlook. Most will find out how important when I/O runs out… I’ve read this best practice on many blogs – “put data and OS, and even swap files on separate LUNs.” I agree this is a good best practice, but I am taking it even further and adding a criteria. “Separate LUN on separate disk parity groups.” Here’s why, ten – 15K disks will give you roughly 1500 IOPS across each LUN it is carved into. Depending on the size of the drive you may have various LUN sizes of 200 to 500 GB (each with 10 – 20 I/O hungry VMs) sharing the same IOPS. Splitting data, OS and swap onto more spindles will give you more IOPS and possibly an alternate path to the 2nd storage processor (active/active), or more cache that is assigned to another FC or NIC port. Make sure data store names include what the LUN is for (Data, OS or swap) and odd even disk parity (data goes on odd, OS go on even). Deinoscloud added: I would add a general comment here, swapping is bad! If your VMs are swapping it means that you did not give them enough memory to run properly their apps. Check out this article to find out how much disk space you should allocate for swapping and further more , how much memory you should allocate to your VMs.
6. Clean up your messes. Don’t leave old proof of concept (POC) VMs or equipment running after the POC is done. Nothing is harder to do then to clean up a VM environment 2 years after everyone who was on the original project team has left and your VM inventory now has 500 VMs in it. The first place you need to look when you hit your host and storage limits is here. Out of 500 VMs you can bet there are at least 50 VM zombies that are idly running and using up precious resources. Then there’s the clean up of zombie VM folders that are from VMs that were improperly deleted and the files were left on the data store (you know the VM you said you’d delete later – that was 2 years ago). Clean up also helps control “Sprawl”. Sprawl is a fancy word for out of control. Deinoscloud added: A great and FREE tool is available to help you reclaim wasted disk space; vOptimizer WasterFinder
7. You probably didn’t hear me the first time so I’m saying “Backups” again. I’m putting this down again to make sure you have a backup solution that backs up the complete VM image. It’s no easy task to change backup process 2 years and 500 VMs later so make sure you do this right from the start.
8. Establish standards for your environment. All hosts will be on the certified version of ESX or whatever hypervisor you use. Once you allow old hosts to say around after you have decided to build new host on the current ESX version, it won’t be long before your virtual infrastructure is fragmented. Remember, virtualization is evolving almost daily and new features are on each new version of ESX and Hyper-V. Live migration didn’t work on the old Hyper-V version but it work on R2, but it doesn’t work across R1 to R2 or R2 to R1. Get all those R1 upgraded to R2 so all are the same and live migration works. Keeping the standard isn’t easy because VM administrator are also system administrators, they have to land the servers, configure the host, as well as deploy the VM and configure the VM. It’s the same people doing both jobs and in some cases they are storage and network administrators too. Make sure you have enough staff to maintain your standards. I’ve known more than a few overworked, underpaid and miss-understood VM administrators in my time. Deinoscloud added: You pay peanuts you get monkeys 🙂
9. I hate this one as much as any true IT professional but someone has to keep doing the job if you leave and take a better paying job somewhere else. Make sure you keep good documentation. If it’s required, cool Visio’s of everything is nice for management, but even more important for day to day support staff are “How To” documents. How to land and provision a host (hardware and hypervisor). How to deploy a VM. How to add additional disk space to the “C” drive of a VM. How to P2V a system. How to properly request more storage. How to decommission a VM. How to schedule a VM backup. How to recover a VM from a backup. Also keep the “How To” documents up-to-date. You need a new “How To” for each version of ESX because they are not the same; customization to the SWAP volume for example is different on 2.5, 3.0 and 3.5. Hyper-V and XENserver have their own little tweaks as well.
10. Don’t buy every tool out there thinking its going to fix everything I have spent the last 2 hours writing about. Listen to what I am saying. Listen to your support staff. Carefully listen to vendors who want to sell you something because there is no silver bullet for poor planning. And, while on the subject of vendor, any consult recommendation with direct connection with equipment vendors should also be scrutinized. I’ve seen the best SAN money can buy collapse under 25 VMs because it was haphazardly used (VM storage Tetris). Many of the problems I have warned you about can be avoided if you plan. Read number 1 and 2 again until this makes sense. To the VM administrators who are fighting the daily battles because most of what I have written about is already occurring in your virtual environment, I feel your pain. To all the new bright-eyed IT managers and system administrator who are licking their chops because they are finally getting a budget to start virtualizing, I warn you and say, “Consider the big picture and plan, plan, plan!”
Hopefully this post has been helpful. Other items that were not covered are: How to monitor VM and host servers, disaster recovery DR of virtual environments, capacity planning, forecasting and hardware (servers, network and storage) brands and types. These can be topics for the next 10 biggie list. My final note is “Backups” will challenge traditional thinking so heed my warnings.