The other day at a customer place, I had to troubleshoot some performance issues with a 2 years old HP MSA2000i. The ‘ i’ stands for iSCSI. The storage runs two active/active iSCSI HBA controllers. Each controller has also one gigabyte R/W clustered cache memory. I say clustered cache because each iSCSI HBA controller’s write cache is copied to the other iSCSI HBA controller’s cache. This is by default and can be disabled if you clearly understand the impact in case one of the iSCSI HBA controller would fail! The storage was setup with two ESX3.5 attached and those two hosts were recently upgraded to ESX4.0U1. Whilst the performance is acceptable I decided to install on the management VM a great new tool from VKernel called VKernel AppVIEW which in the free version monitors up to your five most important VMs and displays detailed data on how these applications are running in your virtual environment. See screenshot below.
Immediately the tool came back with disk I/O alerts for the customer’s five most important VMs. The average disk I/O latency was around 19ms…Ouch that’s not good although the customer is satisfy with overall performance of his virtual environment or at least he hasn’t noticed any performance degradation…yet! So with the customer’s agreement I decided to I check every aspect of the storage setup, from the ESX4.0 host, vSwitches, datastores, the cabling&pSwitches down to the storage device itself and here are my tweaks and findings.
As you can see on the pictures, there are two iSCSI VMKernel Ports sharing the same vSwitch. Uplink is provided with two pNICs, both set to gigabit/full-duplex in active/active teaming mode. Also the pNIC are both Broadcom NetXtreme II BCM5708 models and although both cards do TCP/IP offload (TOE) and accelerated iSCSI, unfortunately ESX4.0U1 can’t use the TOE yet (N.B. It is supported only with iSCSI HBAs). TOE support is expected for the next vSphere release (read more at Virtualization.info). Next, all Traffic Shaping is disabled here. vSwitch setup looks good to me, let’s go deeper.
First issue discovered, all Datastores located on the HP MSA2000i were set to MRU. Let’s reconfigure them to Round Robin (VMware). FYI you would set MRU or Fixed Path if your were in an Active/Passive SP mode. Datastore setup looks good now, let’s move on to the next item.
– Software iSCSI module and LUN Queue depth
The software iSCSI module has a queue depth set to 64 by default and per LUN which is a fairly good value in this environment. By using the vicfg-module command you can change the iSCSI queue depth. Pay attention that the iSCSI SAN Configuration Guide gives a wrong command. Here is the correct command to issue to your vSphere host: esxcfg-module -s iscsivmk_LunQDepth=XX iscsi_vmk
I could not reboot the hosts at will, no vMotion here, thus I could not really evaluate the benefit of increasing the iSCSI LUN queue depth but I keep that under the hood for next Datacenter maintenance week-end. Software iSCSI setup is OK, next item…
– Cabling and pSwitches
Customer uses CAT6 cables and HP unmanageable gigabit switches. MTU is defaulted to 1500 Bytes. Customer doesn’t use Jumbo Frame in his setup. I’ve told the customer the benefit of JB but we leave it ‘as-is’ for the moment. The customer will consider it upon next technology refresh. Good stuff, next item…
– Storage (HP MSA2000i)
On the picture you can see the storage virtual disk layouts. A 900GB RAID10 virtual disk with fast 10K HDDs and a 3000GB RAID5 virtual disk with 7.2K HDDs. Chunk size is 64KB, hopefully VMFS partitions are aligned, they were created with the VC Client. Except there is no spare disk, so far so good.
Here on the picture we can see Read Ahead Cache Size. It was set to Default, that is 64KB (Based on the default 64KB Chunk Size). For the fast RAID10 virtual disk I set it to Maximum, that is the RAID Controller dynamically calculate the maximum read-ahead size for the volume. This option is to be used only when disk drive latencies must be absorbed by cache. Do not use this setting if more than two volumes exist.
For the Cache Optimization, I leave it to Standard which is fine for a storage dedicated to a VMware Cluster environment which has much more random than sequential accesses. The other Cache Optimization option called Super-Sequential is to be used only if your application is strictly sequential and requires extremely low latency like video streaming/editing.
For the second virtual disk, I set different settings because the virtual disk serves other purposes. The second virtual disk is mainly a backup disk. I said mainly because still a couple of VMs have VMDKs located there… Yes I know, it is not a smart move but hey as the customer says: “If we would only have backups over there, it would be a waste of disk space…” and since VMware is about consolidation and better resource utilization, the customer’s point of view and arguments are valid IMO. Now I have to make sure the storage can cope with this situation the best way it can. I set Read Ahead Cache to 4MB and left the Cache Optimality to Standard.
N.B. To access those features on a HP MSA storage device you must logon to the storage device with an account that has Access Level set to Manage and User Type set to Diagnostic.
In the SCSI Configuration Options, the Sync Cache Mode is set to Immediate (aka Write-Back Cache). This is the preferred method because the I/O ACK is returned without waiting the data to be flushed to the disks. Flush to Disk (aka Write-Through) is preferred in rare cases such banking applications where good status is returned only after all data are flushed to disk.
The Host Control of Write-Back Cache is set to disabled. We don’t want the hosts, VMware ESX servers for instance, to control this behavior and to switch off Write-Back Cache even though I don’t think the hosts can do that. That’s a question for the community!
Note that there are triggers that cause a controller to automatically switch from Write-Back caching to Write-Through caching. You can also specify actions for the system to take when Write-Through caching is triggered. Here I use the default settings. Next screenshot…
The Independent Cache Performance Mode is a very important feature. The storage system’s default operating mode, Active/Active, data for volumes configured to use Write-Back Cache is automatically mirrored between the two controllers. Cache redundancy has a slight impact on performance but provides fault tolerance. That is if an iSCSI controller fails, the other iSCSI controller takes over WITH NO DATA LOST because the cache is mirrored. You can disable cache redundancy, which permits independent cache operation for each controller. In that case you don’t benefit anymore of a redundant iSCSI controllers setup.
Link Speed should be set to Force 1Gbit. We don’t want the auto-negotiation to ‘slow down’ things or eventually turn in to Half-Duplex mode. Also because the storage is connected over a private network with its own network switches, we don’t use CHAP nor VLAN tagging. The only thing that I was willing to change but could not, is the Jumbo Frames setting. I will have to wait for the next technology refresh or Datacenter maintenance week-end.
Voila, I have checked all those items, and changed a few settings to have a stable baseline setup before going on deeper in my investigation. Already the customer noticed an improvement but this is not the end. Now I have to collect ESXTOP data for some times and then go through some analysis! Also there are some work to do at the VM level; upgrade to the latest VMware Tools, upgrade to VMXNET3 (a couple of VMs are still using E1000), rethink some VMDKs placement for two out of the five most important VMs, etc… A good topic for my next blog post who knows… Stay tuned!