The other day I’ve read some blog posts from Scott Drumonds. The information I could get from his posts are extremely valuable especially if you have performance issues.
Let me summarize here what I found the most useful. First let’s have a look at the relevant counters, then let’s look at how to get them using ESXTOP and finally, let’s have a look at how to correct the system.
|Queued Disk Commands||disk.queueLatency.
|QUED||Queued commands are queued in the kernel queue. They are awaiting an open slot in the device driver queue. A large number of queued commands means a heavily loaded storage system. See Storage Queues and Performance for information on queues.|
|Queue Usage||Not available||%USD||This counter tracks the percentage of the device driver queue that is in use. See Storage Queues and Performance for info on this queue.|
|ACTV||VirtualCenter reports the number of commands that have been issued in the previous sample period. esxtop provides a live look at the number of commands that are being processed at any one time. Consider these counters a snapshot of activity. But don’t consider any number here “too much” until large queues start developing.|
|HBA Load||Not available||LOAD||In esxtop the LOAD counter tracks how full the device queues are. Once LOAD exceeds one, commands will start to queue in the kernel. See Storage Queues and Performance for information on these queues.|
|Storage Device Latency||disk.deviceReadLatency
|DAVG/cmd||These counters track the latencies of the physical storage hardware. This includes everything from the HBA to the platter.|
|KAVG/cmd||These counters track the latencies due to the kernel’s command processing.|
|Total Storage Latency||Not available||GAVG/cmd||This is the latency that the guest sees to the storage. It is the um of the DAVG and KAVG stats.|
|ABRTS/s||These counters track SCSI aborts. Aborts generally occur because the array is taking far too long to respond to commands.|
As before, esxtop is the best place to start when investigating potential performance issues. To view the disk adapter information in esxtop, hit the ‘d’ key once it is running.
On ESX Server 3.5, the storage system can be displayed per VM (using ‘v’) or per storage device (using ‘u’). But the same counters are displayed on each. Look at the following items:
- For each of the three storage views:
- On the adapter view (‘d’), each physical HBA is displayed on a row of its own with the appropriate adapter name. This short name may be checked against the more descriptive data provided through the Virtual Infrastructure Client to identify the hardware type.
- On ESX Server 3.5’s VM disk view (‘v’), each row represents a group of worlds on the ESX Server. Each VM will have its own row and rows will be displayed for the console, system, and other less-important (from a storage perspective) worlds. The groups’ IDs (GID) match those on the CPU screen and can be expanded by pressing ‘e’.
- On ESX Server 3.5’s disk device view (‘u’), each device is displayed on its own row.
- As with the other system screens, the disk displays can have groups expanded for more detailed information:
- The HBAs listed on the adapter display can be expanded with the ‘E’ key to show worlds that are using those HBAs. By finding a VM’s world ID the activity due to that world can be seen on the expanded line with the matching world ID (WID) column.
- The worlds for each VM can be displayed by expanding the VM row on the VM disk view with the ‘e’ key.
- The disk devices on the device display can be expanded to show usage by each world on the host.
Correct the System
Corrections for these problems can include the following:
- Reduce the guests and host’s need for storage.
- Some applications such as databases can utilize system memory to cache data and avoid disk access. Check in the VMs to see if they may benefit from increased caches and provide more memory to the VM if resources permit. This may reduce the burden on the storage system.
- Eliminate all possible swapping to reduce the burden on the storage system. First verify that the VMs have the memory they need by checking swap statistics in the guest. Provide memory if resources permit. Next, as described in the “Memory” section of this paper, eliminate host swapping.
- Configure the HBAs and RAID controllers for optimal use. It may be worth reading Storage Queues and Performance for information on how disk queueing works.
- Increase the number of outstanding disk requests for the VM by adjusting the “Disk.SchedNumReqOutstanding” parameter. For detailed instructions, check the “Equalizing Disk Access Between Virtual Machines ” section in the “Fibre Channel SAN Configuration Guide”. This step and the following one must both be applied for either to work.
- Increase the queue depths for HBAs. Check the section “Setting Maximum Queue Depth for HBAs ” in the “Fibre Channel SAN Configuration Guide” for detailed instructions. Note that you have to set two variables to correctly change queue depths. This step and the previous one must both be applied for either to work.
- Make sure the appropriate caching is enabled for the disk controllers. You will need to the vendor provided tools to verify this.
- If latencies are high, inspect array performance using the vendor’s array tools. When too many servers simultaneously access common elements on an array the disks may have trouble keeping up. Consider array-side improvements to increase throughput.
- Balance load across the physical resources that are available.
- Spread heavily used storage across LUNs being accessed by different adapters. The presence of separate queues for each adapter can yield some efficiency improvements.
- Use multi-pathing or multiple links in case the combined disk I/O is higher than a single HBA capacity.
- Using VMotion, migrate IO-intensive VMs across different ESX Servers, if possible.
- Upgrade hardware, if possible. Storage system performance often bottlenecks storage-intensive applications but for the very highest storage workloads (many tens of thousands of IOs per second) CPU upgrades at the ESX Server will increase the host’s ability to handle IO.