A Year Of Blogging In Summary And Season’s Greetings

2011 comes to an end and it’s time to do some introspection of this year’s blogging experience! That sounds familiar :)

In May 2011 I joined VMware in a permanent position. I joined a top notch team of Consultants. La crème de la crème as we say in French.

I was honored to be awarded as a vEXPERT 2011.  That’s two times in a row!

The VMware vExpert program was created in 2009 to show appreciation for those individuals who have significantly contributed to the community of VMware users over the past year. Many thanks go to the committee.

This year I also successfully passed my VCP5 certification. En route to VCAP certifications now and eventually VCDX!

Unfortunately I had to slow down my blogging activities this year. There are priorities in my life at the moment and among them are my sweet baby girl and beloved wife.

Nevertheless what would be a year of blogging without some blog site summary tables, statistics and charts ;)

Here is my 2011 top 10 posts in term of page views only. These are not necessarily my preferred blog posts though. Maybe an idea for another blog post :)

Title Views
Installing Oracle Database Client 10g Release 2 (10.2) on a Windows 2008 R2 x64 9,620
One Of The Most Powerful Shuttle Barebone For My VMware Home Lab 8,484
vSphere – Virtual Machine Startup and Shutdown Behavior 6,982
Microsoft Network Load Balancing (NLB) on VMware ESX 6,285
Upgrade ESXi4.0 to ESXi4.1 – The Unofficial Method 5,595
Understanding VMFS Block Size And File Size 4,850
How to increase the size of a local datastore … on an ESXi4? 4,826
How To Set Up a Trunk Port Between An ESXi4.1 And An HP ProCurve 1810g-24 4,563
Understanding disk IOPS 4,443
How To Troubleshoot a Broken RAID Volume On a QNAP Storage Device 4,363

 

Again a big thank you to all my readers.

Best Wishes and a Happy New Year 2012.

 

 

Cluster Profiles

This is the English version of a blog post from Raphael Schitz at hypervisor.fr.

Raphael is very smart guy, vExpert fellow and PowerCLI guru. Recently he came up with a great idea, which turned into a great blog post and a powerful script available for free. All credits go to Raphael.

No need to remind you the benefits of Host Profiles in terms of configuration consistency and correctness across the datacenter. With PXE Manager and PowerCLI, you could free yourself from the hassle of deployment and with Host Profiles’ help you automate and monitor host configuration management (These features were greatly improved in vSphere 5).

Unfortunately Cluster configuration management hasn’t improved at the same pace and remains tedious with no visibility into changes. You configure properly your Cluster settings and 6 months later, after a few maintenance windows and some changes e.g. Admission Control set to disable and DRS set to Partially Automated, you find yourself in a situation where a broken Blade powers off VM’s which are unable to restart on other hosts in the Cluster because someone forgot to re-enable HA. We have experienced this situation but hopefully our latest PowerCLI script will help us to change once for all those bad habits and behaviors: Meet Manage-ClusterProfile

Manage-ClusterProfile was developed for three simple tasks:

  • export Cluster configuration and settings to a cluster profile file.
  • compare Cluster configuration and settings with a cluster profile file.
  • import a cluster profile file to an existing Cluster.

The cluster profile file, which is a xml file, contains the entire configuration and settings of a Cluster (HA, DRS, DPM, rules, swapfile, etc…) and therefore allows a detailed comparison of similarities and differences.

Optionally you can send an email to vAdmins for instance.

The import function addresses only Cluster’s own configuration and settings. For instance, Affinity Rules or any other VM’s settings (e.g. HA/DRS/DPM customization) are not imported.

The script has the following input parameters:

  • ManagedCluster [name of the Cluster]
  • Action [import|export|check]
  • ProfilePath [directory for export|path to xml cluster profile file for import and check]
  • SendMail [1 for enable]
  • ForceImport [1 for enable]

To summarize this blog post,  this script will allow you to create new Clusters by importing cluster profile templates with all your predefined configuration and settings tailored to your own criterias.  Also ran as a scheduled task, this script will allow you to track changes and stay compliance. Of course when you make changes to your Cluster, you will have to export them to a cluster profile file that you use to track changes.

As usual do not hesitate to share your feedback and suggestions in the comment area :)

Enjoy !

Download Manage-ClusterProfile

A Little Sneak Peek At The Future Of Low Latency Ethernet

Look at the table below, network latency has improved far more slowly over the last three decades than other performance metrics for commodity computers.


While the 5-10μs round-trip latency seem achievable within a few
years, what about to reduce RPC latency to 1μs in the long term?

And if we just integrate NIC functionality onto the main CPU die…

Stephen M. Rumble, Diego Ongaro, Ryan Stutsman, Mendel Rosenblum (one of the co-founders of VMware), and John K. Ousterhout at the Stanford University co-authored a paper called: It’s Time for Low Latency

If you want to sneak peek at the future of low network latency, that’s definitely a paper to read ;)

Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs

I came across this technical paper called Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs

Abstract from the paper:

This white paper summarizes findings and recommends best practices to tune the different layers of an application’s environment for latency-sensitive workloads.

If you are about to virtualize low-latency workloads or simply looking at tuning your existing virtual environment for such workloads, this is the technical paper you need read!

I like the tabulated summary at the end of the technical paper and I have pasted here a copy. Very convenient checklist.

Some of these technical papers are like diamonds. To stay on top of latest VMware technical papers and other information, create your own custom RSS feed at VMware.

DIMMs And The Intel Nehalem Memory Architecture Connection

In this post I want to focus only on the intrinsic connection that exists between DIMMs and the Intel Nehalem regarding primarily the memory architecture. I was reading some good papers on this topic and I discovered some interesting details that I wanted to share with you in this blog post.

At the end of the exercise, by picking up the right combination of memory model, memory size, channel population and processor model you can make substantial cost savings especially when doing economies of scale.

Beforehand we need to take a closer look at these two server core items, that is memory, and processors. Let’s start with UDIMM and RDIMM memory architecture and next I’ll go through the Intel Nehalem/Westmere memory architecture. Finally I will have a couple of scenarios to exercise what we have learned here. For both scenarios I will pick up what I think to be the right memory and processor combinations. Feel free to comment to share your experience.

This is quite a long post so bear with me ;)

UDIMMs versus RDIMMs

There are some differences between UDIMMs and RDIMMs that are important in choosing the best options for memory performance. To make the long story short here is a summary of the comparison between UDIMMs and RDIMMs:

  • Typically UDIMMs are a bit cheaper than RDIMMs
  • For one DIMM per memory channel UDIMMs have slightly better memory bandwidth than RDIMMs.
  • For two DIMMs per memory channel RDIMMs have better memory bandwidth than UDIMMs.
  • For the same capacity, RDIMMs will be required more Watt per DIMM than UDIMMs,
  • RDIMMs also provide an extra measure of RAS:
    • Address / control signal parity detection.
    • RDIMMs can use x4 DRAMs so SDDC can correct all DRAM device errors even in independent channel mode.
  • UDIMMs are currently limited to 4GB in a Dual Rank mode.
  • UDIMMs are limited to two DIMMs per memory channel.

So you could go for UDIMMs because they are a bit cheaper, a bit faster and require less power than RDIMMs for the same capacity.

On the other hand you would go for RDIMMs if you need higher capacity per memory module, more reliable error control and data correction than UDIMMs.

So we have define the pro’s and con’s for these two memory models. Keep this in mind and now let’s have a closer look at the Intel Nehalem/Westmere memory architecture and processor models available.

[UPDATE] The LRDIMM case. This is a new type of memory and stands for Load Reduced DIMM. It allows massive memory expansion without sacrificing performance. Remember that as soon you fill in the the third channel, the memory speed drops to 800MHz. LRDIMM increases capacity whilst maintaining high memory speed by fooling the memory controller. The LR Buffer lets a quad rank DIMM look like a dual rank DIMM to the memory controller and therefore allows up to three DIMMs per channel and since that’s still below the eight rank per channel limit the memory speed remains at 1333MHz. How cool is that :)  Obviously you can’t mix LRDIMMs with either RDIMMs or UDIMMs. If you look for maximum capacity and increased performance, look at LRDIMMs. More at the DDR3 for Dummies – 2nd edition

Intel Nehalem-DP/Westmere-DP Memory Architecture and Processor Models

There is no difference in the memory architecture between Nehalem and Westmere. Let me summarize below what is, in the case of the memory architecture, important to me:

  • A 2-way Xeon system (DP) has one QPI channel to connect to the other socket and one QPI to connect to the IOH chipset (IO Hub). Eventually you can have 2 IOH’s.
  • QPI operates at a clock rate of either 2.4 GHz(=4.8GT/s), 2.93 GHz(=5.86GT/s), or 3.2 GHz(=6.4GT/s).
  • The QPI has a bi-directional maximum bandwidth of 6.4GT/s x 2Bits/Hz x 2-Way = ~25.6GB/s.
  • GT/s is calculated with 20 bits in mind (or 20 lanes), whilst the GB/s is calculated on the real payload of 16 bits (2 Bytes). For more information on this particular topic, read the An Introduction to the Intel® QuickPath Interconnect.
  • Nehalem/Westmere supports up to 18 slots DIMM with DDR3 memory.
  • In general servers support DDR3 DIMM with a maximum memory clock speed of 166MHz which gives a data rate of 1333MT/s. Many time misleadingly advertised as the I/O clock rate by labeling the MT/s as MHz.
  • The three DDR3 channels to local DRAM support a maximum bandwidth of 3 x 8 x 1.333GTransfers/s = ~31.99GB/s. That is ~10.6GB/s per channel.
  • At 1066GT/s maximum bandwidth is ~25.58GB/s, that is ~8.52GB/s per channel.
  • At 800GT/s maximum bandwidth is ~19.2GB/s, that is ~6.4GB/s per channel.
  • The available bandwidth to access memory blocks on the other socket is bound by the QPI link speed.
  • The available bandwidth through the QPI link is 12.8 GB/s one way that is approximately 40% of the bandwidth to the local DRAM.
  • At the time of authoring this post, 12MB is the maximum shared L3 cache available for Intel Xeon 5000 series.

The diagram below shows the memory layout of a Nehalem DP Server. By the way DP stands for Dual-Processor.

Note the text in green, I will talk about that later in the post.

The next diagram lists the theoretical bandwidth for local and remote memory accesses.
Note that the remote memory access goes through the QPI link.

But that’s not the only things you need to think about. There are other considerations that are often overlooked. For instance the memory frequency, at which the system operates, is determined by a minimization function of three factors:

  1. DIMM frequency.
  2. Memory controller speed.
  3. Channel population scheme.
We can summarize this with the following formula:
System memory speed = MIN (Memory Controller speed, DIMM frequency, population)

First, memory controller speed is limited by the processor model. In general Xeon 5600 ‘X’ series processors run at a maximum speed of 1333 MHz. ‘L’ and ‘E’ series processors run at either 1066 or 800 MHz depending on the CPU clock frequency. Though this not a constant and you have exceptions that I will call ‘marketing exceptions’. Better to look at the technical details for each processor model.

Second, the operating memory speed is dictated by the DIMM frequency. 1066 MHz DIMMs cannot run at 1333 MHz, but 1333 MHz and 1066 MHz can both run at lower frequencies.

Finally, channel memory population schemes dictate that one DIMM-Per-Channel (DPC) or two DPC can run at either 1066 or 1333 MHz, depending on processor model and DIMM type. As soon as you put more than two DPC in any one memory channel, the speed of all the memory drops to 800 MHz.

The table below summarizes this topic:

The difference of performances between 1333MHz and 1066MHz is about 8.5%, between 1333MHz and 800MHz is about 28.5%. Between 1066MHz and 800MHz is about 22%.

Here is below a table grouping the different DIMM capacity and types available for a HP ProLiant BL460c G7. Note that in some circumstances you can drop to 800MHz by populating a second channel i.e. HP BL490 G7.

On the same topic you also need to focus on the processor modelIntel has released many different product lines of Nehalem/Westmere processors, each combination of a processor die and package has both a separate codename and a product code.

Just for the x86 servers market, Intel has four different Xeon Processor families/sequences and for each processor family/sequence a bunch of different processor number such as the X5690 or the E5502.

Let’s have a look at the dual-socket Intel Xeon 5000 Processor Sequence and more precisely the 5500 and 5600 sequences. There you have something like 40 different processor numbers available making your choice even more difficult.

For each processor number, you have the processor clock rate, number of cores and threads, L3 Cache size, QPI Bus Speed, HT technology, TDP, etc… All these processor characteristics are important to make the right choice but also making it over complicated.

The right combination and Business Requirements

With all of these options; UDIMMs, RDIMMs, various DIMM sizes and speeds, low voltage DIMM, processor frequency and other processor technology features, etc. there is a vast number of possibilities and it’s not always obvious which combination of hardware elements you need to logically interlink all together to bring something consistent and coherent in regard to your business requirements and the server architecture as well. It’s like a giant puzzle of 1000 pieces of information you need to logically order them in a way to come up with the best combinations.

See I’m not using the ‘which options for the highest performance‘ because companies are not tied every time to just a high performance business requirement. Energy efficiency or high consolidation can also be your company’s number one business requirement.

Note that in this economical hard time, cost savings are mandatory for many companies and may rule out the traditional business requirements cited above. Cost savings rule helps to keep the company’s business requirement within the budget boundaries.

Sure your company can have other business requirements than the ones above, I know at least one company where the end-user experience is rated number one. A business requirement list is definitely not limited to three or four items.

In many cases companies have multiple business requirements; we need high performance and high consolidation at the lowest cost…Huh! The goal is to juggle with these business requirements to come up with the right combinations.

Sometime this turns into the Triangle Project with no viable combination :)

Scenarios

Imagine the following scenario, your company server vendor policy is HP and for this project you have picked up the HP BL460c G7. Business requirement is high consolidation, thus you need memory, plenty of memory. You load up the server with 12x32GB RDIMM memory modules for a maximum memory size of 384GB running at … 800MHz. Now what processor would you choose in this case? Would you buy the X5690 @ $1663.00 or the E5649 @ $774.00?

In this specific config the memory controller has the same value for both processors, that is 800MHz. QPI is higher for the X5690 but anyway you can’t use it at full throttle because of the memory controller speed is down to 800MHz. Thus between the two CPU’s just the clock speed makes a difference, ~1GHz more for the X5690, but it’s also more than 2x the price of the E5649. Does it worth the $889.00 extra notes?

By loading up with 12x16GB memory modules for a total of 192GB, you memory frequency remains at 1333MHz, and fast processors, one in the X series, are now a valid option. But then you don’t stick to your business requirement anymore! You have gone from the highest possible consolidation ratio (100%) to a half of that (50%).

Another scenario, you have again pick up a HP BL460c G7. This time the Business Requirement is energy efficiency. Remember UDIMM uses less power than RDIMM, thus you go for it and load up the server with 12x4GB UDIMM memory modules, one per channel. For the same capacity, RDIMM requires 0.5 to 1.0 Watt more. Now what processor would you choose? The one with the lowest power consumption might be the good choice, like the L5609. But then you do not benefit from the UDIMM running at 1333MHz cause the CPU supports maximum 1066MHz… What about going for 6x32GB RDIMM LV (1.35V instead of 1.5V) running at 1066MHz for a total capacity of 192GB (4x  more than UDIMM max capacity). And choose the L5630 also using only 40W, but with HT and Turbo Boost Technology when you need extra power…

Take these two scenarios just for what they are. They may not reflect any real case. This just to demonstrate the thinking process with the information we gathered today.

Neither I have a secret formula that will sort out this kind of puzzle. At least, I hope I shed some light on these unknown but important links between the memory and the Intel Nehalem architectures.

Here are some tools that will help you pick up the right combination I hope:

There are two other unknown puzzle pieces I will shed some lights on next time; processor clock frequency sensitive applications and memory bandwidth sensitive applications. So stay tuned ;)

Sources: wkipedia.org, intel.com, dell.com, hp.com and google.com

Enhanced Storage vMotion in vSphere 5 with Mirror Mode

This is a short post for those who are interested in deep technical details of one of the master architecture pieces introduced in the Enhanced Storage vMotion available in vSphere 5 Ent and Ent+ called Mirror Mode aka IO Mirroring. You can hear about the term in the Profile-Driven Storage video and the Storage DRS video. Mirror Mode can leverage VAAI and enable the use of copy off-load engines sometimes present in storage arrays making it even more appealing.

You can deep dive into Mirror Mode and the other available architectures by reading the The Design and Evolution of Live Storage Migration in VMware ESX written by Ali Mashtizadeh, Emre Celebi, Tal Garfinkel and Min Cai all from VMware.

We describe the evolution of live storage migration in VMware ESX through three separate architectures, and explore the performance, complexity and functionality trade-offs of each.

It All Started With This Question…

…I shoot on Twitter: Can anyone cite me an Active/Active storage array that supports ALUA?

An important information is missing from my question. Precisely I’m talking about symmetric storage arrays. So let me rephrase my question to: Can anyone cite me a symmetric Active/Active storage array that supports ALUA?

No need to say that I’ve been flamed down with harsh tweets and DM’s.

Do you know what the A stands for in ALUA?… Your question is plain wrong… It is not necessary with symmetric arrays… It has nothing to do with symmetric arrays… You don’t understand the basics… Etc.

OK guys maybe my question is dumb but once again…are you sure it *really* is?

You can read on or jump at the end of this article to immediately find out if I was soooo wrong with my dumb question :)

When I wrote VMware PSA, MPP, NMP, PSP, MRU, … And Tutti Quanti! and Aloha ALUA I did a lot of research, read a lot of documents and more I was digging in more questions popped out than answers. There is so much to learn about SCSI protocol, storage arrays in general and ALUA in particular.

So people let’s go back to the SPC-3 standards draft, The latest draft known at T10.org is the spc3r23.pdf but is not available for free unfortunately. Hopefully you can find a copy at 13thmonkey.org.

Let’s have a look at chapters 5.7 up to 5.8.3 for the rest of this post. I’ll be doing here a lot of copy/paste because I want to stick to the SPC-3 standards draft as much as I can. I’ll be pasting diagrams as well and sometime I have added in red my own annotations.  I’m not assuming anything, the information given here are written black and white in the latest SPC-3 standards draft.

5.7 Multiple target port and initiator port behavior
SAM-3 specifies the behavior of logical units being accessed by application clients through more than one initiator port and/or through more than one target port…
If one target port is being used by an initiator port, accesses attempted through other target port(s) may:
a) Receive a status of BUSY; or
b) Be accepted as if the other target port(s) were not in use.

5.8.1 Target port group access overview
Logical units may be connected to the service delivery subsystem via multiple target ports (see SAM-3). The access to logical units through the multiple target ports may be symmetrical (see 5.8.3) or asymmetrical (see 5.8.2).

5.8.2.1 Introduction to asymmetric logical unit access
Asymmetric logical unit access occurs when the access characteristics of one port may differ from those of another port. SCSI target devices with target ports implemented in separate physical units may need to designate differing levels of access for the target ports associated with each logical unit.

5.8.2.3 Discovery of asymmetric logical unit access behavior
SCSI logical units with asymmetric logical unit access may be identified using the INQUIRY command. The value in the target port group support (TPGS) field (see 6.4.2) indicates whether or not the logical unit supports asymmetric logical unit access and if so whether implicit or explicit management is supported.

5.8.2.4.1 Target port asymmetric access states overview
For all SCSI target devices that report in the INQUIRY data that they support asymmetric logical unit access, all of the target ports in a target port group shall be in the same target port asymmetric access state with respect to the ability to route information to a logical unit. The target port asymmetric access states are:
a) Active/optimized;
b) Active/non-optimized;
c) Standby; and
d) Unavailable.

5.8.2.6 Preference Indicator
A device server may indicate one or more target port groups is a preferred target port group for accessing a logical unit by setting the PREF bit to one in the target port group descriptor (see 6.25). The preference indication is independent of the asymmetric access state. An application client may use the PREF bit value in the target port group descriptor to influence the path selected to a logical unit (e.g., a target port group in the standby target port asymmetric access state with the PREF bit set to one may be chosen over a target port group in the active/optimized target port asymmetric access state with the PREF bit set to zero). The value of the PREF bit for a target port group may change whenever an asymmetric access state changes.

5.8.2.7 Implicit asymmetric logical units access management
SCSI target devices with implicit asymmetric logical units access management are capable of setting the target port group asymmetric access state of each target port group using mechanisms other than the SET TARGET PORT GROUPS command.
All logical units that report in the standard INQUIRY data (see 6.4.2) that they support asymmetric logical units access and support implicit asymmetric logical unit access (i.e., the TPGS field contains 01b or 11b) shall:
a) Implement the INQUIRY command Device Identification VPD page identifier types 4h (see 7.6.3.7) and 5h (see 7.6.3.8); and
b) Support the REPORT TARGET PORT GROUPS command as described in 6.25.

5.8.2.8 Explicit asymmetric logical units access management
All logical units that report in the standard INQUIRY data (see 6.4.2) that they support asymmetric logical units access and support explicit asymmetric logical unit access (i.e., the TPGS field contains 10b or 11b) shall:
a) Implement the INQUIRY command Device Identification VPD page (see 7.6.3) identifier types 4h and 5h;
b) Support the REPORT TARGET PORT GROUPS command as described in 6.25; and
c) Support the SET TARGET PORT GROUPS command as described in 6.31.

I kept the best part for the end, hoping you’re still reading :)

5.8.3 Symmetric logical unit access
A device server that provides symmetrical access to a logical unit may use a subset of the asymmetrical logical access features (see 5.8.2) to indicate this ability to an application client, providing an application client a common set of commands to determine how to manage target port access to a logical unit. Symmetrical logical unit access should be represented as follows:
a) The TPGS field in the standard INQUIRY data (see 6.4.2) indicates that implicit asymmetric access is supported;
b) The REPORT TARGET PORT GROUPS command is supported; and
c) The REPORT TARGET PORT GROUPS parameter data indicates that the same state (e.g., active/optimized state) is in effect for all target port groups.

SO YES A SYMMETRIC ACTIVE/ACTIVE STORAGE ARRAY MAY USE A SUBSET OF THE ASYMMETRICAL LOGICAL ACCESS FEATURES!

This statement written in the latest draft of the SPC-3 standards document surprised me as well and that’s why I shoot this question: Can anyone cite me a symmetric Active/Active storage array that supports ALUA?
3PAR does apparently! Thx for the info @mcowger. Who else does?

But this statement raises others questions as well:
-What’s the benefit of ALUA with symmetric arrays?
-What are those ‘application clients’ the document is referring to?
-How does a real life implementation of a symmetric array using ALUA’ subset commands look like?

If you have some answers for me to these questions please comment! I’m dying to know :)

 

Limiting Disk I/O From A Specific Virtual Machine

I wanted to follow up on my previous post; Is a Virtual Machine Bringing Your Storage Down? with some test in my home lab. Nothing ‘real life’ but enough to get familiar with these new features.

Control the IOPS is something crucial in a shared storage resources environment. VMware vSphere 4.1 has several techniques to do that and the premium feature is known as SIOC or Storage I/O Control. If you have the proper license just go for it and  turn it on! Don’t forget to follow the requirements and recommendations.

Now if you haven’t bought vSphere 4.1 Enterprise Plus licenses, you still have other built-in storage features that will be used by your vSphere 4.1 hosts to manage, with great fairness, available storage resources. To name a few: Disk.schednumreqoutstanding, QFullSampleSize and QFullThreshold.

This week in my cave man, I’ve played a bit with a couple of those vSphere 4.1 new features, name it sched.scsix:x.throughputCap and sched.scsix:x.bandwidthCap. I have recorded and published a small video you can watch below.

In summary here are my observations:

  • Apparently you cannot use both parameters at the same time. It is either throughputCap or bandwidthCap per disk. If that makes sense to any one please comment :)
  • If you do use both parameters for the same disk, they are just ignored by the host.
  • sched.scsi0:0.bandwidthCap must be set in Bps. The VMware KB isn’t that clear about this. Why can’t I set IOps here?
  • If for instance your VM has two disks and you set sched.scsi0:0.bandwidthCap=100IOps for the first disk and sched.scsi0:1.bandwidthCap=50IOps for the second disk, the values are added up and actual limit for any of the disks is 150IOps. Is this a bug or by design?

How this two virtual machine parameters play with Disk.schednumreqoutstanding, QFullSampleSize, QFullThreshold and SIOC? I don’t catch the whole picture yet on this subject to write something relevant here, thus if any one knows more please enlighten us ;)

Is a Virtual Machine Bringing Your Storage Down?

Have you ever had a user running IOmeter on a virtual machine just to ‘test’ the performance of the virtual disk… at 9am…on a Monday morning??

Well it might happen to you once but not twice as you would immediately have powered off the virtual machine and eventually head shoot the user :)

There is a more elegant way to mitigate the risk of having a valid virtual machine though but still able to put down your storage on its knee and ruining your Monday morning.

  • Power off the virtual machine
  • Click Edit Settings.
  • Click the Options tab.
  • Under the Advanced: General section, click the Configuration Parameters button.
  • Click the Add Row button.
  • Add one or both of these settings for each virtual disk device:

sched. < diskname > .throughputCap = < value >< unit >

For example: sched.scsi0:0.throughputCap = 10KIOps

sched. < diskname > .bandwidthCap = < value >< unit >

For example: sched.scsi0:0.bandwidthCap = 10KBps

The <value> is either off or an integer, and is a string beginning with K (KBps or KIOps), M (MBps or MIOps) or G(GBps or GIOps). If no units are specified, the default of K is assumed.

[UPDATE] You can cap below the k (thousand) by omitting K, M or G. e.g. 100IOps will cap at maximum 100 IO per second.

  • Start the virtual machine. The virtual machine I/O is limited to the specified values.

This is only available in vSphere 4.1
Source: VMware KB 1038241

In-Guest Defragmentation – The Holy Grail For Best Performance?

In one of our internal mailing list, a question regarding in-guest defragmentation came up again. The right answer to this question is simple. It depends!

OK that’s fair, but it depends on what? Well mostly it depends on two things:

  1. what kind of storage device you store your virtual machines on. There are huge differences between a HP StorageWorks 2120 and an EMC Symmetrix V-MAX in terms of features and capabilities to improve the overall performances of the storage device.
  2. what kind of features you may leverage in your virtual environment. Features like thin provisioning, snapshot, data deduplication and replication either at the ESX host or storage device level.

Scenario #1:

You use enterprise class SAN/NAS devices. In such configuration with multiple hosts hitting the storage devices, it is very likely you have a random IO pattern. No worries, such storage devices are very smart and can deal with random IO pattern using techniques like IO coalescing, read-head mechanism, cache algorithm, RAID stripping, etc…

Storage devices such NetApp and its WAFL mechanism automatically fragments data to the disks. In a technical report called NetApp and VMware vSphere Storage Best Practices, page 78 it says:

Virtual machines stored on NetApp storage arrays should not use disk defragmentation utilities as the WAFL file system is designed to optimally place and access data at a level below the GOS file system

Leveraging features like thin provisioning, snapshot, data deduplication and replication may be impacted by an in-guest defragmentation.

The IO load generated by the defrag process running inside the virtual machine will negate the benefit of having thin disk by having the disk inflating. What’s the point of using thin disk if you bloat it with your in-guest defragmentation?

The same IO load will generate another huge amount of IO’s in the background and mess up with snapshots, growing them unexpectedly eventually reaching the same size as the parent disk, creating many SCSI locks as well, bloating latency. Why would you do that to your virtual machine?

If you’re doing replication to the other side of the earth, you may send a fair amount of extra bytes across your WAN link, eventually you end up saturating the link causing high latency and disconnections. You’re sure you want to get the network team on your back?

Scenario #2

Dumb DAS devices. You have set up a virtual environment with no shared storage array, using one disk or maybe a few disks in a RAID configuration. In that case, in-guest defragmentation could be an improvement or I should say a mitigation of the overall performance degradation of your DAS. Scott Drummond of vpivot.com published an interesting article called Windows Guest Defragmentation, Take Two demoing the benefits of in-guest defragmentation in a very specific storage configuration, that is a DAS!

Again even though you’re using a dumb DAS device, leveraging features like thin provisioning may actually render in-guest defragmentation inappropriate for the same reason I have enlighten in scenario #1.

You may attach the DAS with a smart SCSI controller with plenty of cache and able to do IO coalescing but at the end of the day defragmenting is about writes, many writes, generating many IO’s that neither the controller’s cache or the physical disks in the DAS can absorb efficiently.

Summary:

My thought is that in-guest defragmentation is, for most of the environments I worked in, just totally inappropriate and actually may decrease the overall performances. The overhead of running the defragmentation process is likely to be much more of a burden and outweigh the virtual benefit.

As usual it depends on YOUR environment and the way you’ve designed your storage, the features you have enabled, your functional requirements, etc…

In any case, and that is valid for both of my scenarios here, the easiest way to gain more performance out of your storage device is to align your VMs and your VMFS datastores to the storage device. VM alignment is critical, especially for Microsoft Windows prior Vista and 2008. Another way to improve IOs, is to disable in the guest the access time updates process in NTFS. And finally at the ESX host level, make sure your VMFS datastores are aligned by creating them through the vCenter Client.