VMware HA Agent – Lack and Limitation


VMware HA stands for High Availability and over the past 4 years VMware HA has evolved significantly but now the lack of some features and limitation of existing ones are showing up in large architecture designs and could turn into a deployment show-stopper in the coming years or perhaps months if VMware doesn’t come with an enhanced version or eventually radically switches to a new HA agent. First I think a small introduction to VMware HA is necessary.

Where does that agent come from?

VMware HA is really based on a stripped down version of the Legato Automated Availability Manager 5.1.2, aka LAAM.  When EMC took over Legato in 2003, the L got removed from the agent. Smart people like Deepak Narain developed the agent at Legato back in 2002.

What does that agent do?

VMware HA’s main job is to monitor ESX’s service console network interface card (NIC). Beside that main function,VMware HA provides high availability for virtual machines by pooling them and in the event of a failure, the virtual machines on a failed host are restarted on alternate hosts.

VMware HA does:

  • protect against a server failure by automatically restarting the virtual machines on other hosts within the cluster.
  • protect against operating system failures by continuously monitoring a virtual machine and resetting it in the event that a failure is detected.

What does happen in case of Failure Detection, Host Network Isolation and Operating System Failure?

Taken from vSphere Availability Guide,  HA agents communicate with each other and monitor the liveness of the hosts in the cluster. This is done through the exchange of heartbeats, by default, every second. If a 15-second period elapses without the receipt of heartbeats from a host, and the host cannot be pinged, it is declared as failed. In the event of a host failure, the virtual machines running on that host are failed over, that is, restarted on the alternate hosts with the most available unreserved capacity (CPU and memory.)

Host network isolation occurs when a host is still running, but it can no longer communicate with other hosts in the cluster. With default settings, if a host stops receiving heartbeats from all other hosts in the cluster for more than 12 seconds, it attempts to ping its isolation addresses. If this also fails, the host declares itself as isolated from the network. When the isolated host’s network connection is not restored for 15 seconds or longer, the other hosts in the cluster treat it as failed and attempt to fail over its virtual machines. However, when an isolated host retains access to the shared storage it also retains the disk lock on virtual machine files. To avoid potential data corruption, VMFS disk locking prevents simultaneous write operations to the virtual machine disk files and attempts to fail over the isolated host’s virtual machines fail. By default, the isolated host leaves its virtual machines powered on, but you can change the host isolation response to Shut Down VM or Power Off VM.

To monitor operating system failures, VMware HA monitors heartbeat information provided by the VMware Tools package installed in each virtual machine in the VMware HA cluster. Failures are detected when no heartbeat is received from a given virtual machine within a user-specified time interval. The virtual machine is then restarted on alternate hosts.

Lack and Limitation, what are they?

-Taken from vSphere 4.0 Config Maximums, the following needs to be considered as limitations especially when thinking about the vCloud where you might require much more than 32 hosts in a HA cluster and much more than 40 guests per host:

  • Hosts per HA cluster -> 32 max
  • Virtual machines per host in HA cluster with 8 or fewer hosts -> 100 max
  • Virtual machines per host in HA cluster with 8 or fewer hosts for vSphere 4.0 Update 1 -> 160 max
  • Virtual machines per host in HA cluster with 9 or more hosts -> 40 max

-Next, VMware HA won’t by default protect you against the failure of the guest operating system and eventually if you turn that on, you gain some level of protection against the failure of the guest OS but you still you won’t get protection of the specific application within the guest operating system, unlike for instance an OS-level cluster. An advanced configuration setting, that is das.iostatsInterval, helps you avoid restarting a guest where the heartbeat ceased functioning by checking the guest’s IO activity for a certain period of time.

-Another point, VMware HA clusters spread over geographically dispersed Data Centers will be common designs sooner that we think and as Chad Sakac posted in response to a blog from Arnim Van Lieshout, HA needs:

  • A more “SRM like” ability to control restart conditions/sequencing and
  • A more transparent way to define primaries/secondaries.

-Finally, the current number of maximum 5 primaries per VMware HA cluster is just not enough to cope with large vCloud environments. For those who run Blades, one of the first recommendations is to avoid having those 5 primaries running on, for instance, a single Blade chassis also  know as a ‘possible failure domain’. If the chassis goes down for any reason you lose your HA capabilities!  There is a way to configure a HA node as primary or secondary, however it’s not possible to configure an ESX host as a “fixed” primary HA node. There is hope, back in September 2009, during VMworld 2009, Marc Sevigny from VMware revealed that a future release of HA would contain an option that would allow you to pick your primary hosts.

Conclusions

VMware HA is probably the feature with the most advanced settings as stated by Duncan Epping! Don’t get me wrong, I share Duncans’ thought, HA is awesome and many customers rely on it every day to get high availability across their VMware environments! But in my opinion HA has also reached its limits in many aspects and definitely needs to be improved and why not totally re-written to cope with tomorrow’s new challenging architecture designs of the vCloud initiative.

This is an open discussion where I through in my ideas and point of view. I welcome anybody to leave comments!

Advertisements

About PiroNet

Didier Pironet is an independent blogger and freelancer with +15 years of IT industry experience. Didier is also a former VMware inc. employee where he specialised in Datacenter and Cloud Infrastructure products as well as Infrastructure, Operations and IT Business Management products. Didier is passionate about technologies and he is found to be a creative and a visionary thinker, expressing with passion and excitement, hopefully inspiring and enrolling people to innovation and change.
This entry was posted in Uncategorized. Bookmark the permalink.

7 Responses to VMware HA Agent – Lack and Limitation

  1. Duncan says:

    Unfortunately I am not allowed to say much due to the NDA I signed, but there’s some amazing stuff coming in regards to HA!

  2. daVikes says:

    Glad to hear that Duncan.

    I would like to see all of the Advanced Parameters in there by default as it is with the Advanced vCenter parameters.

    I wonder if it should be more integrated w/ vCenter in knowing / recognizing port groups? I have a Service Console Port Group / Fault Tolerance Port Group / Various VM VLAN Port Groups / iSCSI Port Groups all of which go through a number of Different Network Cards and it seems HA could understand which network cards are unavailable, what is affected and move VM’s around appropriately as needed and also if possible. Similar to how vDS is more integrated with vCenter but yet can operate independently of vCenter if that is unavailable. I wonder if it will be called vHA =)

    So perhaps for example if the Service Console PG uplinks goes down and all other port groups are fine, it knows that’s the case, and you could set it to do nothing. If the uplink ports for FT are down and say you have a FT enabled VM running, then it tries to Fail Over to the Secondary VM so its the new Primary, and creates a new secondary VM for you on a working host? Or say it is your VM’s VLAN port group that is the only Portgroup offline and it knows its affecting the SLA and it knows it can vMotion them or worse do a shutdown of those VM’s in the affected Portgroups and start those up on another Host?

    But it seems to do some of that you need to get it tied into vCenter a bit more than is is today?

    I’m totally speculating at this point, and have no adv. knowledge of anything just pointing out some things just thinking about here and where/how it could go? or maybe none of these make sense or is workable who knows, just throwing them out there.

  3. hypervisor says:

    no more primary nodes limit ! 🙂

  4. Pingback: Increase Number of VMware HA Primaries – How To? « DeinosCloud

  5. Vmware ESX says:

    VMware is starting to become a no brainer for most businesses. I have worked in the IT Industry for over 10 years and never before have I seen a single piece of technology make such an impact in such a short space of time. Nice entry.

  6. Pingback: What VMware DRS/HA/FT is missing? « DeinosCloud

  7. Pingback: What VMware DRS/HA/FT are missing? « DeinosCloud

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s