Could DINO Be The Future Of vSphere NUMA Scheduler?
DINO the future of vSphere NUMA scheduler uh!
First thing first, DINO is not Dino… Dino is one of the The Flintstones’s fictional characters.
Flintstones. Meet the Flintstones. They’re the modern stone age family.
From the town of Bedrock, They’re a page right out of history…yabba dabba doo time!
All right, all right. DINO is not Dino. So what is DINO? I leave this for later.
For now let’s focus on NUMA design and vSphere NUMA Scheduler.
So what is NUMA?
Wikipedia says: “Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors. NUMA architectures logically follow in scaling from symmetric multiprocessing (SMP) architectures”
NUMA is often contrasted with Uniform Memory Access (UMA) which is a shared memory architecture used in parallel computers. All the processors in the UMA model share the physical memory uniformly. In a UMA architecture, access time to a memory location is independent of which processor makes the request or which memory chip contains the transferred data. Read more at Wikipedia.
Figure 1 shows a classic SMP system where there is usually a single pool of memory also referred as an Uniform Memory Access (UMA). That is memory access is equal for all processors. Contention-aware algorithms works well here.
The main drawback of the UMA architecture is that it doesn’t follow in scaling from symmetric multiprocessing (SMP) architectures where many processors must compete for bandwidth on the same system bus. That’s why server vendors added NUMA design on top of SMP design. The first commercial implementation of a NUMA-based Unix system was the Symmetrical Multi Processing XPS-100 family of servers, designed by Dan Gielan of VAST Corporation for Honeywell Information Systems Italy (HISI). In 1991 Honeywell’s computer division was sold to Groupe Bull. How interesting is that!
Figure 2 shows a classic SMP system with Distributed Shared Memory (DSM). In a DSM system there are multiple pools of memory and the latency to access memory depends on the relative position of the processor and memory. This is also referred to a Non-Uniform Memory Access or NUMA.
Major benefit; each processor has local memory with the lowest latency. On the opposite remote memory access is slower. Intel says that latency can go up to 70% and bandwidth as less than half of local access bandwidth.
But the biggest downside of DSM is that it only works well if the operating system is “NUMA-aware” and can efficiently place memory and processes. The OS scheduler and memory allocator play a critical role here.
vSphere is NUMA aware as long as the BIOS reports it. That is as long as the BIOS builds a System Resource Allocation Table (SRAT), so the ESX/ESXi host detects the system as NUMA and applies NUMA optimizations. If you enable node interleaving (also known as interleaved memory), the BIOS does not build an SRAT, so the ESX/ESXi host does not detect the system as NUMA. Does that mean that vSphere doesn’t do any optimization if you haven’t enabled NUMA in the BIOS? I guess it doesn’t since the scheduler doesn’t know the relationship between processor and local memory. That information is only given by the SRAT as I understand it.
What are vSphere NUMA optimizations I’m referring to?
Before we deep dive vSphere NUMA optimizations, first let’s define a Home Node. A Home Node is one of the system’s NUMA nodes containing processors and local memory, as indicated by the System Resource Allocation Table (SRAT).
They are two main vSphere NUMA optimization algorithms and settings you find in the vSphere NUMA Scheduler:
- Home Nodes and Initial Placement. When a virtual machine is powered on, ESX/ESXi assigns it a home node in a round robin fashion. To work around imbalanced systems when virtual machines are stopped or become idle, there is a second set of algorithms and settings called,
- Dynamic Load Balancing and Page Migration. ESX/ESXi combines the traditional initial placement approach with a dynamic rebalancing algorithm. Periodically (every two seconds by default), the system examines the loads of the various nodes and determines if it should rebalance the load by moving a virtual machine from one node to another. This calculation takes into account:
- the resource settings for virtual machines and
- resource pools to improve performance without violating fairness or resource entitlements.
To get a detailed description of the algorithms and settings used by ESX/ESXi to maximize application performance while still maintaining resource guarantees, visit vmware.com.
vSphere NUMA Scheduler has put in place pretty smart algorithms and settings when it comes to initial placement and memory management. I was wondering could it be better?
For instance, by managing contention for shared resources that occurs when memory-intensive threads are co-scheduled on cores that share parts of the memory hierarchy, such as last-level caches and memory controllers.
Sergey Blagodurov from Simon, Sergey Zhuravlev, Mohammad Dashti and Alexandra Fedorova, all from Simon Fraser University, have published a very interesting technical paper at Usenix.org about limitation of current NUMA design and a proposition of a new approach they called DINO which stands for Distributed Intensity NUMA Online.
Those guys have discovered that state-of-the-art contention management algorithms fail to be effective on NUMA systems and may even hurt performance relative to a default OS scheduler.
Contention-aware algorithms focused primarily on UMA (Uniform Memory Access) systems, where there are multiple shared last-level caches (LLC), but only a single memory node equipped with the single memory controller, and memory can be accessed with the same latency from any core.
Remember that unlike on UMA systems, thread migrations are not cheap on NUMA systems because you also have to move the memory of the thread. So their approach to the problem is a mechanism that ensure that superfluous thread, those that are not likely to reduce contention, are not migrated in a NUMA system.
Existing contention aware algorithms perform NUMA-agnostic migration, and so a thread may end up running on a node remote from its memory. Actual vSphere NUMA scheduler is mitigating this issue by detecting when most of a VM’s memory is in a remote node and eventually load balancing and migrating memory as long as it doesn’t cause CPU contention to occur in that NUMA node.
Could DINO Be The Future Of vSphere NUMA Scheduler?
DINO organizes threads into broad classes according to their miss rates, and to perform migrations only when threads change their class, while trying to preserve thread-core affinities whenever possible. VMware vSphere NUMA optimizations would benefit from this by adding DINO approach to the existing optimization code by eventually migrate memory based on threads and their miss rates as well.
In vSphere 5.x VMware introduced vNUMA. It presents the physical NUMA typology to the guest operating system. vNUMA is enabled by default on VMs greater than 8 way but you can change this by modifying the numa.vcpu.min setting. Is this an attempt to hand over the critical NUMA scheduler job to the guest OS hoping it does a better job? I would say that it may seems a good approach but at the cost of losing control. In a shared environment such a VMware environment, the virtual machine monitor should be in control, always.
I’m not within the secret of Gods. I don’t have access to VMware developers and codes. Thus what I’m being saying here is based on a series of elements, readings, articles, vendor architecture documents that I have compiled and read through while preparing Santa Christmas Eve with an enhanced version of eggnog in my mug. Therefore I may be wrong, off-target, totally inaccurate in my conclusion…
If you have another point of view, piece of information I don’t have. If I missed something in my thought process just post a comment. I’ll be very happy to read from you!
Source: vmware, wikipedia.org, usenix.org, clavis.sourceforge.net