The last couple of posts about Bull’s BCS Architecture have been quite intense and I hope I’ve met the technical details you were expecting.
Here are the links to the entire deep dive series so far:
Now I want to talk about another feature that Bull’s BCS Architecture is leveraging: Intel RAS
What is RAS and what is its purpose?
Today’s crucial business challenges require the handling of unrecoverable hardware errors, while delivering uninterrupted application and transaction services to end users. Modern approaches strive to handle unrecoverable errors throughout the complete application stack, from the underlying hardware to the application software itself.
Such solutions involve three components:
- reliability, how the solution preserves data integrity,
- availability, how it guarantees uninterrupted operation with minimal degradation,
- serviceability, how it simplifies proactively and reactively dealing with failed or potentially failed components.
This post covers only the memory management mechanisms providing reliability and availability. Next post will cover other mechanisms.
Memory Management mechanisms
Memory errors are among the most common hardware causes of machine crashes in production sites with large-scale systems.
Google® Inc. researchers conducted a two-year study of memory errors in Google’s server fleet (see Google Inc., “DRAM Errors in the Wild: A Large-Scale Field Study”).
Researchers observed more than 8 percent of DIMMS and about one-third of the machines in the study were affected by correctable errors per year.
At the same time the annual percentage of detected uncorrected errors was 1.3 percent per machine and 0.22 percent per DIMM.
Capacity of memory module has increased – following Moore’s law – over the last two decades. In the 80’s you could buy 2MB memory modules, 20 years later, 32GB memory modules hit the market. That is a 16,000x improvement.
One of the unique reliability and availability features of the bullion is its RAM memory management and memory protection. From basic ECC up to Memory Mirroring, memory protection mechanisms can guarantee up to 100% memory reliability on the bullion.
Let’s have a look at some of those memory protection mechanisms available in the bullion:
Over and above traditional memory correction mechanisms, such as ECC memory, which maintains a memory system effectively free from single-bit errors.
Double device Data Correction (DDDC)
Bullion provides much more sophisticated mechanisms such as Double device Data Correction (DDDC), which corrects dual recoverable errors.
DIMM & Rank Sparing
The commonly available DIMM Sparing is now being enhanced to provide Rank Sparing. With Rank Sparing of dual rank DIMM’s, only 12.5% is being used to enhance the reliability of the memory system. If the level of ECC corrected errors becomes too high, it fails over the spares. Note that DIMM and Rank Sparing does not protect against uncorrectable memory errors.
In a virtualized environment, the Virtual Machine Manager (VMM) shares the silicon platform’s resources with each virtual machine (VM) running an OS and applications.
In systems without MCA recovery, an uncorrectable data error would cause the entire system and all of its virtual machines to crash, disrupting multiple applications.
With MCA recovery, when an uncorrectable data error is detected, the system can isolate the error to only the affected VM. Here the hardware notifies the VMM (Support for VMware vSphere 5.x), which then attempts to retire the failing memory page(s) and notify affected VMs and components.
If the failed page is in free memory then the page is retired and marked for replacement, and operation can return to normal. Otherwise, for each affected VM, if the VM can recover from the error it will continue operation; otherwise the VMM restarts the VM.
In all cases, once VM processing is done, the page is retired and marked, and operation returns to normal.
It is possible for the VM to notify its guest OS and have the OS take appropriate recovery actions, and even notify applications higher up in the software stack so that they take application-level recovery actions.
Here is a video demoing the MCA Recovery (MCAR) with VMware vSphere 5.0
Here is a diagram of MCA recovery process:
MCA Recovery is cool but the main drawback it does not offer 100% memory reliability. The scrubbing process that goes through all memory pages to detect the unrecoverable error takes some time, and a few CPU cycles too.
If you’re fortunate enough the MCA Recovery detects the error and reports to the VMM (VMware vSphere 5.x) otherwise you end up most probably with a purple screen of death.
For 100% memory reliability, bullion use memory lockstep. Data are written simultaneously in two different memory modules in lockstep mode. It is the best memory protection mechanism for both reliability and availability as it protects against both correctable and uncorrectable memory errors. On four memory channel systems such the bullion, you cut your available number of DIMM slots by 1/2.
The bullion can hold up to 4TB of memory, which is surprisingly the double of the memory maximum of VMware vSphere 5.1 tolerates so far 😉
Mirroring mode offers 100% memory reliability and availability but it cost an arm, well two arms and maybe a leg as well… Memory performance drops as well by as much as 50%.
I’ve gone through a small subset of the many many features available to RAS. Here below a full list of Intel Xeon processor E7 family advanced RAS features.
I’ve setup a little poll about the memory protection mechanism you rely on in your production environments. Thank you for your time to answer!
Next post I will address some other RAS features available into the bullion. Stay tuned!
Source: Bull, Intel, Wikipedia