Bull’s BCS Architecture – Deep Dive – Part 3

The last couple of posts about Bull’s BCS Architecture have been quite intense and I hope I’ve met the technical details you were expecting.

Here are the links to the entire deep dive series so far:

Now I want to talk about another feature that Bull’s BCS Architecture is leveraging: Intel RAS

What is RAS and what is its purpose?

Today’s crucial business challenges require the handling of unrecoverable hardware errors, while delivering uninterrupted application and transaction services to end users. Modern approaches strive to handle unrecoverable errors throughout the complete application stack, from the underlying hardware to the application software itself.

RAS Flow – Courtesy of Intel

Such solutions involve three components:

reliability, how the solution preserves data integrity,
availability, how it guarantees uninterrupted operation with minimal degradation,
serviceability, how it simplifies proactively and reactively dealing with failed or potentially failed components.

This post covers only the memory management mechanisms providing reliability and availability. Next post will cover other mechanisms.

Memory Management mechanisms

Memory errors are among the most common hardware causes of machine crashes in production sites with large-scale systems.

Google® Inc. researchers conducted a two-year study of memory errors in Google’s server fleet (see Google Inc., “DRAM Errors in the Wild: A Large-Scale Field Study”).

Researchers observed more than 8 percent of DIMMS and about one-third of the machines in the study were affected by correctable errors per year.

At the same time the annual percentage of detected uncorrected errors was 1.3 percent per machine and 0.22 percent per DIMM.

Capacity of memory module has increased – following Moore’s law – over the last two decades. In the 80’s you could buy 2MB memory modules, 20 years later, 32GB memory modules hit the market. That is a 16,000x improvement.

One of the unique reliability and availability features of the bullion is its RAM memory management and memory protection. From basic ECC up to Memory Mirroring, memory protection mechanisms can guarantee up to 100% memory reliability on the bullion.

Let’s have a look at some of those memory protection mechanisms available in the bullion:

ECC memory

Over and above traditional memory correction mechanisms, such as ECC memory, which maintains a memory system effectively free from single-bit errors.

Double device Data Correction (DDDC)

Bullion provides much more sophisticated mechanisms such as Double device Data Correction (DDDC), which corrects dual recoverable errors.

Double Device Data Correction – DDDC – Courtesy of Bull

DIMM & Rank Sparing

The commonly available DIMM Sparing is now being enhanced to provide Rank Sparing. With Rank Sparing of dual rank DIMM’s, only 12.5% is being used to enhance the reliability of the memory system. If the level of ECC corrected errors becomes too high, it fails over the spares. Note that DIMM and Rank Sparing does not protect against uncorrectable memory errors.

DIMM Sparing- Rank Sparing – Courtesy of Bull

MCA Recovery

In a virtualized environment, the Virtual Machine Manager (VMM) shares the silicon platform’s resources with each virtual machine (VM) running an OS and applications.

In systems without MCA recovery, an uncorrectable data error would cause the entire system and all of its virtual machines to crash, disrupting multiple applications.

With MCA recovery, when an uncorrectable data error is detected, the system can isolate the error to only the affected VM. Here the hardware notifies the VMM (Support for VMware vSphere 5.x), which then attempts to retire the failing memory page(s) and notify affected VMs and components.

If the failed page is in free memory then the page is retired and marked for replacement, and operation can return to normal. Otherwise, for each affected VM, if the VM can recover from the error it will continue operation; otherwise the VMM restarts the VM.

In all cases, once VM processing is done, the page is retired and marked, and operation returns to normal.

It is possible for the VM to notify its guest OS and have the OS take appropriate recovery actions, and even notify applications higher up in the software stack so that they take application-level recovery actions.

Here is a video demoing the MCA Recovery (MCAR) with VMware vSphere 5.0

Here is a diagram of MCA recovery process:

Software-Assisted MCA Recovery Process – Courtesy of Intel

MCA Recovery is cool but the main drawback it does not offer 100% memory reliability. The scrubbing process that goes through all memory pages to detect the unrecoverable error takes some time, and a few CPU cycles too.

If you’re fortunate enough the MCA Recovery detects the error and reports to the VMM (VMware vSphere 5.x) otherwise you end up most probably with a purple screen of death.

Mirroring Mode

For 100% memory reliability, bullion use memory lockstep. Data are written simultaneously in two different memory modules in lockstep mode. It is the best memory protection mechanism for both reliability and availability as it protects against both correctable and uncorrectable memory errors. On four memory channel systems such the bullion, you cut your available number of DIMM slots by 1/2.

The bullion can hold up to 4TB of memory, which is surprisingly the double of the memory maximum of VMware vSphere 5.1 tolerates so far 😉

Memory Mirroring – Courtesy of Bull

Mirroring mode offers 100% memory reliability and availability but it cost an arm, well two arms and maybe a leg as well… Memory performance drops as well by as much as 50%.

I’ve gone through a small subset of the many many features available to RAS. Here below a full list of Intel Xeon processor E7 family advanced RAS features.

Intel Xeon processor E7 family advanced RAS features – Courtesy of Intel

I’ve setup a little poll about the memory protection mechanism you rely on in your production environments. Thank you for your time to answer!

Next post I will address some other RAS features available into the bullion. Stay tuned!

Source: Bull, Intel, Wikipedia

3 Responses to Bull’s BCS Architecture – Deep Dive – Part 3

Pingback: Bull’s BCS Architecture – Deep Dive – Part 4 « DeinosCloud
Mike says:

September 3, 2013 at 18:18

This is very interesting! Yes, we all know that ECC RAM is required because of random bit flips in RAM. But don’t these bit flips also occurs on hard disks, but on a wider scale? How do you resolve hard disk bit flips?

- PiroNet says:
  
  September 3, 2013 at 19:25
  
  Hi Mike and thanks for your comment.
  Good question indeed. Hard disk also uses ECC technologies as well to mitigate to correct bit flips. Adding on top RAID protection.
  Now I don’t know for sure if – for the same capacity – there are more bit flips occurring on HD than memory. I leave that to statisticians 🙂
  
  [UPDATE]
  
  About statistics, have a read at http://www.zdnet.com/blog/storage/dram-error-rates-nightmare-on-dimm-street/638
  And here http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/disk_failures.pdf

	Tom Lockwood on Real Life Scenario – Mig…
	How To Troubleshoot… on Chunk Size Of a RAID0 Volume O…
	PiroNet on It All Started With This …
	Gorka on It All Started With This …
	An administrator not… on Ballooning And Hypervisor Swap…

Bull’s BCS Architecture – Deep Dive – Part 3

What is RAS and what is its purpose?