In this post I want to focus only on the intrinsic connection that exists between DIMMs and the Intel Nehalem regarding primarily the memory architecture. I was reading some good papers on this topic and I discovered some interesting details that I wanted to share with you in this blog post.
At the end of the exercise, by picking up the right combination of memory model, memory size, channel population and processor model you can make substantial cost savings especially when doing economies of scale.
Beforehand we need to take a closer look at these two server core items, that is memory, and processors. Let’s start with UDIMM and RDIMM memory architecture and next I’ll go through the Intel Nehalem/Westmere memory architecture. Finally I will have a couple of scenarios to exercise what we have learned here. For both scenarios I will pick up what I think to be the right memory and processor combinations. Feel free to comment to share your experience.
This is quite a long post so bear with me 😉
UDIMMs versus RDIMMs
There are some differences between UDIMMs and RDIMMs that are important in choosing the best options for memory performance. To make the long story short here is a summary of the comparison between UDIMMs and RDIMMs:
- Typically UDIMMs are a bit cheaper than RDIMMs
- For one DIMM per memory channel UDIMMs have slightly better memory bandwidth than RDIMMs.
- For two DIMMs per memory channel RDIMMs have better memory bandwidth than UDIMMs.
- For the same capacity, RDIMMs will be required more Watt per DIMM than UDIMMs,
- RDIMMs also provide an extra measure of RAS:
- Address / control signal parity detection.
- RDIMMs can use x4 DRAMs so SDDC can correct all DRAM device errors even in independent channel mode.
- UDIMMs are currently limited to 4GB in a Dual Rank mode.
- UDIMMs are limited to two DIMMs per memory channel.
So you could go for UDIMMs because they are a bit cheaper, a bit faster and require less power than RDIMMs for the same capacity.
On the other hand you would go for RDIMMs if you need higher capacity per memory module, more reliable error control and data correction than UDIMMs.
So we have define the pro’s and con’s for these two memory models. Keep this in mind and now let’s have a closer look at the Intel Nehalem/Westmere memory architecture and processor models available.
[UPDATE] The LRDIMM case. This is a new type of memory and stands for Load Reduced DIMM. It allows massive memory expansion without sacrificing performance. Remember that as soon you fill in the the third channel, the memory speed drops to 800MHz. LRDIMM increases capacity whilst maintaining high memory speed by fooling the memory controller. The LR Buffer lets a quad rank DIMM look like a dual rank DIMM to the memory controller and therefore allows up to three DIMMs per channel and since that’s still below the eight rank per channel limit the memory speed remains at 1333MHz. How cool is that 🙂 Obviously you can’t mix LRDIMMs with either RDIMMs or UDIMMs. If you look for maximum capacity and increased performance, look at LRDIMMs. More at the DDR3 for Dummies – 2nd edition
Intel Nehalem-DP/Westmere-DP Memory Architecture and Processor Models
There is no difference in the memory architecture between Nehalem and Westmere. Let me summarize below what is, in the case of the memory architecture, important to me:
- A 2-way Xeon system (DP) has one QPI channel to connect to the other socket and one QPI to connect to the IOH chipset (IO Hub). Eventually you can have 2 IOH’s.
- QPI operates at a clock rate of either 2.4 GHz(=4.8GT/s), 2.93 GHz(=5.86GT/s), or 3.2 GHz(=6.4GT/s).
- The QPI has a bi-directional maximum bandwidth of 6.4GT/s x 2Bits/Hz x 2-Way = ~25.6GB/s.
- GT/s is calculated with 20 bits in mind (or 20 lanes), whilst the GB/s is calculated on the real payload of 16 bits (2 Bytes). For more information on this particular topic, read the An Introduction to the Intel® QuickPath Interconnect.
- Nehalem/Westmere supports up to 18 slots DIMM with DDR3 memory.
- In general servers support DDR3 DIMM with a maximum memory clock speed of 166MHz which gives a data rate of 1333MT/s. Many time misleadingly advertised as the I/O clock rate by labeling the MT/s as MHz.
- The three DDR3 channels to local DRAM support a maximum bandwidth of 3 x 8 x 1.333GTransfers/s = ~31.99GB/s. That is ~10.6GB/s per channel.
- At 1066GT/s maximum bandwidth is ~25.58GB/s, that is ~8.52GB/s per channel.
- At 800GT/s maximum bandwidth is ~19.2GB/s, that is ~6.4GB/s per channel.
- The available bandwidth to access memory blocks on the other socket is bound by the QPI link speed.
- The available bandwidth through the QPI link is 12.8 GB/s one way that is approximately 40% of the bandwidth to the local DRAM.
- At the time of authoring this post, 12MB is the maximum shared L3 cache available for Intel Xeon 5000 series.
The diagram below shows the memory layout of a Nehalem DP Server. By the way DP stands for Dual-Processor.
Note the text in green, I will talk about that later in the post.
The next diagram lists the theoretical bandwidth for local and remote memory accesses.
Note that the remote memory access goes through the QPI link.
But that’s not the only things you need to think about. There are other considerations that are often overlooked. For instance the memory frequency, at which the system operates, is determined by a minimization function of three factors:
- DIMM frequency.
- Memory controller speed.
- Channel population scheme.
First, memory controller speed is limited by the processor model. In general Xeon 5600 ‘X’ series processors run at a maximum speed of 1333 MHz. ‘L’ and ‘E’ series processors run at either 1066 or 800 MHz depending on the CPU clock frequency. Though this not a constant and you have exceptions that I will call ‘marketing exceptions’. Better to look at the technical details for each processor model.
Second, the operating memory speed is dictated by the DIMM frequency. 1066 MHz DIMMs cannot run at 1333 MHz, but 1333 MHz and 1066 MHz can both run at lower frequencies.
Finally, channel memory population schemes dictate that one DIMM-Per-Channel (DPC) or two DPC can run at either 1066 or 1333 MHz, depending on processor model and DIMM type. As soon as you put more than two DPC in any one memory channel, the speed of all the memory drops to 800 MHz.
The table below summarizes this topic:
The difference of performances between 1333MHz and 1066MHz is about 8.5%, between 1333MHz and 800MHz is about 28.5%. Between 1066MHz and 800MHz is about 22%.
Here is below a table grouping the different DIMM capacity and types available for a HP ProLiant BL460c G7. Note that in some circumstances you can drop to 800MHz by populating a second channel i.e. HP BL490 G7.
On the same topic you also need to focus on the processor model. Intel has released many different product lines of Nehalem/Westmere processors, each combination of a processor die and package has both a separate codename and a product code.
Just for the x86 servers market, Intel has four different Xeon Processor families/sequences and for each processor family/sequence a bunch of different processor number such as the X5690 or the E5502.
Let’s have a look at the dual-socket Intel Xeon 5000 Processor Sequence and more precisely the 5500 and 5600 sequences. There you have something like 40 different processor numbers available making your choice even more difficult.
For each processor number, you have the processor clock rate, number of cores and threads, L3 Cache size, QPI Bus Speed, HT technology, TDP, etc… All these processor characteristics are important to make the right choice but also making it over complicated.
The right combination and Business Requirements
With all of these options; UDIMMs, RDIMMs, various DIMM sizes and speeds, low voltage DIMM, processor frequency and other processor technology features, etc. there is a vast number of possibilities and it’s not always obvious which combination of hardware elements you need to logically interlink all together to bring something consistent and coherent in regard to your business requirements and the server architecture as well. It’s like a giant puzzle of 1000 pieces of information you need to logically order them in a way to come up with the best combinations.
See I’m not using the ‘which options for the highest performance‘ because companies are not tied every time to just a high performance business requirement. Energy efficiency or high consolidation can also be your company’s number one business requirement.
Note that in this economical hard time, cost savings are mandatory for many companies and may rule out the traditional business requirements cited above. Cost savings rule helps to keep the company’s business requirement within the budget boundaries.
Sure your company can have other business requirements than the ones above, I know at least one company where the end-user experience is rated number one. A business requirement list is definitely not limited to three or four items.
In many cases companies have multiple business requirements; we need high performance and high consolidation at the lowest cost…Huh! The goal is to juggle with these business requirements to come up with the right combinations.
Sometime this turns into the Triangle Project with no viable combination 🙂
Imagine the following scenario, your company server vendor policy is HP and for this project you have picked up the HP BL460c G7. Business requirement is high consolidation, thus you need memory, plenty of memory. You load up the server with 12x32GB RDIMM memory modules for a maximum memory size of 384GB running at … 800MHz. Now what processor would you choose in this case? Would you buy the X5690 @ $1663.00 or the E5649 @ $774.00?
In this specific config the memory controller has the same value for both processors, that is 800MHz. QPI is higher for the X5690 but anyway you can’t use it at full throttle because of the memory controller speed is down to 800MHz. Thus between the two CPU’s just the clock speed makes a difference, ~1GHz more for the X5690, but it’s also more than 2x the price of the E5649. Does it worth the $889.00 extra notes?
By loading up with 12x16GB memory modules for a total of 192GB, you memory frequency remains at 1333MHz, and fast processors, one in the X series, are now a valid option. But then you don’t stick to your business requirement anymore! You have gone from the highest possible consolidation ratio (100%) to a half of that (50%).
Another scenario, you have again pick up a HP BL460c G7. This time the Business Requirement is energy efficiency. Remember UDIMM uses less power than RDIMM, thus you go for it and load up the server with 12x4GB UDIMM memory modules, one per channel. For the same capacity, RDIMM requires 0.5 to 1.0 Watt more. Now what processor would you choose? The one with the lowest power consumption might be the good choice, like the L5609. But then you do not benefit from the UDIMM running at 1333MHz cause the CPU supports maximum 1066MHz… What about going for 6x32GB RDIMM LV (1.35V instead of 1.5V) running at 1066MHz for a total capacity of 192GB (4x more than UDIMM max capacity). And choose the L5630 also using only 40W, but with HT and Turbo Boost Technology when you need extra power…
Take these two scenarios just for what they are. They may not reflect any real case. This just to demonstrate the thinking process with the information we gathered today.
Neither I have a secret formula that will sort out this kind of puzzle. At least, I hope I shed some light on these unknown but important links between the memory and the Intel Nehalem architectures.
Here are some tools that will help you pick up the right combination I hope:
There are two other unknown puzzle pieces I will shed some lights on next time; processor clock frequency sensitive applications and memory bandwidth sensitive applications. So stay tuned 😉