Two Main Scale-Up Server Architectures – Part 1
To address the increasingly demanding workloads, processor sockets are added in a seamless way within a single server. You’re scaling up. Sockets are connected together as well as the memory and IO boards and applications can benefit from more compute power.
Refer to my first article of a series – Scale-Out And Scale-Up Architectures – The Business-Critical Application Point Of View
There are two broad scale-up server architecture:
- the “glueless” architecture
- the “glued” architecture
The “glueless” architecture
The “glueless” architecture was designed by Intel. It was implemented in the Intel Xeon series E7.
When building servers above 4-sockets, they are directly connected together through the Intel QPI links.
The Intel QPI links are used to access memory, IO’s and networks as well the processors.
A “glueless” socket uses one of these 4 Intel QPI links to connect the processor socket to IO and the remaining three Intel QPI links to interconnect the processor sockets.
In a 8-socket configuration, each processor socket connects directly to three other sockets while the connection to the other four processor sockets are indirect.
The advantages of a “glueless” architecture:
- no requirement for specific development nor expertise from the server manufacturer. Every server makers can build a 8-socket server.
- thus the cost of a 4-socket and 8-socket is also less
The disadvantages of a “glueless” architecture:
- the TCO goes up when scaling out
- limited to 8-socket servers
- difficult to maintain cache coherency when socket increases
- performance increase not linear
- price/performance ratio decreases
- efficiency not optimal when running large VMs
- up to 65% of Intel QPI links bandwidth consumed to address QPI source broadcast snoopy protocol
What’s the issue with the Intel QPI source broadcast snoopy protocol? To achieve cache coherency, a read request must be reflected to all processor caches as a snoop. You can compare this as doing a broadcast on an IP network. Each processor must check for the requested memory line and provide the data if it has the most up to date version. In case the latest version is available in another cache, source broadcast snoopy protocol provides the minimum latency when memory line is copied from one cache to the next. In a source broadcast snoopy protocol, all reads result in snoops to all other caches consuming link and cache bandwidth as these snoop packets use cache cycle and link resources otherwise used for data transfers.
The primary workloads concerned by the Intel QPI source broadcast snoopy issue are:
- Java applications
- large databases
- latency sensitive applications
No bottleneck should result of a scale-up approach otherwise the architecture in useless. Thus linearity of increased performance should be in line with the added resources.
Next part, we will discuss the “glued” architecture and how it can address the drawbacks of the “glueless” architecture while maintaining in line performances.
Source: Bull, Intel, Wikipedia