首页 > 解决方案 > Skylake and newer Ring Bus

问题描述

In Intel Skylake and newer, can each core's memory subsystem directly participate in ring bus traffic? This block diagram (and the accompanying information) would seem to suggest so.

If so, what part of the subsystem is attached?

标签: x86intelcpu-architecturecpu-cache

解决方案


The block diagram you referred to is for the Skylake client processors, which contain 2 or 4 physical cores. In the Wikipedia page on Skylake, this includes all Mainstream Desktop processors, all Mobile processors, and all Xeon E3 v5 processors. All of them use a ring interconnect. Even though client-grade Skylake processors only include up to 4 physical cores, newer generations such as Coffee Lake may include 6 physical cores and also use the ring topology. In the case of other Skylake processors (the server ones), a mesh interconnect is used. These would be the first Intel high-end multicore processors that use a mesh interconnect. Intel has a patent for that, and it actually goes into some detail about how it works and what is connected to what.

The way each core is connected to the interconnect is similar irrespective of the topology of the interconnect. The L1 fill buffers and the L2 are not directly connected to the interconnect. There will be some component that plays the role of an interconnect agent that knows how to create, send, and receive messages over the interconnect to one or multiple nodes. Even though this may not have been mentioned explicitly in the page you have referenced, it is mentioned in the page on Skylake server processors, which seems to be doing a better job explaining how it works at a basic level. At least, the figures there are nicer than those from the patent.

Each core is connected to a common mesh stop (CMS)1, which is part of the uncore (the stuff outside all cores, but on-chip). The CMS knows the identifiers of all nodes on the interconnect including its own node. When it receives a message not intended for its node, it forwards it to the next node on the planned route. If the message is intended for the node, it is transferred to a component that is part of the core, called the cache and home agent (CHA)2. According to the patent, the CHA connects the core's L2 to the on-node L3 slice and the CMS (which is essentially the gateway between the node and the interconnect). In the client-grade Skylake processors, there is no CHA, there is only "CA" (I came up with this term just for the sake of discussing it). I'll discuss this in more detail.

The interconnect looks something like this:

                        |
                        |
                  vertical ring
                        |
                        |
                      -----                        -----
 |node A| -- bus --   |CMS|  -- horizontal ring -- |CMS| -- bus -- |node B|
                      -----                        -----
                        |
                        |
                  vertical ring
                        |
                        |

Each node looks something like this:

-----            -----                                -----
|CMS|  -- bus -- |CHA| -- intra-node interconnect --  |L3 |
-----            -----                                -----
                   |
                   |
         -----------------------
         | (we are now in core)|
         |    L2 controller    |
         -----------------------
                   |
                   |
           the rest of the core

What does the CHA do? Well, it's called the cache and home agent. What? The home agent is on-node? Note that the home agent is responsible for translating the physical memory addresses to memory channel addresses, which get passed over the interconnect to the target memory controller of the target memory channel. In server-grade Skylake processors, the home agents are actually distributed over the nodes. The cache part of CHA means that the CHA is also responsible for determining which LLC slice contains the cache line that is mapped to a given address and routes the memory request accordingly to the target LLC slice controller. The CHA also implements the coherence protocol by doing things like providing (modified) copies of cache lines to other nodes (MESIF or one of its variants) and responding to coherence requests with help from the on-node snoop filter. Non-temporal requests also go through the CHA. Moreover, the CHA handles I/O requests by sending them to the target node that can handle the requests (a PCIe node).

The home agents are only distributed in the server-grade Skylake processors. In the client processors, the home agent (and memory controller) exist in the system agent (refer to the figure from the Wikichip article). On the other hand, in the server processors, each memory controller is a node on the mesh (NUMA).

The mesh topology and the distributed home agents significantly improve the scalability of the server processors with respect to the number of physical cores. The Intel patent discusses sharing a single CMS between multiple nodes to create hybrid topologies, which is useful for a very large number of cores. I don't think any Skylake processors use that though.


(1) The Wikichip article calls it a converged mesh stop. I don't know the origin of this term. Intel calls it a common mesh stop, a shared mesh stop, or a mesh station. I'll use the Intel term.

(2) The Wikichip article calls it the caching and home agent, but Intel calls it the cache and home agent or the cache-home agent.


推荐阅读