Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling
Rakesh Kumar (UCSD)
Victor Zyuban (IBM)Dean Tullsen (UCSD)
L2_0
L2_1 L2_2 L2_3
L2_4 L2_5
L2_6 L2_7
A Naive methodology for Multi-core Design
P0
P1
P2
P3
P4
P5
P6
P7
Multi-core oblivious multi-core design!
Clean, easy way to design
Holistic design of multi-core architectures Naïve Methodology is inefficient
Demonstrated inefficiency for cores and proposed alternatives Single-ISA Heterogeneous Multi-core Architectures for
Power[MICRO03] Single-ISA Heterogeneous Multi-core Architectures for
Performance[ISCA04] Conjoined-core Chip Multiprocessing [MICRO04]
What about interconnects? How much can interconnects impact processor
architecture? Need to be co-designed with caches and cores?
Goal of this Research
Contributions We model the implementation of several interconnection
mechanisms and topologies Quantify various overheads Highlight various tradeoffs Study the scaling of overheads
We show that several common architectural beliefs do not hold when interconnection overheads are properly accounted for
We show that one cannot design a good interconnect in isolation from the CPU cores and memory design
We propose a novel interconnection architecture which exploits behaviors identified by this research
Talk Outline Interconnection Models
Shared Bus Fabric (SBF) Point-to-point links Crossbar
Modeling area, power and latency
Evaluation Methodology
SBF and Crossbar results
Novel architecture
Shared Bus Fabric (SBF) On-chip equivalent of the system bus for snoop-based
shared memory multiprocessors
We assume a MESI-like snoopy write-invalidate protocol with write-back L2s
SBF needs to support several coherence transactions (request, snoop, response, data transfer, invalidates etc.)
Also needs to arbitrate access to the corresponding busses
Shared Bus Fabric (SBF)
Book-keeping
ABSB
RB
DB
D-arb A-arb
L2
Core(incl. I$/D$) Arbiters
Queues
Buses
(pipelined, unidirectional)
Control Wires
(Mux controls, flow-control, request/grant signals)
Details about latencies, overheads etc. in the paper
Point-to-point Link (P2PL) If there are multiple SBFs in the system, a
Point-to-point link connects two SBFs.
Needs queues and arbiters similar to an SBF
Multiple SBFs might be required in the system
To increase bandwidth To decrease signal latencies To ease floorplanning
Crossbar Interconnection System
If two or more cores share L2, the way a lot of present CMPs do, a crossbar provides a high bandwidth connection
Crossbar Interconnection System
Core
L2 bank
AB (one per core)
DoutB(one per core)
DinB(one per bank)
Loads, stores,prefetches,
TLB misses
Data Writebacks
Data reloads, invalidate addresses
Talk Outline Interconnection Models
Shared Bus Fabric (SBF) Point-to-point links Crossbar
Modeling area, power and latency
Evaluation Methodology
SBF and Crossbar results
Novel architecture
Wiring Area Overhead
Repeater Latch Memory Array
Wiring Area Overhead (65nm)
Metal Plane
Effective Pitch(um)
Repeater Spacing (mm)
Repeater Width (um)
Latch Spacing (mm)
Latch Height (um)
1X 0.5 0.4 0.4 1.5 120
2X 1.0 0.8 0.8 3.0 60
4X 2.0 1.6 1.6 5.0 30
8X 4.0 3.2 3.2 8.0 15
Overheads change based on the metal plane(s) the interconnects are mapped to
Wiring Power Overhead Dynamic dissipation in wires, repeaters and
latches Wire capacitance 0.02pF/mm, freq 2.5GHz, 1.1V Repeater capacitance 30% of wire capacitance Dynamic power per latch 0.05mW per latch
Leakage in repeaters and latches Channel and gate leakage values in paper
Wiring Latency Overhead Latency of signals traveling through latches
Latency also from travel of control between a central arbiter and interfaces corresponding to request/data queues
Hence, latencies depends on the location of the particular core or cache or the arbiter
Interconnect-Related Logic Overhead Arbiters, muxes, queues constitute interconnect-
related logic
Area and power overhead primarily due to queues Assumed to be implemented using latches
Performance overhead due to wait time in the queues and arbitration latencies Arbitration overhead increases with the number of
connected units Latching required between different stages of
arbitration
Talk Outline Interconnection Models
Shared Bus Fabric (SBF) Point-to-point links Crossbar
Modeling area, power and latency
Evaluation Methodology
SBF and Crossbar results
Novel architecture
Modeling Multi-core Architectures Stripped version of Power-4 like cores, 10mm^2, 10W
Evaluated 4,8 and 16 core multiprocessors occupying roughly 400mm^2
A CMP consists of cores, L2 banks, memory controllers, DMA controllers and Non-cacheable units
Weak consistency model, MESI-like coherence
All studies done for 65nm
Floorplans for 4,8 and 16 core processors [assuming private caches]
SBF IOX MCMC
NCUNCU
IOX MCMC
NCUNCU
NCU NCU
NCUNCU
SBF IOX MCMC IOX MCMC
NCU NCU
NCUNCU
SBFNCUNCU
NCUNCU
SBFNCUNCU
NCUNCU
NCUNCU
NCUNCU
NCUNCU
NCUNCU
MC IOX MC
MC IOX MC
CoreL2 Data
L2 Tag
P2P Link
Note that there are two SBFs for 16 core processor
Performance Modeling Used a combination of detailed functional simulation
and queuing simulation
Functional simulator Input: SMP traces (TPC-C, TPC-W, TPC-H, Notesbench
etc) Output: Coherence statistics for modeled
memory/interconnection system
Queuing simulator Input: Coherence statistics, interconnection latencies,
CPI of the modeled core assuming infinite L2 Output: System CPI
Results…finally!
SBF: Wiring Area Overhead
0
10
20
30
40
50
60
4 8 16Number of cores
Are
a o
verh
ead
(mm
^2)
Architected busses Control wires Total overhead
Area overhead can be significant – 7-13% of die areaSufficient to place 3-5 extra cores, 4-6 MB of extra cache!Co-design needed: More cores, more cache or more interconnect bandwidth?Observed a scenario where decreasing bandwidth improved performance
SBF: Wiring Area Overhead
Control overhead 37%-63% of total overhead
Constrains how much area can be reduced with narrower busses
0
10
20
30
40
50
60
4 8 16Number of cores
Are
a o
verh
ead
(mm
^2)
Architected busses Control wires Total overhead
SBF: Wiring Area Overhead
Argues against lightweight cores – do not amortize the incremental cost to the interconnect
0
10
20
30
40
50
60
4 8 16Number of cores
Are
a o
verh
ead
(mm
^2)
Architected busses Control wires Total overhead
SBF: Power Overhead
Power overhead can be significant for large number of cores
0
5
10
15
20
25
4 8 16Number of cores
Po
we
r(W
)
leakage due to logiclatches
dynamic power due tologic latches(w/ogating)
leakage due to wiringlatches
leakage due torepeaters
dynamic power due towiring latches(w/ogating)
dynamic power due torepeater cap(AF=0.2)
dynamic power due towire cap(AF=0.2)
SBF: Power Overhead
0
5
10
15
20
25
4 8 16Number of cores
Po
we
r(W
)
leakage due to logiclatches
dynamic power due tologic latches(w/ogating)
leakage due to wiringlatches
leakage due torepeaters
dynamic power due towiring latches(w/ogating)
dynamic power due torepeater cap(AF=0.2)
dynamic power due towire cap(AF=0.2)
Power due to queues more than that due to wires!Good interconnect architecture should have efficient queuing and flow-control
SBF: Performance
0
1
2
3
4
5
6
7
8
9
4 8 16Number of cores
CP
I
performance assuming no interconnection overhead
performance with interconnection overhead
Interconnect Overhead can be significant – 10-26%!Interconnect accounts for over half the latency to the L2 cache
Shared Caches and Crossbar Results for the 8-core processor
2-way, 4-way and full-sharing of the L2 cache
Results are shown for two cases – when crossbar sits between cores and L2
Easy interfacing, but all wiring tracks results in area overhead
when the crossbar is routed over L2 Interfacing difficult, area overhead only due to reduced
cache density
Crossbar: Area Overhead
0
50
100
150
200
250
300
350
400
1X 2X 4XMetal plane
Are
a o
verh
ead
in
mm
^2
two-way sharing
four-way sharingall-sharing
assuming crossbar can routed over L2 (4-way sharing)
11-46% of the die for 2X implementation! Sufficient to put 4 more cores even for 4-way sharing!
Crossbar: Area Overhead (contd)
What is the point of cache sharing? Cores get the effect of having more cache space
But, we have to reduce the size of the shared cache to accommodate a crossbar Is larger caches through sharing an illusion OR Can we really have larger caches by making them
private and reclaim the area used by the crossbar
In other words, does sharing have any benefit in such scenarios?
Crossbar: Performance Overhead
3
5
7
9
11
13
15
all_private two_shared four_shared all_shared
CP
I2X crossbar between cores and L2
2X crossbar routed over L2
4X crossbar routed over L2
Accompanying grain of salt Simplified interconnection model assumed
Systems with memory scouts etc. may have different memory system requirements
Non Uniform Caches (NUCA) might improve performance
Etc. etc.
However, results do show that a shared cache is significantly less desirable for future technologies
What have we learned so far?(in terms of bottlenecks)
Interconnection bottlenecks (and possible solutions)
Long wires result in long latencies See if wires can be shortened
Centralized arbitration See if arbitration can be distributed
Gets worse with the number of modules connected to a bus See if the number of modules connected to a bus
can be decreased
A Hierarchical Interconnect
CORE
NCU
CORE
NCUL2Data L2Data
CORE
NCU
CORE
NCUL2Data L2Data
SBF IOX MCMC IOX MCMC
CORE
NCU
CORE
NCUL2Data L2Data
CORE
NCU
CORE
NCUL2Data L2Data
SBF
A Hierarchical Interconnect
A local and a remote SBF
(smaller average case latency, longer worse case latency)
CORE
NCU
CORE
NCUL2Data L2Data
CORE
NCU
CORE
NCUL2Data L2Data
SBF IOX MCMC IOX MCMC
CORE
NCU
CORE
NCUL2Data L2Data
CORE
NCU
CORE
NCUL2Data L2Data
P2PL
A Hierarchical Interconnect (contd)
The threads need to be mapped intelligently To increase hit rate in caches connected to
the local SBF
For some cases, even random mapping results in better performance E.g. for the 8-core processor shown
More research needs to be done for hierarchical interconnects
More description in the paper
Conclusions Design choices for interconnects have significant effect on
the rest of the chip Should be co-designed with cores and caches
Interconnection power and performance overheads can be almost as much logic-dominated as wire-dominated. Don’t think about wires only – arbitration, queuing and flow-
control important
Some common architectural beliefs (e.g. shared L2 caches) may not hold when interconnection overheads are accounted for. We should do careful interconnect modeling for our CMP
research proposals
A hierarchical bus structure can negate some of the interconnection performance cost