uc regents spring 2014 © ucbcs 152 l14: cache design and coherency 2014-3-6 john lazzaro (not a...
TRANSCRIPT
UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency
2014-3-6
John Lazzaro(not a prof - “John” is always OK)
CS 152Computer Architecture and Engineering
www-inst.eecs.berkeley.edu/~cs152/
TA: Eric Love
Lecture 14 - Cache Design and Coherence
Play:
Today: Shared Cache Design and Coherence
Crossbars and RingsHow to do on-chip
sharing.Concurrent requests
Interfaces that don’t stall.
CPU multi-threadingKeeps memory system
busy.
Coherency ProtocolsBuilding coherent caches.
CPU
Private Cache
Shared Ports
...
...
Shared Caches
DRAM
CPU
Private Cache
I/O
UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency
Multithreading
Sun Microsystems Niagara series
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
The case for multithreadin
g
Amdahl’s Law tells us
that optimizing C is the wrong
thing to do ...
Idea: Create a design that can multiplex threads onto one pipeline. Goal: Maximize throughput of a large number of threads.
Some applications spend their
lives waiting for memory.C = compute M = waiting
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Multi-threading: Assuming perfect caches
4 CPUs,running @ 1/4 clock.
S. Cray, 1962.Labels show this
state:
T3 T2 T1T4
UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency
Mux,Logic
Bypass network is no longer needed ...
IR IR
B
A
M
IR
Y
M
IR
R
WE, MemToReg
ID (Decode) EX MEM WB
From WB
Result: Critical path shortens -- can trade for speed or power.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Multi-threading: Supporting cache misses
Thread scheduler
A thread scheduler keeps track of information about all threads that share
pipeline. When a thread experiences a cache miss, it is taken off the pipeline during the miss
penalty period.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Sun Niagara II # threads/core?
8 threads/core: Enough to keep one core busy, given clock speed, memory system
latency, and target application characteristics.
UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency
Crossbar Networks
Shared-memory
CPU
Private Cache
Shared Ports
...
...
CPUs share lower level of memory system, and I/O.
Common address space, one operating
system image.
Communication occurs through the
memory system (100ns latency,
20 GB/s bandwidth)
Shared Caches
DRAM
CPU
Private Cache
I/O
Sun’s Niagara II: Single-chip implementation ...
SPC == SPARC Core. Only DRAM is not on chip.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Crossbar: Like N ports on an N-register file
R1
R2
...
R31
Q
Q
Q
R0 - The constant 0 Q
clk
.
.
.
32MUX
32
32
sel(rs1)
5...
rd1
32MUX
32
32
sel(rs2)
5...
rd2
D
D
D
En
En
En
DEMUX
.
.
.
sel(ws)
5
WE
wd
32
Flexible, but ... reads slows down as O(N2) ...
Why? Number of loads on each Q goes as O(N), and the wire length
to port mux goes as O(N).
Design challenge: High-performance crossbar
Each DRAM channel: 50 GB/s Read, 25 GB/s Write BW. Crossbar BW: 270 GB/s total (Read +
Write).
Niagara II: 8 cores, 8 L2 banks, 4 DRAM channels.Apps are locality-poor. Goal: saturate DRAM
BW.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Sun Niagara II 8 x 9 CrossbarEvery cross of blue and purple is a tri-statebuffer with a unique control signal.
72 control signals (if distributed unencoded).
Tri-state distributed mux, as in microcode talk.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Sun Niagara II
8 x 9 Crossbar
8 ports on CPU side (one per core)
8 ports for L2 banks, plus one for I/0
4 cycle latency (715ps/cycle). Cycles 1-3 are for arbitration.
Transmit data on cycle 4.
100-200 wires/ port (each way).
Pipelined.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
A complete switch transfer (4 epochs)
Epoch 1: All input ports (that are ready to send data) request an output port.
Epoch 2: Allocation algorithm decides which inputs get to write.
Epoch 3: Allocation system informs the winning inputs and outputs.
Epoch 4: Actual data transfer takes place.
Allocation is pipelined: a data transfer happens on every cycle, as does the three allocation stages, for different sets of requests.
Allocation is pipelined: a data transfer happens on every cycle, as does the three allocation stages, for different sets of requests.
UC Regents Fall 2006 © UCBCS 152 L21: Networks and Routers
Epoch 3: The Allocation Problem (4 x 4)
W X Y Z
A 0 0 1 0
B 1 0 0 0
C 0 0 1 0
D 1 0 0 0
Input Ports
(A, B, C, D)
Input Ports
(A, B, C, D)
Output Ports (W, X, Y, Z)
Output Ports (W, X, Y, Z)
A 1 codes that an input has data ready to send to an output.
A 1 codes that an input has data ready to send to an output.
W X Y Z
A 0 0 1 0
B 0 0 0 0
C 0 0 0 0
D 1 0 0 0
Allocator returns a matrix with at most one 1 in each row and column to set switches. Algorithm should be “fair”, so no port always loses ... should also “scale” to run large matrices fast.
Allocator returns a matrix with at most one 1 in each row and column to set switches. Algorithm should be “fair”, so no port always loses ... should also “scale” to run large matrices fast.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Crossbar defines floorplan: all port devices should be equidistant to the crossbar.
Uniform latency between all port pairs.
Sun Niagara II Crossbar Notes
Low latency: 4 cycles (less than 3 ns).
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Sun Niagara II Energy Facts Crossbar only
1% of total power.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Crossbar defines floorplan: all port devices should be equidistant to the crossbar.
Uniform latency between all port pairs.
Did not scale up for 16-core Rainbow Falls. Rainbow Falls keeps the 8 x 9 crossbar, and shares each CPU-side port with two cores.
Sun Niagara II Crossbar Notes
Low latency: 4 cycles (less than 3 ns).
Design alternatives to crossbar?
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
CLOS Networks: From telecom world ...
Build a high-port switch by tiling fixed-sized shuffle units. Pipeline registers naturally fit between tiles. Trades scalability for latency.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
CLOS Networks: An example route
Numbers on left and right are port numbers. Colors show routing paths for an exchange. Arbitration still needed to prevent blocking.
UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency
Ring Networks
Intel Xeon
Data Center
server chip20% of Intel’s
revenues,
40% of profits.Why? Cloud is growing, Xeon is
dominant.
Compiled Chips
Xeon is a chip family, varying by # of cores, L3 cache
size. Chip family
mask layouts generated
automatically, by adding core/cache
slices.
Ring Bus
Bi-directional Ring Bus connects: Cores, cache banks, DRAM controllers, off-chip I/O.
Chip compiler might size the
ring bus to scale
bandwidth with # of cores.
Ring latency increases
with # of cores. But compared
to baseline, small.
Ring Stop
2.5 MB L3 cache
slice from Xeon E5
Tiles along x-axis are 20 ways of cache
Ring stop interface
lives in the Cache
Control Box
(CBOX)
Ring bus (perhaps 1024 wires), with address, data, and header fields (sender #, recipient #,
command)Ring Stop #1Ring Stop #2 Ring Stop #3
Data Out
Data In Control
Empty
Ring Stop #2 InterfaceReading: Sense Data Out to see if message is for Ring Stop #2. If so, latch data, mux Empty onto ring. Writing: Check is Data Out is Empty. If so, mux a message onto the ring via the Data In port.
1024
In practice: “Extreme EE” to co-optimize bandwidth, reliability.
Debugging: “Network analyzer” built into chip to capture ring messages of a particular kind. Sent off chip via an aux port.
A derivative of this ring bus is also used on laptop and
desktop chips.
UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency
Break
Play:
UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency
Hit-over-Miss Caches
Recall: CPU-cache port that doesn’t stall on a miss
From CPU To CPU
Queue 1
Queue 2
CPU makes a request by placing the following items in Queue 1:
CMD: Read, write, etc ...
TAG: 9-bit number identifying the request.
MTYPE: 8-bit, 16-bit, 32-bit, or 64-bit.
MADDR: Memory address of first byte.
STORE-DATA: For stores, the data to store.
This cache is used in an ASPIRE CPU (Rocket)
When request is ready, cache places the following items in Queue 2:
TAG: Identity of the completed command.LOAD-DATA: For loads, the requested data.
CPU saves info about requests, indexed by TAG.
Why use TAG approach? Multiple misses can proceed in parallel. Loads can return out of order.
From CPU To CPU
Queue 1
Queue 2
Today: How a read request proceeds in L1 D-Cache From CPU To CPU
Queue 1
Queue 2
CPU requestsa read by placing MTYPE, TAG, MADDR in Queue 1.
We do a normal cache access. If there is a hit, we put place load result in Queue 2 ...In the case of a miss, we use the Inverted Miss Status Holding Register.
“We” == L1 D-Cache controller
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Inverted MSHR (Miss Status Holding Register)
0
512-entry table, so that every 9-
bit TAG value has an entry.
Tag ID
(ROM)=
=
Hit
[ ... ]
511
MTYPE01
1st Byte in Block04
Valid Bit
[ ... ]
8 0
Cache Block #
042
To look up a memory address ...
Hit
(1) Associatively look up block # of memory address in table. If there are no hits, do memory request.
Valid Qualifies Hit
Valid Qualifies Hit
Assumptions: 32-byte blocks, 48-bit physical address space.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Inverted MSHR (Miss Status Holding Register)
0
Tag ID
(ROM)=
=
Hit
[ ... ]
511
MTYPE01
1st Byte in Block04
Valid Bit
[ ... ]
8 0
Cache Block #
042
To look up a memory address ...
Hit
TAG (9 bits)
8 0
(2) Index into table using 9-bit TAG, and set all fields using MADDR and MTYPE queue values.
Valid Qualifies Hit
Valid Qualifies Hit
This indexing always finds V=0, because CPU promises not to reuse in-flight tags.
512-entry table, so that every 9-
bit TAG value has an entry.
Assumptions: 32-byte blocks, 48-bit physical address space.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Inverted MSHR (Miss Status Holding Register)
0
Tag ID
(ROM)=
=
Hit
[ ... ]
511
MTYPE01
1st Byte in Block04
Valid Bit
[ ... ]
8 0
Cache Block #
042
Hit
(3) Whenever memory system returns data, associatively look up block # to find all pending transactions. Place transaction data for all hits in Queue 2, and clear valid bits. Also update L1 cache.
Valid Qualifies Hit
Valid Qualifies Hit
To look up a memory address ...
512-entry table, so that every 9-
bit TAG value has an entry.
Assumptions: 32-byte blocks, 48-bit physical address space.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
See Farkas and Jouppi on class website, for low-cost designs that are often good enough.
High cost (# comparators + SRAM cells).
We will return to MHSRs to discuss CPI performance later in the semester.
Inverted MHSR notes.
Structural hazards only occur when TAG space is exhausted by the CPU.
UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency
Coherency Hardware
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Cache Placement
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Two CPUs, two caches, shared DRAM ...
CPU0
Cache
Addr Value
CPU1
Shared Main MemoryAddr Value
16
Cache
Addr Value
5
CPU0:
LW R2, 16(R0)
516
CPU1:
LW R2, 16(R0)
16 5
CPU1:SW R0,16(R0)
0
0Write-through caches
View of memory no longer “coherent”.
Loads of location 16 from CPU0 and CPU1 see different values!
Today: What to do ...
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
The simplest solution ... one cache!
CPU0 CPU1
Shared Main Memory
CPUs do not have internal caches.
Only one cache, so different values for a memory address cannot appear in 2caches!Shared Multi-Bank Cache
Memory Switch
Multiple caches banks support read/writes by both CPUs in a switch epoch, unless both target same bank.
In that case, one request is stalled.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Not a complete solution ... good for L2.
CPU0 CPU1
Shared Main Memory
For modern clock rates,access to shared cache through switch takes 10+ cycles.
Shared Multi-Bank Cache
Memory Switch Using shared cache as the L1 data cache is tantamount to slowing down clock 10X for LWs. Not good.This approach was a complete solution in the days when DRAM row access time and CPU clock period were well matched.
Sequent Systems (1980s)
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Modified form: Private L1s, shared L2
CPU0 CPU1
Shared Main Memory
Thus, we need to solve the cache coherency problem for L1 cache.
Shared Multi-Bank L2 Cache
Memory Switch or Bus
Advantages of shared L2 over private L2s:Processors
communicate at cache speed, not DRAM speed.
L1 Caches L1 Caches
Constructive interference, if both CPUs need same data/instr.Disadvantage:
CPUs share BW to L2 cache ...
IBM Power 4(2001)
Dual core
Shared, multi-bank L2 cache.
Private L1 caches
Off-chip L3 caches
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Cache Coherency
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Cache coherency goals ...
CPU0
Cache
Addr Value
CPU1
Shared Memory HierarchyAddr Value
16
Cache
Addr Value
5
1. Only one processor at a time has write permission for a memory location.
516 16 5 0
0
2. No processor can load a stale copy of a location after a write.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Simple Implementation: Snoopy Caches
CPU0
Cache Snooper
CPU1
Shared Main Memory Hierarchy
Each cache has the ability to “snoop” on memory bus transactions of other CPUs.
Cache SnooperMemory bus
The bus also has mechanisms to let a CPU intervene to stop a bus transaction, and to invalidate cache lines of other CPUs.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Writes from 10,000 feet ... for write-thru L1
CPU0
Cache Snooper
CPU1
Shared Main Memory Hierarchy
Cache SnooperMemory bus
1. Writing CPU takes control of bus.
2. Address to be written is invalidated in all other caches.
3. Write is sent to main memory.
Reads will no longer hit in cache and get stale data.
Reads will cache miss, retrieve new value from main memory
For write-thru caches ...
To a first-order, reads will “just work” if write-thru caches implement this policy.A “two-state” protocol (cache lines are “valid” or “invalid”).
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Limitations of the write-thru approach
CPU0
Cache Snooper
CPU1
Shared Main Memory Hierarchy
Cache SnooperMemory bus
Every write goes to the bus.
Total bus write bandwidth does not support more than 2 CPUs, in modern practice.
To scale further, we need to use write-back caches.
Write-back big trick: add extra states. Simplest version: MSI -- Modified, Shared, Invalid. More efficient versions add more states (MESI adds Exclusive). State definitions are subtle ...
Figure 5.5, page 358 ... the best starting point.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Read misses ... for a MESI protocol ...
CPU0
Cache Snooper
CPU1
Shared Main Memory Hierarchy
Cache SnooperMemory bus
1. A cache requests a cache-line fill for a read miss.
2. Another cache with an exclusive on this line responds with fresh data.
3. The responding cache changes line from exclusive to modified.
Reads miss will not hit main memory, retrieve stale data.
Future writes will go to bus to be snooped..
For write-back caches ...
These sketches are just to give you a sense of how coherency protocols work.Deep understand requires understanding the complete “state machine” for protocol.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Snoopy mechanism doesn’t scale ...
CPU0
Cache Snooper
CPU1
Shared Main Memory Hierarchy
Single-chip implementations have moved to a centralized
“directory” service that tracks the status of each line of each private
cache.
Cache SnooperMemory bus
Multi-socket systems use distributed directories.
Directories attached to on-chip cache network ...
2 socket system ... each socket a multi-core chip
Each chip has its own bank of DRAM.
Distributed directories for multi-socket systems
Directories for Chip 0
... and Chip 1
L1
L2
Directory for Chip 1 DRAM.
Directory for Chip 0 DRAM.
L1
L2
Figure 5.21, page 381 ... directory message basics
Conceptually similar to snoopy caches ... but the different mechanisms require rethinking the protocol to get correct
behaviors.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Other Machine Architectures
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
NUMA: Non-uniform Memory Access
CPU 0
Cache
CPU 1023
Interconnection Network
Each CPU has part of main memory attached to it.
Cache
DRAM DRAM
...
To access other parts of main memory, use the interconnection network.
For best results, applications take the non-uniform memory latency into account.
Network uses coherent global address space. Directory protocols
over fiber networking.
CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB
Clusters: Supercomputing version of WSC
Connect large numbers of 1-CPU or 2-CPU rack mount computers together with high-end network technology (not normal Ethernet).
Instead of using hardware to create a shared memory abstraction, let an application build its own memory model.
University of Illinois, 650 2-CPU Apple Xserve cluster, connected with Myrinet (3.5 μs ping time - low latency network).
On Tuesday
We return to CPU design ...
Have a good weekend !