uc regents spring 2014 © ucbcs 152 l14: cache design and coherency 2014-3-6 john lazzaro (not a...

64
UC Regents Spring 2014 © UCB CS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and Engineering www-inst.eecs.berkeley.edu/ ~cs152/ TA: Eric Love Lecture 14 - Cache Design and Coherence Pla y:

Upload: avice-blair

Post on 17-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

2014-3-6

John Lazzaro(not a prof - “John” is always OK)

CS 152Computer Architecture and Engineering

www-inst.eecs.berkeley.edu/~cs152/

TA: Eric Love

Lecture 14 - Cache Design and Coherence

Play:

Page 2: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

Today: Shared Cache Design and Coherence

Crossbars and RingsHow to do on-chip

sharing.Concurrent requests

Interfaces that don’t stall.

CPU multi-threadingKeeps memory system

busy.

Coherency ProtocolsBuilding coherent caches.

CPU

Private Cache

Shared Ports

...

...

Shared Caches

DRAM

CPU

Private Cache

I/O

Page 3: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

Multithreading

Sun Microsystems Niagara series

Page 4: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

The case for multithreadin

g

Amdahl’s Law tells us

that optimizing C is the wrong

thing to do ...

Idea: Create a design that can multiplex threads onto one pipeline. Goal: Maximize throughput of a large number of threads.

Some applications spend their

lives waiting for memory.C = compute M = waiting

Page 5: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Multi-threading: Assuming perfect caches

4 CPUs,running @ 1/4 clock.

S. Cray, 1962.Labels show this

state:

T3 T2 T1T4

Page 6: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

Mux,Logic

Bypass network is no longer needed ...

IR IR

B

A

M

IR

Y

M

IR

R

WE, MemToReg

ID (Decode) EX MEM WB

From WB

Result: Critical path shortens -- can trade for speed or power.

Page 7: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Multi-threading: Supporting cache misses

Thread scheduler

A thread scheduler keeps track of information about all threads that share

pipeline. When a thread experiences a cache miss, it is taken off the pipeline during the miss

penalty period.

Page 8: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Sun Niagara II # threads/core?

8 threads/core: Enough to keep one core busy, given clock speed, memory system

latency, and target application characteristics.

Page 9: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

Crossbar Networks

Page 10: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

Shared-memory

CPU

Private Cache

Shared Ports

...

...

CPUs share lower level of memory system, and I/O.

Common address space, one operating

system image.

Communication occurs through the

memory system (100ns latency,

20 GB/s bandwidth)

Shared Caches

DRAM

CPU

Private Cache

I/O

Page 11: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

Sun’s Niagara II: Single-chip implementation ...

SPC == SPARC Core. Only DRAM is not on chip.

Page 12: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Crossbar: Like N ports on an N-register file

R1

R2

...

R31

Q

Q

Q

R0 - The constant 0 Q

clk

.

.

.

32MUX

32

32

sel(rs1)

5...

rd1

32MUX

32

32

sel(rs2)

5...

rd2

D

D

D

En

En

En

DEMUX

.

.

.

sel(ws)

5

WE

wd

32

Flexible, but ... reads slows down as O(N2) ...

Why? Number of loads on each Q goes as O(N), and the wire length

to port mux goes as O(N).

Page 13: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

Design challenge: High-performance crossbar

Each DRAM channel: 50 GB/s Read, 25 GB/s Write BW. Crossbar BW: 270 GB/s total (Read +

Write).

Niagara II: 8 cores, 8 L2 banks, 4 DRAM channels.Apps are locality-poor. Goal: saturate DRAM

BW.

Page 14: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Sun Niagara II 8 x 9 CrossbarEvery cross of blue and purple is a tri-statebuffer with a unique control signal.

72 control signals (if distributed unencoded).

Tri-state distributed mux, as in microcode talk.

Page 15: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Sun Niagara II

8 x 9 Crossbar

8 ports on CPU side (one per core)

8 ports for L2 banks, plus one for I/0

4 cycle latency (715ps/cycle). Cycles 1-3 are for arbitration.

Transmit data on cycle 4.

100-200 wires/ port (each way).

Pipelined.

Page 16: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

A complete switch transfer (4 epochs)

Epoch 1: All input ports (that are ready to send data) request an output port.

Epoch 2: Allocation algorithm decides which inputs get to write.

Epoch 3: Allocation system informs the winning inputs and outputs.

Epoch 4: Actual data transfer takes place.

Allocation is pipelined: a data transfer happens on every cycle, as does the three allocation stages, for different sets of requests.

Allocation is pipelined: a data transfer happens on every cycle, as does the three allocation stages, for different sets of requests.

Page 17: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

UC Regents Fall 2006 © UCBCS 152 L21: Networks and Routers

Epoch 3: The Allocation Problem (4 x 4)

W X Y Z

A 0 0 1 0

B 1 0 0 0

C 0 0 1 0

D 1 0 0 0

Input Ports

(A, B, C, D)

Input Ports

(A, B, C, D)

Output Ports (W, X, Y, Z)

Output Ports (W, X, Y, Z)

A 1 codes that an input has data ready to send to an output.

A 1 codes that an input has data ready to send to an output.

W X Y Z

A 0 0 1 0

B 0 0 0 0

C 0 0 0 0

D 1 0 0 0

Allocator returns a matrix with at most one 1 in each row and column to set switches. Algorithm should be “fair”, so no port always loses ... should also “scale” to run large matrices fast.

Allocator returns a matrix with at most one 1 in each row and column to set switches. Algorithm should be “fair”, so no port always loses ... should also “scale” to run large matrices fast.

Page 18: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Crossbar defines floorplan: all port devices should be equidistant to the crossbar.

Uniform latency between all port pairs.

Sun Niagara II Crossbar Notes

Low latency: 4 cycles (less than 3 ns).

Page 19: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Page 20: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Sun Niagara II Energy Facts Crossbar only

1% of total power.

Page 21: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Crossbar defines floorplan: all port devices should be equidistant to the crossbar.

Uniform latency between all port pairs.

Did not scale up for 16-core Rainbow Falls. Rainbow Falls keeps the 8 x 9 crossbar, and shares each CPU-side port with two cores.

Sun Niagara II Crossbar Notes

Low latency: 4 cycles (less than 3 ns).

Design alternatives to crossbar?

Page 22: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

CLOS Networks: From telecom world ...

Build a high-port switch by tiling fixed-sized shuffle units. Pipeline registers naturally fit between tiles. Trades scalability for latency.

Page 23: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

CLOS Networks: An example route

Numbers on left and right are port numbers. Colors show routing paths for an exchange. Arbitration still needed to prevent blocking.

Page 24: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

Ring Networks

Page 25: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

Intel Xeon

Data Center

server chip20% of Intel’s

revenues,

40% of profits.Why? Cloud is growing, Xeon is

dominant.

Page 26: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

Compiled Chips

Xeon is a chip family, varying by # of cores, L3 cache

size. Chip family

mask layouts generated

automatically, by adding core/cache

slices.

Ring Bus

Page 27: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

Bi-directional Ring Bus connects: Cores, cache banks, DRAM controllers, off-chip I/O.

Chip compiler might size the

ring bus to scale

bandwidth with # of cores.

Ring latency increases

with # of cores. But compared

to baseline, small.

Ring Stop

Page 28: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

2.5 MB L3 cache

slice from Xeon E5

Tiles along x-axis are 20 ways of cache

Ring stop interface

lives in the Cache

Control Box

(CBOX)

Page 29: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

Ring bus (perhaps 1024 wires), with address, data, and header fields (sender #, recipient #,

command)Ring Stop #1Ring Stop #2 Ring Stop #3

Data Out

Data In Control

Empty

Ring Stop #2 InterfaceReading: Sense Data Out to see if message is for Ring Stop #2. If so, latch data, mux Empty onto ring. Writing: Check is Data Out is Empty. If so, mux a message onto the ring via the Data In port.

1024

Page 30: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

In practice: “Extreme EE” to co-optimize bandwidth, reliability.

Page 31: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

Debugging: “Network analyzer” built into chip to capture ring messages of a particular kind. Sent off chip via an aux port.

Page 32: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

A derivative of this ring bus is also used on laptop and

desktop chips.

Page 33: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

Break

Play:

Page 34: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

Hit-over-Miss Caches

Page 35: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

Recall: CPU-cache port that doesn’t stall on a miss

From CPU To CPU

Queue 1

Queue 2

CPU makes a request by placing the following items in Queue 1:

CMD: Read, write, etc ...

TAG: 9-bit number identifying the request.

MTYPE: 8-bit, 16-bit, 32-bit, or 64-bit.

MADDR: Memory address of first byte.

STORE-DATA: For stores, the data to store.

Page 36: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

This cache is used in an ASPIRE CPU (Rocket)

When request is ready, cache places the following items in Queue 2:

TAG: Identity of the completed command.LOAD-DATA: For loads, the requested data.

CPU saves info about requests, indexed by TAG.

Why use TAG approach? Multiple misses can proceed in parallel. Loads can return out of order.

From CPU To CPU

Queue 1

Queue 2

Page 37: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

Today: How a read request proceeds in L1 D-Cache From CPU To CPU

Queue 1

Queue 2

CPU requestsa read by placing MTYPE, TAG, MADDR in Queue 1.

We do a normal cache access. If there is a hit, we put place load result in Queue 2 ...In the case of a miss, we use the Inverted Miss Status Holding Register.

“We” == L1 D-Cache controller

Page 38: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Inverted MSHR (Miss Status Holding Register)

0

512-entry table, so that every 9-

bit TAG value has an entry.

Tag ID

(ROM)=

=

Hit

[ ... ]

511

MTYPE01

1st Byte in Block04

Valid Bit

[ ... ]

8 0

Cache Block #

042

To look up a memory address ...

Hit

(1) Associatively look up block # of memory address in table. If there are no hits, do memory request.

Valid Qualifies Hit

Valid Qualifies Hit

Assumptions: 32-byte blocks, 48-bit physical address space.

Page 39: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Inverted MSHR (Miss Status Holding Register)

0

Tag ID

(ROM)=

=

Hit

[ ... ]

511

MTYPE01

1st Byte in Block04

Valid Bit

[ ... ]

8 0

Cache Block #

042

To look up a memory address ...

Hit

TAG (9 bits)

8 0

(2) Index into table using 9-bit TAG, and set all fields using MADDR and MTYPE queue values.

Valid Qualifies Hit

Valid Qualifies Hit

This indexing always finds V=0, because CPU promises not to reuse in-flight tags.

512-entry table, so that every 9-

bit TAG value has an entry.

Assumptions: 32-byte blocks, 48-bit physical address space.

Page 40: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Inverted MSHR (Miss Status Holding Register)

0

Tag ID

(ROM)=

=

Hit

[ ... ]

511

MTYPE01

1st Byte in Block04

Valid Bit

[ ... ]

8 0

Cache Block #

042

Hit

(3) Whenever memory system returns data, associatively look up block # to find all pending transactions. Place transaction data for all hits in Queue 2, and clear valid bits. Also update L1 cache.

Valid Qualifies Hit

Valid Qualifies Hit

To look up a memory address ...

512-entry table, so that every 9-

bit TAG value has an entry.

Assumptions: 32-byte blocks, 48-bit physical address space.

Page 41: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

See Farkas and Jouppi on class website, for low-cost designs that are often good enough.

High cost (# comparators + SRAM cells).

We will return to MHSRs to discuss CPI performance later in the semester.

Inverted MHSR notes.

Structural hazards only occur when TAG space is exhausted by the CPU.

Page 42: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency

Coherency Hardware

Page 43: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Cache Placement

Page 44: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Two CPUs, two caches, shared DRAM ...

CPU0

Cache

Addr Value

CPU1

Shared Main MemoryAddr Value

16

Cache

Addr Value

5

CPU0:

LW R2, 16(R0)

516

CPU1:

LW R2, 16(R0)

16 5

CPU1:SW R0,16(R0)

0

0Write-through caches

View of memory no longer “coherent”.

Loads of location 16 from CPU0 and CPU1 see different values!

Today: What to do ...

Page 45: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

The simplest solution ... one cache!

CPU0 CPU1

Shared Main Memory

CPUs do not have internal caches.

Only one cache, so different values for a memory address cannot appear in 2caches!Shared Multi-Bank Cache

Memory Switch

Multiple caches banks support read/writes by both CPUs in a switch epoch, unless both target same bank.

In that case, one request is stalled.

Page 46: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Not a complete solution ... good for L2.

CPU0 CPU1

Shared Main Memory

For modern clock rates,access to shared cache through switch takes 10+ cycles.

Shared Multi-Bank Cache

Memory Switch Using shared cache as the L1 data cache is tantamount to slowing down clock 10X for LWs. Not good.This approach was a complete solution in the days when DRAM row access time and CPU clock period were well matched.

Sequent Systems (1980s)

Page 47: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Modified form: Private L1s, shared L2

CPU0 CPU1

Shared Main Memory

Thus, we need to solve the cache coherency problem for L1 cache.

Shared Multi-Bank L2 Cache

Memory Switch or Bus

Advantages of shared L2 over private L2s:Processors

communicate at cache speed, not DRAM speed.

L1 Caches L1 Caches

Constructive interference, if both CPUs need same data/instr.Disadvantage:

CPUs share BW to L2 cache ...

Page 48: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

IBM Power 4(2001)

Dual core

Shared, multi-bank L2 cache.

Private L1 caches

Off-chip L3 caches

Page 49: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Cache Coherency

Page 50: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Cache coherency goals ...

CPU0

Cache

Addr Value

CPU1

Shared Memory HierarchyAddr Value

16

Cache

Addr Value

5

1. Only one processor at a time has write permission for a memory location.

516 16 5 0

0

2. No processor can load a stale copy of a location after a write.

Page 51: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Simple Implementation: Snoopy Caches

CPU0

Cache Snooper

CPU1

Shared Main Memory Hierarchy

Each cache has the ability to “snoop” on memory bus transactions of other CPUs.

Cache SnooperMemory bus

The bus also has mechanisms to let a CPU intervene to stop a bus transaction, and to invalidate cache lines of other CPUs.

Page 52: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Writes from 10,000 feet ... for write-thru L1

CPU0

Cache Snooper

CPU1

Shared Main Memory Hierarchy

Cache SnooperMemory bus

1. Writing CPU takes control of bus.

2. Address to be written is invalidated in all other caches.

3. Write is sent to main memory.

Reads will no longer hit in cache and get stale data.

Reads will cache miss, retrieve new value from main memory

For write-thru caches ...

To a first-order, reads will “just work” if write-thru caches implement this policy.A “two-state” protocol (cache lines are “valid” or “invalid”).

Page 53: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Limitations of the write-thru approach

CPU0

Cache Snooper

CPU1

Shared Main Memory Hierarchy

Cache SnooperMemory bus

Every write goes to the bus.

Total bus write bandwidth does not support more than 2 CPUs, in modern practice.

To scale further, we need to use write-back caches.

Write-back big trick: add extra states. Simplest version: MSI -- Modified, Shared, Invalid. More efficient versions add more states (MESI adds Exclusive). State definitions are subtle ...

Page 54: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

Figure 5.5, page 358 ... the best starting point.

Page 55: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Read misses ... for a MESI protocol ...

CPU0

Cache Snooper

CPU1

Shared Main Memory Hierarchy

Cache SnooperMemory bus

1. A cache requests a cache-line fill for a read miss.

2. Another cache with an exclusive on this line responds with fresh data.

3. The responding cache changes line from exclusive to modified.

Reads miss will not hit main memory, retrieve stale data.

Future writes will go to bus to be snooped..

For write-back caches ...

These sketches are just to give you a sense of how coherency protocols work.Deep understand requires understanding the complete “state machine” for protocol.

Page 56: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Snoopy mechanism doesn’t scale ...

CPU0

Cache Snooper

CPU1

Shared Main Memory Hierarchy

Single-chip implementations have moved to a centralized

“directory” service that tracks the status of each line of each private

cache.

Cache SnooperMemory bus

Multi-socket systems use distributed directories.

Page 57: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

Directories attached to on-chip cache network ...

Page 58: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

2 socket system ... each socket a multi-core chip

Each chip has its own bank of DRAM.

Page 59: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

Distributed directories for multi-socket systems

Directories for Chip 0

... and Chip 1

L1

L2

Directory for Chip 1 DRAM.

Directory for Chip 0 DRAM.

L1

L2

Page 60: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

Figure 5.21, page 381 ... directory message basics

Conceptually similar to snoopy caches ... but the different mechanisms require rethinking the protocol to get correct

behaviors.

Page 61: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Other Machine Architectures

Page 62: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

NUMA: Non-uniform Memory Access

CPU 0

Cache

CPU 1023

Interconnection Network

Each CPU has part of main memory attached to it.

Cache

DRAM DRAM

...

To access other parts of main memory, use the interconnection network.

For best results, applications take the non-uniform memory latency into account.

Network uses coherent global address space. Directory protocols

over fiber networking.

Page 63: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB

Clusters: Supercomputing version of WSC

Connect large numbers of 1-CPU or 2-CPU rack mount computers together with high-end network technology (not normal Ethernet).

Instead of using hardware to create a shared memory abstraction, let an application build its own memory model.

University of Illinois, 650 2-CPU Apple Xserve cluster, connected with Myrinet (3.5 μs ping time - low latency network).

Page 64: UC Regents Spring 2014 © UCBCS 152 L14: Cache Design and Coherency 2014-3-6 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture

On Tuesday

We return to CPU design ...

Have a good weekend !