18-600 foundations of computer systemsece600/lectures/lecture17.pdf · 18-600 foundations of...

62
Recommended Reference: “Parallel Computer Organization and Design,” by Michel Dubois, Murali Annavaram, Per Stenstrom, Chapters 5 and 7, 2012. 18 - 600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” John P. Shen October 25, 2017 Prevalence of multicore processors: 2006: 75% for desktops, 85% for servers 2007: 90% for desktops and mobiles, 100% for servers Today: 100% multicore processors with core counts ranging from 2 to 8 cores for desktops and mobiles, 8+ cores for servers 10/25/2017 (© J.P. Shen) 18-600 Lecture #17 1

Upload: others

Post on 30-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

➢ Recommended Reference:• “Parallel Computer Organization and Design,” by Michel Dubois,

Murali Annavaram, Per Stenstrom, Chapters 5 and 7, 2012.

18-600 Foundations of Computer Systems

Lecture 17:“Multicore Cache Coherence”

John P. Shen October 25, 2017

Prevalence of multicore processors:▪ 2006: 75% for desktops, 85% for servers

▪ 2007: 90% for desktops and mobiles, 100%

for servers

▪ Today: 100% multicore processors with core

counts ranging from 2 to 8 cores for

desktops and mobiles, 8+ cores for servers

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 1

Page 2: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Lecture 17:“Multicore Cache Coherence”

A. Multicore Processors▪ The Case for Multicores▪ Programming for Multicores▪ The Cache Coherence Problem

B. Cache Coherence Protocol Categories▪ Write Update▪ Write Invalidate

C. Specific Bus-Based Snoopy Protocols▪ VI & MI Protocols▪ MSI, MESI, MOESI Protocols

18-600 Foundations of Computer Systems

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 2

Page 3: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

The Case for Multicore Processors (MCP)

Regs

L1 d-cache

L1 i-cache

L2 unified

cache

Core 0

Regs

L1 d-cache

L1 i-cache

L2 unified

cache

Core 3

L3 unified cache

(shared by all cores)

Main memory

Multicore Processor package

Stalled Scaling of Single-Core Performance

Expected continuation of Moore’s Law

Throughput Performance for Server Workloads

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 3

Page 4: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Processor Scaling Until ~2004

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 4

Page 5: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Processor Development Until ~2004

Moore’s Law: transistor count doubles every 18 months

Used to improve processor performance by 2x every 18 months

Single core, binary compatible to previous generations

Contributors to performance improvements

More ILP through OOO superscalar techniques

Wider issue, better branch prediction, better instruction scheduling, …

Better memory hierarchies, faster and larger

Clock frequency improvements with deeper pipelines

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 5

Page 6: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Problems with Single Core Performance

Moore’s Law is still doing well (for the foreseeable future…)

The Power Wall

Power ≈ CL * Vdd2 * Freq

Cannot scale transistor count and frequency without reducing Vdd

Unfortunately, voltage scaling has essentially stalled

The Complexity Wall

Designing and verifying increasingly large OOO cores is very expensive

100s of engineers for 3-5 years

Caches are easier to design but can only help so much…

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 6

Page 7: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Power & Latency (Single-Thread) Performance

15

20

25

30

0 2

Relative Performance

Rela

tive P

ow

er

Pentium 4 (Psc)

power = perf (1.74)

Pentium 4 (Wmt)

Pentium Pro

Pentiumi486

0

5

10

4 6 8

[Ed Grochowski, 2004]

❖ For comparison▪ Factor out contributions due to process

technology

▪ Keep contributions due to microarchitecture design

▪ Normalize to i486™ processor

❖ Relative to i486™ Pentium® 4 (Wmt) processor is▪ 6x faster (2X IPC at 3X frequency)

▪ 23x higher power

▪ Spending 4 units of power for every 1 unit of scalar performance

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 7

Page 8: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

From ILP to TLP

So far, we run single process, single thread

Extracting ILP from sequential instruction stream

Single-thread performance can't scale indefinitely!

Limited ILP within each thread

Power consumption & complexity of superscalar cores

We will now pursue Thread-Level Parallelism (TLP)

To increase utilization and tolerate latency in single core

To exploit multiple cores

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 8

Page 9: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Thread-Level Parallelism

Instruction-level parallelism (ILP)

Reaps performance by finding independent work in a single thread

Thread-level parallelism (TLP)

Reaps performance by finding independent work across multiple threads

Historically, requires explicitly parallel workloads

Originate from mainframe time-sharing workloads

Even then, CPU speed >> I/O speed

Had to overlap I/O latency with “something else” for the CPU to do

Hence, operating system would schedule other tasks/processes/threads that were “time-sharing” the CPU

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 9

Page 10: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Thread-Level Parallelism

Reduces effectiveness of temporal and spatial locality

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 10

Page 11: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Thread-Level Parallelism

Initially motivated by time-sharing of single CPU OS, applications written to be multithreaded

Quickly led to adoption of multiple CPUs in a single system Enabled scalable product line from entry-level single-CPU systems to high-end

multiple-CPU systems Same applications, OS, run seamlessly Adding CPUs increases throughput (performance)

More recently: Multiple threads per processor core

Coarse-grained multithreading (aka “switch-on-event”) Simultaneous multithreading (aka “hyper-threading”)

Multiple processor cores per die Chip multiprocessors (CMP) or “Muticore processors” (MCP) Chip multithreading (CMT)

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 11

Page 12: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Recall: Processes and Software Threads

Process: an instance of a program executing in a system

OS supports concurrent execution of multiple processes

Each process has its own address space, set of registers, and PC

Two different processes can partially share their address spaces to communicate

Thread: an independent control stream within a process

A process can have one or more threads

Private state: PC, registers (int, FP), stack, thread-local storage

Shared state: heap, address space (VM structures)

A “parallel program” is one process but multiple threads

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 12

Page 13: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Reminder: Classic OS Context Switch

OS context-switch

Timer interrupt stops a program mid-execution (precise)

OS saves the context of the stopped thread

PC, GPRs, and more

Shared state such as physical pages are not saved

OS restores the context of a previously stopped thread (all except PC)

OS uses a “return from exception” to jump to the restarting PC

The restored thread has no idea it was interrupted, removed, later restarted

Take a few hundred cycles per switch (why?)

Amortized over the execution “quantum”

What latencies can you hide using OS context switching?

How much faster would a user-level thread switch be?

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 13

Page 14: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Multithreaded Cores (old “Multiprogramming”)

Basic idea:

CPU resources are expensive and should not be idle

1960’s: Virtual memory and multiprogramming

Virtual memory/multiprogramming invented to tolerate latency to secondary storage (disk/tape/etc.)

Processor-disk speed mismatch:

microseconds to tens of milliseconds (1:10,000 or more)

OS context switch used to bring in other useful work while waiting for page fault or explicit file read/write accesses

Cost of context switch must be much less than I/O latency (easy)

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 14

Page 15: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Multithreaded Cores (new “Multithreading”)

1990’s: Memory wall and multithreading Processor-DRAM speed mismatch:

nanosecond to fractions of a microsecond (1:500)

H/W task switch used to bring in other useful work while waiting for cache miss

Cost of context switch must be much less than cache miss latency

Very attractive for applications with abundant thread-level parallelism Commercial multi-user (transaction processing) workloads

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 15

Page 16: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Processor Scaling Since ~2005

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 16

Page 17: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

The Multicore Alternative Use Moore’s law to place more cores per chip

Potentially 2x cores/chip with each CMOS generation

Without significantly compromising clock frequency

Known as Multi-Core Processors (MCP) or Chip Multiprocessors (CMP)

The good news

Continued scaling of chip-level peak (throughput) performance

Mitigate the undesirable superscalar power scaling (“wrong side of the square law”)

Facilitate design and verification, and product differentiation

The bad news

Require multithreaded workloads: multiple programs or parallel programs

Require parallelizing single applications into parallel programs

Power is still an issue as transistors shrink due to leakage current

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 17

Page 18: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Big OOO Superscalar vs. Multicore Processor

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 18

Page 19: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Power & Throughput (Multi-Thread) Performance

power = perf (1.74)

30

Rela

tive P

ow

er

10

15

20

25

0 2 4 6 8

Relative Performance

Pentium 4 (Psc)

Pentium 4 (Wmt)

Pentium Pro

Pentium

i486

0

5

Pentium M

power = perf (1.0) ?

power = perf (1.74)

Scalar/LatencyPerformance

ThroughputPerformance

❖ Assume a large-scale multicore processor (MCP) with potentially many cores

❖ Replication of cores results in nearly proportional increases to both throughput performance and power (hopefully).

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 19

[Ed Grochowski, 2004]

Page 20: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Programming for Multicore Processors (MCP)

➢ Programmers must write parallel programs using threads/processes.

➢ Spread the workload across multiple cores at run time.

➢ OS will map threads/processes to cores at run time.

Assigning Threads to Cores:

➢ Each thread/process has an affinity mask

➢ Affinity mask specifies what cores the thread is allowed to run on.

➢ Different threads can have different masks

➢ Affinities are inherited across fork()

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 20

Page 21: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Shared-Memory Multiprocessors or Multicores

All processor cores have access to unified physical memory They can communicate via the shared memory using loads and stores

Advantages Supports multi-threading (TLP) using multiple cores Requires relatively simple changes to the OS for scheduling Threads within an app can communicate implicitly without using OS

Simpler to code for and lower overhead App development: first focus on correctness, then on performance

Disadvantages Implicit communication is hard to optimize Synchronization can get tricky Higher hardware complexity for cache management

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 21

Page 22: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

18-600 Lecture #1710/25/2017 (© J.P. Shen) 22

Caches are (equally) helpful with multicores Reduce access latency, reduce bandwidth requirements For both private and shared data across cores

Advantages of private caches: They are closer to core, so faster access Reduces contention to cache by cores

Advantages of shared cache: Threads on different cores can share the same cache data More cache space available if a single (or a few) high-performance

thread runs on the system

But multiple private caches introduce the two problems of Cache Coherence (cover in this lecture) Memory Consistency (beyond this course)

Caches for Multicores (or Multicore Processors)

Page 23: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

18-600 Lecture #1710/25/2017 (© J.P. Shen) 23

• Since we have private caches:

How to keep the data consistent across caches?

• Each core should perceive the memory as a monolithic array, shared

by all the cores

The Cache Coherence Problem

Page 24: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

18-600 Lecture #1710/25/2017 (© J.P. Shen) 24

Suppose variable x initially contains 15213

Core 1 Core 2 Core 3 Core 4

One or more

levels of

cache

One or more

levels of

cache

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=15213

multi-core chip

The Cache Coherence Problem

Page 25: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

18-600 Lecture #1710/25/2017 (© J.P. Shen) 25

Core 1 reads x

Core 1 Core 2 Core 3 Core 4

One or more

levels of

cache

x=15213

One or more

levels of

cache

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=15213

multi-core chip

The Cache Coherence Problem

Page 26: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

18-600 Lecture #1710/25/2017 (© J.P. Shen) 26

Core 2 reads x

Core 1 Core 2 Core 3 Core 4

One or more

levels of

cache

x=15213

One or more

levels of

cache

x=15213

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=15213

multi-core chip

The Cache Coherence Problem

Page 27: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

18-600 Lecture #1710/25/2017 (© J.P. Shen) 27

Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more

levels of

cache

x=21660

One or more

levels of

cache

x=15213

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=21660

multi-core chipassuming

write-through

caches

The Cache Coherence Problem

Page 28: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

18-600 Lecture #1710/25/2017 (© J.P. Shen) 28

Core 2 attempts to read x… gets a stale copy

Core 1 Core 2 Core 3 Core 4

One or more

levels of

cache

x=21660

One or more

levels of

cache

x=15213

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=21660

multi-core chip

The Cache Coherence Problem

Page 29: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

18-600 Lecture #1710/25/2017 (© J.P. Shen) 29

• This is a general problem with shared memory

multiprocessors and multicores with private caches

• Coherence Solution:

• Use HW to ensure that loads from all cores will return the

value of the latest store to that memory location

• Use metadata to track the state for cached data

• There exist two major categories with many specific

coherence protocols.

Solutions for Cache Coherence Problem

Page 30: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

18-600 Lecture #1710/25/2017 (© J.P. Shen) 30

Core 1 Core 2 Core 3 Core 4

One or more

levels of

cache

One or more

levels of

cache

One or more

levels of

cache

One or more

levels of

cache

Main memory

multi-core chip

inter-core bus

Bus Based (“Snooping”) Multicore Processor

Page 31: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

18-600 Lecture #1710/25/2017 (© J.P. Shen) 31

• Invalidation:

If a core writes to a data item, all other copies of this

data item in other caches are invalidated

• Snooping:

All cores continuously “snoop” (monitor) the bus

connecting the cores.

Invalidation Protocol with Snooping

Page 32: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

18-600 Lecture #1710/25/2017 (© J.P. Shen) 32

Revisited: Cores 1 and 2 have both read x

Core 1 Core 2 Core 3 Core 4

One or more

levels of

cache

x=15213

One or more

levels of

cache

x=15213

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=15213

multi-core chip

Invalidation Based Cache Coherence Protocol

Page 33: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

18-600 Lecture #1710/25/2017 (© J.P. Shen) 33

Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more

levels of

cache

x=21660

One or more

levels of

cache

x=15213

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=21660

multi-core chipassuming

write-through

caches

INVALIDATEDsends

invalidation

request

inter-core bus

Invalidation Based Cache Coherence Protocol

Page 34: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

18-600 Lecture #1710/25/2017 (© J.P. Shen) 34

After invalidation:

Core 1 Core 2 Core 3 Core 4

One or more

levels of

cache

x=21660

One or more

levels of

cache

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=21660

multi-core chip

Invalidation Based Cache Coherence Protocol

Page 35: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

18-600 Lecture #1710/25/2017 (© J.P. Shen) 35

Core 2 reads x. Cache misses, and loads the new copy.

Core 1 Core 2 Core 3 Core 4

One or more

levels of

cache

x=21660

One or more

levels of

cache

x=21660

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=21660

multi-core chip

Invalidation Based Cache Coherence Protocol

Page 36: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

18-600 Lecture #1710/25/2017 (© J.P. Shen) 36

Core 1 writes x=21660:

Core 1 Core 2 Core 3 Core 4

One or more

levels of

cache

x=21660

One or more

levels of

cache

x=21660

One or more

levels of

cache

One or more

levels of

cache

Main memory

x=21660

multi-core chipassuming

write-through

caches

UPDATED

broadcasts

updated

valueinter-core bus

Update Based Cache Coherence Protocol

Page 37: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

18-600 Lecture #1710/25/2017 (© J.P. Shen) 37

• Multiple writes to the same location

– invalidation: only the first time

– update: must broadcast each write

(which includes new variable value)

• Invalidation generally performs better:

it generates less bus traffic

Invalidation vs. Update Protocols

Page 38: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Cache Coherence

Informally, with coherent caches: accesses to a memory location appear to occur simultaneously in all copies of that memory location“copies” caches + memory

Cache coherence suggests an absolute time scale -- this is not necessary What is required is the "appearance" of coherence... not

absolute coherence E.g. temporary incoherence between memory and a write-back

cache may be OK.

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 38

Page 39: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Write Update vs. Write Invalidate

Coherent caches withShared Memory

All cores see the effects of others’ writes

How/when writes are propagated

Determined by coherence protocol

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 39

Page 40: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Bus-Based Snoopy Cache Coherence

All requests broadcast on the bus

All processors (or private caches) and memory snoop and respond

Cache blocks writeable at one processor or read-only at several

Single-writer protocol

Snoops that hit dirty (i.e. modified) lines?

Flush modified data out of cache

Either write back to memory, then satisfy remote miss from memory, or

Provide dirty (modified) data directly to requestor

Big problem in shared-memory multicore processor systems

Dirty/coherence/sharing misses

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 40

Page 41: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Bus-Based Protocols

Cache

ControllerCache Data

Processor

Bus

Processor Actions

Bus Actions

State Tags

Protocol consists of states and actions (state transitions)

Actions can be invoked from processor or bus tothe cache controller

Coherence based on per cache line (block)

Main Memory

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 41

Page 42: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Minimal Coherence Protocol (Write-Back Caches)

Valid (M)

Invalid (I)

Local Read or

Local Write

Evict or Remote Read or Remote Write

Local Read or Local Write

Tag State Data

A M …

B I …

Cache

Blocks are always private or exclusive

State transitions: Local read: I->M, fetch,

invalidate other copies Local write: I->M, fetch,

invalidate other copies Evict: M->I, write back data Remote read: M->I, write

back data Remote write: M->I, write

back data

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 42

Page 43: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Invalidate Protocol Optimization

Observation: data often read shared by multiple CPUs

Add S (shared) state to protocol: MSI

State transitions:

Local read: I->S, fetch shared

Local write: I->M, fetch modified; S->M, invalidate other copies

Remote read: M->I, write back data

Remote write: M->I, write back data

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 43

Page 44: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

MSI Protocol (with Write Back Cache)

Action and Next State

Current

State

Processor

Read

Processor

Write

Eviction Cache

Read

Cache

Read&M

Cache

Upgrade

I Cache Read

Acquire

Copy

→ S

Cache Read&M

Acquire Copy

→ M

No Action

→ I

No Action

→ I

No Action

→ I

S No Action

→ S

Cache Upgrade

→ M

No Action

→ I

No Action

→ S

Invalidate

Frame

→ I

Invalidate

Frame

→ I

M No Action

→ M

No Action

→ M

Cache

Write

back

→ I

Memory

inhibit;

Supply

data;

→ S

Invalidate

Frame;

Memory

inhibit;

Supply data;

→ I

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 44

Page 45: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

MSI ExampleThread Event Bus Action Data From Global State Local

States:

C0 C1 C2

0. Initially: <0,0,0,1> I I I

1. T0 read→ CR Memory <1,0,0,1> S I I

2. T0 write→ CU <1,0,0,0> M I I

3. T2 read→ CR C0 <1,0,1,1> S I S

4. T1 write→ CRM Memory <0,1,0,0> I M I

If line is in no other cache Read, modify, Write requires 2 bus transactions Optimization: add Exclusive state

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 45

Page 46: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

MSI: A Coherence Protocol (Write Back Caches)

M

S I

M: ModifiedS: SharedI: Invalid

Each cache line has a tag

Address tag

statebits

Write miss

Other processorintent to write

Readmiss

Other processorintent to write

Read by anyprocessor

P1 readsor writes

Cache state in processor P1

Other processor readsP1 writes back

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 46

Page 47: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

MSI Coherence Protocol Example with 2 Cores

M

S I

Write miss

Readmiss

P2 intent to write

P2 reads,P1 writes back

P1 readsor writes

P2 intent to write

P1

M

S I

Write miss

Readmiss

P1 intent to write

P1 reads,P2 writes back

P2 readsor writes

P1 intent to write

P2

P1 reads

P1 writes

P2 reads

P2 writes

P1 writes

P2 writes

P1 reads

P1 writes

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 47

Page 48: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Invalidate Protocol Optimizations

Observation: data can be write-private (e.g. stack frame)

Avoid invalidate messages in that case

Add E (exclusive) state to protocol: MESI

State transitions:

Local read: I->E if only copy, I->S if other copies exist

Local write: E->M silently, S->M, invalidate other copies

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 48

Page 49: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

MESI Protocol (most common in industry)

Variation used in many Intel processors

4-State Protocol

Modified: <1,0,0…0>

Exclusive: <1,0,0,…,1>

Shared: <1,X,X,…,1>

Invalid: <0,X,X,…X>

Bus/Processor Actions

Same as MSI

Adds shared signal to indicate if other caches have a copy

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 49

Page 50: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

MESI ProtocolAction and Next State

Current

State

Processor

Read

Processor

WriteEviction

Cache

Read

Cache

Read&M

Cache

Upgrade

I

Cache

Read

If no

sharers:

→ E

If sharers:

→ S

Cache Read&M

→ M

No Action

→ I

No Action

→ I

No Action

→ I

SNo Action

→ S

Cache Upgrade

→ M

No Action

→ I

Respond

Shared:

→ S

No Action

→ I

No Action

→ I

ENo Action

→ E

No Action

→ M

No Action

→ I

Respond

Shared;

→ S

No Action

→ I

MNo Action

→ M

No Action

→ M

Cache

Write-back

→ I

Respond

dirty;

Write back

data;

→ S

Respond

dirty;

Write back

data;

→ I

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 50

Page 51: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

MESI Example

Thread Event Bus

Action

Data

From

Global State Local States:

C0 C1 C2

0. Initially: <0,0,0,1> I I I

1. T0 read→ CR Memory <1,0,0,1> E I I

2. T0 write→ none <1,0,0,0> M I I

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 51

Page 52: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Cache-to-Cache Transfers

Common in many workloads:

T0 writes to a block: <1,0,…,0> (block in M state in T0)

T1 reads from block: T0 must write back, then T1 reads from memory

In shared-bus system

T1 can snarf data from the bus during the writeback

Called cache-to-cache transfer or dirty miss or intervention

Without shared bus

Must explicitly send data to requestor and to memory (for writeback)

Known as the 4th C (cold, capacity, conflict, communication)

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 52

Page 53: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

MESI Example 2

Thread Event Bus

Action

Data From Global State Local States:

C0 C1 C2

0. Initially: <0,0,0,1> I I I

1. T0 read→ CR Memory <1,0,0,1> E I I

2. T0 write→ none <1,0,0,0> M I I

3. T1 read→ CR C0 <1,1,0,1> S S I

4. T2 read→ CR Memory <1,1,1,1> S S S

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 53

Page 54: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

MOESI Optimization (IEEE Standard)

Observation: shared ownership prevents cache-to-cache transfer, causes unnecessary memory read

Add O (owner) state to protocol: MOSI/MOESI

Last requestor becomes the owner

Avoid writeback (to memory) of dirty data

Also called shared-dirty state, since memory is stale

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 54

Page 55: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

MOESI Protocol

Used in AMD Opteron

5-State Protocol

Modified: <1,0,0…0>

Exclusive: <1,0,0,…,1>

Shared: <1,X,X,…,1>

Invalid: <0,X,X,…X>

Owned: <1,X,X,X,0> ; only one owner, memory not up to date

Owner can supply data, so memory does not have to

Avoids lengthy memory access

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 55

Page 56: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

MOESI ProtocolAction and Next State

Current StateProcessor

Read

Processor

WriteEviction Cache Read Cache Read&M

Cache

Upgrade

I

Cache Read

If no sharers:

→ E

If sharers:

→ S

Cache Read&M

→ M

No Action

→ I

No Action

→ I

No Action

→ I

SNo Action

→ S

Cache Upgrade

→ M

No Action

→ I

Respond shared;

→ S

No Action

→ I

No Action

→ I

ENo Action

→ E

No Action

→ M

No Action

→ I

Respond shared;

Supply data;

→ S

Respond

shared;

Supply data;

→ I

ONo Action

→ O

Cache Upgrade

→ M

Cache Write-

back

→ I

Respond shared;

Supply data;

→ O

Respond shared;

Supply data;

→ I

MNo Action

→ M

No Action

→ M

Cache Write-

back

→ I

Respond shared;

Supply data;

→ O

Respond shared;

Supply data;

→ I

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 56

Page 57: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

MOESI Example

Thread Event Bus Action Data From Global State local states

C0 C1 C2

0. Initially: <0,0,0,1> I I I

1. T0 read→ CR Memory <1,0,0,1> E I I

2. T0 write→ none <1,0,0,0> M I I

3. T2 read→ CR C0 <1,0,1,0> O I S

4. T1 write→ CRM C0 <0,1,0,0> I M I

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 57

Page 58: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

MOESI Coherence Protocol

A protocol that tracks validity, ownership, and exclusiveness Modified: dirty and private

Owned: dirty but shared

Avoid writeback to memory on M->S transitions

Exclusive: clean but private

Avoid upgrade misses on private data

Shared

Invalid

There are also some variations (MOSI and MESI)

What happens when 2 cores read/write different words in a cache line?

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 58

Page 59: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Snooping with Multi-level Caches

Private L2 caches If inclusive, snooping traffic checked at the L2 level first

Only accesses that refer to data cached in L1 need to be forwarded

Saves bandwidth at the L1 cache

Shared L2 or L3 caches Can act as serialization points even if there is no bus

Track state of cache line and list of sharers (bit mask)

Essentially the shared cache acts like a coherence directory

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 59

Page 60: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Scaling Coherence Protocols

The problem

Too much broadcast traffic for snooping (probing)

Solution: probe filters

Maintain info of which address ranges that are definitely not shared or definitely shared

Allows filtering of snoop traffic

Solution: directory based coherence

A directory stores all coherence info (e.g., sharers)

Consult directory before sending coherence messages

Caching/filtering schemes to avoid latency of 3-hops

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 60

Page 61: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

The Memory Consistency Problem: ExampleP1 P2

/*Assume initial value of A and flag is 0*/

A = 1; while (flag == 0); /*spin idly*/

flag = 1; print A;

Intuitively, you expect to print A=1

But can you think of a case where you will print A=0?

Even if cache coherence is available

Coherence talks about accesses to a single location

Consistency is about ordering for accesses to difference locations

Alternatively

Coherence determines what value is returned by a read

Consistency determines when a write value becomes visible

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 61

Page 62: 18-600 Foundations of Computer Systemsece600/lectures/lecture17.pdf · 18-600 Foundations of Computer Systems Lecture 17: “Multicore Cache Coherence” ... OS supports concurrent

Lecture 18:“Program Performance Optimizations”

John P. Shen & Gregory KesdenNovember 1, 2017

18-600 Foundations of Computer Systems

➢ Required Reading Assignment:• Chapter 5 of CS:APP (3rd edition) by Randy Bryant & Dave O’Hallaron.

SE18-600

PLOSCA

10/25/2017 (© J.P. Shen) 18-600 Lecture #17 62