acoherent shared memory derek r. hower ph.d. defense july 16, 2012

Acoherent

Shared Memory

Derek R. HowerPh.D. DefenseJuly 16, 2012

2

Executive Summary

PP

Coherent

View

PPC

O

CO

CI

CI

Acoherent

View

GPU

Simple abstraction?

L1 L1

L2Simple abstraction

- Simple implementation- Abstracts caches- Low overhead

- Complex implementation- Hides caches (bad?!)- High overhead

3

Outline

Motivation and GoalsASM ModelASM-CMP PrototypeEvaluation and ResultsConclusions and Future Work

4

Trends

We must change

We can change

Energy Matters Dark Silicon/Mobile/Datacenter < 50% of processor powered by 20241

Complexity Matters Lower barrier to entry for accelerators

Area Matters New tech nodes are not cheaper2

Memory: may be difficult to turn off e.g., S-NUCA

Compatibility doesn’t matter Vertical integration is the new black

1 Esmaeilzadeh, et al. ISCA 20112 ExtremeTech 2012

5

The Problem With Coherence Wrong abstraction

Optimized for fine-grained, share-everything• Programs aren’t!

Makes SW isolation hard Hypothesis: SW will want control over data

placement

Impedes HW specialization Does your multicore ASIC need a coherence

controller? Coherent GPUs?

Efficiency problems Directories take space/broadcasts take energy

• e.g. 14% of cache are dedicated to directory on 4-core die1

1 Stackhouse et al., ISSCC 2008

6

Rethinking Coherence: Goals Maintain programmer sanity

Keep shared memory Minimal compatibility change

Expose hardware capabilities Let SW guide memory management -> semantics

Simple hardware Lower cost of entry for accelerators

Solution: Acoherent Shared Memory

7

Outline


8

ASM Model Basics Replace black box with simple hierarchy

Still flat, linear address space SW gets private storage

Manage with CVS-like checkout/checkin

P

CI

P

CI

CO CO

9

Checkout/Checkin

Checkout: Pull data into private storage

P

CI

P

CI

CO CO

Checkin: Publish local updates globally

Checkout/Checkin are not synchronization primitives - Closer to a FENCE

Granularity?

10

Segments

Stack

Code

BSSData

Heap

Compromise: Memory Segments– Linear partition of address space– CO/CI segments at a time

Observation: Programs are already segmented Can re-use layout

Typical CO/CI granularity in existing C code

11

Segment Types

Acoherent

PrivateStack

Code

BSSData

Heap

Coherent RO

Shared

Private

Shared, Read-Only

Not all memory wants/needs acoherence Segment types give different “views” Communicate semantic information to HW

Available Types

Private

Coherent-RW

Coherent-RO

Acoherent

Device

12

Managing Finite Resources Model so far is strong acoherence

Likely requires prohibitive HW resources Also weak acoherence and best-effort

acoherence Still useful to software/hardware

Weak acoherence: Data visible early (before checkin)

Best-effort acoherence: Spontaneous checkouts at any time

• + SW notification All-or-nothing

Synchronized => not a problem

Hybrid Runtimes =>not a problem

13

Case Study: pthreadspthread_barrier_t barrier;char* shared_data;

int main(int argc, char* argv[]) { int i,j,k; pthread_t sib; shared_data = malloc(PROBLEM_SIZE); pthread_barrier_init(&barrier, NULL, 2); pthread_create(&sib, NULL, worker, (void*) 1); worker((void*) 0); pthread_join(sib, NULL); return 0;}

void* worker(void* arg){ while (work remains) { <split work> <do work> pthread_barrier_wait(&barrier); }}

Task: Convert to ASM

• Global, Heap in acoherent segment

• Stack in private segment

• Synch. in coherent-RW segment

• CI/CO Global, Heap at synchronization

Communication Point

barrier;

barrier

barrier

shared_data

argc argv

arg

i j ksib

sib

sib

• Text in coherent-RO segment

shared_data

Automatic:Runtime

Automatic:Library

Works as isint pthread_barrier_init(…) { … _barrier = coherent_malloc(sizeof(int)); …}

int pthread_barrier_wait(…) { … checkin(heap, data); <barrier> checkout(heap, data); …}

Step 1: Assign Segments

Step 2: Checkout/Checkin

14

Memory Consistency Model

Option 1: The Details(6 slides + really ugly equations)

Option 2: The Highlights (2 slides)

15

Memory Consistency Model Defined in style of SPARC TSO/RMO

Memory Order: Total order of memory ops• Restricted by consistency model

Processor Order: Local dependencies

Value of load: defined via memory + processor order

16

Weak Acoherence

# Load -> Load to same address (a)# Load -> Store to same address (b)# Store -> Store to same address (c)# Paired CI-CO act as distributed fence (d)# CI/CO -> CI/CO (e)

1. Define Memory Order

2. Define legal value of loads

Same as TSO, etc.

CI-CO pair => fenceTotal order of CO/CI

S S Si mpS

i i iL a L L a L

S S S Si i ip m iL S a L S a

S S S Si i i mp iS a S a S a S a

SS Si i

S S Sp m j p j i m jS a CI CO L S L

p mCX CX CX CX

value value max |S S S S Sm

Si m i p iL a S a S a L a or S a L a

JJJJJJJJJJJJJJ

17

Strong Acoherence

# Load -> Load to same address (a) # Load -> Store to same address (b) # Store -> Store to same address (c) # Paired CI-CO act as distributed fence (d) # CI/CO -> CI/CO (e) # Store not visible until CI (f) # Stores can be clobbered

1. Define Memory Order

2. Define legal value of loads

S S Si mpS

i i iL a L L a L

S S S Si i ip m iL S a L S a

S S S Si i i mp iS a S a S a S a

SS Si i

S S Sp m j p j i m jS a CI CO L S L

p mCX CX CX CX

next ( , ) max ( , )S S S S Si p p i i mp p iS CO CI S S CO S

JJJJJJJJJJJJJJJJJJJJJJJJJJJJ

max | max ( , )

or, if does not exist,

max | max ( , )

S S S Si p i p i

S

S Si p

S S S Sm m

i

m i

i p

S

p

value L a value S a CO L a S a L a

S a

value S a CO S a S a L a

JJJJJJJJJJJJJJJJJJJJJJJJJJJJ

JJJJJJJJJJJJJJ JJJJJJJJJJJJJJ

( , ) next ( , ) ( , )S S S S S S S Si p p ipp i p i i mS next CI S CO S next CI S S

JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

Normally:Stores not visible until CI

S Si p m iS CI CI S

Can “lose” data

18

Other Segment Types Coherent

Like weak, but:• Loads implicitly paired with (atomic) CO• Stores implicitly paired with (atomic) CI

SC w.r.t. each other

Private Like weak

19

Analysis CO/CI not atomic Subtlties:

03: R0 = A

02: CHECKOUT

13: CHECKIN

Thread 0 Thread 1

12: A = 1

Initially, A = 0

04: R1 = A

Strong: R0 = 0, R1 = 0Weak: R0 = 0, R1 = 0 or 1

(b) Isolation

05: CHECKOUT

Thread 0 Thread 1

14: A = 1

Initially, A = 0

06: R0 = A

Strong: R0 = 0Weak: R0 = 0 or 1

(c) Leaky stores

00: CHECKOUT

11: CHECKIN

Thread 0 Thread 1

10: A = 1

Initially, A = 0

01: R0 = A

(a) Lazy checkout

Strong: R0 = 0 or 1Weak: R0 = 0 or 1

20

ASM = SC for DRF ASM = SC for lossless and properly paired Lossless:

No clobbering checkouts i.e.,

Properly Paired: All conflicting stores->load separated by CI/CO i.e.,

Proof sketch: LL+PP executions defined by CO/CI order, program

order only CO/CI, program order same in ASM, SC

, :

:

S S S Si i i p i

S S S Si i p i p i

S a if CO S a CO

CI S a CI CO

, : value( ) value( ),

, :

S S Sj i j

S S S S S Si j j p j m i p

i

i

SL a S a L a S a i j

CO CI S a CI CO L a

Next

21

CO/CI Semantics CO/CI like fence

Lazy checkouts Non-atomic, non-blocking checkins

• Updates can interleave

00: CHECKOUT

11: CHECKIN

Thread 0 Thread 1

10: A = 1

Initially, A = 0

01: R0 = A

Finally: R0 = 0 or 1

00: A = 101: B = 202: CHECKIN

Thread 0 Thread 1

10: A = 1011: B = 2012: CHECKIN

Initially, A = 0

Finally, any combo of: A = 1 or 10 B = 2 or 20

22

Consistency Highlights Coherent accesses have implicit CO/CI

CO/CI are totally ordered Transitivity hides non-atomicity

Sequentially consistent for data-race-free Lossless & Properly Paired

ST critical

ST lock

CI lock_segment

CO lock_segment

LD lock

ST lock

CI lock_segment

CO critical_segment

CI critical_segment

LD critical

Thread 0 Thread 1

STsync lock

LL lock

SC lock

23

Outline


24

ASM-CMP Overview Based on MIPS

+ special insns, e.g., checkout, checkin Uses segments, no paging

• Maintains flat address space

Coherence protocol -> Acoherence Engine DMA for caches

• Selectively move data

Skipping the Details

25

Baseline

Switch

CoreL1I

L1D

L2

Memory Controller

Memory Controller

Mem

ory

Contro

llerM

em

ory

Contr

olle

r

26

Segment Types

L1

Non-inclusive L2

L1 AEAE

P P

CO

CO

CI

CI

L2

P P

L1

Exclusive L2

L1

P P

Acoherent

Coherent-RW

Private

27

Acoherence Engine

Three main responsibilities: Checkout:

• Invalidate all segment data Checkin:

• Write back all dirty segment data Order:

• Detect CI-CO pairs

FSM like coherence, but few races, no directory

Timestamp based

Lazy Flash Invalidate

Track write set

Decoupled MetastateCache

28

Decoupled Metastate Cache All L1 Caches

Decouple metastate from data Quick access to aggregate

state Track V/D per-segment

Checkout: XOR

global/segment valid

Checkin: Walk segment

dirty state

29

Order Need to:

1. Determine if a CI precedes a CO2. Delay load after CO if previous CI hasn’t completed

Timestamp algorithm (per segment): Two phase CO/CI

1. Acquire timestamp1. Invalidate/Flush

2. Wait for previous CO/CI to complete

Implemented in firmware

30

Multiple Writer Support

Keep per-byte dirty bitmask in L1s Allows multiple writers with false sharing 12.5% larger L1 cache

Bitmask accompanies data to L2

31

Simple?

Directory / L2

L1 L1

REQREQ RESP RESP

FWD

Source of Races / Complexity

L2

32

Outline

Motivation and GoalsASM ModelASM-1 PrototypeEvaluation and ResultsConclusions and Future Work

33

Methodology Simulation-based

Enhanced-User Mode

Workloads: Class-1: SPLASH Class-2: Task-Q

Three memory modules ASM-CMP CC from gem5-Ruby

• MESI (Inclusive)• MOESI (Non-inclusive)

34

barn

es fftfm

m lu

mp3

d

ocea

nra

dix

water

cilks

ort

clu

heat

2d

heat

3d

mat

mul

uts_

circ

uts_

fixed

gmea

n0

0.2

0.4

0.6

0.8

1

1.2

1.4

moesi mesi asm

Runti

me N

orm

alized t

o M

OESI

Performance

Comparable performance

Checkout too muchFalse Sharing/

Migratory Sharing

35

Perfect Checkout

barn

es fftfm

m lu

mp3

d

ocea

nra

dix

water

cilks

ort

clu

heat

2d

heat

3d

mat

mul

uts_

circ

uts_

fixed

gmea

n0

0.2

0.4

0.6

0.8

1

1.2

asm_base asm_ideal

Runti

me N

orm

alized t

o A

SM

Baseline

36

Energy

.

barn

es fftfm

m lu

mp3

d

ocea

nra

dix

water

cilks

ort

clu

heat

2d

heat

3d

mat

mul

uts_

circ

uts_

fixed

gmea

n0

0.2

0.4

0.6

0.8

1

1.2

e_l1d e_l1i e_l2 e_link e_switch e_tlb

Energ

y N

orm

alized t

o M

OESI

Less Energy (Same Performance)

37

Checkout Characteristics

0-7 8-15 16-23 24-31 32-39 40-47 48-55 56-63 64-71 72-79 80-87 88-95 96-103 104-111

112-119

120-127

0%10%20%30%40%50%60%70%80%

Class-1 Workloads barnes fft fmm lu mp3d ocean radix water

# blocks invalidated

% o

f checkouts

barn

es fftfm

m lu

mp3

d

ocea

nra

dix

water

cilks

ort

clu

heat

2d

heat

3d

mat

mul

uts_

fixed

mea

n0

0.2

0.4

0.6

0.8

1

1.2

% o

f C

heckout

Invalidati

ons

Elided

Most checkout invalidations affect dead

blocks

Checkouts usually small;Can be large (> 25% of

L1)

38

Checkin Characteristics

0-7 8-15 16-23 24-31 32-39 40-47 48-55 56-63 64-71 72-79 80-87 88-95 96-103 104-111

112-119

120-127

0%

5%

10%

15%

20%

25%

30%

35%

40%

Class-1 Workloads

barnes fft fmm lu mp3d ocean radix water

# blocks invalidated

% o

f checkin

s

Checkin latency is hiddenCheckins usually small;Can be large (> 25% of

L1)

39

Outline

Motivation and GoalsASM ModelASM-CMP PrototypeEvaluation and ResultsConclusions/Other Work

40

Conclusions Going forward:

HW designs must find efficiency SW will want to see caches/control placement

ASM: viable alternative to coherent shared memory Semantic cooperation between HW/SW

ASM-CMP: build components w/o coherence engine Make custom integration easier

Practically: Will the next x86 core use ASM? No Will a heterogeneous accelerator? Maybe

41

Related Work

ASM Model

ASM-CMP

Alternatives/Detractors

Skip

42

Related Work – ASM Model Relaxed consistency models

Release Consistency (ISCA 1990)• Acquire/Release ≈ CO/CI

DRF-0 (ISCA 1990), DRF-1 (PDS 1993)• SC for DRF

Weak ordering (ISCA 1998)

Semantic Segmentation Cohesion (ISCA 2011) Entry consistency (CMU-TR 1991)

43

Related Work – ASM-CMP Rigel: IEEE Micro 2011

Differentiates coherent/incoherent Treadmarks: ISCA 1992

Twinning and diffing

44

Related Work - Alternatives Reduce directory overhead

Cuckoo directory (HPCA 2011) Tagless directory (MICRO 2009, PACT 2011) Waypoint (PACT 2010) Region coherence (IEEE Micro 2006) SW controlled coherence (…)

Simplify coherence design Denovo (PACT 2011)

Coherence is here to stay CACM 2012

45

Future Work

ASM Model

ASM Implementations

ASM Software

Skip

46

Future Work – ASM Model1. Use CO/CI for synchronization

Return timestamp with CO/CI Blocking CO

2. Only guarantee transitivity across coherent accesses Would eliminate need for timestamps

3. Hierarchical ASM Expose multiple levels of abstracted caches

4. Interaction with coherent shared memory Acoherent/coherent components in same system

47

Future Work: ASM Implementation ASM-CMP

1. Optimize empty checkout/checkin2. Non-speculative support for strong

acoherence• e.g., HW copy-on-write support on eviction• Use ASM as foundation for TM/Determinism/etc

3. Low overhead byte-diffing• False sharing is rare/pattern reuse is common

4. More segment control• Non-contiguous• Remap-able

Other1. Multi-socket support2. Use ASM to simplify traditional coherence

• Private/shared

48

Future Work – ASM Software1. Message passing on ASM

More efficient than coherence (think: migratory)

2. Software speculation Use working memory for isolation

3. Programming language integration CO/CI first-class operations Work already exists:

• Worlds (ECOOP 2011), Revisions (OOPSLA 2010), PGAS

49

Previous Work Rerun: ISCA 2008 and CACM 2009

Race recorder for deterministic replay vs. state of the art:

• SAME logging performance, > 10x state reduction

Calvin: HPCA 2011 Coherence for deterministic execution

• i.e., zero-log-size deterministic replay Selective determinism to match program

requirements

Hobbes: WoDet 2011 Strong acoherence in SW runtime

Backup Slides

52

What I would do differently Focus on more specific target system Stop building new infrastructure!

Why did I? • gem5 wasn’t ready• Started more radical/not clear it would have helped

Step back more often Easy to get sucked in to details – usually don’t

matter Functional specification of consistency -> yuck!

53

Case Study: Cilk Work-stealing task queue

Distributed design

54

ASM Segments Benefit

.

barn

es fftfm

m lu

mp3

d

ocea

nra

dix

water

cilks

ort

clu

heat

2d

heat

3d

mat

mul

uts_

circ

uts_

fixed

gmea

n0

0.20.40.60.8

11.21.41.61.8

2

e_l1d e_l1i e_l2 e_link e_switch e_tlb

Energ

y N

orm

alized t

o M

OESI

Usin

g S

egm

ents

barn

es fftfm

m lu

mp3

d

ocea

nra

dix

water

cilks

ort

clu

heat

2d

heat

3d

mat

mul

uts_

circ

uts_

fixed

gmea

n0

0.4

0.8

1.2

1.6

2

moesi_tlb_0 moesi_tlb_32 moesi_tlb_64

Runti

me N

orm

alized t

o

MO

ESI

Usin

g S

egm

ents

55

History

1980 2000 2010CPU Era Multicore Era Dark Era

Moore of the sameEverything is

general purpose ??

56

Navigating the Darkness

Solution #1: Wait for CMOS replacement Don’t hold your breath

Solution #2: Rethink everything Deep integration HW Specialization/Heterogeneity Efficiency Take compatibility off its pedestal

Coherence?

57

Rethinking Coherence: Why Now? Dennard Scaling is over; Moore’s Law continues:

Need efficient components/reduced waste Heterogeneity/Specialization

Different memory access patterns Multicore ASICs

Important workloads don’t use it Compatibility not a show stopper

Mobile -> fast design cycles, controlled SW stacks Datacenter -> economy of scale in single location

Missing opportunities

58

Case Study 2: Software Speculationbegin_speculation() {

<copy state> <setup>}

end_speculation() { if(success) <free copies> else abort_speculation();}

abort_speculation() { <revert to copy> <cleanup>}

Multiple checkouts: “forget” updatesTask: Convert to ASM

checkout(…)

checkout(…)

checkin(…) Checkin: commit updates

SW can use memory in new ways

Use private storage

59

New Software Potential Evaluate ability to write speculation software

Microbenchmark: Fill array with speculative data, then commit Vary size of array

16 64 256 1K 16K 32K 64K 128K0

0.4

0.8

1.2

ASMMESI

# of Blocks in Isolation Region

Norm

alized

R

un

tim

e

60

Using Weak Acoherence

func producer(…) checkout(array); array[0] = x; array[1] = y; checkin(array);

signal(consumer);

end func

func consumer(…)

waitfor(producer);

checkout(array); …end func

global array;

weak acoherent

checkin(array);

Synchronized -> Early visibility OK

Synch hides early checkin

globally visible!

61

Using Best-Effort Acoherence

Exception!

SW handles resource limitations

array[1] = y checkin(array)

end_tx

begin_tx checkout(array)

array[0] = x checkout(array)

62

Simulator Design Two Goals

Functionally evaluate ASM system• programming model, kernel management

Performance comparison to CMP

Enhanced User Mode simulator Emulate non-timing critical components (e.g., disks) Simulate the rest (e.g., virtual memory)

63

Qualitative Data

Is ASM a reasonable model?

YES Almost no changes to application software

• Unsynchronized flags• Stack sharing

Functioning Kernel, same tricks• Heavier use of coherent segments

64

Three Questions

PC PC

DRAM

LLC

PP PPPPPP

CO

CO

CI

CI

Hardware

Layout

Coherent

View

PrivateView

Acoherent

View

1. How can software select view?2. Which view to use?3. How to manage CO/CI?

65

ASM-CMP Segments Uses true memory segments

e.g., all pointers are long (segment + offset)

BUT, address space still appears flat!

Long Pointer Propagation Segment pointers propagate through datapath Add lp/sp + register sidecars Languages/SW remain segment-oblivious

66

ASM-CMP SegmentsSegment pointers propagates with datapath

memcpy(dst, src, len);

lp $t0, 0(dst) lp $t1, 0(src) mov $t2, $a2 ; cnt <- lenloop: beqz $t2, exit lb $t3, 0($t1) ; ld src sb $t3, 0($t0) ; st dst addi $t0, $t0, 1 ; inc. dst addi $t1, $t1, 1 ; inc. src subi $t2, $t2, 1 ; dec. cnt b loopexit:

OffsetSeg. Ptr.

dst ptr

Memory

OffsetSeg. Ptr.

src ptr

OffsetRegister File

SegOffset

Seg. Ptr.

lp $t0, 0(dst) lp $t1, 0(src) mov $t2, $a2 ; cnt <- lenloop: beqz $t2, exit lb $t3, 0($t1) ; ld src sb $t3, 0($t0) ; st dst addi $t0, $t0, 1 ; inc. dst addi $t1, $t1, 1 ; inc. src subi $t2, $t2, 1 ; dec. cnt b loopexit:

Offset Seg$t1Seg. Ptr.Offset

len$t2data$t3

dst

src

ALU

Offset SegOffset SegOffset

1

Seg

Seg

Offset+1

lp $t0, 0(dst) lp $t1, 0(src) mov $t2, $a2 ; cnt <- len beqz $t2, exit lb $t3, 0($t1) ; ld src sb $t3, 0($t0) ; st dst addi $t0, $t0, 1 ; inc.dst

$t0

Segment propagates src -> dst

Pointers are long

67

The Problem

HardwareLayout

PC PC

DRAM

LLC

P

Coherent Shared Memory

P PP

SoftwareView

?

68

The Problem

HardwareLayout

PC PC

DRAM

LLC

PP PP

SoftwareView

Coherent Shared Memory

Hardware Policy – Software Can’t Change!

69

All Data Are Created Equal?

Location := 1;

PC PC

DRAM

LLC

PP

?

?

?

Assume: CMP MESI protocol, inclusive LLC

70

Missed Opportunities

Location := 1;

PC PC

DRAM

LLC

PP


begin_tx

end_tx

cpLocation := Location;

SW Makes Redundant Copy

71

All Data Are NOT Created Equal

Location := 1;

PC PC

DRAM

LLC

PP


func foo() var Location;

Private

Wasting Space

72

ASM-1 Hardware

8MB L3

256KB L2

32KB L1

P

AE

Bitmask

Bitmask

Per-line

73

Baseline

L3_0

L3_1

L3_2

L3_3

L3_4

L3_5

L3_6

L3_7

L2L2L2L2L2L2L2L2

L2L2L2L2L2L2L2L2

L1L1L1L1L1L1L1L1

L1L1L1L1L1L1L1L1

P0P1P2P3P4P5P6P7

P8P9

P10P11P12P13P14P15

In-order, single thread Ring interconnect

74

Storage Overhead

2 4 8 16 32 64 128

256

512

1024

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

ASM-1MESI-1 LevelMESI-2 LevelMESI-3 Level

# Cores

Sto

rag

e O

verh

ead

More indirection -> longer latency

No Indirection

75

Rethinking Coherence: Why Now? Dennard Scaling is over; Moore’s Law continues

Need scalable, energy efficient components Accelerators are here

How should they see memory? Shared-little workloads in important markets

76

All Data Are NOT Created Equal

Location := 1;

PC PC

DRAM

LLC

GPUP


func CUDAKernel(…) …

Not clear accelerators want/need coherence

acoherent shared memory derek r. hower ph.d. defense july 16, 2012

Documents

sw runtime

time sw notificationall

rethinking coherence

coherence controller

sw isolation hardhypothesis

checkin besteffort acoherence

future work

data placement