acoherent shared memory derek r. hower ph.d. defense july 16, 2012
TRANSCRIPT
Acoherent
Shared Memory
Derek R. HowerPh.D. DefenseJuly 16, 2012
2
Executive Summary
PP
Coherent
View
PPC
O
CO
CI
CI
Acoherent
View
GPU
Simple abstraction?
L1 L1
L2Simple abstraction
- Simple implementation- Abstracts caches- Low overhead
- Complex implementation- Hides caches (bad?!)- High overhead
3
Outline
Motivation and GoalsASM ModelASM-CMP PrototypeEvaluation and ResultsConclusions and Future Work
4
Trends
We must change
We can change
Energy Matters Dark Silicon/Mobile/Datacenter < 50% of processor powered by 20241
Complexity Matters Lower barrier to entry for accelerators
Area Matters New tech nodes are not cheaper2
Memory: may be difficult to turn off e.g., S-NUCA
Compatibility doesn’t matter Vertical integration is the new black
1 Esmaeilzadeh, et al. ISCA 20112 ExtremeTech 2012
5
The Problem With Coherence Wrong abstraction
Optimized for fine-grained, share-everything• Programs aren’t!
Makes SW isolation hard Hypothesis: SW will want control over data
placement
Impedes HW specialization Does your multicore ASIC need a coherence
controller? Coherent GPUs?
Efficiency problems Directories take space/broadcasts take energy
• e.g. 14% of cache are dedicated to directory on 4-core die1
1 Stackhouse et al., ISSCC 2008
6
Rethinking Coherence: Goals Maintain programmer sanity
Keep shared memory Minimal compatibility change
Expose hardware capabilities Let SW guide memory management -> semantics
Simple hardware Lower cost of entry for accelerators
Solution: Acoherent Shared Memory
7
Outline
Motivation and GoalsASM ModelASM-CMP PrototypeEvaluation and ResultsConclusions and Future Work
8
ASM Model Basics Replace black box with simple hierarchy
Still flat, linear address space SW gets private storage
Manage with CVS-like checkout/checkin
P
CI
P
CI
CO CO
9
Checkout/Checkin
Checkout: Pull data into private storage
P
CI
P
CI
CO CO
Checkin: Publish local updates globally
Checkout/Checkin are not synchronization primitives - Closer to a FENCE
Granularity?
10
Segments
Stack
Code
BSSData
Heap
Compromise: Memory Segments– Linear partition of address space– CO/CI segments at a time
Observation: Programs are already segmented Can re-use layout
Typical CO/CI granularity in existing C code
11
Segment Types
Acoherent
PrivateStack
Code
BSSData
Heap
Coherent RO
Shared
Private
Shared, Read-Only
Not all memory wants/needs acoherence Segment types give different “views” Communicate semantic information to HW
Available Types
Private
Coherent-RW
Coherent-RO
Acoherent
Device
12
Managing Finite Resources Model so far is strong acoherence
Likely requires prohibitive HW resources Also weak acoherence and best-effort
acoherence Still useful to software/hardware
Weak acoherence: Data visible early (before checkin)
Best-effort acoherence: Spontaneous checkouts at any time
• + SW notification All-or-nothing
Synchronized => not a problem
Hybrid Runtimes =>not a problem
13
Case Study: pthreadspthread_barrier_t barrier;char* shared_data;
int main(int argc, char* argv[]) { int i,j,k; pthread_t sib; shared_data = malloc(PROBLEM_SIZE); pthread_barrier_init(&barrier, NULL, 2); pthread_create(&sib, NULL, worker, (void*) 1); worker((void*) 0); pthread_join(sib, NULL); return 0;}
void* worker(void* arg){ while (work remains) { <split work> <do work> pthread_barrier_wait(&barrier); }}
Task: Convert to ASM
• Global, Heap in acoherent segment
• Stack in private segment
• Synch. in coherent-RW segment
• CI/CO Global, Heap at synchronization
Communication Point
barrier;
barrier
barrier
shared_data
argc argv
arg
i j ksib
sib
sib
• Text in coherent-RO segment
shared_data
Automatic:Runtime
Automatic:Library
Works as isint pthread_barrier_init(…) { … _barrier = coherent_malloc(sizeof(int)); …}
int pthread_barrier_wait(…) { … checkin(heap, data); <barrier> checkout(heap, data); …}
Step 1: Assign Segments
Step 2: Checkout/Checkin
14
Memory Consistency Model
Option 1: The Details(6 slides + really ugly equations)
Option 2: The Highlights (2 slides)
15
Memory Consistency Model Defined in style of SPARC TSO/RMO
Memory Order: Total order of memory ops• Restricted by consistency model
Processor Order: Local dependencies
Value of load: defined via memory + processor order
16
Weak Acoherence
# Load -> Load to same address (a)# Load -> Store to same address (b)# Store -> Store to same address (c)# Paired CI-CO act as distributed fence (d)# CI/CO -> CI/CO (e)
1. Define Memory Order
2. Define legal value of loads
Same as TSO, etc.
CI-CO pair => fenceTotal order of CO/CI
S S Si mpS
i i iL a L L a L
S S S Si i ip m iL S a L S a
S S S Si i i mp iS a S a S a S a
SS Si i
S S Sp m j p j i m jS a CI CO L S L
p mCX CX CX CX
value value max |S S S S Sm
Si m i p iL a S a S a L a or S a L a
JJJJJJJJJJJJJJ
17
Strong Acoherence
# Load -> Load to same address (a) # Load -> Store to same address (b) # Store -> Store to same address (c) # Paired CI-CO act as distributed fence (d) # CI/CO -> CI/CO (e) # Store not visible until CI (f) # Stores can be clobbered
1. Define Memory Order
2. Define legal value of loads
S S Si mpS
i i iL a L L a L
S S S Si i ip m iL S a L S a
S S S Si i i mp iS a S a S a S a
SS Si i
S S Sp m j p j i m jS a CI CO L S L
p mCX CX CX CX
next ( , ) max ( , )S S S S Si p p i i mp p iS CO CI S S CO S
JJJJJJJJJJJJJJJJJJJJJJJJJJJJ
max | max ( , )
or, if does not exist,
max | max ( , )
S S S Si p i p i
S
S Si p
S S S Sm m
i
m i
i p
S
p
value L a value S a CO L a S a L a
S a
value S a CO S a S a L a
JJJJJJJJJJJJJJJJJJJJJJJJJJJJ
JJJJJJJJJJJJJJ JJJJJJJJJJJJJJ
( , ) next ( , ) ( , )S S S S S S S Si p p ipp i p i i mS next CI S CO S next CI S S
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
Normally:Stores not visible until CI
S Si p m iS CI CI S
Can “lose” data
18
Other Segment Types Coherent
Like weak, but:• Loads implicitly paired with (atomic) CO• Stores implicitly paired with (atomic) CI
SC w.r.t. each other
Private Like weak
19
Analysis CO/CI not atomic Subtlties:
03: R0 = A
02: CHECKOUT
13: CHECKIN
Thread 0 Thread 1
12: A = 1
Initially, A = 0
04: R1 = A
Strong: R0 = 0, R1 = 0Weak: R0 = 0, R1 = 0 or 1
(b) Isolation
05: CHECKOUT
Thread 0 Thread 1
14: A = 1
Initially, A = 0
06: R0 = A
Strong: R0 = 0Weak: R0 = 0 or 1
(c) Leaky stores
00: CHECKOUT
11: CHECKIN
Thread 0 Thread 1
10: A = 1
Initially, A = 0
01: R0 = A
(a) Lazy checkout
Strong: R0 = 0 or 1Weak: R0 = 0 or 1
20
ASM = SC for DRF ASM = SC for lossless and properly paired Lossless:
No clobbering checkouts i.e.,
Properly Paired: All conflicting stores->load separated by CI/CO i.e.,
Proof sketch: LL+PP executions defined by CO/CI order, program
order only CO/CI, program order same in ASM, SC
, :
:
S S S Si i i p i
S S S Si i p i p i
S a if CO S a CO
CI S a CI CO
, : value( ) value( ),
, :
S S Sj i j
S S S S S Si j j p j m i p
i
i
SL a S a L a S a i j
CO CI S a CI CO L a
Next
21
CO/CI Semantics CO/CI like fence
Lazy checkouts Non-atomic, non-blocking checkins
• Updates can interleave
00: CHECKOUT
11: CHECKIN
Thread 0 Thread 1
10: A = 1
Initially, A = 0
01: R0 = A
Finally: R0 = 0 or 1
00: A = 101: B = 202: CHECKIN
Thread 0 Thread 1
10: A = 1011: B = 2012: CHECKIN
Initially, A = 0
Finally, any combo of: A = 1 or 10 B = 2 or 20
22
Consistency Highlights Coherent accesses have implicit CO/CI
CO/CI are totally ordered Transitivity hides non-atomicity
Sequentially consistent for data-race-free Lossless & Properly Paired
ST critical
ST lock
CI lock_segment
CO lock_segment
LD lock
ST lock
CI lock_segment
CO critical_segment
CI critical_segment
LD critical
Thread 0 Thread 1
STsync lock
LL lock
SC lock
23
Outline
Motivation and GoalsASM ModelASM-CMP PrototypeEvaluation and ResultsConclusions and Future Work
24
ASM-CMP Overview Based on MIPS
+ special insns, e.g., checkout, checkin Uses segments, no paging
• Maintains flat address space
Coherence protocol -> Acoherence Engine DMA for caches
• Selectively move data
Skipping the Details
25
Baseline
Switch
CoreL1I
L1D
L2
Memory Controller
Memory Controller
Mem
ory
Contro
llerM
em
ory
Contr
olle
r
26
Segment Types
L1
Non-inclusive L2
L1 AEAE
P P
CO
CO
CI
CI
L2
P P
L1
Exclusive L2
L1
P P
Acoherent
Coherent-RW
Private
27
Acoherence Engine
Three main responsibilities: Checkout:
• Invalidate all segment data Checkin:
• Write back all dirty segment data Order:
• Detect CI-CO pairs
FSM like coherence, but few races, no directory
Timestamp based
Lazy Flash Invalidate
Track write set
Decoupled MetastateCache
28
Decoupled Metastate Cache All L1 Caches
Decouple metastate from data Quick access to aggregate
state Track V/D per-segment
Checkout: XOR
global/segment valid
Checkin: Walk segment
dirty state
29
Order Need to:
1. Determine if a CI precedes a CO2. Delay load after CO if previous CI hasn’t completed
Timestamp algorithm (per segment): Two phase CO/CI
1. Acquire timestamp1. Invalidate/Flush
2. Wait for previous CO/CI to complete
Implemented in firmware
30
Multiple Writer Support
Keep per-byte dirty bitmask in L1s Allows multiple writers with false sharing 12.5% larger L1 cache
Bitmask accompanies data to L2
31
Simple?
Directory / L2
L1 L1
REQREQ RESP RESP
FWD
Source of Races / Complexity
L2
32
Outline
Motivation and GoalsASM ModelASM-1 PrototypeEvaluation and ResultsConclusions and Future Work
33
Methodology Simulation-based
Enhanced-User Mode
Workloads: Class-1: SPLASH Class-2: Task-Q
Three memory modules ASM-CMP CC from gem5-Ruby
• MESI (Inclusive)• MOESI (Non-inclusive)
34
barn
es fftfm
m lu
mp3
d
ocea
nra
dix
water
cilks
ort
clu
heat
2d
heat
3d
mat
mul
uts_
circ
uts_
fixed
gmea
n0
0.2
0.4
0.6
0.8
1
1.2
1.4
moesi mesi asm
Runti
me N
orm
alized t
o M
OESI
Performance
Comparable performance
Checkout too muchFalse Sharing/
Migratory Sharing
35
Perfect Checkout
barn
es fftfm
m lu
mp3
d
ocea
nra
dix
water
cilks
ort
clu
heat
2d
heat
3d
mat
mul
uts_
circ
uts_
fixed
gmea
n0
0.2
0.4
0.6
0.8
1
1.2
asm_base asm_ideal
Runti
me N
orm
alized t
o A
SM
Baseline
36
Energy
.
barn
es fftfm
m lu
mp3
d
ocea
nra
dix
water
cilks
ort
clu
heat
2d
heat
3d
mat
mul
uts_
circ
uts_
fixed
gmea
n0
0.2
0.4
0.6
0.8
1
1.2
e_l1d e_l1i e_l2 e_link e_switch e_tlb
Energ
y N
orm
alized t
o M
OESI
Less Energy (Same Performance)
37
Checkout Characteristics
0-7 8-15 16-23 24-31 32-39 40-47 48-55 56-63 64-71 72-79 80-87 88-95 96-103 104-111
112-119
120-127
0%10%20%30%40%50%60%70%80%
Class-1 Workloads barnes fft fmm lu mp3d ocean radix water
# blocks invalidated
% o
f checkouts
barn
es fftfm
m lu
mp3
d
ocea
nra
dix
water
cilks
ort
clu
heat
2d
heat
3d
mat
mul
uts_
fixed
mea
n0
0.2
0.4
0.6
0.8
1
1.2
% o
f C
heckout
Invalidati
ons
Elided
Most checkout invalidations affect dead
blocks
Checkouts usually small;Can be large (> 25% of
L1)
38
Checkin Characteristics
0-7 8-15 16-23 24-31 32-39 40-47 48-55 56-63 64-71 72-79 80-87 88-95 96-103 104-111
112-119
120-127
0%
5%
10%
15%
20%
25%
30%
35%
40%
Class-1 Workloads
barnes fft fmm lu mp3d ocean radix water
# blocks invalidated
% o
f checkin
s
Checkin latency is hiddenCheckins usually small;Can be large (> 25% of
L1)
39
Outline
Motivation and GoalsASM ModelASM-CMP PrototypeEvaluation and ResultsConclusions/Other Work
40
Conclusions Going forward:
HW designs must find efficiency SW will want to see caches/control placement
ASM: viable alternative to coherent shared memory Semantic cooperation between HW/SW
ASM-CMP: build components w/o coherence engine Make custom integration easier
Practically: Will the next x86 core use ASM? No Will a heterogeneous accelerator? Maybe
41
Related Work
ASM Model
ASM-CMP
Alternatives/Detractors
Skip
42
Related Work – ASM Model Relaxed consistency models
Release Consistency (ISCA 1990)• Acquire/Release ≈ CO/CI
DRF-0 (ISCA 1990), DRF-1 (PDS 1993)• SC for DRF
Weak ordering (ISCA 1998)
Semantic Segmentation Cohesion (ISCA 2011) Entry consistency (CMU-TR 1991)
43
Related Work – ASM-CMP Rigel: IEEE Micro 2011
Differentiates coherent/incoherent Treadmarks: ISCA 1992
Twinning and diffing
44
Related Work - Alternatives Reduce directory overhead
Cuckoo directory (HPCA 2011) Tagless directory (MICRO 2009, PACT 2011) Waypoint (PACT 2010) Region coherence (IEEE Micro 2006) SW controlled coherence (…)
Simplify coherence design Denovo (PACT 2011)
Coherence is here to stay CACM 2012
45
Future Work
ASM Model
ASM Implementations
ASM Software
Skip
46
Future Work – ASM Model1. Use CO/CI for synchronization
Return timestamp with CO/CI Blocking CO
2. Only guarantee transitivity across coherent accesses Would eliminate need for timestamps
3. Hierarchical ASM Expose multiple levels of abstracted caches
4. Interaction with coherent shared memory Acoherent/coherent components in same system
47
Future Work: ASM Implementation ASM-CMP
1. Optimize empty checkout/checkin2. Non-speculative support for strong
acoherence• e.g., HW copy-on-write support on eviction• Use ASM as foundation for TM/Determinism/etc
3. Low overhead byte-diffing• False sharing is rare/pattern reuse is common
4. More segment control• Non-contiguous• Remap-able
Other1. Multi-socket support2. Use ASM to simplify traditional coherence
• Private/shared
48
Future Work – ASM Software1. Message passing on ASM
More efficient than coherence (think: migratory)
2. Software speculation Use working memory for isolation
3. Programming language integration CO/CI first-class operations Work already exists:
• Worlds (ECOOP 2011), Revisions (OOPSLA 2010), PGAS
49
Previous Work Rerun: ISCA 2008 and CACM 2009
Race recorder for deterministic replay vs. state of the art:
• SAME logging performance, > 10x state reduction
Calvin: HPCA 2011 Coherence for deterministic execution
• i.e., zero-log-size deterministic replay Selective determinism to match program
requirements
Hobbes: WoDet 2011 Strong acoherence in SW runtime
Phew!
Backup Slides
52
What I would do differently Focus on more specific target system Stop building new infrastructure!
Why did I? • gem5 wasn’t ready• Started more radical/not clear it would have helped
Step back more often Easy to get sucked in to details – usually don’t
matter Functional specification of consistency -> yuck!
53
Case Study: Cilk Work-stealing task queue
Distributed design
54
ASM Segments Benefit
.
barn
es fftfm
m lu
mp3
d
ocea
nra
dix
water
cilks
ort
clu
heat
2d
heat
3d
mat
mul
uts_
circ
uts_
fixed
gmea
n0
0.20.40.60.8
11.21.41.61.8
2
e_l1d e_l1i e_l2 e_link e_switch e_tlb
Energ
y N
orm
alized t
o M
OESI
Usin
g S
egm
ents
barn
es fftfm
m lu
mp3
d
ocea
nra
dix
water
cilks
ort
clu
heat
2d
heat
3d
mat
mul
uts_
circ
uts_
fixed
gmea
n0
0.4
0.8
1.2
1.6
2
moesi_tlb_0 moesi_tlb_32 moesi_tlb_64
Runti
me N
orm
alized t
o
MO
ESI
Usin
g S
egm
ents
55
History
1980 2000 2010CPU Era Multicore Era Dark Era
Moore of the sameEverything is
general purpose ??
56
Navigating the Darkness
Solution #1: Wait for CMOS replacement Don’t hold your breath
Solution #2: Rethink everything Deep integration HW Specialization/Heterogeneity Efficiency Take compatibility off its pedestal
Coherence?
57
Rethinking Coherence: Why Now? Dennard Scaling is over; Moore’s Law continues:
Need efficient components/reduced waste Heterogeneity/Specialization
Different memory access patterns Multicore ASICs
Important workloads don’t use it Compatibility not a show stopper
Mobile -> fast design cycles, controlled SW stacks Datacenter -> economy of scale in single location
Missing opportunities
58
Case Study 2: Software Speculationbegin_speculation() {
<copy state> <setup>}
end_speculation() { if(success) <free copies> else abort_speculation();}
abort_speculation() { <revert to copy> <cleanup>}
Multiple checkouts: “forget” updatesTask: Convert to ASM
checkout(…)
checkout(…)
checkin(…) Checkin: commit updates
SW can use memory in new ways
Use private storage
59
New Software Potential Evaluate ability to write speculation software
Microbenchmark: Fill array with speculative data, then commit Vary size of array
16 64 256 1K 16K 32K 64K 128K0
0.4
0.8
1.2
ASMMESI
# of Blocks in Isolation Region
Norm
alized
R
un
tim
e
60
Using Weak Acoherence
func producer(…) checkout(array); array[0] = x; array[1] = y; checkin(array);
signal(consumer);
end func
func consumer(…)
waitfor(producer);
checkout(array); …end func
global array;
weak acoherent
checkin(array);
Synchronized -> Early visibility OK
Synch hides early checkin
globally visible!
61
Using Best-Effort Acoherence
Exception!
SW handles resource limitations
array[1] = y checkin(array)
end_tx
begin_tx checkout(array)
array[0] = x checkout(array)
62
Simulator Design Two Goals
Functionally evaluate ASM system• programming model, kernel management
Performance comparison to CMP
Enhanced User Mode simulator Emulate non-timing critical components (e.g., disks) Simulate the rest (e.g., virtual memory)
63
Qualitative Data
Is ASM a reasonable model?
YES Almost no changes to application software
• Unsynchronized flags• Stack sharing
Functioning Kernel, same tricks• Heavier use of coherent segments
64
Three Questions
PC PC
DRAM
LLC
PP PPPPPP
CO
CO
CI
CI
Hardware
Layout
Coherent
View
PrivateView
Acoherent
View
1. How can software select view?2. Which view to use?3. How to manage CO/CI?
65
ASM-CMP Segments Uses true memory segments
e.g., all pointers are long (segment + offset)
BUT, address space still appears flat!
Long Pointer Propagation Segment pointers propagate through datapath Add lp/sp + register sidecars Languages/SW remain segment-oblivious
66
ASM-CMP SegmentsSegment pointers propagates with datapath
memcpy(dst, src, len);
lp $t0, 0(dst) lp $t1, 0(src) mov $t2, $a2 ; cnt <- lenloop: beqz $t2, exit lb $t3, 0($t1) ; ld src sb $t3, 0($t0) ; st dst addi $t0, $t0, 1 ; inc. dst addi $t1, $t1, 1 ; inc. src subi $t2, $t2, 1 ; dec. cnt b loopexit:
OffsetSeg. Ptr.
dst ptr
Memory
OffsetSeg. Ptr.
src ptr
OffsetRegister File
SegOffset
Seg. Ptr.
lp $t0, 0(dst) lp $t1, 0(src) mov $t2, $a2 ; cnt <- lenloop: beqz $t2, exit lb $t3, 0($t1) ; ld src sb $t3, 0($t0) ; st dst addi $t0, $t0, 1 ; inc. dst addi $t1, $t1, 1 ; inc. src subi $t2, $t2, 1 ; dec. cnt b loopexit:
Offset Seg$t1Seg. Ptr.Offset
len$t2data$t3
dst
src
ALU
Offset SegOffset SegOffset
1
Seg
Seg
Offset+1
lp $t0, 0(dst) lp $t1, 0(src) mov $t2, $a2 ; cnt <- len beqz $t2, exit lb $t3, 0($t1) ; ld src sb $t3, 0($t0) ; st dst addi $t0, $t0, 1 ; inc.dst
$t0
Segment propagates src -> dst
Pointers are long
67
The Problem
HardwareLayout
PC PC
DRAM
LLC
P
Coherent Shared Memory
P PP
SoftwareView
?
68
The Problem
HardwareLayout
PC PC
DRAM
LLC
PP PP
SoftwareView
Coherent Shared Memory
Hardware Policy – Software Can’t Change!
69
All Data Are Created Equal?
Location := 1;
PC PC
DRAM
LLC
PP
?
?
?
Assume: CMP MESI protocol, inclusive LLC
70
Missed Opportunities
Location := 1;
PC PC
DRAM
LLC
PP
Assume: CMP MESI protocol, inclusive LLC
begin_tx
end_tx
cpLocation := Location;
SW Makes Redundant Copy
71
All Data Are NOT Created Equal
Location := 1;
PC PC
DRAM
LLC
PP
Assume: CMP MESI protocol, inclusive LLC
func foo() var Location;
Private
Wasting Space
72
ASM-1 Hardware
8MB L3
256KB L2
32KB L1
P
AE
Bitmask
Bitmask
Per-line
73
Baseline
L3_0
L3_1
L3_2
L3_3
L3_4
L3_5
L3_6
L3_7
L2L2L2L2L2L2L2L2
L2L2L2L2L2L2L2L2
L1L1L1L1L1L1L1L1
L1L1L1L1L1L1L1L1
P0P1P2P3P4P5P6P7
P8P9
P10P11P12P13P14P15
In-order, single thread Ring interconnect
74
Storage Overhead
2 4 8 16 32 64 128
256
512
1024
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
ASM-1MESI-1 LevelMESI-2 LevelMESI-3 Level
# Cores
Sto
rag
e O
verh
ead
More indirection -> longer latency
No Indirection
75
Rethinking Coherence: Why Now? Dennard Scaling is over; Moore’s Law continues
Need scalable, energy efficient components Accelerators are here
How should they see memory? Shared-little workloads in important markets
76
All Data Are NOT Created Equal
Location := 1;
PC PC
DRAM
LLC
GPUP
Assume: CMP MESI protocol, inclusive LLC
func CUDAKernel(…) …
Not clear accelerators want/need coherence