memory models: a case for rethinking parallel languages and hardware †

42
Memory Models: A Case for Rethinking Parallel Languages and Hardware Sarita V. Adve University of Illinois [email protected] Acks: Mark Hill, Kourosh Gharachorloo, Jeremy Manson, Bill Pugh, Hans Boehm, Doug Lea, Herb Sutter, Vikram Adve, Rob Bocchino, Marc Snir, Byn Choi, Rakesh Komuravelli, Hyojin Sung Also a paper by S. V. Adve & H.-J. Boehm, To appear in CACM

Upload: stella

Post on 24-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Memory Models: A Case for Rethinking Parallel Languages and Hardware †. Sarita V. Adve University of Illinois [email protected] Acks : Mark Hill, Kourosh Gharachorloo, Jeremy Manson, Bill Pugh, Hans Boehm, Doug Lea, Herb Sutter, Vikram Adve, Rob Bocchino, Marc Snir, - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Memory Models: A Case for Rethinking Parallel Languages and Hardware†

Sarita V. AdveUniversity of Illinois

[email protected]

Acks: Mark Hill, Kourosh Gharachorloo, Jeremy Manson, Bill Pugh, Hans Boehm, Doug Lea, Herb Sutter, Vikram Adve, Rob Bocchino, Marc Snir,

Byn Choi, Rakesh Komuravelli, Hyojin Sung

† Also a paper by S. V. Adve & H.-J. Boehm, To appear in CACM

Page 2: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Memory Consistency Models

Parallelism for the masses!Shared-memory most common

Memory model = Legal values for reads

Page 3: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Memory Consistency Models

Parallelism for the masses!Shared-memory most common

Memory model = Legal values for reads

Page 4: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Memory Consistency Models

Parallelism for the masses!Shared-memory most common

Memory model = Legal values for reads

Page 5: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Memory Consistency Models

Parallelism for the masses!Shared-memory most common

Memory model = Legal values for reads

Page 6: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Memory Consistency Models

Parallelism for the masses!Shared-memory most common

Memory model = Legal values for reads

Page 7: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

20 Years of Memory Models• Memory model is at the heart of concurrency semantics

– 20 year journey from confusion to convergence at last!– Hard lessons learned– Implications for future

• Current way to specify concurrency semantics is too hard– Fundamentally broken

• Must rethink parallel languages and hardware– E.g., Illinois Deterministic Parallel Java, DeNovo architecture

Page 8: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

What is a Memory Model?

• Memory model defines what values a read can return

Initially A=B=C=Flag=0

Thread 1 Thread 2 A = 26 while (Flag != 1) {;} B = 90 r1 = B … r2 = A Flag = 1 …

9026 0

Page 9: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Memory Model is Key to Concurrency Semantics

• Interface between program and transformers of program– Defines what values a read can return

C++ program Compiler

Dynamic optimizer

Hardware

• Weakest system component exposed to the programmer– Language level model has implications for hardware

• Interface must last beyond trends

Assem

bly

Page 10: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Desirable Properties of a Memory Model• 3 Ps

– Programmability– Performance– Portability

• Challenge: hard to satisfy all 3 Ps– Late 1980’s - 90’s: Largely driven by hardware

• Lots of models, little consensus– 2000 onwards: Largely driven by languages/compilers

• Consensus model for Java, C++ (C, others ongoing)• Had to deal with mismatches in hardware models

Path to convergence has lessons for future

Page 11: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Programmability – SC [Lamport79]• Programmability: Sequential consistency (SC) most

intuitive– Operations of a single thread in program order– All operations in a total order or atomic

• But Performance?– Recent (complex) hardware techniques boost

performance with SC– But compiler transformations still inhibited

• But Portability? – Almost all h/w, compilers violate SC today

SC not practical, but…

Page 12: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Next Best Thing – SC Almost Always

• Parallel programming too hard even with SC– Programmers (want to) write well structured code– Explicit synchronization, no data races

Thread 1 Thread 2Lock(L) Lock(L)

Read Data1 Read Data2Write Data2 Write Data1… …

Unlock(L) Unlock(L)

– SC for such programs much easier: can reorder data accesses

Data-race-free model [AdveHill90]

– SC for data-race-free programs– No guarantees for programs with data races

Page 13: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Definition of a Data Race• Distinguish between data and non-data (synchronization) accesses• Only need to define for SC executions total order• Two memory accesses form a race if

– From different threads, to same location, at least one is a write– Occur one after another

Thread 1 Thread 2 Write, A, 26 Write, B, 90 Read, Flag, 0 Write, Flag, 1

Read, Flag, 1 Read, B, 90

Read, A, 26

• A race with a data access is a data race• Data-race-free-program = No data race in any SC execution

Page 14: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Data-Race-Free ModelData-race-free model = SC for data-race-free

programs– Does not preclude races for wait-free constructs,

etc.• Requires races be explicitly identified as synchronization • E.g., use volatile variables in Java, atomics in C++

– Dekker’s algorithm Initially Flag1 = Flag2 = 0

volatile Flag1, Flag2

Thread1 Thread2 Flag1 = 1 Flag2 = 1 if Flag2 == 0 if Flag1 == 0 //critical section //critical section

SC prohibits both loads returning 0

Page 15: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Data-Race-Free Approach• Programmer’s model: SC for data-race-free

programs• Programmability

– Simplicity of SC, for data-race-free programs

• Performance– Specifies minimal constraints (for SC-centric view)

• Portability– Language must provide way to identify races– Hardware must provide way to preserve ordering on

races– Compiler must translate correctly

Page 16: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

1990's in Practice (The Memory Models Mess)• Hardware

– Implementation/performance-centric view– Different vendors had different models – most non-SC

• Alpha, Sun, x86, Itanium, IBM, AMD, HP, Cray, … – Various ordering guarantees + fences to impose other orders– Many ambiguities - due to complexity, by design(?), …

• High-level languages– Most shared-memory programming with Pthreads, OpenMP

• Incomplete, ambiguous model specs• Memory model property of language, not library [Boehm05]

– Java – commercially successful language with threads• Chapter 17 of Java language spec on memory model• But hard to interpret, badly broken

LD

LD

LD

ST

ST

ST

ST

LD

Fence

Page 17: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

2000 – 2004: Java Memory Model

• ~ 2000: Bill Pugh publicized fatal flaws in Java model

• Lobbied Sun to form expert group to revise Java model

• Open process via mailing list – Diverse participants– Took 5 years of intense, spirited debates– Many competing models– Final consensus model approved in 2005 for Java 5.0

[MansonPughAdve POPL 2005]

Page 18: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Java Memory Model Highlights

• Quick agreement that SC for data-race-free was required

• Missing piece: Semantics for programs with data races– Java cannot have undefined semantics for ANY program– Must ensure safety/security guarantees

– Limit damage from data races in untrusted code

• Goal: Satisfy security/safety, w/ maximum system flexibility– Problem: “safety/security, limited damage” w/ threads

very vague

Page 19: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Java Memory Model Highlights

Initially X=Y=0 Thread 1 Thread 2

r1 = X r2 = Y Y = r1 X = r2 Is r1=r2=42 allowed?

Data races produce causality loop!

Definition of a causality loop was surprisingly hardCommon compiler optimizations seem to

violate“causality”

Page 20: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Java Memory Model Highlights• Final model based on consensus, but complex

– Programmers can (must) use “SC for data-race-free”– But system designers must deal with complexity – Correctness tools, racy programs, debuggers, …??– Recent discovery of bugs [SevcikAspinall08]

Page 21: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

2005 - :C++, Microsoft Prism, Multicore

• ~ 2005: Hans Boehm initiated C++ concurrency model– Prior status: no threads in C++, most concurrency w/

Pthreads

• Microsoft concurrently started its own internal effort

• C++ easier than Java because it is unsafe– Data-race-free is plausible model

• BUT multicore New h/w optimizations, more scrutiny– Mismatched h/w, programming views became painfully

obvious– Debate that SC for data-race-free inefficient w/ hardware

models

Page 22: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Hardware Implications of Data-Race-Free

• Synchronization (volatiles/atomics) must appear SC– Each thread’s synch must appear in program order

synch Flag1, Flag2

Thread 1 Thread 2

Flag1 = 1 Flag2 = 1

Fence Fence

if Flag2 == 0 if Flag1 == 0 critical section critical section

SC both reads cannot return 0

– Requires efficient fences between synch

stores/loads– All synchs must appear in a total order (atomic)

Page 23: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Independent reads, independent writes (IRIW): Initially X=Y=0 T1 T2 T3 T4X = 1 Y = 1 … = Y … = X

fence fence … = X … = Y

SC no thread sees new value until old copies invalidated

– Shared caches w/ hyperthreading/multicore make this

harder

– Programmers don’t usually use IRIW

– Why pay cost for SC in h/w if not useful to s/w?

0

Implications of Atomic Synch Writes

1 1

0

Page 24: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

C++ Challenges

2006: Pressure to change Java/C++ to remove SC baselineTo accommodate some hardware vendors

• But what is alternative?– Must allow some hardware optimizations– But must be teachable to undergrads

• Showed such an alternative (probably) does not exist

Page 25: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

C++ Compromise• Default C++ model is data-race-free• AMD, Intel, … on board• But

– Some systems need expensive fence for SC– Some programmers really want more flexibility

• C++ specifies low-level model only for experts• Complicates spec, but only for experts• We are not advertising this part

– [BoehmAdve PLDI 2008]

Page 26: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Summary of Current Status

• Convergence to “SC for data-race-free” as baseline• For programs with data races

– Minimal but complex semantics for safe languages– No semantics for unsafe languages

Page 27: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Lessons Learned• SC for data-race-free minimal baseline• Specifying semantics for programs with data races

is HARD– But “no semantics for data races” also has problems

• Not an option for safe languages; debugging; correctness checking tools

• Hardware-software mismatch for some code– “Simple” optimizations have unintended consequences

State-of-the-art is fundamentally broken

Page 28: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Lessons Learned• SC for data-race-free minimal baseline• Specifying semantics for programs with data races

is HARD– But “no semantics for data races” also has problems

• Not an option for safe languages; ebugging; correctness checking tools

• Hardware-software mismatch for some code– “Simple” optimizations have unintended consequences

State-of-the-art is fundamentally broken

Banish shared-memory?

Page 29: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Lessons Learned• SC for data-race-free minimal baseline• Specifying semantics for programs with data races

is HARD– But “no semantics for data races” also has problems

• Not an option for safe languages; debugging; correctness checking tools

• Hardware-software mismatch for some code– “Simple” optimizations have unintended consequences

State-of-the-art is fundamentally broken• We need

– Higher-level disciplined models that enforce discipline– Hardware co-designed with high-level models

Banish wild shared-memory!

Page 30: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Lessons Learned• SC for data-race-free minimal baseline• Specifying semantics for programs with data races

is HARD– But “no semantics for data races” also has problems

• Not an option for safe languages; debugging; correctness checking tools

• Hardware-software mismatch for some code– “Simple” optimizations have unintended consequences

State-of-the-art is fundamentally broken• We need

– Higher-level disciplined models that enforce discipline

– Hardware co-designed with high-level models

Banish wild shared-memory!

Deterministic Parallel Java [V. Adve et al.]

DeNovo hardware

Page 31: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Research Agenda for Languages• Disciplined shared-memory models

– Simple– Enforceable– Expressive– Performance

Key: What discipline? How to enforce it?

Page 32: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Data-Race-Free• A near-term discipline: Data-race-free• Enforcement

– Ideally, language prohibits by design– Else, runtime catches as exception

• But data-race-free still not sufficiently high level

Page 33: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Deterministic-by-Default Parallel Programming• Even data-race-free parallel programs are too hard

– Multiple interleavings due to unordered synchronization (or races)– Makes reasoning and testing hard

• But many algorithms are deterministic– Fixed input gives fixed output– Standard model for sequential programs– Also holds for many transformative parallel programs

• Parallelism not part of problem specification, only for performance

Why write such an algorithm in non-deterministic style, then struggle to understand and control its behavior?

Page 34: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Deterministic-by-Default Model• Parallel programs should be deterministic-by-default

– Sequential semantics (easier than SC!)• If non-determinism is needed

– should be explicitly requested– should be isolated from deterministic parts

• Enforcement: – Ideally, language prohibits by design– Else, runtime

Page 35: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

State-of-the-art• Many deterministic languages today

– Functional, pure data parallel, some domain-specific, …– Much recent work on runtime, library-based approaches

• Our work: Language approach for modern O-O methods– Deterministic Parallel Java (DPJ) [V. Adve et al.]

Page 36: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Deterministic Parallel Java (DPJ)• Object-oriented type and effect system

– Use “named” regions to partition the heap– Annotate methods with effect summaries: regions read or written– If program type-checks, guaranteed deterministic

* Simple, modular compiler checking* No run-time checks today, may add in future

– Side benefit: regions, effects are valuable documentation• Extended sequential subset of Java (DPC++ ongoing)

– Initial evaluation for expressivity, performance [Oopsla09]– Integrating disciplined non-determinism– Encapsulating frameworks and unchecked code– Semi-automatic tool for effect annotations [ASE09]

Page 37: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Research Agenda for Hardware• Current hardware not matched even to current model• Near term: ISA changes, speculation• Long term: Co-design hardware with new software models

Page 38: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Illinois DeNovo Project• Design hardware to exploit disciplined parallelism

– Simpler hardware– Scalable performance– Power/energy efficiency

• Working with DPJ as example disciplined model– Exploit data-race-freedom, region/effect information

* Simpler coherence* Efficient communication: point to point, bulk, …* Efficient data layout: region vs. cache line centric memory* New hardware/software interface

Page 39: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Cache Coherence• Commonly accepted definition (software-oblivious)

– All writes to the same location appear in the same order– Source of much complexity– Coherence protocols to scale to 1000 cores?

• What do we really need (software-aware)?– Get the right data to the right task at the right time– Disciplined models make it easier to determine what is “right”

(Assume only for-each loops)• Read must return value of

– Last write in its own task or– Last write in previous for-each loop

Page 40: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Today's Coherence Protocols

• Snooping– Broadcast, ordered networks

• Directory – avoid broadcast through level of indirection– Complexity: Races in protocol – Performance: Level of indirection – Overhead: Sharer list

Page 41: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Today's Coherence Protocols• Snooping

– Broadcast, ordered networks• Directory – avoid broadcast through level of indirection

– Complexity: Races in protocol • Race-free software race-free coherence protocol

– Performance: Level of indirection • But definition of coherence no longer requires serialization

– Overhead: Sharer list• Region-effects enable self-invalidations

+ No false sharing, flexible communication granularity, region based data layout

Simpler, more efficient DeNovo protocol

Page 42: Memory Models: A Case for  Rethinking Parallel Languages and Hardware †

Conclusions• Current way to specify concurrency semantics fundamentally broken

– Best we can do is SC for data-race-free* But cannot hide from programs with data races

– Mismatched hardware-software* Simple optimizations give unintended consequences

• Need– High-level disciplined models that enforce discipline– Hardware co-designed with high-level model DPJ – deterministic-by-default parallel programming

DeNovo – hardware for disciplined parallel programming• Previous memory models convergence from similar process

– But this time, let’s co-design s/w, h/w