memory models: a case for rethinking parallel languages and hardware

Memory Models: A Case for RethinkingParallel Languages and Hardware

COE 502/CSE 661 Fall 2011Parallel Processing Architectures

Professor: Muhamed Mudawar, KFUPM

Presented by: Ahmad Hammad

outline

• Abstract• Introduction• Sequential Consistency• Data-Race-Free• Industry Practice and Evolution• Lessons Learned• Implications for Languages and Hardware• Conclusion

Abstract

• Writing correct parallel programs remains far more difficult than writing sequential programs.

• most parallel programs are written using a shared-memory approach.

• The memory model, specifies the meaning of shared variables.– involved a tradeoff between programmability and

performance– One of the most challenging areas in both hardware

architecture and programming language specification.

Abstract

• Recent broad community-scale efforts have finally led to a convergence in this debate,– with popular languages such as Java and C++ and – most hardware vendors publishing compatible

memory model specifications.

• Although this convergence is a dramatic improvement, it has exposed fundamental shortcomings that prevent achieving the vision of structured and safe parallel programming.

Abstract• This paper discusses the path to the above

convergence, the hard lessons learned, and their implications.

• A cornerstone of this convergence has been the view that the memory model should be a contract between the programmer and the system – – if the programmer writes disciplined (data-race-free)

programs, the system will provide high programmability (sequential consistency) and performance.

Abstract• We discuss why this view is the best we can do

with current• popular languages, and why it is inadequate

moving forward.• In particular, we argue that parallel languages

should not only promote high-level disciplined models, but they should also enforce the discipline.

• for scalable and efficient performance, hardware should be co-designed to take advantage of and support such disciplined models.

Introduction

• Most parallel programs today are written using threads and shared variables

• There are a number of reasons why threads remain popular. – Even before the dominance of multicore, threads

were already widely supported by ,most OS because they are also useful for other purposes.

Introduction

• Direct HW support for shared-memory provides a performance advantage

• The ability to pass memory references among threads makes it easier to share complex data structures.

• shared-memory makes it easier to selectively parallelize application hot spots without complete redesign of data structures.

Memory Model

• At he heart of the concurrency semantics of a shared memory program.

• It defines the set of values a read in a program can return

• It answers questions such as:– Is there enough synchronization to ensure a

thread’s write will occur before another’s read? – Can two threads write to adjacent fields in a

memory location at the same time?

Memory Model

• We should have unambiguous MM• Defines an interface between a program and

any hardware or software that may transform that program – The compiler, the virtual machine, or any dynamic

optimizer• A complex memory model – parallel programs difficult to write– parallel programming difficult to teach.

Memory Model

• Since it is an interface property, – has a long-lasting impact, affecting portability and

maintainability of programs– overly constraining one may limit hardware and

compiler optimization reducing performance.

Memory Model

• The central role of the memory model has often been downplayed– Specifying a model that balances all desirable

properties has proven surprisingly complex• programmability• performance • portability.

Memory Model

• In 1980s and 1990s, the area received attention primarily in the hardware community– which explored many approaches, with little

consensus – challenge for HW architects was the lack of clear

MM at the programming language level – Unclear what programmers expected hardware to

do.

Memory Model• hardware researchers proposed approaches to

bridge this gap but there were no wide consensus from the software community.

• Before 2000– few programming environments addressed the issue

clearly– widely used environments had unclear and

questionable specifications– Even when specifications were relatively clear, they

were often violated to obtain sufficient performance ,– tended to be misunderstood even by experts, and

were difficult to teach.

Memory Model

• Since 2000, efforts to cleanly specify programming-language-level memory models,– Java and C++, with efforts now underway to

adopt similar models for C and other languages. – In the process, issues created by hardware that

had evolved and made it difficult to reconcile the need for a simple and usable programming model with that for adequate performance on existing hardware.

Memory Model

• Today, the above languages and most hardware vendors have published (or plan to publish) compatible memory model specifications.

• Although this convergence is a dramatic improvement over the past, it has exposed fundamental shortcomings in our parallel languages and their interplay with hardware.

Memory Model

• After decades of research, it is still unacceptably difficult to describe what value a load can return without compromising modern safety guarantees or implementation methodologies.

• To us, this experience has made it clear that solving the memory model problem will require a significantly new and cross-disciplinary research direction for parallel computing languages, hardware, and environments as a whole.

Sequential Consistency

• The execution of a multithreaded program with shared variables is as follows.– Each step in the execution consists of choosing

one of the threads to execute, – then performing the next step in that thread’s

execution according to program order.– This process is repeated until the program as a

whole terminates.


• Execution viewed as taking all steps executed by each thread, and interleaving them.

• Possible results determined by trying all possible interleaving between instructions.

• Instructions in a single thread must occur in same order in interleaving as in the thread.

• According to Lamport this exc. Called sequentially consistent

Red Thread Blue Thread

X=1;r1=Y;

Y=1;R2=X;

Initially X=Y=0

Fig. 2 Some executions for Fig. 1

Exec. 1 Exec. 2 Exec. 3

X=1;r1=Y;Y=1;

R2=X;-----

r1=0, r2=1

Y=1;R2=X;X=1;r1=Y;

r1=1, r2=0

X=1;Y=1;r1=Y;R2=X;

r1=1, r2=1

Fig 1. Core of Dekker’s algorithmCan r1=r2=0?


Fig.2 gives three possible interleaving that together illustrate all possible final values of the non-shared variables r1 and r2.

Sequential Consistency• Sequential consistency gives the simplest

possible meaning for shared variables, but suffers from several related flaws.– sequential consistency can be expensive to

implement.– Almost all h/w, compilers violate SC today

• reorder the two independent assignments, since scheduling loads early tends to hide the load latency.

• Modern processors use a store buffer to avoid waiting for stores to complete.

• Both the compiler and hardware optimization make outcome of r1 = r2 = 0 possible non-sequentially consistent execution.

Data-Race-Free• reordering accesses to unrelated variables in a thread– same meaning of single-threaded programs– affect multithreaded programs

• Two memory accesses form a race if– different threads, try to access same location, at least one

is a write– Occur one after another

• Data-race-free-program No data race in any SC execution– Requires races be explicitly identified as synchronization – locks, use volatile variables in Java, atomics in C++

data-race-free

• A program that does not allow a data race is said to be data-race-free.

• The data-race-free model guarantees sequential consistency only for data-race-free programs.

• For programs that allow data races, the model does not provide any guarantees.

Data-Race-Free Approach

• Programmability– Simply identify the shared variables as synchronization variables

• Performance– Specifies minimal constraints

• Portability– Language must provide way to identify races– Hardware must provide way to preserve ordering on races– Compiler must translate correctly– synchronization accesses are performed indivisibly– ensure sequential consistency, in spite of those simultaneous

accesses

Data-Race-Free Approach

• Most important h/w and compiler optimizations allowed for ordinary accesses

• Care must be taken at the explicitly identified synchronization accesses

• data-race-free was formally proposed in 1990,– it did not see widespread adoption as a formal

model in industry until recently.

Industry Practice and Evolution

• Hardware Memory Models• High-Level Language Memory Models– The Java Memory Model– The C++ Memory Model

Hardware Memory Models• Most hardware supports relaxed models that are

weaker than sequential consistency. – These models take an implementation- or performance-

centric view, – desirable hardware optimizations drive the model

specification • Different vendors had different models– Alpha, Sun, x86, Itanium, IBM, AMD, HP, Cray.– Various ordering guarantees + fences to impose other

orders– Many ambiguities - due to complexity by design and

unclear documentation

High-Level Language Memory Models

• Ada was the first widely used high-level programming language

• provides first class support for shared-memory parallel programming. – Advanced in treatment of memory semantics.– It used a style similar to data-race-free, requiring

legal programs to be well-synchronized;– It did not fully formalize the notion of well-

synchronized.

Pthreads and OpenMP

• Most shared-memory programming enabled through libraries and APIs such as Posix threads and OpenMP.– Incomplete, ambiguous model specs– Memory model property of language, not library– unclear what compiler transformations are legal,– what the programmer is allowed to assume.

Pthreads and OpenMP

• The Posix threads – specification indicates a model similar to data-

race-free– Several inconsistent aspects, – Varying interpretations even among experts

• The OpenMP model– Unclear and largely based on a flush instruction

that is similar to fence instructions in h/w models

Java

• Java provided first class support for threads– chapter specifically devoted to its memory model.

• Bill Pugh publicized fatal flaws in Java model – This model was hard to interpret – Badly broken • Common compiler optimizations were prohibited • Many cases the model gave ambiguous or unexpected

behavior

Java Memory Model Highlights• In 2000, Sun appointed an expert group to revise the model

through the Java community process.– coordination through an open mailing list– Attracted a variety of participants, representing

• Hardware, software, researchers and practitioners.– Took 5 years of debates– Many competing models– Final consensus model approved in 2005 for Java 5.0– Quickly decided that the JMM must provide SC for data-race-

free programs– Unfortunately, this model is very complex– Have some surprising behaviors, – Recently been shown to have a bug

Java Memory Model Highlights

• Missing piece: Semantics for programs with data races– Java is meant to be a safe and secure language– It cannot allow arbitrary behavior for data races. – Java cannot have undefined semantics for ANY

program– Must ensure safety/security guarantees to Limit

damage from data races in untrusted code.– Problem: “safety/security, limited damage” in a

multithreaded context were not clearly defined

Java Memory Model Highlights

• Final model based on consensus, but complex– Programmers (must) use SC for data-race-free– But system designers must deal with complexity – Recent discovery of bugs in 2008

The C++ Memory Model

• The situation in C++ was significantly different from Java.

• The language itself provided no support for threads.– most concurrency with Pthreads.– Corresponding Microsoft Windows facilities.

• combination of the C++ standard with the Pthread standard.– made it unclear to compiler writers precisely what

they needed to implement

The C++ Memory Model• In 2005 a concurrency model for C++ initiated• The resulting effort expanded to include the definition

of atomic operations, and the threads API itself.• Microsoft concurrently started its own internal effort• C++ easier than Java because it is unsafe– Data-race-free is possible model

• BUT multicore New h/w optimizations, more search– Mismatched h/w, programming views became painful– Debate that SC for data-race-free inefficient with new

hardware models

Lessons Learned• Data-race-free provides a simple and consistent

model for threads and shared variables. • The current hardware implies three significant

weaknesses:– Debugging Accidental introduction of a data race

results in “undefined behavior,”– Synchronization variable performance on current

hardware: ensuring SC in Java or C++ on current h/w complicate the model for experts.

– Untrusted code There is no way to ensure data-race-freedom in untrusted code (Java).

Lessons Learned

• There is a need for:– Higher-level disciplined models that enforce

discipline• Deterministic Parallel Java

– Hardware co-designed with high-level models• DeNovo hardware

Implications for Languages

• Modern software needs to be both parallel and secure

• Choice between the two should not be acceptable.

• Disciplined shared-memory models– Simple– Enforceable– Expressive– Performance

Implications for Languages• Simple enough to be easily teachable to

undergraduates.– minimally provide sequential consistency to programs that

obey the required discipline.• enable the enforcement of the discipline.– violations the discipline should not have undefined or

complex semantics.– should be caught and returned back to the programmer as

illegal;• General-purpose enough to express important parallel

algorithms and patterns• enable high and scalable performance.

Data-race-free

• A near-term discipline: continue with Data-race-free

• Focus research on its enforcement– Ideally, language prohibits and eliminate data race

by design– Else, runtime catches as exception

Deterministic-by-Default Parallel Programming

• Many algorithms are deterministic– Fixed input gives fixed output– Standard model for sequential programs– Also holds for many transformative parallel

programs• Parallelism not part of problem specification, only for

performance

• Language-based approaches with such goals Deterministic Parallel Java (DPJ) .

Implications for Hardware

• Current h/w MM are an imperfect match for even current software MM

• instruction set architecture changes – identify individual loads and stores as synchronization

can alleviate some short-term problems.

• An established ISA, is hard to change– when existing code works mostly adequately – there is no enough experience to document the

benefits of the change.

Implications for Hardware• longer term perspective, more fundamental

solution to the problem will emerge with a co-designed approach,– where future multicore hardware research evolves in

concert with the software models research.• DeNovo project at Illinois in concert with DPJ– Design hardware to exploit disciplined parallelism– Simpler hardware– Scalable performance– Power/energy efficiency– Working with DPJ as example disciplined model

Conclusion

• This paper gives a perspective based on work collectively spanning about thirty years.

• Current way to specify concurrency semantics fundamentally broken – Best we can do is SC for data-race-free– Mismatched hardware-software

* Simple optimizations give unintended consequences

• Need – High-level disciplined models that enforce discipline– Hardware co-designed with high-level model

Authors

• Sarita V. Adve ([email protected]) is a professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign.

• Hans-J. Boehm ([email protected]) is a member of Hewlett-Packard's Exascale Computing Lab, Palo Alto, CA.

http://delivery.acm.org/10.1145/1790000/1787255/mailto:[email protected]

http://delivery.acm.org/10.1145/1790000/1787255/mailto:[email protected]

memory models: a case for rethinking parallel languages and hardware

Documents

shared memory program

memory modelat

memory references

memory modelwe

memory location

memory modelsince

sharedmemory approach

parallel programming