performance multicore programming - dspace@mit: · pdf fileperformance multicore ... access...

13
Parallelism and Performance 2009-10-22 Lecture 11 6.172 Performance Multicore Engineering of Programming Software Systems Charles E. Leiserson October 22, 2009 Moore’s Law Clock Speed (MHz) Transistors (000) 0 10 100 1000 10000 100000 1000000 1071 1076 1070 1000 1007 1001 1006 1000 2000 2007 Intel CPU Introductions Figure by MIT OpenCourseWare. Source: Herb Sutter, “The free lunch is over: a fundamental turn toward concurrency in software,” Dr. Dobb's Journal, 30(3), March 2005. © 2009 Char les E. L eiserson 2 © 2009 Char les E. L eiserson 1 Power Density © 2009 Char les E. L eiserson 3 Source: Patrick Gelsinger, Intel Developer’s Forum, Intel Corporation, 2004. Reprinted with permission of Intel Corporation. Intel Core i7 processor Reprinted Vendor Solution processor. Reprinted with permission of Intel Corporation. © 2009 Char les E. L eiserson 4 To scale performance, put many processin cores on the microprocessor chip. Each generation of Moore’s Law potentially doubles the number of cores. MIT 6.172 Lecture 11 1 Transistor count is still rising rising , … but clock speed is bounded at ~5GHz. 5GHz.

Upload: docong

Post on 04-Feb-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance Multicore Programming - DSpace@MIT: · PDF filePerformance Multicore ... access and update items in the container concurrently ... and proprietary, including gcc and Visual

Parallelism and Performance 2009-10-22

Lecture 11

6172Performance Multicore

Engineering of ProgrammingSoftware Systems

Charles E Leiserson October 22 2009

Moorersquos Law

Clock Speed (MHz)

Transistors (000)

0

10

100

1000

10000

100000

1000000

1071 1076 1070 1000 1007 1001 1006 1000 2000 2007

Intel CPU Introductions

Figure by MIT OpenCourseWare Source Herb Sutter ldquoThe free lunch is over a fundamental turn toward concurrency in softwarerdquo Dr Dobbs Journal 30(3) March 2005

copy2009 Char lesE L eiserson 2copy2009 Char lesE L eiserson 1

Power Density

copy2009 Char lesE L eiserson 3

Source Patrick Gelsinger Intel Developerrsquos Forum Intel Corporation 2004 Reprinted with permission of Intel Corporation

Intel Core i7 processor Reprinted

Vendor Solution

processor Reprintedwith permission of Intel Corporation

copy2009 Char lesE L eiserson 4

∙To scale performance put many processincores on the microprocessor chip ∙Each generation of Moorersquos Law potentially

doubles the number of cores

MIT 6172 Lecture 11 1

Transistor count is still

risingrising hellip

but clock speed is

bounded at ~5GHz5GHz

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 2

Memory IO

Multicore Architecture

Network

y

copy 2009 Charles E Leiserson 5

hellip$ $ $

Chip Multiprocessor (CMP)

PPP

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 6

Cilk++

bullRace Conditions

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 7

Cilk++

bullRace Conditions

x=3

Cache Coherence

copy 2009 Charles E Leiserson 8

hellipP P P

Load x x=3

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 3

x=3

Cache Coherence

copy 2009 Charles E Leiserson 9

hellipP P P

Load x x=3 x=3

x=3

Cache Coherence

copy 2009 Charles E Leiserson 10

hellipP P P

Load x x=3 x=3 x=3

x=3

Cache Coherence

copy 2009 Charles E Leiserson 11

hellipP P P

Store x=5 x=3 x=3 x=3

x=3

Cache Coherence

copy 2009 Charles E Leiserson 12

hellipP P P

Store x=5 x=3 x=5 x=3

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 4

x=3

Cache Coherence

copy 2009 Charles E Leiserson 13

hellipP P P

Load x x=3 x=5 x=3

MSI Protocol

Each cache line is labeled with a statebull M cache block has been modified No other

caches contain this block in M or S states

M x=13S y=17

bull S other caches may be sharing this blockbull I cache block is invalid (same as not there)

S y=17I x=4 I x=12

S y=17

copy 2009 Charles E Leiserson 14

I z=8 M z=7 I z=3

Before a cache modifies a location the hardware first invalidates all other copies

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 15

Cilk++

bullRace Conditions

Concurrency Platforms

bull Programming directly on processor cores is painful and error-prone

bull A concurrency platform abstracts processor cores handles synchronization and communication protocols and performs load balancing

bull Examplesbull Pthreads and WinAPI threads

copy 2009 Charles E Leiserson 16

Pthreads and WinAPI threadsbull Threading Building Blocks (TBB)bull OpenMPbull Cilk++

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 5

Fibonacci NumbersThe Fibonacci numbers are the sequence lang0 1 1 2 3 5 8 13 21 34 helliprang where each number is the sum of the previous two

The sequence is named after Leonardo di Pisa (1170ndash

RecurrenceF0 = 0F1 = 1Fn = Fnndash1 + Fnndash2 for n gt 1

copy 2009 Charles E Leiserson 17

1250 AD) also known as Fibonacci a contraction of filius Bonaccii mdashldquoson of Bonacciordquo Fibonaccirsquos 1202 book Liber Abaci introduced the sequence to Western mathematics although it had previously been discovered by Indian mathematicians

Fibonacci Program

include ltstdiohgtinclude ltstdlibhgt

int fib(int n)

DisclaimerThis recursive program is a poor

if (n lt 2) return nelse int x = fib(n-1)int y = fib(n-2)return x + y

int main(int argc char argv[])

way to compute the nth Fibonacci number but it provides a good didactic example

copy 2009 Charles E Leiserson 18

int main(int argc char argv[])int n = atoi(argv[1])int result = fib(n)printf(Fibonacci of d is dn n result)return 0

Fibonacci Executionfib(4)

fib(3) fib(2)

fib(2)

fib(1) fib(0)

fib(1) fib(1) fib(0)

Key idea for parallelization

int fib(int n)if (n lt 2) return nelse

copy 2009 Charles E Leiserson 19

Key idea for parallelizationThe calculations of fib(n-1)and fib(n-2) can be executed simultaneously without mutual interference

else int x = fib(n-1)int y = fib(n-2)return x + y

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 20

Cilk++

bullRace Conditions

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 6

Pthreads∙Standard API for threading specified by

ANSIIEEE POSIX 10031-1995 sect10031c∙Do-it-yourself concurrency platform

B il lib f f i i h ldquo i lrdquo∙Built as a library of functions with ldquospecialrdquo non-C++ semantics∙Each thread implements an abstraction of

a processor which are multiplexed onto machine resources∙Threads communicate though shared

copy 2009 Charles E Leiserson 21

Threads communicate though shared memory∙Library functions mask the protocols

involved in interthread coordinationWinAPI threads provide similar functionality

Key Pthread Functions

int pthread_create(pthread_t thread

returned identifier for the new thread const pthread_attr_t attr

b h d b ( f d f l )object to set thread attributes (NULL for default)void (func)(void )

routine executed after creation void arg

a single argument passed to func) returns error status

i h d j i (

copy 2009 Charles E Leiserson 22

int pthread_join (pthread_t thread

identifier of thread to wait forvoid status

terminating threadrsquos status (NULL to ignore)) returns error status

Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

int main(int argc char argv[])pthread_t threadthread args args

int main(int argc char argv[])pthread_t threadthread args argselse

int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

else int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

copy 2009 Charles E Leiserson 23

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

int main(int argc char argv[])pthread_t threadthread args args

int main(int argc char argv[])pthread_t threadthread args args

Original code

Structure

No point in creating thread if there isnrsquot

enough to doelse int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

else int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

Structure for thread

argumentsMarshal input argument to

thread

Main program executes

fib(nndash2) in parallel

Block until auxiliary thread

copy 2009 Charles E Leiserson 24

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

Function called when

thread is created

Create thread to execute fib(nndash1)

parallelfinishes

Add results together to produce final output

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 7

Issues with Pthreads

Overhead The cost of creating a thread gt104

cycles rArr coarse-grained concurrency (Thread pools can help )(Thread pools can help)

Scalability Fibonacci code gets about 15 speedup for 2 cores Need a rewrite for more cores

Modularity The Fibonacci logic is no longer neatly encapsulated in the fib() function

copy 2009 Charles E Leiserson 25

Code Simplicity

Programmers must marshal arguments(shades of 1958 ) and engage in error-prone protocols in order to load-balance

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 26

Cilk++

bullRace Conditions

Threading Building Blocks

∙Developed by Intel∙ Implemented as a C++p

library that runs on top of native threads∙Programmer specifies

tasks rather than threads∙Tasks are automatically

l d b l d th

Image of book cover removed due to copyright restrictions

Reinders James Intel Threading Building Blocks Outfitting C++ for Multi-Core Processor

copy 2009 Charles E Leiserson 27

load balanced across the threads using work-stealing∙Focus on performance

Multi Core Processor Parallelism OReilly 2007

Fibonacci in TBB

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum ) n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

copy 2009 Charles E Leiserson 28

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 8

Fibonacci in TBB

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

A computation organized as explicit tasks

FibTask has an input parameter n and an output parameter sum

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

The execute() function performs the

computation when the task is started

Recursively create two child tasks

copy 2009 Charles E Leiserson 29

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

child tasks a and b

Set the number of tasks to wait for (2 children + 1

implicit for bookkeeping)Start task b

Start task a and wait for both a and b to finish

Other TBB Features

∙TBB provides many C++ templates to express common patterns simply such as

parallel for for loop parallelismparallel_for for loop parallelismparallel_reduce for data aggregationpipeline and filter for software pipelining

∙TBB provides concurrent container classes which allow multiple threads to safely access and update items in the container concurrently

copy 2009 Charles E Leiserson 30

concurrently∙TBB also provides a variety of mutual-

exclusion library functions including locks and atomic updates

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 31

Cilk++

bullRace Conditions

OpenMP

∙Specification produced by an industry consortium∙Several compilers available both open-source

and proprietary including gcc and Visual Studio∙Linguistic extensions to CC++ or Fortran in

the form of compiler pragmas∙Runs on top of native threads

Supports loop parallelism and more recently

copy 2009 Charles E Leiserson 32

∙Supports loop parallelism and more recently in Version 30 task parallelism

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 9

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x yint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

copy 2009 Charles E Leiserson 33

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x y

Compiler directiveint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

The following statement is an

independent task

Sharing of memory is

copy 2009 Charles E Leiserson 34

managed explicitly

Wait for the two tasks to complete before continuing

Other OpenMP Features

∙OpenMP provides many pragma directivesto express common patterns such as

parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing

∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks

copy 2009 Charles E Leiserson 35

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 36

Cilk++

bullRace Conditions

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 10

Cilk++

∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off

which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk

multithreaded language developed at MIT∙Features a provably efficient work-stealing

scheduler

copy 2009 Charles E Leiserson 37

∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and

Cilkview scalability analyzer

Nested Parallelism in Cilk++

int fib(int n)

if (n lt 2) return n

int fib(int n)

if (n lt 2) return n

The named childfunction may execute in parallel with the

llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

parent caller

Control cannot pass this Control cannot pass this point until all spawned h ld h d

copy 2009 Charles E Leiserson 38

children have returned

Cilk++ keywords grant permission for parallel execution They do not command parallel execution

Loop Parallelism in Cilk++

ExampleIn-place matrix

a11 a12 ⋯ a1n

a21 a22 ⋯ a2n

⋮ ⋮ ⋱ ⋮

a11 a21 ⋯ an1

a12 a22 ⋯ an2

⋮ ⋮ ⋱ ⋮

The iterations of a cilk for

indices run from 0 not 1cilk_for (int i=1 iltn ++i)

matrix transpose

⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann

⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann

A AT

copy 2009 Charles E Leiserson 39

of a cilk_forloop execute in parallel

for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp

Serial Semantics

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync

( )

The C++ serialization of a Cilk++ program is always a legal interpretation of the

Cilk++ source

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

return (x+y)

gprogramrsquos semantics

Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution

copy 2009 Charles E Leiserson 40

To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync

Or specify a switch to the Cilk++ compiler

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 11

Scheduling

∙The Cilk++ concurrency platform allows the programmer to express

t ti l ll li i

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in

an application∙The Cilk++ scheduler

maps the executing program onto the processor cores d i ll i

Memory IO

cilk_syncreturn (x+y)

copy 2009 Charles E Leiserson 41

dynamically at runtime∙Cilk++rsquos work-stealing

scheduler is provably efficient

Network

hellipPP P P$ $ $

Cilk++Compiler

Conventional Compiler

HyperobjectLibrary1

2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)

Cilk++ Platform

Cilk++source

BinaryBinary CilkscreenRace Detector

Linker

5

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

CilkviewScalability Analyzer

6

copy 2009 Charles E Leiserson 42

Conventional Regression Tests

Reliable Single-Threaded Code

Exceptional Performance

Reliable Multi-Threaded Code

Parallel Regression Tests

Runtime System

4

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 43

Cilk++

bullRace Conditions

Race Bugs

Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write

int x = 0cilk_for(int i=0 ilt2 ++i)

x++

A

B C x++

int x = 0

x++

A

B C

Example

copy 2009 Charles E Leiserson 44

x++assert(x == 2)

B C

D

x++

assert(x == 2)

x++B C

D

Dependency Graph

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 12

A Closer Look

1 2

x = 01

2 4int x = 0

A

r1 = x

r1++

x = r1

r2 = x

r2++

x = r2

2

3

4

5

67

x++

int x 0

assert(x == 2)

x++B C

D

copy 2009 Charles E Leiserson 45

assert(x == 2)8

x

r1

r2

0001 0 01111 1

Types of Races

Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)

A B Race Typeread read noneread write read racewrite read read race

A∥B (A is parallel to B)

copy 2009 Charles E Leiserson 46

write read read racewrite write write race

Two sections of code are independent if they have no determinacy races between them

Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs

bull Machine word size matters Watch out for races in packed data structures

copy 2009 Charles E Leiserson 47

packed data structures

structchar achar b

x

structchar achar b

x

Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)

Cilkscreen Race Detector

∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race

∙ Employs a regression-test methodology where the programmer provides test inputs

∙ Identifies filenames lines and variables involved in races including stack traces

copy 2009 Charles E Leiserson 48

g∙ Runs off the binary executable using dynamic

instrumentation ∙ Runs about 20 times slower than real-time

MIT OpenCourseWarehttpocwmitedu

6172 Performance Engineering of Software Systems Fall 2009

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

Page 2: Performance Multicore Programming - DSpace@MIT: · PDF filePerformance Multicore ... access and update items in the container concurrently ... and proprietary, including gcc and Visual

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 2

Memory IO

Multicore Architecture

Network

y

copy 2009 Charles E Leiserson 5

hellip$ $ $

Chip Multiprocessor (CMP)

PPP

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 6

Cilk++

bullRace Conditions

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 7

Cilk++

bullRace Conditions

x=3

Cache Coherence

copy 2009 Charles E Leiserson 8

hellipP P P

Load x x=3

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 3

x=3

Cache Coherence

copy 2009 Charles E Leiserson 9

hellipP P P

Load x x=3 x=3

x=3

Cache Coherence

copy 2009 Charles E Leiserson 10

hellipP P P

Load x x=3 x=3 x=3

x=3

Cache Coherence

copy 2009 Charles E Leiserson 11

hellipP P P

Store x=5 x=3 x=3 x=3

x=3

Cache Coherence

copy 2009 Charles E Leiserson 12

hellipP P P

Store x=5 x=3 x=5 x=3

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 4

x=3

Cache Coherence

copy 2009 Charles E Leiserson 13

hellipP P P

Load x x=3 x=5 x=3

MSI Protocol

Each cache line is labeled with a statebull M cache block has been modified No other

caches contain this block in M or S states

M x=13S y=17

bull S other caches may be sharing this blockbull I cache block is invalid (same as not there)

S y=17I x=4 I x=12

S y=17

copy 2009 Charles E Leiserson 14

I z=8 M z=7 I z=3

Before a cache modifies a location the hardware first invalidates all other copies

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 15

Cilk++

bullRace Conditions

Concurrency Platforms

bull Programming directly on processor cores is painful and error-prone

bull A concurrency platform abstracts processor cores handles synchronization and communication protocols and performs load balancing

bull Examplesbull Pthreads and WinAPI threads

copy 2009 Charles E Leiserson 16

Pthreads and WinAPI threadsbull Threading Building Blocks (TBB)bull OpenMPbull Cilk++

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 5

Fibonacci NumbersThe Fibonacci numbers are the sequence lang0 1 1 2 3 5 8 13 21 34 helliprang where each number is the sum of the previous two

The sequence is named after Leonardo di Pisa (1170ndash

RecurrenceF0 = 0F1 = 1Fn = Fnndash1 + Fnndash2 for n gt 1

copy 2009 Charles E Leiserson 17

1250 AD) also known as Fibonacci a contraction of filius Bonaccii mdashldquoson of Bonacciordquo Fibonaccirsquos 1202 book Liber Abaci introduced the sequence to Western mathematics although it had previously been discovered by Indian mathematicians

Fibonacci Program

include ltstdiohgtinclude ltstdlibhgt

int fib(int n)

DisclaimerThis recursive program is a poor

if (n lt 2) return nelse int x = fib(n-1)int y = fib(n-2)return x + y

int main(int argc char argv[])

way to compute the nth Fibonacci number but it provides a good didactic example

copy 2009 Charles E Leiserson 18

int main(int argc char argv[])int n = atoi(argv[1])int result = fib(n)printf(Fibonacci of d is dn n result)return 0

Fibonacci Executionfib(4)

fib(3) fib(2)

fib(2)

fib(1) fib(0)

fib(1) fib(1) fib(0)

Key idea for parallelization

int fib(int n)if (n lt 2) return nelse

copy 2009 Charles E Leiserson 19

Key idea for parallelizationThe calculations of fib(n-1)and fib(n-2) can be executed simultaneously without mutual interference

else int x = fib(n-1)int y = fib(n-2)return x + y

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 20

Cilk++

bullRace Conditions

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 6

Pthreads∙Standard API for threading specified by

ANSIIEEE POSIX 10031-1995 sect10031c∙Do-it-yourself concurrency platform

B il lib f f i i h ldquo i lrdquo∙Built as a library of functions with ldquospecialrdquo non-C++ semantics∙Each thread implements an abstraction of

a processor which are multiplexed onto machine resources∙Threads communicate though shared

copy 2009 Charles E Leiserson 21

Threads communicate though shared memory∙Library functions mask the protocols

involved in interthread coordinationWinAPI threads provide similar functionality

Key Pthread Functions

int pthread_create(pthread_t thread

returned identifier for the new thread const pthread_attr_t attr

b h d b ( f d f l )object to set thread attributes (NULL for default)void (func)(void )

routine executed after creation void arg

a single argument passed to func) returns error status

i h d j i (

copy 2009 Charles E Leiserson 22

int pthread_join (pthread_t thread

identifier of thread to wait forvoid status

terminating threadrsquos status (NULL to ignore)) returns error status

Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

int main(int argc char argv[])pthread_t threadthread args args

int main(int argc char argv[])pthread_t threadthread args argselse

int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

else int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

copy 2009 Charles E Leiserson 23

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

int main(int argc char argv[])pthread_t threadthread args args

int main(int argc char argv[])pthread_t threadthread args args

Original code

Structure

No point in creating thread if there isnrsquot

enough to doelse int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

else int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

Structure for thread

argumentsMarshal input argument to

thread

Main program executes

fib(nndash2) in parallel

Block until auxiliary thread

copy 2009 Charles E Leiserson 24

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

Function called when

thread is created

Create thread to execute fib(nndash1)

parallelfinishes

Add results together to produce final output

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 7

Issues with Pthreads

Overhead The cost of creating a thread gt104

cycles rArr coarse-grained concurrency (Thread pools can help )(Thread pools can help)

Scalability Fibonacci code gets about 15 speedup for 2 cores Need a rewrite for more cores

Modularity The Fibonacci logic is no longer neatly encapsulated in the fib() function

copy 2009 Charles E Leiserson 25

Code Simplicity

Programmers must marshal arguments(shades of 1958 ) and engage in error-prone protocols in order to load-balance

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 26

Cilk++

bullRace Conditions

Threading Building Blocks

∙Developed by Intel∙ Implemented as a C++p

library that runs on top of native threads∙Programmer specifies

tasks rather than threads∙Tasks are automatically

l d b l d th

Image of book cover removed due to copyright restrictions

Reinders James Intel Threading Building Blocks Outfitting C++ for Multi-Core Processor

copy 2009 Charles E Leiserson 27

load balanced across the threads using work-stealing∙Focus on performance

Multi Core Processor Parallelism OReilly 2007

Fibonacci in TBB

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum ) n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

copy 2009 Charles E Leiserson 28

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 8

Fibonacci in TBB

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

A computation organized as explicit tasks

FibTask has an input parameter n and an output parameter sum

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

The execute() function performs the

computation when the task is started

Recursively create two child tasks

copy 2009 Charles E Leiserson 29

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

child tasks a and b

Set the number of tasks to wait for (2 children + 1

implicit for bookkeeping)Start task b

Start task a and wait for both a and b to finish

Other TBB Features

∙TBB provides many C++ templates to express common patterns simply such as

parallel for for loop parallelismparallel_for for loop parallelismparallel_reduce for data aggregationpipeline and filter for software pipelining

∙TBB provides concurrent container classes which allow multiple threads to safely access and update items in the container concurrently

copy 2009 Charles E Leiserson 30

concurrently∙TBB also provides a variety of mutual-

exclusion library functions including locks and atomic updates

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 31

Cilk++

bullRace Conditions

OpenMP

∙Specification produced by an industry consortium∙Several compilers available both open-source

and proprietary including gcc and Visual Studio∙Linguistic extensions to CC++ or Fortran in

the form of compiler pragmas∙Runs on top of native threads

Supports loop parallelism and more recently

copy 2009 Charles E Leiserson 32

∙Supports loop parallelism and more recently in Version 30 task parallelism

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 9

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x yint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

copy 2009 Charles E Leiserson 33

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x y

Compiler directiveint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

The following statement is an

independent task

Sharing of memory is

copy 2009 Charles E Leiserson 34

managed explicitly

Wait for the two tasks to complete before continuing

Other OpenMP Features

∙OpenMP provides many pragma directivesto express common patterns such as

parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing

∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks

copy 2009 Charles E Leiserson 35

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 36

Cilk++

bullRace Conditions

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 10

Cilk++

∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off

which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk

multithreaded language developed at MIT∙Features a provably efficient work-stealing

scheduler

copy 2009 Charles E Leiserson 37

∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and

Cilkview scalability analyzer

Nested Parallelism in Cilk++

int fib(int n)

if (n lt 2) return n

int fib(int n)

if (n lt 2) return n

The named childfunction may execute in parallel with the

llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

parent caller

Control cannot pass this Control cannot pass this point until all spawned h ld h d

copy 2009 Charles E Leiserson 38

children have returned

Cilk++ keywords grant permission for parallel execution They do not command parallel execution

Loop Parallelism in Cilk++

ExampleIn-place matrix

a11 a12 ⋯ a1n

a21 a22 ⋯ a2n

⋮ ⋮ ⋱ ⋮

a11 a21 ⋯ an1

a12 a22 ⋯ an2

⋮ ⋮ ⋱ ⋮

The iterations of a cilk for

indices run from 0 not 1cilk_for (int i=1 iltn ++i)

matrix transpose

⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann

⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann

A AT

copy 2009 Charles E Leiserson 39

of a cilk_forloop execute in parallel

for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp

Serial Semantics

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync

( )

The C++ serialization of a Cilk++ program is always a legal interpretation of the

Cilk++ source

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

return (x+y)

gprogramrsquos semantics

Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution

copy 2009 Charles E Leiserson 40

To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync

Or specify a switch to the Cilk++ compiler

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 11

Scheduling

∙The Cilk++ concurrency platform allows the programmer to express

t ti l ll li i

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in

an application∙The Cilk++ scheduler

maps the executing program onto the processor cores d i ll i

Memory IO

cilk_syncreturn (x+y)

copy 2009 Charles E Leiserson 41

dynamically at runtime∙Cilk++rsquos work-stealing

scheduler is provably efficient

Network

hellipPP P P$ $ $

Cilk++Compiler

Conventional Compiler

HyperobjectLibrary1

2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)

Cilk++ Platform

Cilk++source

BinaryBinary CilkscreenRace Detector

Linker

5

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

CilkviewScalability Analyzer

6

copy 2009 Charles E Leiserson 42

Conventional Regression Tests

Reliable Single-Threaded Code

Exceptional Performance

Reliable Multi-Threaded Code

Parallel Regression Tests

Runtime System

4

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 43

Cilk++

bullRace Conditions

Race Bugs

Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write

int x = 0cilk_for(int i=0 ilt2 ++i)

x++

A

B C x++

int x = 0

x++

A

B C

Example

copy 2009 Charles E Leiserson 44

x++assert(x == 2)

B C

D

x++

assert(x == 2)

x++B C

D

Dependency Graph

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 12

A Closer Look

1 2

x = 01

2 4int x = 0

A

r1 = x

r1++

x = r1

r2 = x

r2++

x = r2

2

3

4

5

67

x++

int x 0

assert(x == 2)

x++B C

D

copy 2009 Charles E Leiserson 45

assert(x == 2)8

x

r1

r2

0001 0 01111 1

Types of Races

Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)

A B Race Typeread read noneread write read racewrite read read race

A∥B (A is parallel to B)

copy 2009 Charles E Leiserson 46

write read read racewrite write write race

Two sections of code are independent if they have no determinacy races between them

Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs

bull Machine word size matters Watch out for races in packed data structures

copy 2009 Charles E Leiserson 47

packed data structures

structchar achar b

x

structchar achar b

x

Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)

Cilkscreen Race Detector

∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race

∙ Employs a regression-test methodology where the programmer provides test inputs

∙ Identifies filenames lines and variables involved in races including stack traces

copy 2009 Charles E Leiserson 48

g∙ Runs off the binary executable using dynamic

instrumentation ∙ Runs about 20 times slower than real-time

MIT OpenCourseWarehttpocwmitedu

6172 Performance Engineering of Software Systems Fall 2009

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

Page 3: Performance Multicore Programming - DSpace@MIT: · PDF filePerformance Multicore ... access and update items in the container concurrently ... and proprietary, including gcc and Visual

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 3

x=3

Cache Coherence

copy 2009 Charles E Leiserson 9

hellipP P P

Load x x=3 x=3

x=3

Cache Coherence

copy 2009 Charles E Leiserson 10

hellipP P P

Load x x=3 x=3 x=3

x=3

Cache Coherence

copy 2009 Charles E Leiserson 11

hellipP P P

Store x=5 x=3 x=3 x=3

x=3

Cache Coherence

copy 2009 Charles E Leiserson 12

hellipP P P

Store x=5 x=3 x=5 x=3

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 4

x=3

Cache Coherence

copy 2009 Charles E Leiserson 13

hellipP P P

Load x x=3 x=5 x=3

MSI Protocol

Each cache line is labeled with a statebull M cache block has been modified No other

caches contain this block in M or S states

M x=13S y=17

bull S other caches may be sharing this blockbull I cache block is invalid (same as not there)

S y=17I x=4 I x=12

S y=17

copy 2009 Charles E Leiserson 14

I z=8 M z=7 I z=3

Before a cache modifies a location the hardware first invalidates all other copies

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 15

Cilk++

bullRace Conditions

Concurrency Platforms

bull Programming directly on processor cores is painful and error-prone

bull A concurrency platform abstracts processor cores handles synchronization and communication protocols and performs load balancing

bull Examplesbull Pthreads and WinAPI threads

copy 2009 Charles E Leiserson 16

Pthreads and WinAPI threadsbull Threading Building Blocks (TBB)bull OpenMPbull Cilk++

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 5

Fibonacci NumbersThe Fibonacci numbers are the sequence lang0 1 1 2 3 5 8 13 21 34 helliprang where each number is the sum of the previous two

The sequence is named after Leonardo di Pisa (1170ndash

RecurrenceF0 = 0F1 = 1Fn = Fnndash1 + Fnndash2 for n gt 1

copy 2009 Charles E Leiserson 17

1250 AD) also known as Fibonacci a contraction of filius Bonaccii mdashldquoson of Bonacciordquo Fibonaccirsquos 1202 book Liber Abaci introduced the sequence to Western mathematics although it had previously been discovered by Indian mathematicians

Fibonacci Program

include ltstdiohgtinclude ltstdlibhgt

int fib(int n)

DisclaimerThis recursive program is a poor

if (n lt 2) return nelse int x = fib(n-1)int y = fib(n-2)return x + y

int main(int argc char argv[])

way to compute the nth Fibonacci number but it provides a good didactic example

copy 2009 Charles E Leiserson 18

int main(int argc char argv[])int n = atoi(argv[1])int result = fib(n)printf(Fibonacci of d is dn n result)return 0

Fibonacci Executionfib(4)

fib(3) fib(2)

fib(2)

fib(1) fib(0)

fib(1) fib(1) fib(0)

Key idea for parallelization

int fib(int n)if (n lt 2) return nelse

copy 2009 Charles E Leiserson 19

Key idea for parallelizationThe calculations of fib(n-1)and fib(n-2) can be executed simultaneously without mutual interference

else int x = fib(n-1)int y = fib(n-2)return x + y

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 20

Cilk++

bullRace Conditions

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 6

Pthreads∙Standard API for threading specified by

ANSIIEEE POSIX 10031-1995 sect10031c∙Do-it-yourself concurrency platform

B il lib f f i i h ldquo i lrdquo∙Built as a library of functions with ldquospecialrdquo non-C++ semantics∙Each thread implements an abstraction of

a processor which are multiplexed onto machine resources∙Threads communicate though shared

copy 2009 Charles E Leiserson 21

Threads communicate though shared memory∙Library functions mask the protocols

involved in interthread coordinationWinAPI threads provide similar functionality

Key Pthread Functions

int pthread_create(pthread_t thread

returned identifier for the new thread const pthread_attr_t attr

b h d b ( f d f l )object to set thread attributes (NULL for default)void (func)(void )

routine executed after creation void arg

a single argument passed to func) returns error status

i h d j i (

copy 2009 Charles E Leiserson 22

int pthread_join (pthread_t thread

identifier of thread to wait forvoid status

terminating threadrsquos status (NULL to ignore)) returns error status

Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

int main(int argc char argv[])pthread_t threadthread args args

int main(int argc char argv[])pthread_t threadthread args argselse

int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

else int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

copy 2009 Charles E Leiserson 23

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

int main(int argc char argv[])pthread_t threadthread args args

int main(int argc char argv[])pthread_t threadthread args args

Original code

Structure

No point in creating thread if there isnrsquot

enough to doelse int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

else int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

Structure for thread

argumentsMarshal input argument to

thread

Main program executes

fib(nndash2) in parallel

Block until auxiliary thread

copy 2009 Charles E Leiserson 24

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

Function called when

thread is created

Create thread to execute fib(nndash1)

parallelfinishes

Add results together to produce final output

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 7

Issues with Pthreads

Overhead The cost of creating a thread gt104

cycles rArr coarse-grained concurrency (Thread pools can help )(Thread pools can help)

Scalability Fibonacci code gets about 15 speedup for 2 cores Need a rewrite for more cores

Modularity The Fibonacci logic is no longer neatly encapsulated in the fib() function

copy 2009 Charles E Leiserson 25

Code Simplicity

Programmers must marshal arguments(shades of 1958 ) and engage in error-prone protocols in order to load-balance

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 26

Cilk++

bullRace Conditions

Threading Building Blocks

∙Developed by Intel∙ Implemented as a C++p

library that runs on top of native threads∙Programmer specifies

tasks rather than threads∙Tasks are automatically

l d b l d th

Image of book cover removed due to copyright restrictions

Reinders James Intel Threading Building Blocks Outfitting C++ for Multi-Core Processor

copy 2009 Charles E Leiserson 27

load balanced across the threads using work-stealing∙Focus on performance

Multi Core Processor Parallelism OReilly 2007

Fibonacci in TBB

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum ) n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

copy 2009 Charles E Leiserson 28

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 8

Fibonacci in TBB

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

A computation organized as explicit tasks

FibTask has an input parameter n and an output parameter sum

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

The execute() function performs the

computation when the task is started

Recursively create two child tasks

copy 2009 Charles E Leiserson 29

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

child tasks a and b

Set the number of tasks to wait for (2 children + 1

implicit for bookkeeping)Start task b

Start task a and wait for both a and b to finish

Other TBB Features

∙TBB provides many C++ templates to express common patterns simply such as

parallel for for loop parallelismparallel_for for loop parallelismparallel_reduce for data aggregationpipeline and filter for software pipelining

∙TBB provides concurrent container classes which allow multiple threads to safely access and update items in the container concurrently

copy 2009 Charles E Leiserson 30

concurrently∙TBB also provides a variety of mutual-

exclusion library functions including locks and atomic updates

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 31

Cilk++

bullRace Conditions

OpenMP

∙Specification produced by an industry consortium∙Several compilers available both open-source

and proprietary including gcc and Visual Studio∙Linguistic extensions to CC++ or Fortran in

the form of compiler pragmas∙Runs on top of native threads

Supports loop parallelism and more recently

copy 2009 Charles E Leiserson 32

∙Supports loop parallelism and more recently in Version 30 task parallelism

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 9

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x yint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

copy 2009 Charles E Leiserson 33

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x y

Compiler directiveint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

The following statement is an

independent task

Sharing of memory is

copy 2009 Charles E Leiserson 34

managed explicitly

Wait for the two tasks to complete before continuing

Other OpenMP Features

∙OpenMP provides many pragma directivesto express common patterns such as

parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing

∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks

copy 2009 Charles E Leiserson 35

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 36

Cilk++

bullRace Conditions

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 10

Cilk++

∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off

which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk

multithreaded language developed at MIT∙Features a provably efficient work-stealing

scheduler

copy 2009 Charles E Leiserson 37

∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and

Cilkview scalability analyzer

Nested Parallelism in Cilk++

int fib(int n)

if (n lt 2) return n

int fib(int n)

if (n lt 2) return n

The named childfunction may execute in parallel with the

llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

parent caller

Control cannot pass this Control cannot pass this point until all spawned h ld h d

copy 2009 Charles E Leiserson 38

children have returned

Cilk++ keywords grant permission for parallel execution They do not command parallel execution

Loop Parallelism in Cilk++

ExampleIn-place matrix

a11 a12 ⋯ a1n

a21 a22 ⋯ a2n

⋮ ⋮ ⋱ ⋮

a11 a21 ⋯ an1

a12 a22 ⋯ an2

⋮ ⋮ ⋱ ⋮

The iterations of a cilk for

indices run from 0 not 1cilk_for (int i=1 iltn ++i)

matrix transpose

⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann

⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann

A AT

copy 2009 Charles E Leiserson 39

of a cilk_forloop execute in parallel

for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp

Serial Semantics

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync

( )

The C++ serialization of a Cilk++ program is always a legal interpretation of the

Cilk++ source

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

return (x+y)

gprogramrsquos semantics

Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution

copy 2009 Charles E Leiserson 40

To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync

Or specify a switch to the Cilk++ compiler

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 11

Scheduling

∙The Cilk++ concurrency platform allows the programmer to express

t ti l ll li i

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in

an application∙The Cilk++ scheduler

maps the executing program onto the processor cores d i ll i

Memory IO

cilk_syncreturn (x+y)

copy 2009 Charles E Leiserson 41

dynamically at runtime∙Cilk++rsquos work-stealing

scheduler is provably efficient

Network

hellipPP P P$ $ $

Cilk++Compiler

Conventional Compiler

HyperobjectLibrary1

2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)

Cilk++ Platform

Cilk++source

BinaryBinary CilkscreenRace Detector

Linker

5

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

CilkviewScalability Analyzer

6

copy 2009 Charles E Leiserson 42

Conventional Regression Tests

Reliable Single-Threaded Code

Exceptional Performance

Reliable Multi-Threaded Code

Parallel Regression Tests

Runtime System

4

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 43

Cilk++

bullRace Conditions

Race Bugs

Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write

int x = 0cilk_for(int i=0 ilt2 ++i)

x++

A

B C x++

int x = 0

x++

A

B C

Example

copy 2009 Charles E Leiserson 44

x++assert(x == 2)

B C

D

x++

assert(x == 2)

x++B C

D

Dependency Graph

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 12

A Closer Look

1 2

x = 01

2 4int x = 0

A

r1 = x

r1++

x = r1

r2 = x

r2++

x = r2

2

3

4

5

67

x++

int x 0

assert(x == 2)

x++B C

D

copy 2009 Charles E Leiserson 45

assert(x == 2)8

x

r1

r2

0001 0 01111 1

Types of Races

Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)

A B Race Typeread read noneread write read racewrite read read race

A∥B (A is parallel to B)

copy 2009 Charles E Leiserson 46

write read read racewrite write write race

Two sections of code are independent if they have no determinacy races between them

Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs

bull Machine word size matters Watch out for races in packed data structures

copy 2009 Charles E Leiserson 47

packed data structures

structchar achar b

x

structchar achar b

x

Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)

Cilkscreen Race Detector

∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race

∙ Employs a regression-test methodology where the programmer provides test inputs

∙ Identifies filenames lines and variables involved in races including stack traces

copy 2009 Charles E Leiserson 48

g∙ Runs off the binary executable using dynamic

instrumentation ∙ Runs about 20 times slower than real-time

MIT OpenCourseWarehttpocwmitedu

6172 Performance Engineering of Software Systems Fall 2009

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

Page 4: Performance Multicore Programming - DSpace@MIT: · PDF filePerformance Multicore ... access and update items in the container concurrently ... and proprietary, including gcc and Visual

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 4

x=3

Cache Coherence

copy 2009 Charles E Leiserson 13

hellipP P P

Load x x=3 x=5 x=3

MSI Protocol

Each cache line is labeled with a statebull M cache block has been modified No other

caches contain this block in M or S states

M x=13S y=17

bull S other caches may be sharing this blockbull I cache block is invalid (same as not there)

S y=17I x=4 I x=12

S y=17

copy 2009 Charles E Leiserson 14

I z=8 M z=7 I z=3

Before a cache modifies a location the hardware first invalidates all other copies

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 15

Cilk++

bullRace Conditions

Concurrency Platforms

bull Programming directly on processor cores is painful and error-prone

bull A concurrency platform abstracts processor cores handles synchronization and communication protocols and performs load balancing

bull Examplesbull Pthreads and WinAPI threads

copy 2009 Charles E Leiserson 16

Pthreads and WinAPI threadsbull Threading Building Blocks (TBB)bull OpenMPbull Cilk++

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 5

Fibonacci NumbersThe Fibonacci numbers are the sequence lang0 1 1 2 3 5 8 13 21 34 helliprang where each number is the sum of the previous two

The sequence is named after Leonardo di Pisa (1170ndash

RecurrenceF0 = 0F1 = 1Fn = Fnndash1 + Fnndash2 for n gt 1

copy 2009 Charles E Leiserson 17

1250 AD) also known as Fibonacci a contraction of filius Bonaccii mdashldquoson of Bonacciordquo Fibonaccirsquos 1202 book Liber Abaci introduced the sequence to Western mathematics although it had previously been discovered by Indian mathematicians

Fibonacci Program

include ltstdiohgtinclude ltstdlibhgt

int fib(int n)

DisclaimerThis recursive program is a poor

if (n lt 2) return nelse int x = fib(n-1)int y = fib(n-2)return x + y

int main(int argc char argv[])

way to compute the nth Fibonacci number but it provides a good didactic example

copy 2009 Charles E Leiserson 18

int main(int argc char argv[])int n = atoi(argv[1])int result = fib(n)printf(Fibonacci of d is dn n result)return 0

Fibonacci Executionfib(4)

fib(3) fib(2)

fib(2)

fib(1) fib(0)

fib(1) fib(1) fib(0)

Key idea for parallelization

int fib(int n)if (n lt 2) return nelse

copy 2009 Charles E Leiserson 19

Key idea for parallelizationThe calculations of fib(n-1)and fib(n-2) can be executed simultaneously without mutual interference

else int x = fib(n-1)int y = fib(n-2)return x + y

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 20

Cilk++

bullRace Conditions

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 6

Pthreads∙Standard API for threading specified by

ANSIIEEE POSIX 10031-1995 sect10031c∙Do-it-yourself concurrency platform

B il lib f f i i h ldquo i lrdquo∙Built as a library of functions with ldquospecialrdquo non-C++ semantics∙Each thread implements an abstraction of

a processor which are multiplexed onto machine resources∙Threads communicate though shared

copy 2009 Charles E Leiserson 21

Threads communicate though shared memory∙Library functions mask the protocols

involved in interthread coordinationWinAPI threads provide similar functionality

Key Pthread Functions

int pthread_create(pthread_t thread

returned identifier for the new thread const pthread_attr_t attr

b h d b ( f d f l )object to set thread attributes (NULL for default)void (func)(void )

routine executed after creation void arg

a single argument passed to func) returns error status

i h d j i (

copy 2009 Charles E Leiserson 22

int pthread_join (pthread_t thread

identifier of thread to wait forvoid status

terminating threadrsquos status (NULL to ignore)) returns error status

Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

int main(int argc char argv[])pthread_t threadthread args args

int main(int argc char argv[])pthread_t threadthread args argselse

int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

else int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

copy 2009 Charles E Leiserson 23

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

int main(int argc char argv[])pthread_t threadthread args args

int main(int argc char argv[])pthread_t threadthread args args

Original code

Structure

No point in creating thread if there isnrsquot

enough to doelse int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

else int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

Structure for thread

argumentsMarshal input argument to

thread

Main program executes

fib(nndash2) in parallel

Block until auxiliary thread

copy 2009 Charles E Leiserson 24

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

Function called when

thread is created

Create thread to execute fib(nndash1)

parallelfinishes

Add results together to produce final output

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 7

Issues with Pthreads

Overhead The cost of creating a thread gt104

cycles rArr coarse-grained concurrency (Thread pools can help )(Thread pools can help)

Scalability Fibonacci code gets about 15 speedup for 2 cores Need a rewrite for more cores

Modularity The Fibonacci logic is no longer neatly encapsulated in the fib() function

copy 2009 Charles E Leiserson 25

Code Simplicity

Programmers must marshal arguments(shades of 1958 ) and engage in error-prone protocols in order to load-balance

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 26

Cilk++

bullRace Conditions

Threading Building Blocks

∙Developed by Intel∙ Implemented as a C++p

library that runs on top of native threads∙Programmer specifies

tasks rather than threads∙Tasks are automatically

l d b l d th

Image of book cover removed due to copyright restrictions

Reinders James Intel Threading Building Blocks Outfitting C++ for Multi-Core Processor

copy 2009 Charles E Leiserson 27

load balanced across the threads using work-stealing∙Focus on performance

Multi Core Processor Parallelism OReilly 2007

Fibonacci in TBB

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum ) n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

copy 2009 Charles E Leiserson 28

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 8

Fibonacci in TBB

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

A computation organized as explicit tasks

FibTask has an input parameter n and an output parameter sum

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

The execute() function performs the

computation when the task is started

Recursively create two child tasks

copy 2009 Charles E Leiserson 29

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

child tasks a and b

Set the number of tasks to wait for (2 children + 1

implicit for bookkeeping)Start task b

Start task a and wait for both a and b to finish

Other TBB Features

∙TBB provides many C++ templates to express common patterns simply such as

parallel for for loop parallelismparallel_for for loop parallelismparallel_reduce for data aggregationpipeline and filter for software pipelining

∙TBB provides concurrent container classes which allow multiple threads to safely access and update items in the container concurrently

copy 2009 Charles E Leiserson 30

concurrently∙TBB also provides a variety of mutual-

exclusion library functions including locks and atomic updates

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 31

Cilk++

bullRace Conditions

OpenMP

∙Specification produced by an industry consortium∙Several compilers available both open-source

and proprietary including gcc and Visual Studio∙Linguistic extensions to CC++ or Fortran in

the form of compiler pragmas∙Runs on top of native threads

Supports loop parallelism and more recently

copy 2009 Charles E Leiserson 32

∙Supports loop parallelism and more recently in Version 30 task parallelism

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 9

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x yint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

copy 2009 Charles E Leiserson 33

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x y

Compiler directiveint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

The following statement is an

independent task

Sharing of memory is

copy 2009 Charles E Leiserson 34

managed explicitly

Wait for the two tasks to complete before continuing

Other OpenMP Features

∙OpenMP provides many pragma directivesto express common patterns such as

parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing

∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks

copy 2009 Charles E Leiserson 35

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 36

Cilk++

bullRace Conditions

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 10

Cilk++

∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off

which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk

multithreaded language developed at MIT∙Features a provably efficient work-stealing

scheduler

copy 2009 Charles E Leiserson 37

∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and

Cilkview scalability analyzer

Nested Parallelism in Cilk++

int fib(int n)

if (n lt 2) return n

int fib(int n)

if (n lt 2) return n

The named childfunction may execute in parallel with the

llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

parent caller

Control cannot pass this Control cannot pass this point until all spawned h ld h d

copy 2009 Charles E Leiserson 38

children have returned

Cilk++ keywords grant permission for parallel execution They do not command parallel execution

Loop Parallelism in Cilk++

ExampleIn-place matrix

a11 a12 ⋯ a1n

a21 a22 ⋯ a2n

⋮ ⋮ ⋱ ⋮

a11 a21 ⋯ an1

a12 a22 ⋯ an2

⋮ ⋮ ⋱ ⋮

The iterations of a cilk for

indices run from 0 not 1cilk_for (int i=1 iltn ++i)

matrix transpose

⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann

⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann

A AT

copy 2009 Charles E Leiserson 39

of a cilk_forloop execute in parallel

for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp

Serial Semantics

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync

( )

The C++ serialization of a Cilk++ program is always a legal interpretation of the

Cilk++ source

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

return (x+y)

gprogramrsquos semantics

Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution

copy 2009 Charles E Leiserson 40

To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync

Or specify a switch to the Cilk++ compiler

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 11

Scheduling

∙The Cilk++ concurrency platform allows the programmer to express

t ti l ll li i

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in

an application∙The Cilk++ scheduler

maps the executing program onto the processor cores d i ll i

Memory IO

cilk_syncreturn (x+y)

copy 2009 Charles E Leiserson 41

dynamically at runtime∙Cilk++rsquos work-stealing

scheduler is provably efficient

Network

hellipPP P P$ $ $

Cilk++Compiler

Conventional Compiler

HyperobjectLibrary1

2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)

Cilk++ Platform

Cilk++source

BinaryBinary CilkscreenRace Detector

Linker

5

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

CilkviewScalability Analyzer

6

copy 2009 Charles E Leiserson 42

Conventional Regression Tests

Reliable Single-Threaded Code

Exceptional Performance

Reliable Multi-Threaded Code

Parallel Regression Tests

Runtime System

4

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 43

Cilk++

bullRace Conditions

Race Bugs

Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write

int x = 0cilk_for(int i=0 ilt2 ++i)

x++

A

B C x++

int x = 0

x++

A

B C

Example

copy 2009 Charles E Leiserson 44

x++assert(x == 2)

B C

D

x++

assert(x == 2)

x++B C

D

Dependency Graph

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 12

A Closer Look

1 2

x = 01

2 4int x = 0

A

r1 = x

r1++

x = r1

r2 = x

r2++

x = r2

2

3

4

5

67

x++

int x 0

assert(x == 2)

x++B C

D

copy 2009 Charles E Leiserson 45

assert(x == 2)8

x

r1

r2

0001 0 01111 1

Types of Races

Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)

A B Race Typeread read noneread write read racewrite read read race

A∥B (A is parallel to B)

copy 2009 Charles E Leiserson 46

write read read racewrite write write race

Two sections of code are independent if they have no determinacy races between them

Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs

bull Machine word size matters Watch out for races in packed data structures

copy 2009 Charles E Leiserson 47

packed data structures

structchar achar b

x

structchar achar b

x

Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)

Cilkscreen Race Detector

∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race

∙ Employs a regression-test methodology where the programmer provides test inputs

∙ Identifies filenames lines and variables involved in races including stack traces

copy 2009 Charles E Leiserson 48

g∙ Runs off the binary executable using dynamic

instrumentation ∙ Runs about 20 times slower than real-time

MIT OpenCourseWarehttpocwmitedu

6172 Performance Engineering of Software Systems Fall 2009

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

Page 5: Performance Multicore Programming - DSpace@MIT: · PDF filePerformance Multicore ... access and update items in the container concurrently ... and proprietary, including gcc and Visual

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 5

Fibonacci NumbersThe Fibonacci numbers are the sequence lang0 1 1 2 3 5 8 13 21 34 helliprang where each number is the sum of the previous two

The sequence is named after Leonardo di Pisa (1170ndash

RecurrenceF0 = 0F1 = 1Fn = Fnndash1 + Fnndash2 for n gt 1

copy 2009 Charles E Leiserson 17

1250 AD) also known as Fibonacci a contraction of filius Bonaccii mdashldquoson of Bonacciordquo Fibonaccirsquos 1202 book Liber Abaci introduced the sequence to Western mathematics although it had previously been discovered by Indian mathematicians

Fibonacci Program

include ltstdiohgtinclude ltstdlibhgt

int fib(int n)

DisclaimerThis recursive program is a poor

if (n lt 2) return nelse int x = fib(n-1)int y = fib(n-2)return x + y

int main(int argc char argv[])

way to compute the nth Fibonacci number but it provides a good didactic example

copy 2009 Charles E Leiserson 18

int main(int argc char argv[])int n = atoi(argv[1])int result = fib(n)printf(Fibonacci of d is dn n result)return 0

Fibonacci Executionfib(4)

fib(3) fib(2)

fib(2)

fib(1) fib(0)

fib(1) fib(1) fib(0)

Key idea for parallelization

int fib(int n)if (n lt 2) return nelse

copy 2009 Charles E Leiserson 19

Key idea for parallelizationThe calculations of fib(n-1)and fib(n-2) can be executed simultaneously without mutual interference

else int x = fib(n-1)int y = fib(n-2)return x + y

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 20

Cilk++

bullRace Conditions

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 6

Pthreads∙Standard API for threading specified by

ANSIIEEE POSIX 10031-1995 sect10031c∙Do-it-yourself concurrency platform

B il lib f f i i h ldquo i lrdquo∙Built as a library of functions with ldquospecialrdquo non-C++ semantics∙Each thread implements an abstraction of

a processor which are multiplexed onto machine resources∙Threads communicate though shared

copy 2009 Charles E Leiserson 21

Threads communicate though shared memory∙Library functions mask the protocols

involved in interthread coordinationWinAPI threads provide similar functionality

Key Pthread Functions

int pthread_create(pthread_t thread

returned identifier for the new thread const pthread_attr_t attr

b h d b ( f d f l )object to set thread attributes (NULL for default)void (func)(void )

routine executed after creation void arg

a single argument passed to func) returns error status

i h d j i (

copy 2009 Charles E Leiserson 22

int pthread_join (pthread_t thread

identifier of thread to wait forvoid status

terminating threadrsquos status (NULL to ignore)) returns error status

Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

int main(int argc char argv[])pthread_t threadthread args args

int main(int argc char argv[])pthread_t threadthread args argselse

int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

else int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

copy 2009 Charles E Leiserson 23

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

int main(int argc char argv[])pthread_t threadthread args args

int main(int argc char argv[])pthread_t threadthread args args

Original code

Structure

No point in creating thread if there isnrsquot

enough to doelse int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

else int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

Structure for thread

argumentsMarshal input argument to

thread

Main program executes

fib(nndash2) in parallel

Block until auxiliary thread

copy 2009 Charles E Leiserson 24

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

Function called when

thread is created

Create thread to execute fib(nndash1)

parallelfinishes

Add results together to produce final output

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 7

Issues with Pthreads

Overhead The cost of creating a thread gt104

cycles rArr coarse-grained concurrency (Thread pools can help )(Thread pools can help)

Scalability Fibonacci code gets about 15 speedup for 2 cores Need a rewrite for more cores

Modularity The Fibonacci logic is no longer neatly encapsulated in the fib() function

copy 2009 Charles E Leiserson 25

Code Simplicity

Programmers must marshal arguments(shades of 1958 ) and engage in error-prone protocols in order to load-balance

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 26

Cilk++

bullRace Conditions

Threading Building Blocks

∙Developed by Intel∙ Implemented as a C++p

library that runs on top of native threads∙Programmer specifies

tasks rather than threads∙Tasks are automatically

l d b l d th

Image of book cover removed due to copyright restrictions

Reinders James Intel Threading Building Blocks Outfitting C++ for Multi-Core Processor

copy 2009 Charles E Leiserson 27

load balanced across the threads using work-stealing∙Focus on performance

Multi Core Processor Parallelism OReilly 2007

Fibonacci in TBB

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum ) n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

copy 2009 Charles E Leiserson 28

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 8

Fibonacci in TBB

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

A computation organized as explicit tasks

FibTask has an input parameter n and an output parameter sum

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

The execute() function performs the

computation when the task is started

Recursively create two child tasks

copy 2009 Charles E Leiserson 29

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

child tasks a and b

Set the number of tasks to wait for (2 children + 1

implicit for bookkeeping)Start task b

Start task a and wait for both a and b to finish

Other TBB Features

∙TBB provides many C++ templates to express common patterns simply such as

parallel for for loop parallelismparallel_for for loop parallelismparallel_reduce for data aggregationpipeline and filter for software pipelining

∙TBB provides concurrent container classes which allow multiple threads to safely access and update items in the container concurrently

copy 2009 Charles E Leiserson 30

concurrently∙TBB also provides a variety of mutual-

exclusion library functions including locks and atomic updates

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 31

Cilk++

bullRace Conditions

OpenMP

∙Specification produced by an industry consortium∙Several compilers available both open-source

and proprietary including gcc and Visual Studio∙Linguistic extensions to CC++ or Fortran in

the form of compiler pragmas∙Runs on top of native threads

Supports loop parallelism and more recently

copy 2009 Charles E Leiserson 32

∙Supports loop parallelism and more recently in Version 30 task parallelism

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 9

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x yint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

copy 2009 Charles E Leiserson 33

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x y

Compiler directiveint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

The following statement is an

independent task

Sharing of memory is

copy 2009 Charles E Leiserson 34

managed explicitly

Wait for the two tasks to complete before continuing

Other OpenMP Features

∙OpenMP provides many pragma directivesto express common patterns such as

parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing

∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks

copy 2009 Charles E Leiserson 35

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 36

Cilk++

bullRace Conditions

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 10

Cilk++

∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off

which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk

multithreaded language developed at MIT∙Features a provably efficient work-stealing

scheduler

copy 2009 Charles E Leiserson 37

∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and

Cilkview scalability analyzer

Nested Parallelism in Cilk++

int fib(int n)

if (n lt 2) return n

int fib(int n)

if (n lt 2) return n

The named childfunction may execute in parallel with the

llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

parent caller

Control cannot pass this Control cannot pass this point until all spawned h ld h d

copy 2009 Charles E Leiserson 38

children have returned

Cilk++ keywords grant permission for parallel execution They do not command parallel execution

Loop Parallelism in Cilk++

ExampleIn-place matrix

a11 a12 ⋯ a1n

a21 a22 ⋯ a2n

⋮ ⋮ ⋱ ⋮

a11 a21 ⋯ an1

a12 a22 ⋯ an2

⋮ ⋮ ⋱ ⋮

The iterations of a cilk for

indices run from 0 not 1cilk_for (int i=1 iltn ++i)

matrix transpose

⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann

⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann

A AT

copy 2009 Charles E Leiserson 39

of a cilk_forloop execute in parallel

for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp

Serial Semantics

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync

( )

The C++ serialization of a Cilk++ program is always a legal interpretation of the

Cilk++ source

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

return (x+y)

gprogramrsquos semantics

Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution

copy 2009 Charles E Leiserson 40

To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync

Or specify a switch to the Cilk++ compiler

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 11

Scheduling

∙The Cilk++ concurrency platform allows the programmer to express

t ti l ll li i

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in

an application∙The Cilk++ scheduler

maps the executing program onto the processor cores d i ll i

Memory IO

cilk_syncreturn (x+y)

copy 2009 Charles E Leiserson 41

dynamically at runtime∙Cilk++rsquos work-stealing

scheduler is provably efficient

Network

hellipPP P P$ $ $

Cilk++Compiler

Conventional Compiler

HyperobjectLibrary1

2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)

Cilk++ Platform

Cilk++source

BinaryBinary CilkscreenRace Detector

Linker

5

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

CilkviewScalability Analyzer

6

copy 2009 Charles E Leiserson 42

Conventional Regression Tests

Reliable Single-Threaded Code

Exceptional Performance

Reliable Multi-Threaded Code

Parallel Regression Tests

Runtime System

4

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 43

Cilk++

bullRace Conditions

Race Bugs

Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write

int x = 0cilk_for(int i=0 ilt2 ++i)

x++

A

B C x++

int x = 0

x++

A

B C

Example

copy 2009 Charles E Leiserson 44

x++assert(x == 2)

B C

D

x++

assert(x == 2)

x++B C

D

Dependency Graph

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 12

A Closer Look

1 2

x = 01

2 4int x = 0

A

r1 = x

r1++

x = r1

r2 = x

r2++

x = r2

2

3

4

5

67

x++

int x 0

assert(x == 2)

x++B C

D

copy 2009 Charles E Leiserson 45

assert(x == 2)8

x

r1

r2

0001 0 01111 1

Types of Races

Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)

A B Race Typeread read noneread write read racewrite read read race

A∥B (A is parallel to B)

copy 2009 Charles E Leiserson 46

write read read racewrite write write race

Two sections of code are independent if they have no determinacy races between them

Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs

bull Machine word size matters Watch out for races in packed data structures

copy 2009 Charles E Leiserson 47

packed data structures

structchar achar b

x

structchar achar b

x

Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)

Cilkscreen Race Detector

∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race

∙ Employs a regression-test methodology where the programmer provides test inputs

∙ Identifies filenames lines and variables involved in races including stack traces

copy 2009 Charles E Leiserson 48

g∙ Runs off the binary executable using dynamic

instrumentation ∙ Runs about 20 times slower than real-time

MIT OpenCourseWarehttpocwmitedu

6172 Performance Engineering of Software Systems Fall 2009

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

Page 6: Performance Multicore Programming - DSpace@MIT: · PDF filePerformance Multicore ... access and update items in the container concurrently ... and proprietary, including gcc and Visual

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 6

Pthreads∙Standard API for threading specified by

ANSIIEEE POSIX 10031-1995 sect10031c∙Do-it-yourself concurrency platform

B il lib f f i i h ldquo i lrdquo∙Built as a library of functions with ldquospecialrdquo non-C++ semantics∙Each thread implements an abstraction of

a processor which are multiplexed onto machine resources∙Threads communicate though shared

copy 2009 Charles E Leiserson 21

Threads communicate though shared memory∙Library functions mask the protocols

involved in interthread coordinationWinAPI threads provide similar functionality

Key Pthread Functions

int pthread_create(pthread_t thread

returned identifier for the new thread const pthread_attr_t attr

b h d b ( f d f l )object to set thread attributes (NULL for default)void (func)(void )

routine executed after creation void arg

a single argument passed to func) returns error status

i h d j i (

copy 2009 Charles E Leiserson 22

int pthread_join (pthread_t thread

identifier of thread to wait forvoid status

terminating threadrsquos status (NULL to ignore)) returns error status

Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

int main(int argc char argv[])pthread_t threadthread args args

int main(int argc char argv[])pthread_t threadthread args argselse

int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

else int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

copy 2009 Charles E Leiserson 23

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt

int fib(int n)if (n lt 2) return nelse

int main(int argc char argv[])pthread_t threadthread args args

int main(int argc char argv[])pthread_t threadthread args args

Original code

Structure

No point in creating thread if there isnrsquot

enough to doelse int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

else int x = fib(n-1)int y = fib(n-2)return x + y

typedef struct int inputint output

thread_args

void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread

NULL thread_func (void) ampargs )

Structure for thread

argumentsMarshal input argument to

thread

Main program executes

fib(nndash2) in parallel

Block until auxiliary thread

copy 2009 Charles E Leiserson 24

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput

printf(Fibonacci of d is dn n result)return 0

Function called when

thread is created

Create thread to execute fib(nndash1)

parallelfinishes

Add results together to produce final output

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 7

Issues with Pthreads

Overhead The cost of creating a thread gt104

cycles rArr coarse-grained concurrency (Thread pools can help )(Thread pools can help)

Scalability Fibonacci code gets about 15 speedup for 2 cores Need a rewrite for more cores

Modularity The Fibonacci logic is no longer neatly encapsulated in the fib() function

copy 2009 Charles E Leiserson 25

Code Simplicity

Programmers must marshal arguments(shades of 1958 ) and engage in error-prone protocols in order to load-balance

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 26

Cilk++

bullRace Conditions

Threading Building Blocks

∙Developed by Intel∙ Implemented as a C++p

library that runs on top of native threads∙Programmer specifies

tasks rather than threads∙Tasks are automatically

l d b l d th

Image of book cover removed due to copyright restrictions

Reinders James Intel Threading Building Blocks Outfitting C++ for Multi-Core Processor

copy 2009 Charles E Leiserson 27

load balanced across the threads using work-stealing∙Focus on performance

Multi Core Processor Parallelism OReilly 2007

Fibonacci in TBB

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum ) n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

copy 2009 Charles E Leiserson 28

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 8

Fibonacci in TBB

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

A computation organized as explicit tasks

FibTask has an input parameter n and an output parameter sum

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

The execute() function performs the

computation when the task is started

Recursively create two child tasks

copy 2009 Charles E Leiserson 29

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

child tasks a and b

Set the number of tasks to wait for (2 children + 1

implicit for bookkeeping)Start task b

Start task a and wait for both a and b to finish

Other TBB Features

∙TBB provides many C++ templates to express common patterns simply such as

parallel for for loop parallelismparallel_for for loop parallelismparallel_reduce for data aggregationpipeline and filter for software pipelining

∙TBB provides concurrent container classes which allow multiple threads to safely access and update items in the container concurrently

copy 2009 Charles E Leiserson 30

concurrently∙TBB also provides a variety of mutual-

exclusion library functions including locks and atomic updates

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 31

Cilk++

bullRace Conditions

OpenMP

∙Specification produced by an industry consortium∙Several compilers available both open-source

and proprietary including gcc and Visual Studio∙Linguistic extensions to CC++ or Fortran in

the form of compiler pragmas∙Runs on top of native threads

Supports loop parallelism and more recently

copy 2009 Charles E Leiserson 32

∙Supports loop parallelism and more recently in Version 30 task parallelism

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 9

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x yint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

copy 2009 Charles E Leiserson 33

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x y

Compiler directiveint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

The following statement is an

independent task

Sharing of memory is

copy 2009 Charles E Leiserson 34

managed explicitly

Wait for the two tasks to complete before continuing

Other OpenMP Features

∙OpenMP provides many pragma directivesto express common patterns such as

parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing

∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks

copy 2009 Charles E Leiserson 35

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 36

Cilk++

bullRace Conditions

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 10

Cilk++

∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off

which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk

multithreaded language developed at MIT∙Features a provably efficient work-stealing

scheduler

copy 2009 Charles E Leiserson 37

∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and

Cilkview scalability analyzer

Nested Parallelism in Cilk++

int fib(int n)

if (n lt 2) return n

int fib(int n)

if (n lt 2) return n

The named childfunction may execute in parallel with the

llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

parent caller

Control cannot pass this Control cannot pass this point until all spawned h ld h d

copy 2009 Charles E Leiserson 38

children have returned

Cilk++ keywords grant permission for parallel execution They do not command parallel execution

Loop Parallelism in Cilk++

ExampleIn-place matrix

a11 a12 ⋯ a1n

a21 a22 ⋯ a2n

⋮ ⋮ ⋱ ⋮

a11 a21 ⋯ an1

a12 a22 ⋯ an2

⋮ ⋮ ⋱ ⋮

The iterations of a cilk for

indices run from 0 not 1cilk_for (int i=1 iltn ++i)

matrix transpose

⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann

⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann

A AT

copy 2009 Charles E Leiserson 39

of a cilk_forloop execute in parallel

for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp

Serial Semantics

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync

( )

The C++ serialization of a Cilk++ program is always a legal interpretation of the

Cilk++ source

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

return (x+y)

gprogramrsquos semantics

Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution

copy 2009 Charles E Leiserson 40

To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync

Or specify a switch to the Cilk++ compiler

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 11

Scheduling

∙The Cilk++ concurrency platform allows the programmer to express

t ti l ll li i

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in

an application∙The Cilk++ scheduler

maps the executing program onto the processor cores d i ll i

Memory IO

cilk_syncreturn (x+y)

copy 2009 Charles E Leiserson 41

dynamically at runtime∙Cilk++rsquos work-stealing

scheduler is provably efficient

Network

hellipPP P P$ $ $

Cilk++Compiler

Conventional Compiler

HyperobjectLibrary1

2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)

Cilk++ Platform

Cilk++source

BinaryBinary CilkscreenRace Detector

Linker

5

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

CilkviewScalability Analyzer

6

copy 2009 Charles E Leiserson 42

Conventional Regression Tests

Reliable Single-Threaded Code

Exceptional Performance

Reliable Multi-Threaded Code

Parallel Regression Tests

Runtime System

4

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 43

Cilk++

bullRace Conditions

Race Bugs

Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write

int x = 0cilk_for(int i=0 ilt2 ++i)

x++

A

B C x++

int x = 0

x++

A

B C

Example

copy 2009 Charles E Leiserson 44

x++assert(x == 2)

B C

D

x++

assert(x == 2)

x++B C

D

Dependency Graph

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 12

A Closer Look

1 2

x = 01

2 4int x = 0

A

r1 = x

r1++

x = r1

r2 = x

r2++

x = r2

2

3

4

5

67

x++

int x 0

assert(x == 2)

x++B C

D

copy 2009 Charles E Leiserson 45

assert(x == 2)8

x

r1

r2

0001 0 01111 1

Types of Races

Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)

A B Race Typeread read noneread write read racewrite read read race

A∥B (A is parallel to B)

copy 2009 Charles E Leiserson 46

write read read racewrite write write race

Two sections of code are independent if they have no determinacy races between them

Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs

bull Machine word size matters Watch out for races in packed data structures

copy 2009 Charles E Leiserson 47

packed data structures

structchar achar b

x

structchar achar b

x

Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)

Cilkscreen Race Detector

∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race

∙ Employs a regression-test methodology where the programmer provides test inputs

∙ Identifies filenames lines and variables involved in races including stack traces

copy 2009 Charles E Leiserson 48

g∙ Runs off the binary executable using dynamic

instrumentation ∙ Runs about 20 times slower than real-time

MIT OpenCourseWarehttpocwmitedu

6172 Performance Engineering of Software Systems Fall 2009

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

Page 7: Performance Multicore Programming - DSpace@MIT: · PDF filePerformance Multicore ... access and update items in the container concurrently ... and proprietary, including gcc and Visual

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 7

Issues with Pthreads

Overhead The cost of creating a thread gt104

cycles rArr coarse-grained concurrency (Thread pools can help )(Thread pools can help)

Scalability Fibonacci code gets about 15 speedup for 2 cores Need a rewrite for more cores

Modularity The Fibonacci logic is no longer neatly encapsulated in the fib() function

copy 2009 Charles E Leiserson 25

Code Simplicity

Programmers must marshal arguments(shades of 1958 ) and engage in error-prone protocols in order to load-balance

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 26

Cilk++

bullRace Conditions

Threading Building Blocks

∙Developed by Intel∙ Implemented as a C++p

library that runs on top of native threads∙Programmer specifies

tasks rather than threads∙Tasks are automatically

l d b l d th

Image of book cover removed due to copyright restrictions

Reinders James Intel Threading Building Blocks Outfitting C++ for Multi-Core Processor

copy 2009 Charles E Leiserson 27

load balanced across the threads using work-stealing∙Focus on performance

Multi Core Processor Parallelism OReilly 2007

Fibonacci in TBB

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum ) n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

copy 2009 Charles E Leiserson 28

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 8

Fibonacci in TBB

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

A computation organized as explicit tasks

FibTask has an input parameter n and an output parameter sum

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

The execute() function performs the

computation when the task is started

Recursively create two child tasks

copy 2009 Charles E Leiserson 29

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

child tasks a and b

Set the number of tasks to wait for (2 children + 1

implicit for bookkeeping)Start task b

Start task a and wait for both a and b to finish

Other TBB Features

∙TBB provides many C++ templates to express common patterns simply such as

parallel for for loop parallelismparallel_for for loop parallelismparallel_reduce for data aggregationpipeline and filter for software pipelining

∙TBB provides concurrent container classes which allow multiple threads to safely access and update items in the container concurrently

copy 2009 Charles E Leiserson 30

concurrently∙TBB also provides a variety of mutual-

exclusion library functions including locks and atomic updates

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 31

Cilk++

bullRace Conditions

OpenMP

∙Specification produced by an industry consortium∙Several compilers available both open-source

and proprietary including gcc and Visual Studio∙Linguistic extensions to CC++ or Fortran in

the form of compiler pragmas∙Runs on top of native threads

Supports loop parallelism and more recently

copy 2009 Charles E Leiserson 32

∙Supports loop parallelism and more recently in Version 30 task parallelism

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 9

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x yint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

copy 2009 Charles E Leiserson 33

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x y

Compiler directiveint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

The following statement is an

independent task

Sharing of memory is

copy 2009 Charles E Leiserson 34

managed explicitly

Wait for the two tasks to complete before continuing

Other OpenMP Features

∙OpenMP provides many pragma directivesto express common patterns such as

parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing

∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks

copy 2009 Charles E Leiserson 35

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 36

Cilk++

bullRace Conditions

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 10

Cilk++

∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off

which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk

multithreaded language developed at MIT∙Features a provably efficient work-stealing

scheduler

copy 2009 Charles E Leiserson 37

∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and

Cilkview scalability analyzer

Nested Parallelism in Cilk++

int fib(int n)

if (n lt 2) return n

int fib(int n)

if (n lt 2) return n

The named childfunction may execute in parallel with the

llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

parent caller

Control cannot pass this Control cannot pass this point until all spawned h ld h d

copy 2009 Charles E Leiserson 38

children have returned

Cilk++ keywords grant permission for parallel execution They do not command parallel execution

Loop Parallelism in Cilk++

ExampleIn-place matrix

a11 a12 ⋯ a1n

a21 a22 ⋯ a2n

⋮ ⋮ ⋱ ⋮

a11 a21 ⋯ an1

a12 a22 ⋯ an2

⋮ ⋮ ⋱ ⋮

The iterations of a cilk for

indices run from 0 not 1cilk_for (int i=1 iltn ++i)

matrix transpose

⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann

⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann

A AT

copy 2009 Charles E Leiserson 39

of a cilk_forloop execute in parallel

for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp

Serial Semantics

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync

( )

The C++ serialization of a Cilk++ program is always a legal interpretation of the

Cilk++ source

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

return (x+y)

gprogramrsquos semantics

Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution

copy 2009 Charles E Leiserson 40

To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync

Or specify a switch to the Cilk++ compiler

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 11

Scheduling

∙The Cilk++ concurrency platform allows the programmer to express

t ti l ll li i

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in

an application∙The Cilk++ scheduler

maps the executing program onto the processor cores d i ll i

Memory IO

cilk_syncreturn (x+y)

copy 2009 Charles E Leiserson 41

dynamically at runtime∙Cilk++rsquos work-stealing

scheduler is provably efficient

Network

hellipPP P P$ $ $

Cilk++Compiler

Conventional Compiler

HyperobjectLibrary1

2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)

Cilk++ Platform

Cilk++source

BinaryBinary CilkscreenRace Detector

Linker

5

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

CilkviewScalability Analyzer

6

copy 2009 Charles E Leiserson 42

Conventional Regression Tests

Reliable Single-Threaded Code

Exceptional Performance

Reliable Multi-Threaded Code

Parallel Regression Tests

Runtime System

4

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 43

Cilk++

bullRace Conditions

Race Bugs

Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write

int x = 0cilk_for(int i=0 ilt2 ++i)

x++

A

B C x++

int x = 0

x++

A

B C

Example

copy 2009 Charles E Leiserson 44

x++assert(x == 2)

B C

D

x++

assert(x == 2)

x++B C

D

Dependency Graph

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 12

A Closer Look

1 2

x = 01

2 4int x = 0

A

r1 = x

r1++

x = r1

r2 = x

r2++

x = r2

2

3

4

5

67

x++

int x 0

assert(x == 2)

x++B C

D

copy 2009 Charles E Leiserson 45

assert(x == 2)8

x

r1

r2

0001 0 01111 1

Types of Races

Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)

A B Race Typeread read noneread write read racewrite read read race

A∥B (A is parallel to B)

copy 2009 Charles E Leiserson 46

write read read racewrite write write race

Two sections of code are independent if they have no determinacy races between them

Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs

bull Machine word size matters Watch out for races in packed data structures

copy 2009 Charles E Leiserson 47

packed data structures

structchar achar b

x

structchar achar b

x

Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)

Cilkscreen Race Detector

∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race

∙ Employs a regression-test methodology where the programmer provides test inputs

∙ Identifies filenames lines and variables involved in races including stack traces

copy 2009 Charles E Leiserson 48

g∙ Runs off the binary executable using dynamic

instrumentation ∙ Runs about 20 times slower than real-time

MIT OpenCourseWarehttpocwmitedu

6172 Performance Engineering of Software Systems Fall 2009

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

Page 8: Performance Multicore Programming - DSpace@MIT: · PDF filePerformance Multicore ... access and update items in the container concurrently ... and proprietary, including gcc and Visual

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 8

Fibonacci in TBB

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

class FibTask public task public

const long n long const sum FibTask( long n_ long sum_ )

n(n ) sum(sum )

A computation organized as explicit tasks

FibTask has an input parameter n and an output parameter sum

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

n(n_) sum(sum_)

task execute() if( n lt 2 )

sum = n else

long x y FibTaskamp a = new( allocate_child() )

FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )

FibTask(n-2ampy) f (3)

The execute() function performs the

computation when the task is started

Recursively create two child tasks

copy 2009 Charles E Leiserson 29

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y

return NULL

child tasks a and b

Set the number of tasks to wait for (2 children + 1

implicit for bookkeeping)Start task b

Start task a and wait for both a and b to finish

Other TBB Features

∙TBB provides many C++ templates to express common patterns simply such as

parallel for for loop parallelismparallel_for for loop parallelismparallel_reduce for data aggregationpipeline and filter for software pipelining

∙TBB provides concurrent container classes which allow multiple threads to safely access and update items in the container concurrently

copy 2009 Charles E Leiserson 30

concurrently∙TBB also provides a variety of mutual-

exclusion library functions including locks and atomic updates

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 31

Cilk++

bullRace Conditions

OpenMP

∙Specification produced by an industry consortium∙Several compilers available both open-source

and proprietary including gcc and Visual Studio∙Linguistic extensions to CC++ or Fortran in

the form of compiler pragmas∙Runs on top of native threads

Supports loop parallelism and more recently

copy 2009 Charles E Leiserson 32

∙Supports loop parallelism and more recently in Version 30 task parallelism

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 9

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x yint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

copy 2009 Charles E Leiserson 33

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x y

Compiler directiveint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

The following statement is an

independent task

Sharing of memory is

copy 2009 Charles E Leiserson 34

managed explicitly

Wait for the two tasks to complete before continuing

Other OpenMP Features

∙OpenMP provides many pragma directivesto express common patterns such as

parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing

∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks

copy 2009 Charles E Leiserson 35

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 36

Cilk++

bullRace Conditions

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 10

Cilk++

∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off

which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk

multithreaded language developed at MIT∙Features a provably efficient work-stealing

scheduler

copy 2009 Charles E Leiserson 37

∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and

Cilkview scalability analyzer

Nested Parallelism in Cilk++

int fib(int n)

if (n lt 2) return n

int fib(int n)

if (n lt 2) return n

The named childfunction may execute in parallel with the

llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

parent caller

Control cannot pass this Control cannot pass this point until all spawned h ld h d

copy 2009 Charles E Leiserson 38

children have returned

Cilk++ keywords grant permission for parallel execution They do not command parallel execution

Loop Parallelism in Cilk++

ExampleIn-place matrix

a11 a12 ⋯ a1n

a21 a22 ⋯ a2n

⋮ ⋮ ⋱ ⋮

a11 a21 ⋯ an1

a12 a22 ⋯ an2

⋮ ⋮ ⋱ ⋮

The iterations of a cilk for

indices run from 0 not 1cilk_for (int i=1 iltn ++i)

matrix transpose

⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann

⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann

A AT

copy 2009 Charles E Leiserson 39

of a cilk_forloop execute in parallel

for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp

Serial Semantics

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync

( )

The C++ serialization of a Cilk++ program is always a legal interpretation of the

Cilk++ source

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

return (x+y)

gprogramrsquos semantics

Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution

copy 2009 Charles E Leiserson 40

To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync

Or specify a switch to the Cilk++ compiler

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 11

Scheduling

∙The Cilk++ concurrency platform allows the programmer to express

t ti l ll li i

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in

an application∙The Cilk++ scheduler

maps the executing program onto the processor cores d i ll i

Memory IO

cilk_syncreturn (x+y)

copy 2009 Charles E Leiserson 41

dynamically at runtime∙Cilk++rsquos work-stealing

scheduler is provably efficient

Network

hellipPP P P$ $ $

Cilk++Compiler

Conventional Compiler

HyperobjectLibrary1

2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)

Cilk++ Platform

Cilk++source

BinaryBinary CilkscreenRace Detector

Linker

5

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

CilkviewScalability Analyzer

6

copy 2009 Charles E Leiserson 42

Conventional Regression Tests

Reliable Single-Threaded Code

Exceptional Performance

Reliable Multi-Threaded Code

Parallel Regression Tests

Runtime System

4

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 43

Cilk++

bullRace Conditions

Race Bugs

Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write

int x = 0cilk_for(int i=0 ilt2 ++i)

x++

A

B C x++

int x = 0

x++

A

B C

Example

copy 2009 Charles E Leiserson 44

x++assert(x == 2)

B C

D

x++

assert(x == 2)

x++B C

D

Dependency Graph

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 12

A Closer Look

1 2

x = 01

2 4int x = 0

A

r1 = x

r1++

x = r1

r2 = x

r2++

x = r2

2

3

4

5

67

x++

int x 0

assert(x == 2)

x++B C

D

copy 2009 Charles E Leiserson 45

assert(x == 2)8

x

r1

r2

0001 0 01111 1

Types of Races

Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)

A B Race Typeread read noneread write read racewrite read read race

A∥B (A is parallel to B)

copy 2009 Charles E Leiserson 46

write read read racewrite write write race

Two sections of code are independent if they have no determinacy races between them

Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs

bull Machine word size matters Watch out for races in packed data structures

copy 2009 Charles E Leiserson 47

packed data structures

structchar achar b

x

structchar achar b

x

Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)

Cilkscreen Race Detector

∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race

∙ Employs a regression-test methodology where the programmer provides test inputs

∙ Identifies filenames lines and variables involved in races including stack traces

copy 2009 Charles E Leiserson 48

g∙ Runs off the binary executable using dynamic

instrumentation ∙ Runs about 20 times slower than real-time

MIT OpenCourseWarehttpocwmitedu

6172 Performance Engineering of Software Systems Fall 2009

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

Page 9: Performance Multicore Programming - DSpace@MIT: · PDF filePerformance Multicore ... access and update items in the container concurrently ... and proprietary, including gcc and Visual

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 9

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x yint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

copy 2009 Charles E Leiserson 33

Fibonacci in OpenMP 30

int fib(int n)

if (n lt 2) return nint x y

int fib(int n)

if (n lt 2) return nint x y

Compiler directiveint x y

pragma omp task shared(x)x = fib(n - 1)

pragma omp task shared(y)y = fib(n - 2)

pragma omp taskwaitreturn x+y

int x ypragma omp task shared(x)

x = fib(n - 1)pragma omp task shared(y)

y = fib(n - 2)pragma omp taskwait

return x+y

The following statement is an

independent task

Sharing of memory is

copy 2009 Charles E Leiserson 34

managed explicitly

Wait for the two tasks to complete before continuing

Other OpenMP Features

∙OpenMP provides many pragma directivesto express common patterns such as

parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing

∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks

copy 2009 Charles E Leiserson 35

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 36

Cilk++

bullRace Conditions

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 10

Cilk++

∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off

which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk

multithreaded language developed at MIT∙Features a provably efficient work-stealing

scheduler

copy 2009 Charles E Leiserson 37

∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and

Cilkview scalability analyzer

Nested Parallelism in Cilk++

int fib(int n)

if (n lt 2) return n

int fib(int n)

if (n lt 2) return n

The named childfunction may execute in parallel with the

llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

parent caller

Control cannot pass this Control cannot pass this point until all spawned h ld h d

copy 2009 Charles E Leiserson 38

children have returned

Cilk++ keywords grant permission for parallel execution They do not command parallel execution

Loop Parallelism in Cilk++

ExampleIn-place matrix

a11 a12 ⋯ a1n

a21 a22 ⋯ a2n

⋮ ⋮ ⋱ ⋮

a11 a21 ⋯ an1

a12 a22 ⋯ an2

⋮ ⋮ ⋱ ⋮

The iterations of a cilk for

indices run from 0 not 1cilk_for (int i=1 iltn ++i)

matrix transpose

⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann

⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann

A AT

copy 2009 Charles E Leiserson 39

of a cilk_forloop execute in parallel

for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp

Serial Semantics

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync

( )

The C++ serialization of a Cilk++ program is always a legal interpretation of the

Cilk++ source

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

return (x+y)

gprogramrsquos semantics

Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution

copy 2009 Charles E Leiserson 40

To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync

Or specify a switch to the Cilk++ compiler

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 11

Scheduling

∙The Cilk++ concurrency platform allows the programmer to express

t ti l ll li i

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in

an application∙The Cilk++ scheduler

maps the executing program onto the processor cores d i ll i

Memory IO

cilk_syncreturn (x+y)

copy 2009 Charles E Leiserson 41

dynamically at runtime∙Cilk++rsquos work-stealing

scheduler is provably efficient

Network

hellipPP P P$ $ $

Cilk++Compiler

Conventional Compiler

HyperobjectLibrary1

2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)

Cilk++ Platform

Cilk++source

BinaryBinary CilkscreenRace Detector

Linker

5

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

CilkviewScalability Analyzer

6

copy 2009 Charles E Leiserson 42

Conventional Regression Tests

Reliable Single-Threaded Code

Exceptional Performance

Reliable Multi-Threaded Code

Parallel Regression Tests

Runtime System

4

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 43

Cilk++

bullRace Conditions

Race Bugs

Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write

int x = 0cilk_for(int i=0 ilt2 ++i)

x++

A

B C x++

int x = 0

x++

A

B C

Example

copy 2009 Charles E Leiserson 44

x++assert(x == 2)

B C

D

x++

assert(x == 2)

x++B C

D

Dependency Graph

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 12

A Closer Look

1 2

x = 01

2 4int x = 0

A

r1 = x

r1++

x = r1

r2 = x

r2++

x = r2

2

3

4

5

67

x++

int x 0

assert(x == 2)

x++B C

D

copy 2009 Charles E Leiserson 45

assert(x == 2)8

x

r1

r2

0001 0 01111 1

Types of Races

Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)

A B Race Typeread read noneread write read racewrite read read race

A∥B (A is parallel to B)

copy 2009 Charles E Leiserson 46

write read read racewrite write write race

Two sections of code are independent if they have no determinacy races between them

Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs

bull Machine word size matters Watch out for races in packed data structures

copy 2009 Charles E Leiserson 47

packed data structures

structchar achar b

x

structchar achar b

x

Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)

Cilkscreen Race Detector

∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race

∙ Employs a regression-test methodology where the programmer provides test inputs

∙ Identifies filenames lines and variables involved in races including stack traces

copy 2009 Charles E Leiserson 48

g∙ Runs off the binary executable using dynamic

instrumentation ∙ Runs about 20 times slower than real-time

MIT OpenCourseWarehttpocwmitedu

6172 Performance Engineering of Software Systems Fall 2009

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

Page 10: Performance Multicore Programming - DSpace@MIT: · PDF filePerformance Multicore ... access and update items in the container concurrently ... and proprietary, including gcc and Visual

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 10

Cilk++

∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off

which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk

multithreaded language developed at MIT∙Features a provably efficient work-stealing

scheduler

copy 2009 Charles E Leiserson 37

∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and

Cilkview scalability analyzer

Nested Parallelism in Cilk++

int fib(int n)

if (n lt 2) return n

int fib(int n)

if (n lt 2) return n

The named childfunction may execute in parallel with the

llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y

parent caller

Control cannot pass this Control cannot pass this point until all spawned h ld h d

copy 2009 Charles E Leiserson 38

children have returned

Cilk++ keywords grant permission for parallel execution They do not command parallel execution

Loop Parallelism in Cilk++

ExampleIn-place matrix

a11 a12 ⋯ a1n

a21 a22 ⋯ a2n

⋮ ⋮ ⋱ ⋮

a11 a21 ⋯ an1

a12 a22 ⋯ an2

⋮ ⋮ ⋱ ⋮

The iterations of a cilk for

indices run from 0 not 1cilk_for (int i=1 iltn ++i)

matrix transpose

⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann

⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann

A AT

copy 2009 Charles E Leiserson 39

of a cilk_forloop execute in parallel

for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp

Serial Semantics

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync

( )

The C++ serialization of a Cilk++ program is always a legal interpretation of the

Cilk++ source

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

return (x+y)

gprogramrsquos semantics

Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution

copy 2009 Charles E Leiserson 40

To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync

Or specify a switch to the Cilk++ compiler

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 11

Scheduling

∙The Cilk++ concurrency platform allows the programmer to express

t ti l ll li i

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in

an application∙The Cilk++ scheduler

maps the executing program onto the processor cores d i ll i

Memory IO

cilk_syncreturn (x+y)

copy 2009 Charles E Leiserson 41

dynamically at runtime∙Cilk++rsquos work-stealing

scheduler is provably efficient

Network

hellipPP P P$ $ $

Cilk++Compiler

Conventional Compiler

HyperobjectLibrary1

2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)

Cilk++ Platform

Cilk++source

BinaryBinary CilkscreenRace Detector

Linker

5

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

CilkviewScalability Analyzer

6

copy 2009 Charles E Leiserson 42

Conventional Regression Tests

Reliable Single-Threaded Code

Exceptional Performance

Reliable Multi-Threaded Code

Parallel Regression Tests

Runtime System

4

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 43

Cilk++

bullRace Conditions

Race Bugs

Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write

int x = 0cilk_for(int i=0 ilt2 ++i)

x++

A

B C x++

int x = 0

x++

A

B C

Example

copy 2009 Charles E Leiserson 44

x++assert(x == 2)

B C

D

x++

assert(x == 2)

x++B C

D

Dependency Graph

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 12

A Closer Look

1 2

x = 01

2 4int x = 0

A

r1 = x

r1++

x = r1

r2 = x

r2++

x = r2

2

3

4

5

67

x++

int x 0

assert(x == 2)

x++B C

D

copy 2009 Charles E Leiserson 45

assert(x == 2)8

x

r1

r2

0001 0 01111 1

Types of Races

Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)

A B Race Typeread read noneread write read racewrite read read race

A∥B (A is parallel to B)

copy 2009 Charles E Leiserson 46

write read read racewrite write write race

Two sections of code are independent if they have no determinacy races between them

Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs

bull Machine word size matters Watch out for races in packed data structures

copy 2009 Charles E Leiserson 47

packed data structures

structchar achar b

x

structchar achar b

x

Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)

Cilkscreen Race Detector

∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race

∙ Employs a regression-test methodology where the programmer provides test inputs

∙ Identifies filenames lines and variables involved in races including stack traces

copy 2009 Charles E Leiserson 48

g∙ Runs off the binary executable using dynamic

instrumentation ∙ Runs about 20 times slower than real-time

MIT OpenCourseWarehttpocwmitedu

6172 Performance Engineering of Software Systems Fall 2009

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

Page 11: Performance Multicore Programming - DSpace@MIT: · PDF filePerformance Multicore ... access and update items in the container concurrently ... and proprietary, including gcc and Visual

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 11

Scheduling

∙The Cilk++ concurrency platform allows the programmer to express

t ti l ll li i

int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in

an application∙The Cilk++ scheduler

maps the executing program onto the processor cores d i ll i

Memory IO

cilk_syncreturn (x+y)

copy 2009 Charles E Leiserson 41

dynamically at runtime∙Cilk++rsquos work-stealing

scheduler is provably efficient

Network

hellipPP P P$ $ $

Cilk++Compiler

Conventional Compiler

HyperobjectLibrary1

2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)

Cilk++ Platform

Cilk++source

BinaryBinary CilkscreenRace Detector

Linker

5

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)

Serialization

CilkviewScalability Analyzer

6

copy 2009 Charles E Leiserson 42

Conventional Regression Tests

Reliable Single-Threaded Code

Exceptional Performance

Reliable Multi-Threaded Code

Parallel Regression Tests

Runtime System

4

OUTLINE

bullShared-Memory HardwarebullConcurrency Platforms

Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++

copy 2009 Charles E Leiserson 43

Cilk++

bullRace Conditions

Race Bugs

Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write

int x = 0cilk_for(int i=0 ilt2 ++i)

x++

A

B C x++

int x = 0

x++

A

B C

Example

copy 2009 Charles E Leiserson 44

x++assert(x == 2)

B C

D

x++

assert(x == 2)

x++B C

D

Dependency Graph

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 12

A Closer Look

1 2

x = 01

2 4int x = 0

A

r1 = x

r1++

x = r1

r2 = x

r2++

x = r2

2

3

4

5

67

x++

int x 0

assert(x == 2)

x++B C

D

copy 2009 Charles E Leiserson 45

assert(x == 2)8

x

r1

r2

0001 0 01111 1

Types of Races

Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)

A B Race Typeread read noneread write read racewrite read read race

A∥B (A is parallel to B)

copy 2009 Charles E Leiserson 46

write read read racewrite write write race

Two sections of code are independent if they have no determinacy races between them

Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs

bull Machine word size matters Watch out for races in packed data structures

copy 2009 Charles E Leiserson 47

packed data structures

structchar achar b

x

structchar achar b

x

Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)

Cilkscreen Race Detector

∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race

∙ Employs a regression-test methodology where the programmer provides test inputs

∙ Identifies filenames lines and variables involved in races including stack traces

copy 2009 Charles E Leiserson 48

g∙ Runs off the binary executable using dynamic

instrumentation ∙ Runs about 20 times slower than real-time

MIT OpenCourseWarehttpocwmitedu

6172 Performance Engineering of Software Systems Fall 2009

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

Page 12: Performance Multicore Programming - DSpace@MIT: · PDF filePerformance Multicore ... access and update items in the container concurrently ... and proprietary, including gcc and Visual

Parallelism and Performance 2009-10-22

MIT 6172 Lecture 11 12

A Closer Look

1 2

x = 01

2 4int x = 0

A

r1 = x

r1++

x = r1

r2 = x

r2++

x = r2

2

3

4

5

67

x++

int x 0

assert(x == 2)

x++B C

D

copy 2009 Charles E Leiserson 45

assert(x == 2)8

x

r1

r2

0001 0 01111 1

Types of Races

Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)

A B Race Typeread read noneread write read racewrite read read race

A∥B (A is parallel to B)

copy 2009 Charles E Leiserson 46

write read read racewrite write write race

Two sections of code are independent if they have no determinacy races between them

Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs

bull Machine word size matters Watch out for races in packed data structures

copy 2009 Charles E Leiserson 47

packed data structures

structchar achar b

x

structchar achar b

x

Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)

Cilkscreen Race Detector

∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race

∙ Employs a regression-test methodology where the programmer provides test inputs

∙ Identifies filenames lines and variables involved in races including stack traces

copy 2009 Charles E Leiserson 48

g∙ Runs off the binary executable using dynamic

instrumentation ∙ Runs about 20 times slower than real-time

MIT OpenCourseWarehttpocwmitedu

6172 Performance Engineering of Software Systems Fall 2009

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

Page 13: Performance Multicore Programming - DSpace@MIT: · PDF filePerformance Multicore ... access and update items in the container concurrently ... and proprietary, including gcc and Visual

MIT OpenCourseWarehttpocwmitedu

6172 Performance Engineering of Software Systems Fall 2009

For information about citing these materials or our Terms of Use visit httpocwmiteduterms