performance multicore programming - dspace@mit: · pdf fileperformance multicore ... access...
TRANSCRIPT
Parallelism and Performance 2009-10-22
Lecture 11
6172Performance Multicore
Engineering of ProgrammingSoftware Systems
Charles E Leiserson October 22 2009
Moorersquos Law
Clock Speed (MHz)
Transistors (000)
0
10
100
1000
10000
100000
1000000
1071 1076 1070 1000 1007 1001 1006 1000 2000 2007
Intel CPU Introductions
Figure by MIT OpenCourseWare Source Herb Sutter ldquoThe free lunch is over a fundamental turn toward concurrency in softwarerdquo Dr Dobbs Journal 30(3) March 2005
copy2009 Char lesE L eiserson 2copy2009 Char lesE L eiserson 1
Power Density
copy2009 Char lesE L eiserson 3
Source Patrick Gelsinger Intel Developerrsquos Forum Intel Corporation 2004 Reprinted with permission of Intel Corporation
Intel Core i7 processor Reprinted
Vendor Solution
processor Reprintedwith permission of Intel Corporation
copy2009 Char lesE L eiserson 4
∙To scale performance put many processincores on the microprocessor chip ∙Each generation of Moorersquos Law potentially
doubles the number of cores
MIT 6172 Lecture 11 1
Transistor count is still
risingrising hellip
but clock speed is
bounded at ~5GHz5GHz
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 2
Memory IO
Multicore Architecture
Network
y
copy 2009 Charles E Leiserson 5
hellip$ $ $
Chip Multiprocessor (CMP)
PPP
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 6
Cilk++
bullRace Conditions
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 7
Cilk++
bullRace Conditions
x=3
Cache Coherence
copy 2009 Charles E Leiserson 8
hellipP P P
Load x x=3
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 3
x=3
Cache Coherence
copy 2009 Charles E Leiserson 9
hellipP P P
Load x x=3 x=3
x=3
Cache Coherence
copy 2009 Charles E Leiserson 10
hellipP P P
Load x x=3 x=3 x=3
x=3
Cache Coherence
copy 2009 Charles E Leiserson 11
hellipP P P
Store x=5 x=3 x=3 x=3
x=3
Cache Coherence
copy 2009 Charles E Leiserson 12
hellipP P P
Store x=5 x=3 x=5 x=3
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 4
x=3
Cache Coherence
copy 2009 Charles E Leiserson 13
hellipP P P
Load x x=3 x=5 x=3
MSI Protocol
Each cache line is labeled with a statebull M cache block has been modified No other
caches contain this block in M or S states
M x=13S y=17
bull S other caches may be sharing this blockbull I cache block is invalid (same as not there)
S y=17I x=4 I x=12
S y=17
copy 2009 Charles E Leiserson 14
I z=8 M z=7 I z=3
Before a cache modifies a location the hardware first invalidates all other copies
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 15
Cilk++
bullRace Conditions
Concurrency Platforms
bull Programming directly on processor cores is painful and error-prone
bull A concurrency platform abstracts processor cores handles synchronization and communication protocols and performs load balancing
bull Examplesbull Pthreads and WinAPI threads
copy 2009 Charles E Leiserson 16
Pthreads and WinAPI threadsbull Threading Building Blocks (TBB)bull OpenMPbull Cilk++
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 5
Fibonacci NumbersThe Fibonacci numbers are the sequence lang0 1 1 2 3 5 8 13 21 34 helliprang where each number is the sum of the previous two
The sequence is named after Leonardo di Pisa (1170ndash
RecurrenceF0 = 0F1 = 1Fn = Fnndash1 + Fnndash2 for n gt 1
copy 2009 Charles E Leiserson 17
1250 AD) also known as Fibonacci a contraction of filius Bonaccii mdashldquoson of Bonacciordquo Fibonaccirsquos 1202 book Liber Abaci introduced the sequence to Western mathematics although it had previously been discovered by Indian mathematicians
Fibonacci Program
include ltstdiohgtinclude ltstdlibhgt
int fib(int n)
DisclaimerThis recursive program is a poor
if (n lt 2) return nelse int x = fib(n-1)int y = fib(n-2)return x + y
int main(int argc char argv[])
way to compute the nth Fibonacci number but it provides a good didactic example
copy 2009 Charles E Leiserson 18
int main(int argc char argv[])int n = atoi(argv[1])int result = fib(n)printf(Fibonacci of d is dn n result)return 0
Fibonacci Executionfib(4)
fib(3) fib(2)
fib(2)
fib(1) fib(0)
fib(1) fib(1) fib(0)
Key idea for parallelization
int fib(int n)if (n lt 2) return nelse
copy 2009 Charles E Leiserson 19
Key idea for parallelizationThe calculations of fib(n-1)and fib(n-2) can be executed simultaneously without mutual interference
else int x = fib(n-1)int y = fib(n-2)return x + y
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 20
Cilk++
bullRace Conditions
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 6
Pthreads∙Standard API for threading specified by
ANSIIEEE POSIX 10031-1995 sect10031c∙Do-it-yourself concurrency platform
B il lib f f i i h ldquo i lrdquo∙Built as a library of functions with ldquospecialrdquo non-C++ semantics∙Each thread implements an abstraction of
a processor which are multiplexed onto machine resources∙Threads communicate though shared
copy 2009 Charles E Leiserson 21
Threads communicate though shared memory∙Library functions mask the protocols
involved in interthread coordinationWinAPI threads provide similar functionality
Key Pthread Functions
int pthread_create(pthread_t thread
returned identifier for the new thread const pthread_attr_t attr
b h d b ( f d f l )object to set thread attributes (NULL for default)void (func)(void )
routine executed after creation void arg
a single argument passed to func) returns error status
i h d j i (
copy 2009 Charles E Leiserson 22
int pthread_join (pthread_t thread
identifier of thread to wait forvoid status
terminating threadrsquos status (NULL to ignore)) returns error status
Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
int main(int argc char argv[])pthread_t threadthread args args
int main(int argc char argv[])pthread_t threadthread args argselse
int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
else int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
copy 2009 Charles E Leiserson 23
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
int main(int argc char argv[])pthread_t threadthread args args
int main(int argc char argv[])pthread_t threadthread args args
Original code
Structure
No point in creating thread if there isnrsquot
enough to doelse int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
else int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
Structure for thread
argumentsMarshal input argument to
thread
Main program executes
fib(nndash2) in parallel
Block until auxiliary thread
copy 2009 Charles E Leiserson 24
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
Function called when
thread is created
Create thread to execute fib(nndash1)
parallelfinishes
Add results together to produce final output
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 7
Issues with Pthreads
Overhead The cost of creating a thread gt104
cycles rArr coarse-grained concurrency (Thread pools can help )(Thread pools can help)
Scalability Fibonacci code gets about 15 speedup for 2 cores Need a rewrite for more cores
Modularity The Fibonacci logic is no longer neatly encapsulated in the fib() function
copy 2009 Charles E Leiserson 25
Code Simplicity
Programmers must marshal arguments(shades of 1958 ) and engage in error-prone protocols in order to load-balance
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 26
Cilk++
bullRace Conditions
Threading Building Blocks
∙Developed by Intel∙ Implemented as a C++p
library that runs on top of native threads∙Programmer specifies
tasks rather than threads∙Tasks are automatically
l d b l d th
Image of book cover removed due to copyright restrictions
Reinders James Intel Threading Building Blocks Outfitting C++ for Multi-Core Processor
copy 2009 Charles E Leiserson 27
load balanced across the threads using work-stealing∙Focus on performance
Multi Core Processor Parallelism OReilly 2007
Fibonacci in TBB
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum ) n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
copy 2009 Charles E Leiserson 28
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 8
Fibonacci in TBB
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
A computation organized as explicit tasks
FibTask has an input parameter n and an output parameter sum
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
The execute() function performs the
computation when the task is started
Recursively create two child tasks
copy 2009 Charles E Leiserson 29
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
child tasks a and b
Set the number of tasks to wait for (2 children + 1
implicit for bookkeeping)Start task b
Start task a and wait for both a and b to finish
Other TBB Features
∙TBB provides many C++ templates to express common patterns simply such as
parallel for for loop parallelismparallel_for for loop parallelismparallel_reduce for data aggregationpipeline and filter for software pipelining
∙TBB provides concurrent container classes which allow multiple threads to safely access and update items in the container concurrently
copy 2009 Charles E Leiserson 30
concurrently∙TBB also provides a variety of mutual-
exclusion library functions including locks and atomic updates
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 31
Cilk++
bullRace Conditions
OpenMP
∙Specification produced by an industry consortium∙Several compilers available both open-source
and proprietary including gcc and Visual Studio∙Linguistic extensions to CC++ or Fortran in
the form of compiler pragmas∙Runs on top of native threads
Supports loop parallelism and more recently
copy 2009 Charles E Leiserson 32
∙Supports loop parallelism and more recently in Version 30 task parallelism
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 9
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x yint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
copy 2009 Charles E Leiserson 33
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x y
Compiler directiveint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
The following statement is an
independent task
Sharing of memory is
copy 2009 Charles E Leiserson 34
managed explicitly
Wait for the two tasks to complete before continuing
Other OpenMP Features
∙OpenMP provides many pragma directivesto express common patterns such as
parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing
∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks
copy 2009 Charles E Leiserson 35
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 36
Cilk++
bullRace Conditions
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 10
Cilk++
∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off
which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk
multithreaded language developed at MIT∙Features a provably efficient work-stealing
scheduler
copy 2009 Charles E Leiserson 37
∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and
Cilkview scalability analyzer
Nested Parallelism in Cilk++
int fib(int n)
if (n lt 2) return n
int fib(int n)
if (n lt 2) return n
The named childfunction may execute in parallel with the
llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
parent caller
Control cannot pass this Control cannot pass this point until all spawned h ld h d
copy 2009 Charles E Leiserson 38
children have returned
Cilk++ keywords grant permission for parallel execution They do not command parallel execution
Loop Parallelism in Cilk++
ExampleIn-place matrix
a11 a12 ⋯ a1n
a21 a22 ⋯ a2n
⋮ ⋮ ⋱ ⋮
a11 a21 ⋯ an1
a12 a22 ⋯ an2
⋮ ⋮ ⋱ ⋮
The iterations of a cilk for
indices run from 0 not 1cilk_for (int i=1 iltn ++i)
matrix transpose
⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann
⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann
A AT
copy 2009 Charles E Leiserson 39
of a cilk_forloop execute in parallel
for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp
Serial Semantics
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync
( )
The C++ serialization of a Cilk++ program is always a legal interpretation of the
Cilk++ source
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
return (x+y)
gprogramrsquos semantics
Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution
copy 2009 Charles E Leiserson 40
To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync
Or specify a switch to the Cilk++ compiler
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 11
Scheduling
∙The Cilk++ concurrency platform allows the programmer to express
t ti l ll li i
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in
an application∙The Cilk++ scheduler
maps the executing program onto the processor cores d i ll i
Memory IO
cilk_syncreturn (x+y)
copy 2009 Charles E Leiserson 41
dynamically at runtime∙Cilk++rsquos work-stealing
scheduler is provably efficient
Network
hellipPP P P$ $ $
Cilk++Compiler
Conventional Compiler
HyperobjectLibrary1
2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)
Cilk++ Platform
Cilk++source
BinaryBinary CilkscreenRace Detector
Linker
5
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
CilkviewScalability Analyzer
6
copy 2009 Charles E Leiserson 42
Conventional Regression Tests
Reliable Single-Threaded Code
Exceptional Performance
Reliable Multi-Threaded Code
Parallel Regression Tests
Runtime System
4
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 43
Cilk++
bullRace Conditions
Race Bugs
Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write
int x = 0cilk_for(int i=0 ilt2 ++i)
x++
A
B C x++
int x = 0
x++
A
B C
Example
copy 2009 Charles E Leiserson 44
x++assert(x == 2)
B C
D
x++
assert(x == 2)
x++B C
D
Dependency Graph
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 12
A Closer Look
1 2
x = 01
2 4int x = 0
A
r1 = x
r1++
x = r1
r2 = x
r2++
x = r2
2
3
4
5
67
x++
int x 0
assert(x == 2)
x++B C
D
copy 2009 Charles E Leiserson 45
assert(x == 2)8
x
r1
r2
0001 0 01111 1
Types of Races
Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)
A B Race Typeread read noneread write read racewrite read read race
A∥B (A is parallel to B)
copy 2009 Charles E Leiserson 46
write read read racewrite write write race
Two sections of code are independent if they have no determinacy races between them
Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs
bull Machine word size matters Watch out for races in packed data structures
copy 2009 Charles E Leiserson 47
packed data structures
structchar achar b
x
structchar achar b
x
Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)
Cilkscreen Race Detector
∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race
∙ Employs a regression-test methodology where the programmer provides test inputs
∙ Identifies filenames lines and variables involved in races including stack traces
copy 2009 Charles E Leiserson 48
g∙ Runs off the binary executable using dynamic
instrumentation ∙ Runs about 20 times slower than real-time
MIT OpenCourseWarehttpocwmitedu
6172 Performance Engineering of Software Systems Fall 2009
For information about citing these materials or our Terms of Use visit httpocwmiteduterms
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 2
Memory IO
Multicore Architecture
Network
y
copy 2009 Charles E Leiserson 5
hellip$ $ $
Chip Multiprocessor (CMP)
PPP
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 6
Cilk++
bullRace Conditions
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 7
Cilk++
bullRace Conditions
x=3
Cache Coherence
copy 2009 Charles E Leiserson 8
hellipP P P
Load x x=3
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 3
x=3
Cache Coherence
copy 2009 Charles E Leiserson 9
hellipP P P
Load x x=3 x=3
x=3
Cache Coherence
copy 2009 Charles E Leiserson 10
hellipP P P
Load x x=3 x=3 x=3
x=3
Cache Coherence
copy 2009 Charles E Leiserson 11
hellipP P P
Store x=5 x=3 x=3 x=3
x=3
Cache Coherence
copy 2009 Charles E Leiserson 12
hellipP P P
Store x=5 x=3 x=5 x=3
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 4
x=3
Cache Coherence
copy 2009 Charles E Leiserson 13
hellipP P P
Load x x=3 x=5 x=3
MSI Protocol
Each cache line is labeled with a statebull M cache block has been modified No other
caches contain this block in M or S states
M x=13S y=17
bull S other caches may be sharing this blockbull I cache block is invalid (same as not there)
S y=17I x=4 I x=12
S y=17
copy 2009 Charles E Leiserson 14
I z=8 M z=7 I z=3
Before a cache modifies a location the hardware first invalidates all other copies
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 15
Cilk++
bullRace Conditions
Concurrency Platforms
bull Programming directly on processor cores is painful and error-prone
bull A concurrency platform abstracts processor cores handles synchronization and communication protocols and performs load balancing
bull Examplesbull Pthreads and WinAPI threads
copy 2009 Charles E Leiserson 16
Pthreads and WinAPI threadsbull Threading Building Blocks (TBB)bull OpenMPbull Cilk++
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 5
Fibonacci NumbersThe Fibonacci numbers are the sequence lang0 1 1 2 3 5 8 13 21 34 helliprang where each number is the sum of the previous two
The sequence is named after Leonardo di Pisa (1170ndash
RecurrenceF0 = 0F1 = 1Fn = Fnndash1 + Fnndash2 for n gt 1
copy 2009 Charles E Leiserson 17
1250 AD) also known as Fibonacci a contraction of filius Bonaccii mdashldquoson of Bonacciordquo Fibonaccirsquos 1202 book Liber Abaci introduced the sequence to Western mathematics although it had previously been discovered by Indian mathematicians
Fibonacci Program
include ltstdiohgtinclude ltstdlibhgt
int fib(int n)
DisclaimerThis recursive program is a poor
if (n lt 2) return nelse int x = fib(n-1)int y = fib(n-2)return x + y
int main(int argc char argv[])
way to compute the nth Fibonacci number but it provides a good didactic example
copy 2009 Charles E Leiserson 18
int main(int argc char argv[])int n = atoi(argv[1])int result = fib(n)printf(Fibonacci of d is dn n result)return 0
Fibonacci Executionfib(4)
fib(3) fib(2)
fib(2)
fib(1) fib(0)
fib(1) fib(1) fib(0)
Key idea for parallelization
int fib(int n)if (n lt 2) return nelse
copy 2009 Charles E Leiserson 19
Key idea for parallelizationThe calculations of fib(n-1)and fib(n-2) can be executed simultaneously without mutual interference
else int x = fib(n-1)int y = fib(n-2)return x + y
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 20
Cilk++
bullRace Conditions
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 6
Pthreads∙Standard API for threading specified by
ANSIIEEE POSIX 10031-1995 sect10031c∙Do-it-yourself concurrency platform
B il lib f f i i h ldquo i lrdquo∙Built as a library of functions with ldquospecialrdquo non-C++ semantics∙Each thread implements an abstraction of
a processor which are multiplexed onto machine resources∙Threads communicate though shared
copy 2009 Charles E Leiserson 21
Threads communicate though shared memory∙Library functions mask the protocols
involved in interthread coordinationWinAPI threads provide similar functionality
Key Pthread Functions
int pthread_create(pthread_t thread
returned identifier for the new thread const pthread_attr_t attr
b h d b ( f d f l )object to set thread attributes (NULL for default)void (func)(void )
routine executed after creation void arg
a single argument passed to func) returns error status
i h d j i (
copy 2009 Charles E Leiserson 22
int pthread_join (pthread_t thread
identifier of thread to wait forvoid status
terminating threadrsquos status (NULL to ignore)) returns error status
Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
int main(int argc char argv[])pthread_t threadthread args args
int main(int argc char argv[])pthread_t threadthread args argselse
int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
else int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
copy 2009 Charles E Leiserson 23
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
int main(int argc char argv[])pthread_t threadthread args args
int main(int argc char argv[])pthread_t threadthread args args
Original code
Structure
No point in creating thread if there isnrsquot
enough to doelse int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
else int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
Structure for thread
argumentsMarshal input argument to
thread
Main program executes
fib(nndash2) in parallel
Block until auxiliary thread
copy 2009 Charles E Leiserson 24
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
Function called when
thread is created
Create thread to execute fib(nndash1)
parallelfinishes
Add results together to produce final output
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 7
Issues with Pthreads
Overhead The cost of creating a thread gt104
cycles rArr coarse-grained concurrency (Thread pools can help )(Thread pools can help)
Scalability Fibonacci code gets about 15 speedup for 2 cores Need a rewrite for more cores
Modularity The Fibonacci logic is no longer neatly encapsulated in the fib() function
copy 2009 Charles E Leiserson 25
Code Simplicity
Programmers must marshal arguments(shades of 1958 ) and engage in error-prone protocols in order to load-balance
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 26
Cilk++
bullRace Conditions
Threading Building Blocks
∙Developed by Intel∙ Implemented as a C++p
library that runs on top of native threads∙Programmer specifies
tasks rather than threads∙Tasks are automatically
l d b l d th
Image of book cover removed due to copyright restrictions
Reinders James Intel Threading Building Blocks Outfitting C++ for Multi-Core Processor
copy 2009 Charles E Leiserson 27
load balanced across the threads using work-stealing∙Focus on performance
Multi Core Processor Parallelism OReilly 2007
Fibonacci in TBB
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum ) n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
copy 2009 Charles E Leiserson 28
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 8
Fibonacci in TBB
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
A computation organized as explicit tasks
FibTask has an input parameter n and an output parameter sum
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
The execute() function performs the
computation when the task is started
Recursively create two child tasks
copy 2009 Charles E Leiserson 29
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
child tasks a and b
Set the number of tasks to wait for (2 children + 1
implicit for bookkeeping)Start task b
Start task a and wait for both a and b to finish
Other TBB Features
∙TBB provides many C++ templates to express common patterns simply such as
parallel for for loop parallelismparallel_for for loop parallelismparallel_reduce for data aggregationpipeline and filter for software pipelining
∙TBB provides concurrent container classes which allow multiple threads to safely access and update items in the container concurrently
copy 2009 Charles E Leiserson 30
concurrently∙TBB also provides a variety of mutual-
exclusion library functions including locks and atomic updates
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 31
Cilk++
bullRace Conditions
OpenMP
∙Specification produced by an industry consortium∙Several compilers available both open-source
and proprietary including gcc and Visual Studio∙Linguistic extensions to CC++ or Fortran in
the form of compiler pragmas∙Runs on top of native threads
Supports loop parallelism and more recently
copy 2009 Charles E Leiserson 32
∙Supports loop parallelism and more recently in Version 30 task parallelism
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 9
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x yint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
copy 2009 Charles E Leiserson 33
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x y
Compiler directiveint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
The following statement is an
independent task
Sharing of memory is
copy 2009 Charles E Leiserson 34
managed explicitly
Wait for the two tasks to complete before continuing
Other OpenMP Features
∙OpenMP provides many pragma directivesto express common patterns such as
parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing
∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks
copy 2009 Charles E Leiserson 35
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 36
Cilk++
bullRace Conditions
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 10
Cilk++
∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off
which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk
multithreaded language developed at MIT∙Features a provably efficient work-stealing
scheduler
copy 2009 Charles E Leiserson 37
∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and
Cilkview scalability analyzer
Nested Parallelism in Cilk++
int fib(int n)
if (n lt 2) return n
int fib(int n)
if (n lt 2) return n
The named childfunction may execute in parallel with the
llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
parent caller
Control cannot pass this Control cannot pass this point until all spawned h ld h d
copy 2009 Charles E Leiserson 38
children have returned
Cilk++ keywords grant permission for parallel execution They do not command parallel execution
Loop Parallelism in Cilk++
ExampleIn-place matrix
a11 a12 ⋯ a1n
a21 a22 ⋯ a2n
⋮ ⋮ ⋱ ⋮
a11 a21 ⋯ an1
a12 a22 ⋯ an2
⋮ ⋮ ⋱ ⋮
The iterations of a cilk for
indices run from 0 not 1cilk_for (int i=1 iltn ++i)
matrix transpose
⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann
⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann
A AT
copy 2009 Charles E Leiserson 39
of a cilk_forloop execute in parallel
for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp
Serial Semantics
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync
( )
The C++ serialization of a Cilk++ program is always a legal interpretation of the
Cilk++ source
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
return (x+y)
gprogramrsquos semantics
Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution
copy 2009 Charles E Leiserson 40
To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync
Or specify a switch to the Cilk++ compiler
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 11
Scheduling
∙The Cilk++ concurrency platform allows the programmer to express
t ti l ll li i
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in
an application∙The Cilk++ scheduler
maps the executing program onto the processor cores d i ll i
Memory IO
cilk_syncreturn (x+y)
copy 2009 Charles E Leiserson 41
dynamically at runtime∙Cilk++rsquos work-stealing
scheduler is provably efficient
Network
hellipPP P P$ $ $
Cilk++Compiler
Conventional Compiler
HyperobjectLibrary1
2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)
Cilk++ Platform
Cilk++source
BinaryBinary CilkscreenRace Detector
Linker
5
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
CilkviewScalability Analyzer
6
copy 2009 Charles E Leiserson 42
Conventional Regression Tests
Reliable Single-Threaded Code
Exceptional Performance
Reliable Multi-Threaded Code
Parallel Regression Tests
Runtime System
4
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 43
Cilk++
bullRace Conditions
Race Bugs
Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write
int x = 0cilk_for(int i=0 ilt2 ++i)
x++
A
B C x++
int x = 0
x++
A
B C
Example
copy 2009 Charles E Leiserson 44
x++assert(x == 2)
B C
D
x++
assert(x == 2)
x++B C
D
Dependency Graph
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 12
A Closer Look
1 2
x = 01
2 4int x = 0
A
r1 = x
r1++
x = r1
r2 = x
r2++
x = r2
2
3
4
5
67
x++
int x 0
assert(x == 2)
x++B C
D
copy 2009 Charles E Leiserson 45
assert(x == 2)8
x
r1
r2
0001 0 01111 1
Types of Races
Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)
A B Race Typeread read noneread write read racewrite read read race
A∥B (A is parallel to B)
copy 2009 Charles E Leiserson 46
write read read racewrite write write race
Two sections of code are independent if they have no determinacy races between them
Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs
bull Machine word size matters Watch out for races in packed data structures
copy 2009 Charles E Leiserson 47
packed data structures
structchar achar b
x
structchar achar b
x
Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)
Cilkscreen Race Detector
∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race
∙ Employs a regression-test methodology where the programmer provides test inputs
∙ Identifies filenames lines and variables involved in races including stack traces
copy 2009 Charles E Leiserson 48
g∙ Runs off the binary executable using dynamic
instrumentation ∙ Runs about 20 times slower than real-time
MIT OpenCourseWarehttpocwmitedu
6172 Performance Engineering of Software Systems Fall 2009
For information about citing these materials or our Terms of Use visit httpocwmiteduterms
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 3
x=3
Cache Coherence
copy 2009 Charles E Leiserson 9
hellipP P P
Load x x=3 x=3
x=3
Cache Coherence
copy 2009 Charles E Leiserson 10
hellipP P P
Load x x=3 x=3 x=3
x=3
Cache Coherence
copy 2009 Charles E Leiserson 11
hellipP P P
Store x=5 x=3 x=3 x=3
x=3
Cache Coherence
copy 2009 Charles E Leiserson 12
hellipP P P
Store x=5 x=3 x=5 x=3
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 4
x=3
Cache Coherence
copy 2009 Charles E Leiserson 13
hellipP P P
Load x x=3 x=5 x=3
MSI Protocol
Each cache line is labeled with a statebull M cache block has been modified No other
caches contain this block in M or S states
M x=13S y=17
bull S other caches may be sharing this blockbull I cache block is invalid (same as not there)
S y=17I x=4 I x=12
S y=17
copy 2009 Charles E Leiserson 14
I z=8 M z=7 I z=3
Before a cache modifies a location the hardware first invalidates all other copies
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 15
Cilk++
bullRace Conditions
Concurrency Platforms
bull Programming directly on processor cores is painful and error-prone
bull A concurrency platform abstracts processor cores handles synchronization and communication protocols and performs load balancing
bull Examplesbull Pthreads and WinAPI threads
copy 2009 Charles E Leiserson 16
Pthreads and WinAPI threadsbull Threading Building Blocks (TBB)bull OpenMPbull Cilk++
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 5
Fibonacci NumbersThe Fibonacci numbers are the sequence lang0 1 1 2 3 5 8 13 21 34 helliprang where each number is the sum of the previous two
The sequence is named after Leonardo di Pisa (1170ndash
RecurrenceF0 = 0F1 = 1Fn = Fnndash1 + Fnndash2 for n gt 1
copy 2009 Charles E Leiserson 17
1250 AD) also known as Fibonacci a contraction of filius Bonaccii mdashldquoson of Bonacciordquo Fibonaccirsquos 1202 book Liber Abaci introduced the sequence to Western mathematics although it had previously been discovered by Indian mathematicians
Fibonacci Program
include ltstdiohgtinclude ltstdlibhgt
int fib(int n)
DisclaimerThis recursive program is a poor
if (n lt 2) return nelse int x = fib(n-1)int y = fib(n-2)return x + y
int main(int argc char argv[])
way to compute the nth Fibonacci number but it provides a good didactic example
copy 2009 Charles E Leiserson 18
int main(int argc char argv[])int n = atoi(argv[1])int result = fib(n)printf(Fibonacci of d is dn n result)return 0
Fibonacci Executionfib(4)
fib(3) fib(2)
fib(2)
fib(1) fib(0)
fib(1) fib(1) fib(0)
Key idea for parallelization
int fib(int n)if (n lt 2) return nelse
copy 2009 Charles E Leiserson 19
Key idea for parallelizationThe calculations of fib(n-1)and fib(n-2) can be executed simultaneously without mutual interference
else int x = fib(n-1)int y = fib(n-2)return x + y
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 20
Cilk++
bullRace Conditions
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 6
Pthreads∙Standard API for threading specified by
ANSIIEEE POSIX 10031-1995 sect10031c∙Do-it-yourself concurrency platform
B il lib f f i i h ldquo i lrdquo∙Built as a library of functions with ldquospecialrdquo non-C++ semantics∙Each thread implements an abstraction of
a processor which are multiplexed onto machine resources∙Threads communicate though shared
copy 2009 Charles E Leiserson 21
Threads communicate though shared memory∙Library functions mask the protocols
involved in interthread coordinationWinAPI threads provide similar functionality
Key Pthread Functions
int pthread_create(pthread_t thread
returned identifier for the new thread const pthread_attr_t attr
b h d b ( f d f l )object to set thread attributes (NULL for default)void (func)(void )
routine executed after creation void arg
a single argument passed to func) returns error status
i h d j i (
copy 2009 Charles E Leiserson 22
int pthread_join (pthread_t thread
identifier of thread to wait forvoid status
terminating threadrsquos status (NULL to ignore)) returns error status
Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
int main(int argc char argv[])pthread_t threadthread args args
int main(int argc char argv[])pthread_t threadthread args argselse
int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
else int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
copy 2009 Charles E Leiserson 23
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
int main(int argc char argv[])pthread_t threadthread args args
int main(int argc char argv[])pthread_t threadthread args args
Original code
Structure
No point in creating thread if there isnrsquot
enough to doelse int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
else int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
Structure for thread
argumentsMarshal input argument to
thread
Main program executes
fib(nndash2) in parallel
Block until auxiliary thread
copy 2009 Charles E Leiserson 24
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
Function called when
thread is created
Create thread to execute fib(nndash1)
parallelfinishes
Add results together to produce final output
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 7
Issues with Pthreads
Overhead The cost of creating a thread gt104
cycles rArr coarse-grained concurrency (Thread pools can help )(Thread pools can help)
Scalability Fibonacci code gets about 15 speedup for 2 cores Need a rewrite for more cores
Modularity The Fibonacci logic is no longer neatly encapsulated in the fib() function
copy 2009 Charles E Leiserson 25
Code Simplicity
Programmers must marshal arguments(shades of 1958 ) and engage in error-prone protocols in order to load-balance
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 26
Cilk++
bullRace Conditions
Threading Building Blocks
∙Developed by Intel∙ Implemented as a C++p
library that runs on top of native threads∙Programmer specifies
tasks rather than threads∙Tasks are automatically
l d b l d th
Image of book cover removed due to copyright restrictions
Reinders James Intel Threading Building Blocks Outfitting C++ for Multi-Core Processor
copy 2009 Charles E Leiserson 27
load balanced across the threads using work-stealing∙Focus on performance
Multi Core Processor Parallelism OReilly 2007
Fibonacci in TBB
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum ) n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
copy 2009 Charles E Leiserson 28
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 8
Fibonacci in TBB
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
A computation organized as explicit tasks
FibTask has an input parameter n and an output parameter sum
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
The execute() function performs the
computation when the task is started
Recursively create two child tasks
copy 2009 Charles E Leiserson 29
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
child tasks a and b
Set the number of tasks to wait for (2 children + 1
implicit for bookkeeping)Start task b
Start task a and wait for both a and b to finish
Other TBB Features
∙TBB provides many C++ templates to express common patterns simply such as
parallel for for loop parallelismparallel_for for loop parallelismparallel_reduce for data aggregationpipeline and filter for software pipelining
∙TBB provides concurrent container classes which allow multiple threads to safely access and update items in the container concurrently
copy 2009 Charles E Leiserson 30
concurrently∙TBB also provides a variety of mutual-
exclusion library functions including locks and atomic updates
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 31
Cilk++
bullRace Conditions
OpenMP
∙Specification produced by an industry consortium∙Several compilers available both open-source
and proprietary including gcc and Visual Studio∙Linguistic extensions to CC++ or Fortran in
the form of compiler pragmas∙Runs on top of native threads
Supports loop parallelism and more recently
copy 2009 Charles E Leiserson 32
∙Supports loop parallelism and more recently in Version 30 task parallelism
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 9
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x yint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
copy 2009 Charles E Leiserson 33
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x y
Compiler directiveint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
The following statement is an
independent task
Sharing of memory is
copy 2009 Charles E Leiserson 34
managed explicitly
Wait for the two tasks to complete before continuing
Other OpenMP Features
∙OpenMP provides many pragma directivesto express common patterns such as
parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing
∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks
copy 2009 Charles E Leiserson 35
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 36
Cilk++
bullRace Conditions
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 10
Cilk++
∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off
which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk
multithreaded language developed at MIT∙Features a provably efficient work-stealing
scheduler
copy 2009 Charles E Leiserson 37
∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and
Cilkview scalability analyzer
Nested Parallelism in Cilk++
int fib(int n)
if (n lt 2) return n
int fib(int n)
if (n lt 2) return n
The named childfunction may execute in parallel with the
llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
parent caller
Control cannot pass this Control cannot pass this point until all spawned h ld h d
copy 2009 Charles E Leiserson 38
children have returned
Cilk++ keywords grant permission for parallel execution They do not command parallel execution
Loop Parallelism in Cilk++
ExampleIn-place matrix
a11 a12 ⋯ a1n
a21 a22 ⋯ a2n
⋮ ⋮ ⋱ ⋮
a11 a21 ⋯ an1
a12 a22 ⋯ an2
⋮ ⋮ ⋱ ⋮
The iterations of a cilk for
indices run from 0 not 1cilk_for (int i=1 iltn ++i)
matrix transpose
⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann
⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann
A AT
copy 2009 Charles E Leiserson 39
of a cilk_forloop execute in parallel
for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp
Serial Semantics
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync
( )
The C++ serialization of a Cilk++ program is always a legal interpretation of the
Cilk++ source
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
return (x+y)
gprogramrsquos semantics
Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution
copy 2009 Charles E Leiserson 40
To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync
Or specify a switch to the Cilk++ compiler
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 11
Scheduling
∙The Cilk++ concurrency platform allows the programmer to express
t ti l ll li i
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in
an application∙The Cilk++ scheduler
maps the executing program onto the processor cores d i ll i
Memory IO
cilk_syncreturn (x+y)
copy 2009 Charles E Leiserson 41
dynamically at runtime∙Cilk++rsquos work-stealing
scheduler is provably efficient
Network
hellipPP P P$ $ $
Cilk++Compiler
Conventional Compiler
HyperobjectLibrary1
2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)
Cilk++ Platform
Cilk++source
BinaryBinary CilkscreenRace Detector
Linker
5
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
CilkviewScalability Analyzer
6
copy 2009 Charles E Leiserson 42
Conventional Regression Tests
Reliable Single-Threaded Code
Exceptional Performance
Reliable Multi-Threaded Code
Parallel Regression Tests
Runtime System
4
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 43
Cilk++
bullRace Conditions
Race Bugs
Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write
int x = 0cilk_for(int i=0 ilt2 ++i)
x++
A
B C x++
int x = 0
x++
A
B C
Example
copy 2009 Charles E Leiserson 44
x++assert(x == 2)
B C
D
x++
assert(x == 2)
x++B C
D
Dependency Graph
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 12
A Closer Look
1 2
x = 01
2 4int x = 0
A
r1 = x
r1++
x = r1
r2 = x
r2++
x = r2
2
3
4
5
67
x++
int x 0
assert(x == 2)
x++B C
D
copy 2009 Charles E Leiserson 45
assert(x == 2)8
x
r1
r2
0001 0 01111 1
Types of Races
Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)
A B Race Typeread read noneread write read racewrite read read race
A∥B (A is parallel to B)
copy 2009 Charles E Leiserson 46
write read read racewrite write write race
Two sections of code are independent if they have no determinacy races between them
Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs
bull Machine word size matters Watch out for races in packed data structures
copy 2009 Charles E Leiserson 47
packed data structures
structchar achar b
x
structchar achar b
x
Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)
Cilkscreen Race Detector
∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race
∙ Employs a regression-test methodology where the programmer provides test inputs
∙ Identifies filenames lines and variables involved in races including stack traces
copy 2009 Charles E Leiserson 48
g∙ Runs off the binary executable using dynamic
instrumentation ∙ Runs about 20 times slower than real-time
MIT OpenCourseWarehttpocwmitedu
6172 Performance Engineering of Software Systems Fall 2009
For information about citing these materials or our Terms of Use visit httpocwmiteduterms
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 4
x=3
Cache Coherence
copy 2009 Charles E Leiserson 13
hellipP P P
Load x x=3 x=5 x=3
MSI Protocol
Each cache line is labeled with a statebull M cache block has been modified No other
caches contain this block in M or S states
M x=13S y=17
bull S other caches may be sharing this blockbull I cache block is invalid (same as not there)
S y=17I x=4 I x=12
S y=17
copy 2009 Charles E Leiserson 14
I z=8 M z=7 I z=3
Before a cache modifies a location the hardware first invalidates all other copies
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 15
Cilk++
bullRace Conditions
Concurrency Platforms
bull Programming directly on processor cores is painful and error-prone
bull A concurrency platform abstracts processor cores handles synchronization and communication protocols and performs load balancing
bull Examplesbull Pthreads and WinAPI threads
copy 2009 Charles E Leiserson 16
Pthreads and WinAPI threadsbull Threading Building Blocks (TBB)bull OpenMPbull Cilk++
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 5
Fibonacci NumbersThe Fibonacci numbers are the sequence lang0 1 1 2 3 5 8 13 21 34 helliprang where each number is the sum of the previous two
The sequence is named after Leonardo di Pisa (1170ndash
RecurrenceF0 = 0F1 = 1Fn = Fnndash1 + Fnndash2 for n gt 1
copy 2009 Charles E Leiserson 17
1250 AD) also known as Fibonacci a contraction of filius Bonaccii mdashldquoson of Bonacciordquo Fibonaccirsquos 1202 book Liber Abaci introduced the sequence to Western mathematics although it had previously been discovered by Indian mathematicians
Fibonacci Program
include ltstdiohgtinclude ltstdlibhgt
int fib(int n)
DisclaimerThis recursive program is a poor
if (n lt 2) return nelse int x = fib(n-1)int y = fib(n-2)return x + y
int main(int argc char argv[])
way to compute the nth Fibonacci number but it provides a good didactic example
copy 2009 Charles E Leiserson 18
int main(int argc char argv[])int n = atoi(argv[1])int result = fib(n)printf(Fibonacci of d is dn n result)return 0
Fibonacci Executionfib(4)
fib(3) fib(2)
fib(2)
fib(1) fib(0)
fib(1) fib(1) fib(0)
Key idea for parallelization
int fib(int n)if (n lt 2) return nelse
copy 2009 Charles E Leiserson 19
Key idea for parallelizationThe calculations of fib(n-1)and fib(n-2) can be executed simultaneously without mutual interference
else int x = fib(n-1)int y = fib(n-2)return x + y
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 20
Cilk++
bullRace Conditions
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 6
Pthreads∙Standard API for threading specified by
ANSIIEEE POSIX 10031-1995 sect10031c∙Do-it-yourself concurrency platform
B il lib f f i i h ldquo i lrdquo∙Built as a library of functions with ldquospecialrdquo non-C++ semantics∙Each thread implements an abstraction of
a processor which are multiplexed onto machine resources∙Threads communicate though shared
copy 2009 Charles E Leiserson 21
Threads communicate though shared memory∙Library functions mask the protocols
involved in interthread coordinationWinAPI threads provide similar functionality
Key Pthread Functions
int pthread_create(pthread_t thread
returned identifier for the new thread const pthread_attr_t attr
b h d b ( f d f l )object to set thread attributes (NULL for default)void (func)(void )
routine executed after creation void arg
a single argument passed to func) returns error status
i h d j i (
copy 2009 Charles E Leiserson 22
int pthread_join (pthread_t thread
identifier of thread to wait forvoid status
terminating threadrsquos status (NULL to ignore)) returns error status
Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
int main(int argc char argv[])pthread_t threadthread args args
int main(int argc char argv[])pthread_t threadthread args argselse
int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
else int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
copy 2009 Charles E Leiserson 23
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
int main(int argc char argv[])pthread_t threadthread args args
int main(int argc char argv[])pthread_t threadthread args args
Original code
Structure
No point in creating thread if there isnrsquot
enough to doelse int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
else int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
Structure for thread
argumentsMarshal input argument to
thread
Main program executes
fib(nndash2) in parallel
Block until auxiliary thread
copy 2009 Charles E Leiserson 24
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
Function called when
thread is created
Create thread to execute fib(nndash1)
parallelfinishes
Add results together to produce final output
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 7
Issues with Pthreads
Overhead The cost of creating a thread gt104
cycles rArr coarse-grained concurrency (Thread pools can help )(Thread pools can help)
Scalability Fibonacci code gets about 15 speedup for 2 cores Need a rewrite for more cores
Modularity The Fibonacci logic is no longer neatly encapsulated in the fib() function
copy 2009 Charles E Leiserson 25
Code Simplicity
Programmers must marshal arguments(shades of 1958 ) and engage in error-prone protocols in order to load-balance
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 26
Cilk++
bullRace Conditions
Threading Building Blocks
∙Developed by Intel∙ Implemented as a C++p
library that runs on top of native threads∙Programmer specifies
tasks rather than threads∙Tasks are automatically
l d b l d th
Image of book cover removed due to copyright restrictions
Reinders James Intel Threading Building Blocks Outfitting C++ for Multi-Core Processor
copy 2009 Charles E Leiserson 27
load balanced across the threads using work-stealing∙Focus on performance
Multi Core Processor Parallelism OReilly 2007
Fibonacci in TBB
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum ) n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
copy 2009 Charles E Leiserson 28
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 8
Fibonacci in TBB
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
A computation organized as explicit tasks
FibTask has an input parameter n and an output parameter sum
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
The execute() function performs the
computation when the task is started
Recursively create two child tasks
copy 2009 Charles E Leiserson 29
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
child tasks a and b
Set the number of tasks to wait for (2 children + 1
implicit for bookkeeping)Start task b
Start task a and wait for both a and b to finish
Other TBB Features
∙TBB provides many C++ templates to express common patterns simply such as
parallel for for loop parallelismparallel_for for loop parallelismparallel_reduce for data aggregationpipeline and filter for software pipelining
∙TBB provides concurrent container classes which allow multiple threads to safely access and update items in the container concurrently
copy 2009 Charles E Leiserson 30
concurrently∙TBB also provides a variety of mutual-
exclusion library functions including locks and atomic updates
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 31
Cilk++
bullRace Conditions
OpenMP
∙Specification produced by an industry consortium∙Several compilers available both open-source
and proprietary including gcc and Visual Studio∙Linguistic extensions to CC++ or Fortran in
the form of compiler pragmas∙Runs on top of native threads
Supports loop parallelism and more recently
copy 2009 Charles E Leiserson 32
∙Supports loop parallelism and more recently in Version 30 task parallelism
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 9
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x yint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
copy 2009 Charles E Leiserson 33
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x y
Compiler directiveint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
The following statement is an
independent task
Sharing of memory is
copy 2009 Charles E Leiserson 34
managed explicitly
Wait for the two tasks to complete before continuing
Other OpenMP Features
∙OpenMP provides many pragma directivesto express common patterns such as
parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing
∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks
copy 2009 Charles E Leiserson 35
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 36
Cilk++
bullRace Conditions
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 10
Cilk++
∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off
which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk
multithreaded language developed at MIT∙Features a provably efficient work-stealing
scheduler
copy 2009 Charles E Leiserson 37
∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and
Cilkview scalability analyzer
Nested Parallelism in Cilk++
int fib(int n)
if (n lt 2) return n
int fib(int n)
if (n lt 2) return n
The named childfunction may execute in parallel with the
llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
parent caller
Control cannot pass this Control cannot pass this point until all spawned h ld h d
copy 2009 Charles E Leiserson 38
children have returned
Cilk++ keywords grant permission for parallel execution They do not command parallel execution
Loop Parallelism in Cilk++
ExampleIn-place matrix
a11 a12 ⋯ a1n
a21 a22 ⋯ a2n
⋮ ⋮ ⋱ ⋮
a11 a21 ⋯ an1
a12 a22 ⋯ an2
⋮ ⋮ ⋱ ⋮
The iterations of a cilk for
indices run from 0 not 1cilk_for (int i=1 iltn ++i)
matrix transpose
⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann
⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann
A AT
copy 2009 Charles E Leiserson 39
of a cilk_forloop execute in parallel
for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp
Serial Semantics
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync
( )
The C++ serialization of a Cilk++ program is always a legal interpretation of the
Cilk++ source
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
return (x+y)
gprogramrsquos semantics
Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution
copy 2009 Charles E Leiserson 40
To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync
Or specify a switch to the Cilk++ compiler
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 11
Scheduling
∙The Cilk++ concurrency platform allows the programmer to express
t ti l ll li i
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in
an application∙The Cilk++ scheduler
maps the executing program onto the processor cores d i ll i
Memory IO
cilk_syncreturn (x+y)
copy 2009 Charles E Leiserson 41
dynamically at runtime∙Cilk++rsquos work-stealing
scheduler is provably efficient
Network
hellipPP P P$ $ $
Cilk++Compiler
Conventional Compiler
HyperobjectLibrary1
2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)
Cilk++ Platform
Cilk++source
BinaryBinary CilkscreenRace Detector
Linker
5
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
CilkviewScalability Analyzer
6
copy 2009 Charles E Leiserson 42
Conventional Regression Tests
Reliable Single-Threaded Code
Exceptional Performance
Reliable Multi-Threaded Code
Parallel Regression Tests
Runtime System
4
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 43
Cilk++
bullRace Conditions
Race Bugs
Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write
int x = 0cilk_for(int i=0 ilt2 ++i)
x++
A
B C x++
int x = 0
x++
A
B C
Example
copy 2009 Charles E Leiserson 44
x++assert(x == 2)
B C
D
x++
assert(x == 2)
x++B C
D
Dependency Graph
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 12
A Closer Look
1 2
x = 01
2 4int x = 0
A
r1 = x
r1++
x = r1
r2 = x
r2++
x = r2
2
3
4
5
67
x++
int x 0
assert(x == 2)
x++B C
D
copy 2009 Charles E Leiserson 45
assert(x == 2)8
x
r1
r2
0001 0 01111 1
Types of Races
Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)
A B Race Typeread read noneread write read racewrite read read race
A∥B (A is parallel to B)
copy 2009 Charles E Leiserson 46
write read read racewrite write write race
Two sections of code are independent if they have no determinacy races between them
Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs
bull Machine word size matters Watch out for races in packed data structures
copy 2009 Charles E Leiserson 47
packed data structures
structchar achar b
x
structchar achar b
x
Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)
Cilkscreen Race Detector
∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race
∙ Employs a regression-test methodology where the programmer provides test inputs
∙ Identifies filenames lines and variables involved in races including stack traces
copy 2009 Charles E Leiserson 48
g∙ Runs off the binary executable using dynamic
instrumentation ∙ Runs about 20 times slower than real-time
MIT OpenCourseWarehttpocwmitedu
6172 Performance Engineering of Software Systems Fall 2009
For information about citing these materials or our Terms of Use visit httpocwmiteduterms
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 5
Fibonacci NumbersThe Fibonacci numbers are the sequence lang0 1 1 2 3 5 8 13 21 34 helliprang where each number is the sum of the previous two
The sequence is named after Leonardo di Pisa (1170ndash
RecurrenceF0 = 0F1 = 1Fn = Fnndash1 + Fnndash2 for n gt 1
copy 2009 Charles E Leiserson 17
1250 AD) also known as Fibonacci a contraction of filius Bonaccii mdashldquoson of Bonacciordquo Fibonaccirsquos 1202 book Liber Abaci introduced the sequence to Western mathematics although it had previously been discovered by Indian mathematicians
Fibonacci Program
include ltstdiohgtinclude ltstdlibhgt
int fib(int n)
DisclaimerThis recursive program is a poor
if (n lt 2) return nelse int x = fib(n-1)int y = fib(n-2)return x + y
int main(int argc char argv[])
way to compute the nth Fibonacci number but it provides a good didactic example
copy 2009 Charles E Leiserson 18
int main(int argc char argv[])int n = atoi(argv[1])int result = fib(n)printf(Fibonacci of d is dn n result)return 0
Fibonacci Executionfib(4)
fib(3) fib(2)
fib(2)
fib(1) fib(0)
fib(1) fib(1) fib(0)
Key idea for parallelization
int fib(int n)if (n lt 2) return nelse
copy 2009 Charles E Leiserson 19
Key idea for parallelizationThe calculations of fib(n-1)and fib(n-2) can be executed simultaneously without mutual interference
else int x = fib(n-1)int y = fib(n-2)return x + y
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 20
Cilk++
bullRace Conditions
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 6
Pthreads∙Standard API for threading specified by
ANSIIEEE POSIX 10031-1995 sect10031c∙Do-it-yourself concurrency platform
B il lib f f i i h ldquo i lrdquo∙Built as a library of functions with ldquospecialrdquo non-C++ semantics∙Each thread implements an abstraction of
a processor which are multiplexed onto machine resources∙Threads communicate though shared
copy 2009 Charles E Leiserson 21
Threads communicate though shared memory∙Library functions mask the protocols
involved in interthread coordinationWinAPI threads provide similar functionality
Key Pthread Functions
int pthread_create(pthread_t thread
returned identifier for the new thread const pthread_attr_t attr
b h d b ( f d f l )object to set thread attributes (NULL for default)void (func)(void )
routine executed after creation void arg
a single argument passed to func) returns error status
i h d j i (
copy 2009 Charles E Leiserson 22
int pthread_join (pthread_t thread
identifier of thread to wait forvoid status
terminating threadrsquos status (NULL to ignore)) returns error status
Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
int main(int argc char argv[])pthread_t threadthread args args
int main(int argc char argv[])pthread_t threadthread args argselse
int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
else int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
copy 2009 Charles E Leiserson 23
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
int main(int argc char argv[])pthread_t threadthread args args
int main(int argc char argv[])pthread_t threadthread args args
Original code
Structure
No point in creating thread if there isnrsquot
enough to doelse int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
else int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
Structure for thread
argumentsMarshal input argument to
thread
Main program executes
fib(nndash2) in parallel
Block until auxiliary thread
copy 2009 Charles E Leiserson 24
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
Function called when
thread is created
Create thread to execute fib(nndash1)
parallelfinishes
Add results together to produce final output
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 7
Issues with Pthreads
Overhead The cost of creating a thread gt104
cycles rArr coarse-grained concurrency (Thread pools can help )(Thread pools can help)
Scalability Fibonacci code gets about 15 speedup for 2 cores Need a rewrite for more cores
Modularity The Fibonacci logic is no longer neatly encapsulated in the fib() function
copy 2009 Charles E Leiserson 25
Code Simplicity
Programmers must marshal arguments(shades of 1958 ) and engage in error-prone protocols in order to load-balance
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 26
Cilk++
bullRace Conditions
Threading Building Blocks
∙Developed by Intel∙ Implemented as a C++p
library that runs on top of native threads∙Programmer specifies
tasks rather than threads∙Tasks are automatically
l d b l d th
Image of book cover removed due to copyright restrictions
Reinders James Intel Threading Building Blocks Outfitting C++ for Multi-Core Processor
copy 2009 Charles E Leiserson 27
load balanced across the threads using work-stealing∙Focus on performance
Multi Core Processor Parallelism OReilly 2007
Fibonacci in TBB
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum ) n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
copy 2009 Charles E Leiserson 28
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 8
Fibonacci in TBB
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
A computation organized as explicit tasks
FibTask has an input parameter n and an output parameter sum
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
The execute() function performs the
computation when the task is started
Recursively create two child tasks
copy 2009 Charles E Leiserson 29
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
child tasks a and b
Set the number of tasks to wait for (2 children + 1
implicit for bookkeeping)Start task b
Start task a and wait for both a and b to finish
Other TBB Features
∙TBB provides many C++ templates to express common patterns simply such as
parallel for for loop parallelismparallel_for for loop parallelismparallel_reduce for data aggregationpipeline and filter for software pipelining
∙TBB provides concurrent container classes which allow multiple threads to safely access and update items in the container concurrently
copy 2009 Charles E Leiserson 30
concurrently∙TBB also provides a variety of mutual-
exclusion library functions including locks and atomic updates
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 31
Cilk++
bullRace Conditions
OpenMP
∙Specification produced by an industry consortium∙Several compilers available both open-source
and proprietary including gcc and Visual Studio∙Linguistic extensions to CC++ or Fortran in
the form of compiler pragmas∙Runs on top of native threads
Supports loop parallelism and more recently
copy 2009 Charles E Leiserson 32
∙Supports loop parallelism and more recently in Version 30 task parallelism
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 9
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x yint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
copy 2009 Charles E Leiserson 33
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x y
Compiler directiveint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
The following statement is an
independent task
Sharing of memory is
copy 2009 Charles E Leiserson 34
managed explicitly
Wait for the two tasks to complete before continuing
Other OpenMP Features
∙OpenMP provides many pragma directivesto express common patterns such as
parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing
∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks
copy 2009 Charles E Leiserson 35
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 36
Cilk++
bullRace Conditions
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 10
Cilk++
∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off
which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk
multithreaded language developed at MIT∙Features a provably efficient work-stealing
scheduler
copy 2009 Charles E Leiserson 37
∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and
Cilkview scalability analyzer
Nested Parallelism in Cilk++
int fib(int n)
if (n lt 2) return n
int fib(int n)
if (n lt 2) return n
The named childfunction may execute in parallel with the
llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
parent caller
Control cannot pass this Control cannot pass this point until all spawned h ld h d
copy 2009 Charles E Leiserson 38
children have returned
Cilk++ keywords grant permission for parallel execution They do not command parallel execution
Loop Parallelism in Cilk++
ExampleIn-place matrix
a11 a12 ⋯ a1n
a21 a22 ⋯ a2n
⋮ ⋮ ⋱ ⋮
a11 a21 ⋯ an1
a12 a22 ⋯ an2
⋮ ⋮ ⋱ ⋮
The iterations of a cilk for
indices run from 0 not 1cilk_for (int i=1 iltn ++i)
matrix transpose
⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann
⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann
A AT
copy 2009 Charles E Leiserson 39
of a cilk_forloop execute in parallel
for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp
Serial Semantics
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync
( )
The C++ serialization of a Cilk++ program is always a legal interpretation of the
Cilk++ source
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
return (x+y)
gprogramrsquos semantics
Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution
copy 2009 Charles E Leiserson 40
To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync
Or specify a switch to the Cilk++ compiler
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 11
Scheduling
∙The Cilk++ concurrency platform allows the programmer to express
t ti l ll li i
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in
an application∙The Cilk++ scheduler
maps the executing program onto the processor cores d i ll i
Memory IO
cilk_syncreturn (x+y)
copy 2009 Charles E Leiserson 41
dynamically at runtime∙Cilk++rsquos work-stealing
scheduler is provably efficient
Network
hellipPP P P$ $ $
Cilk++Compiler
Conventional Compiler
HyperobjectLibrary1
2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)
Cilk++ Platform
Cilk++source
BinaryBinary CilkscreenRace Detector
Linker
5
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
CilkviewScalability Analyzer
6
copy 2009 Charles E Leiserson 42
Conventional Regression Tests
Reliable Single-Threaded Code
Exceptional Performance
Reliable Multi-Threaded Code
Parallel Regression Tests
Runtime System
4
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 43
Cilk++
bullRace Conditions
Race Bugs
Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write
int x = 0cilk_for(int i=0 ilt2 ++i)
x++
A
B C x++
int x = 0
x++
A
B C
Example
copy 2009 Charles E Leiserson 44
x++assert(x == 2)
B C
D
x++
assert(x == 2)
x++B C
D
Dependency Graph
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 12
A Closer Look
1 2
x = 01
2 4int x = 0
A
r1 = x
r1++
x = r1
r2 = x
r2++
x = r2
2
3
4
5
67
x++
int x 0
assert(x == 2)
x++B C
D
copy 2009 Charles E Leiserson 45
assert(x == 2)8
x
r1
r2
0001 0 01111 1
Types of Races
Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)
A B Race Typeread read noneread write read racewrite read read race
A∥B (A is parallel to B)
copy 2009 Charles E Leiserson 46
write read read racewrite write write race
Two sections of code are independent if they have no determinacy races between them
Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs
bull Machine word size matters Watch out for races in packed data structures
copy 2009 Charles E Leiserson 47
packed data structures
structchar achar b
x
structchar achar b
x
Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)
Cilkscreen Race Detector
∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race
∙ Employs a regression-test methodology where the programmer provides test inputs
∙ Identifies filenames lines and variables involved in races including stack traces
copy 2009 Charles E Leiserson 48
g∙ Runs off the binary executable using dynamic
instrumentation ∙ Runs about 20 times slower than real-time
MIT OpenCourseWarehttpocwmitedu
6172 Performance Engineering of Software Systems Fall 2009
For information about citing these materials or our Terms of Use visit httpocwmiteduterms
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 6
Pthreads∙Standard API for threading specified by
ANSIIEEE POSIX 10031-1995 sect10031c∙Do-it-yourself concurrency platform
B il lib f f i i h ldquo i lrdquo∙Built as a library of functions with ldquospecialrdquo non-C++ semantics∙Each thread implements an abstraction of
a processor which are multiplexed onto machine resources∙Threads communicate though shared
copy 2009 Charles E Leiserson 21
Threads communicate though shared memory∙Library functions mask the protocols
involved in interthread coordinationWinAPI threads provide similar functionality
Key Pthread Functions
int pthread_create(pthread_t thread
returned identifier for the new thread const pthread_attr_t attr
b h d b ( f d f l )object to set thread attributes (NULL for default)void (func)(void )
routine executed after creation void arg
a single argument passed to func) returns error status
i h d j i (
copy 2009 Charles E Leiserson 22
int pthread_join (pthread_t thread
identifier of thread to wait forvoid status
terminating threadrsquos status (NULL to ignore)) returns error status
Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
int main(int argc char argv[])pthread_t threadthread args args
int main(int argc char argv[])pthread_t threadthread args argselse
int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
else int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
copy 2009 Charles E Leiserson 23
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
Pthread Implementationinclude ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
include ltstdiohgtinclude ltstdlibhgtinclude ltpthreadhgt
int fib(int n)if (n lt 2) return nelse
int main(int argc char argv[])pthread_t threadthread args args
int main(int argc char argv[])pthread_t threadthread args args
Original code
Structure
No point in creating thread if there isnrsquot
enough to doelse int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
else int x = fib(n-1)int y = fib(n-2)return x + y
typedef struct int inputint output
thread_args
void thread_func ( void ptr )int i = ((thread args ) ptr)-gtinput
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
thread_args argsint statusint resultint thread_resultif (argc lt 2) return 1int n = atoi(argv[1])if (n lt 30) result = fib(n)else argsinput = n-1status = pthread_create(ampthread
NULL thread_func (void) ampargs )
Structure for thread
argumentsMarshal input argument to
thread
Main program executes
fib(nndash2) in parallel
Block until auxiliary thread
copy 2009 Charles E Leiserson 24
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
int i = ((thread_args ) ptr)-gtinput((thread_args ) ptr)-gtoutput = fib(i)return NULL
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
main can continue executingresult = fib(n-2) Wait for the thread to terminatepthread_join(thread NULL)result += argsoutput
printf(Fibonacci of d is dn n result)return 0
Function called when
thread is created
Create thread to execute fib(nndash1)
parallelfinishes
Add results together to produce final output
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 7
Issues with Pthreads
Overhead The cost of creating a thread gt104
cycles rArr coarse-grained concurrency (Thread pools can help )(Thread pools can help)
Scalability Fibonacci code gets about 15 speedup for 2 cores Need a rewrite for more cores
Modularity The Fibonacci logic is no longer neatly encapsulated in the fib() function
copy 2009 Charles E Leiserson 25
Code Simplicity
Programmers must marshal arguments(shades of 1958 ) and engage in error-prone protocols in order to load-balance
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 26
Cilk++
bullRace Conditions
Threading Building Blocks
∙Developed by Intel∙ Implemented as a C++p
library that runs on top of native threads∙Programmer specifies
tasks rather than threads∙Tasks are automatically
l d b l d th
Image of book cover removed due to copyright restrictions
Reinders James Intel Threading Building Blocks Outfitting C++ for Multi-Core Processor
copy 2009 Charles E Leiserson 27
load balanced across the threads using work-stealing∙Focus on performance
Multi Core Processor Parallelism OReilly 2007
Fibonacci in TBB
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum ) n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
copy 2009 Charles E Leiserson 28
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 8
Fibonacci in TBB
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
A computation organized as explicit tasks
FibTask has an input parameter n and an output parameter sum
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
The execute() function performs the
computation when the task is started
Recursively create two child tasks
copy 2009 Charles E Leiserson 29
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
child tasks a and b
Set the number of tasks to wait for (2 children + 1
implicit for bookkeeping)Start task b
Start task a and wait for both a and b to finish
Other TBB Features
∙TBB provides many C++ templates to express common patterns simply such as
parallel for for loop parallelismparallel_for for loop parallelismparallel_reduce for data aggregationpipeline and filter for software pipelining
∙TBB provides concurrent container classes which allow multiple threads to safely access and update items in the container concurrently
copy 2009 Charles E Leiserson 30
concurrently∙TBB also provides a variety of mutual-
exclusion library functions including locks and atomic updates
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 31
Cilk++
bullRace Conditions
OpenMP
∙Specification produced by an industry consortium∙Several compilers available both open-source
and proprietary including gcc and Visual Studio∙Linguistic extensions to CC++ or Fortran in
the form of compiler pragmas∙Runs on top of native threads
Supports loop parallelism and more recently
copy 2009 Charles E Leiserson 32
∙Supports loop parallelism and more recently in Version 30 task parallelism
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 9
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x yint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
copy 2009 Charles E Leiserson 33
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x y
Compiler directiveint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
The following statement is an
independent task
Sharing of memory is
copy 2009 Charles E Leiserson 34
managed explicitly
Wait for the two tasks to complete before continuing
Other OpenMP Features
∙OpenMP provides many pragma directivesto express common patterns such as
parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing
∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks
copy 2009 Charles E Leiserson 35
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 36
Cilk++
bullRace Conditions
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 10
Cilk++
∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off
which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk
multithreaded language developed at MIT∙Features a provably efficient work-stealing
scheduler
copy 2009 Charles E Leiserson 37
∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and
Cilkview scalability analyzer
Nested Parallelism in Cilk++
int fib(int n)
if (n lt 2) return n
int fib(int n)
if (n lt 2) return n
The named childfunction may execute in parallel with the
llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
parent caller
Control cannot pass this Control cannot pass this point until all spawned h ld h d
copy 2009 Charles E Leiserson 38
children have returned
Cilk++ keywords grant permission for parallel execution They do not command parallel execution
Loop Parallelism in Cilk++
ExampleIn-place matrix
a11 a12 ⋯ a1n
a21 a22 ⋯ a2n
⋮ ⋮ ⋱ ⋮
a11 a21 ⋯ an1
a12 a22 ⋯ an2
⋮ ⋮ ⋱ ⋮
The iterations of a cilk for
indices run from 0 not 1cilk_for (int i=1 iltn ++i)
matrix transpose
⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann
⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann
A AT
copy 2009 Charles E Leiserson 39
of a cilk_forloop execute in parallel
for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp
Serial Semantics
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync
( )
The C++ serialization of a Cilk++ program is always a legal interpretation of the
Cilk++ source
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
return (x+y)
gprogramrsquos semantics
Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution
copy 2009 Charles E Leiserson 40
To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync
Or specify a switch to the Cilk++ compiler
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 11
Scheduling
∙The Cilk++ concurrency platform allows the programmer to express
t ti l ll li i
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in
an application∙The Cilk++ scheduler
maps the executing program onto the processor cores d i ll i
Memory IO
cilk_syncreturn (x+y)
copy 2009 Charles E Leiserson 41
dynamically at runtime∙Cilk++rsquos work-stealing
scheduler is provably efficient
Network
hellipPP P P$ $ $
Cilk++Compiler
Conventional Compiler
HyperobjectLibrary1
2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)
Cilk++ Platform
Cilk++source
BinaryBinary CilkscreenRace Detector
Linker
5
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
CilkviewScalability Analyzer
6
copy 2009 Charles E Leiserson 42
Conventional Regression Tests
Reliable Single-Threaded Code
Exceptional Performance
Reliable Multi-Threaded Code
Parallel Regression Tests
Runtime System
4
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 43
Cilk++
bullRace Conditions
Race Bugs
Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write
int x = 0cilk_for(int i=0 ilt2 ++i)
x++
A
B C x++
int x = 0
x++
A
B C
Example
copy 2009 Charles E Leiserson 44
x++assert(x == 2)
B C
D
x++
assert(x == 2)
x++B C
D
Dependency Graph
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 12
A Closer Look
1 2
x = 01
2 4int x = 0
A
r1 = x
r1++
x = r1
r2 = x
r2++
x = r2
2
3
4
5
67
x++
int x 0
assert(x == 2)
x++B C
D
copy 2009 Charles E Leiserson 45
assert(x == 2)8
x
r1
r2
0001 0 01111 1
Types of Races
Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)
A B Race Typeread read noneread write read racewrite read read race
A∥B (A is parallel to B)
copy 2009 Charles E Leiserson 46
write read read racewrite write write race
Two sections of code are independent if they have no determinacy races between them
Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs
bull Machine word size matters Watch out for races in packed data structures
copy 2009 Charles E Leiserson 47
packed data structures
structchar achar b
x
structchar achar b
x
Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)
Cilkscreen Race Detector
∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race
∙ Employs a regression-test methodology where the programmer provides test inputs
∙ Identifies filenames lines and variables involved in races including stack traces
copy 2009 Charles E Leiserson 48
g∙ Runs off the binary executable using dynamic
instrumentation ∙ Runs about 20 times slower than real-time
MIT OpenCourseWarehttpocwmitedu
6172 Performance Engineering of Software Systems Fall 2009
For information about citing these materials or our Terms of Use visit httpocwmiteduterms
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 7
Issues with Pthreads
Overhead The cost of creating a thread gt104
cycles rArr coarse-grained concurrency (Thread pools can help )(Thread pools can help)
Scalability Fibonacci code gets about 15 speedup for 2 cores Need a rewrite for more cores
Modularity The Fibonacci logic is no longer neatly encapsulated in the fib() function
copy 2009 Charles E Leiserson 25
Code Simplicity
Programmers must marshal arguments(shades of 1958 ) and engage in error-prone protocols in order to load-balance
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 26
Cilk++
bullRace Conditions
Threading Building Blocks
∙Developed by Intel∙ Implemented as a C++p
library that runs on top of native threads∙Programmer specifies
tasks rather than threads∙Tasks are automatically
l d b l d th
Image of book cover removed due to copyright restrictions
Reinders James Intel Threading Building Blocks Outfitting C++ for Multi-Core Processor
copy 2009 Charles E Leiserson 27
load balanced across the threads using work-stealing∙Focus on performance
Multi Core Processor Parallelism OReilly 2007
Fibonacci in TBB
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum ) n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
copy 2009 Charles E Leiserson 28
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 8
Fibonacci in TBB
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
A computation organized as explicit tasks
FibTask has an input parameter n and an output parameter sum
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
The execute() function performs the
computation when the task is started
Recursively create two child tasks
copy 2009 Charles E Leiserson 29
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
child tasks a and b
Set the number of tasks to wait for (2 children + 1
implicit for bookkeeping)Start task b
Start task a and wait for both a and b to finish
Other TBB Features
∙TBB provides many C++ templates to express common patterns simply such as
parallel for for loop parallelismparallel_for for loop parallelismparallel_reduce for data aggregationpipeline and filter for software pipelining
∙TBB provides concurrent container classes which allow multiple threads to safely access and update items in the container concurrently
copy 2009 Charles E Leiserson 30
concurrently∙TBB also provides a variety of mutual-
exclusion library functions including locks and atomic updates
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 31
Cilk++
bullRace Conditions
OpenMP
∙Specification produced by an industry consortium∙Several compilers available both open-source
and proprietary including gcc and Visual Studio∙Linguistic extensions to CC++ or Fortran in
the form of compiler pragmas∙Runs on top of native threads
Supports loop parallelism and more recently
copy 2009 Charles E Leiserson 32
∙Supports loop parallelism and more recently in Version 30 task parallelism
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 9
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x yint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
copy 2009 Charles E Leiserson 33
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x y
Compiler directiveint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
The following statement is an
independent task
Sharing of memory is
copy 2009 Charles E Leiserson 34
managed explicitly
Wait for the two tasks to complete before continuing
Other OpenMP Features
∙OpenMP provides many pragma directivesto express common patterns such as
parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing
∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks
copy 2009 Charles E Leiserson 35
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 36
Cilk++
bullRace Conditions
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 10
Cilk++
∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off
which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk
multithreaded language developed at MIT∙Features a provably efficient work-stealing
scheduler
copy 2009 Charles E Leiserson 37
∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and
Cilkview scalability analyzer
Nested Parallelism in Cilk++
int fib(int n)
if (n lt 2) return n
int fib(int n)
if (n lt 2) return n
The named childfunction may execute in parallel with the
llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
parent caller
Control cannot pass this Control cannot pass this point until all spawned h ld h d
copy 2009 Charles E Leiserson 38
children have returned
Cilk++ keywords grant permission for parallel execution They do not command parallel execution
Loop Parallelism in Cilk++
ExampleIn-place matrix
a11 a12 ⋯ a1n
a21 a22 ⋯ a2n
⋮ ⋮ ⋱ ⋮
a11 a21 ⋯ an1
a12 a22 ⋯ an2
⋮ ⋮ ⋱ ⋮
The iterations of a cilk for
indices run from 0 not 1cilk_for (int i=1 iltn ++i)
matrix transpose
⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann
⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann
A AT
copy 2009 Charles E Leiserson 39
of a cilk_forloop execute in parallel
for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp
Serial Semantics
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync
( )
The C++ serialization of a Cilk++ program is always a legal interpretation of the
Cilk++ source
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
return (x+y)
gprogramrsquos semantics
Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution
copy 2009 Charles E Leiserson 40
To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync
Or specify a switch to the Cilk++ compiler
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 11
Scheduling
∙The Cilk++ concurrency platform allows the programmer to express
t ti l ll li i
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in
an application∙The Cilk++ scheduler
maps the executing program onto the processor cores d i ll i
Memory IO
cilk_syncreturn (x+y)
copy 2009 Charles E Leiserson 41
dynamically at runtime∙Cilk++rsquos work-stealing
scheduler is provably efficient
Network
hellipPP P P$ $ $
Cilk++Compiler
Conventional Compiler
HyperobjectLibrary1
2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)
Cilk++ Platform
Cilk++source
BinaryBinary CilkscreenRace Detector
Linker
5
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
CilkviewScalability Analyzer
6
copy 2009 Charles E Leiserson 42
Conventional Regression Tests
Reliable Single-Threaded Code
Exceptional Performance
Reliable Multi-Threaded Code
Parallel Regression Tests
Runtime System
4
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 43
Cilk++
bullRace Conditions
Race Bugs
Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write
int x = 0cilk_for(int i=0 ilt2 ++i)
x++
A
B C x++
int x = 0
x++
A
B C
Example
copy 2009 Charles E Leiserson 44
x++assert(x == 2)
B C
D
x++
assert(x == 2)
x++B C
D
Dependency Graph
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 12
A Closer Look
1 2
x = 01
2 4int x = 0
A
r1 = x
r1++
x = r1
r2 = x
r2++
x = r2
2
3
4
5
67
x++
int x 0
assert(x == 2)
x++B C
D
copy 2009 Charles E Leiserson 45
assert(x == 2)8
x
r1
r2
0001 0 01111 1
Types of Races
Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)
A B Race Typeread read noneread write read racewrite read read race
A∥B (A is parallel to B)
copy 2009 Charles E Leiserson 46
write read read racewrite write write race
Two sections of code are independent if they have no determinacy races between them
Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs
bull Machine word size matters Watch out for races in packed data structures
copy 2009 Charles E Leiserson 47
packed data structures
structchar achar b
x
structchar achar b
x
Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)
Cilkscreen Race Detector
∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race
∙ Employs a regression-test methodology where the programmer provides test inputs
∙ Identifies filenames lines and variables involved in races including stack traces
copy 2009 Charles E Leiserson 48
g∙ Runs off the binary executable using dynamic
instrumentation ∙ Runs about 20 times slower than real-time
MIT OpenCourseWarehttpocwmitedu
6172 Performance Engineering of Software Systems Fall 2009
For information about citing these materials or our Terms of Use visit httpocwmiteduterms
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 8
Fibonacci in TBB
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
class FibTask public task public
const long n long const sum FibTask( long n_ long sum_ )
n(n ) sum(sum )
A computation organized as explicit tasks
FibTask has an input parameter n and an output parameter sum
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
n(n_) sum(sum_)
task execute() if( n lt 2 )
sum = n else
long x y FibTaskamp a = new( allocate_child() )
FibTask(n-1ampx) FibTaskamp b = new( allocate_child() )
FibTask(n-2ampy) f (3)
The execute() function performs the
computation when the task is started
Recursively create two child tasks
copy 2009 Charles E Leiserson 29
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
set_ref_count(3) spawn( b ) spawn_and_wait_for_all( a ) sum = x+y
return NULL
child tasks a and b
Set the number of tasks to wait for (2 children + 1
implicit for bookkeeping)Start task b
Start task a and wait for both a and b to finish
Other TBB Features
∙TBB provides many C++ templates to express common patterns simply such as
parallel for for loop parallelismparallel_for for loop parallelismparallel_reduce for data aggregationpipeline and filter for software pipelining
∙TBB provides concurrent container classes which allow multiple threads to safely access and update items in the container concurrently
copy 2009 Charles E Leiserson 30
concurrently∙TBB also provides a variety of mutual-
exclusion library functions including locks and atomic updates
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 31
Cilk++
bullRace Conditions
OpenMP
∙Specification produced by an industry consortium∙Several compilers available both open-source
and proprietary including gcc and Visual Studio∙Linguistic extensions to CC++ or Fortran in
the form of compiler pragmas∙Runs on top of native threads
Supports loop parallelism and more recently
copy 2009 Charles E Leiserson 32
∙Supports loop parallelism and more recently in Version 30 task parallelism
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 9
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x yint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
copy 2009 Charles E Leiserson 33
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x y
Compiler directiveint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
The following statement is an
independent task
Sharing of memory is
copy 2009 Charles E Leiserson 34
managed explicitly
Wait for the two tasks to complete before continuing
Other OpenMP Features
∙OpenMP provides many pragma directivesto express common patterns such as
parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing
∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks
copy 2009 Charles E Leiserson 35
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 36
Cilk++
bullRace Conditions
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 10
Cilk++
∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off
which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk
multithreaded language developed at MIT∙Features a provably efficient work-stealing
scheduler
copy 2009 Charles E Leiserson 37
∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and
Cilkview scalability analyzer
Nested Parallelism in Cilk++
int fib(int n)
if (n lt 2) return n
int fib(int n)
if (n lt 2) return n
The named childfunction may execute in parallel with the
llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
parent caller
Control cannot pass this Control cannot pass this point until all spawned h ld h d
copy 2009 Charles E Leiserson 38
children have returned
Cilk++ keywords grant permission for parallel execution They do not command parallel execution
Loop Parallelism in Cilk++
ExampleIn-place matrix
a11 a12 ⋯ a1n
a21 a22 ⋯ a2n
⋮ ⋮ ⋱ ⋮
a11 a21 ⋯ an1
a12 a22 ⋯ an2
⋮ ⋮ ⋱ ⋮
The iterations of a cilk for
indices run from 0 not 1cilk_for (int i=1 iltn ++i)
matrix transpose
⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann
⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann
A AT
copy 2009 Charles E Leiserson 39
of a cilk_forloop execute in parallel
for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp
Serial Semantics
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync
( )
The C++ serialization of a Cilk++ program is always a legal interpretation of the
Cilk++ source
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
return (x+y)
gprogramrsquos semantics
Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution
copy 2009 Charles E Leiserson 40
To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync
Or specify a switch to the Cilk++ compiler
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 11
Scheduling
∙The Cilk++ concurrency platform allows the programmer to express
t ti l ll li i
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in
an application∙The Cilk++ scheduler
maps the executing program onto the processor cores d i ll i
Memory IO
cilk_syncreturn (x+y)
copy 2009 Charles E Leiserson 41
dynamically at runtime∙Cilk++rsquos work-stealing
scheduler is provably efficient
Network
hellipPP P P$ $ $
Cilk++Compiler
Conventional Compiler
HyperobjectLibrary1
2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)
Cilk++ Platform
Cilk++source
BinaryBinary CilkscreenRace Detector
Linker
5
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
CilkviewScalability Analyzer
6
copy 2009 Charles E Leiserson 42
Conventional Regression Tests
Reliable Single-Threaded Code
Exceptional Performance
Reliable Multi-Threaded Code
Parallel Regression Tests
Runtime System
4
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 43
Cilk++
bullRace Conditions
Race Bugs
Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write
int x = 0cilk_for(int i=0 ilt2 ++i)
x++
A
B C x++
int x = 0
x++
A
B C
Example
copy 2009 Charles E Leiserson 44
x++assert(x == 2)
B C
D
x++
assert(x == 2)
x++B C
D
Dependency Graph
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 12
A Closer Look
1 2
x = 01
2 4int x = 0
A
r1 = x
r1++
x = r1
r2 = x
r2++
x = r2
2
3
4
5
67
x++
int x 0
assert(x == 2)
x++B C
D
copy 2009 Charles E Leiserson 45
assert(x == 2)8
x
r1
r2
0001 0 01111 1
Types of Races
Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)
A B Race Typeread read noneread write read racewrite read read race
A∥B (A is parallel to B)
copy 2009 Charles E Leiserson 46
write read read racewrite write write race
Two sections of code are independent if they have no determinacy races between them
Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs
bull Machine word size matters Watch out for races in packed data structures
copy 2009 Charles E Leiserson 47
packed data structures
structchar achar b
x
structchar achar b
x
Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)
Cilkscreen Race Detector
∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race
∙ Employs a regression-test methodology where the programmer provides test inputs
∙ Identifies filenames lines and variables involved in races including stack traces
copy 2009 Charles E Leiserson 48
g∙ Runs off the binary executable using dynamic
instrumentation ∙ Runs about 20 times slower than real-time
MIT OpenCourseWarehttpocwmitedu
6172 Performance Engineering of Software Systems Fall 2009
For information about citing these materials or our Terms of Use visit httpocwmiteduterms
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 9
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x yint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
copy 2009 Charles E Leiserson 33
Fibonacci in OpenMP 30
int fib(int n)
if (n lt 2) return nint x y
int fib(int n)
if (n lt 2) return nint x y
Compiler directiveint x y
pragma omp task shared(x)x = fib(n - 1)
pragma omp task shared(y)y = fib(n - 2)
pragma omp taskwaitreturn x+y
int x ypragma omp task shared(x)
x = fib(n - 1)pragma omp task shared(y)
y = fib(n - 2)pragma omp taskwait
return x+y
The following statement is an
independent task
Sharing of memory is
copy 2009 Charles E Leiserson 34
managed explicitly
Wait for the two tasks to complete before continuing
Other OpenMP Features
∙OpenMP provides many pragma directivesto express common patterns such as
parallel for for loop parallelismparallel for for loop parallelismreduction for data aggregationdirectives for scheduling and data sharing
∙OpenMP provides a variety of synchroni-zation constructs such as barriers atomic updates and mutual-exclusion locks
copy 2009 Charles E Leiserson 35
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 36
Cilk++
bullRace Conditions
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 10
Cilk++
∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off
which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk
multithreaded language developed at MIT∙Features a provably efficient work-stealing
scheduler
copy 2009 Charles E Leiserson 37
∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and
Cilkview scalability analyzer
Nested Parallelism in Cilk++
int fib(int n)
if (n lt 2) return n
int fib(int n)
if (n lt 2) return n
The named childfunction may execute in parallel with the
llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
parent caller
Control cannot pass this Control cannot pass this point until all spawned h ld h d
copy 2009 Charles E Leiserson 38
children have returned
Cilk++ keywords grant permission for parallel execution They do not command parallel execution
Loop Parallelism in Cilk++
ExampleIn-place matrix
a11 a12 ⋯ a1n
a21 a22 ⋯ a2n
⋮ ⋮ ⋱ ⋮
a11 a21 ⋯ an1
a12 a22 ⋯ an2
⋮ ⋮ ⋱ ⋮
The iterations of a cilk for
indices run from 0 not 1cilk_for (int i=1 iltn ++i)
matrix transpose
⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann
⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann
A AT
copy 2009 Charles E Leiserson 39
of a cilk_forloop execute in parallel
for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp
Serial Semantics
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync
( )
The C++ serialization of a Cilk++ program is always a legal interpretation of the
Cilk++ source
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
return (x+y)
gprogramrsquos semantics
Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution
copy 2009 Charles E Leiserson 40
To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync
Or specify a switch to the Cilk++ compiler
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 11
Scheduling
∙The Cilk++ concurrency platform allows the programmer to express
t ti l ll li i
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in
an application∙The Cilk++ scheduler
maps the executing program onto the processor cores d i ll i
Memory IO
cilk_syncreturn (x+y)
copy 2009 Charles E Leiserson 41
dynamically at runtime∙Cilk++rsquos work-stealing
scheduler is provably efficient
Network
hellipPP P P$ $ $
Cilk++Compiler
Conventional Compiler
HyperobjectLibrary1
2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)
Cilk++ Platform
Cilk++source
BinaryBinary CilkscreenRace Detector
Linker
5
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
CilkviewScalability Analyzer
6
copy 2009 Charles E Leiserson 42
Conventional Regression Tests
Reliable Single-Threaded Code
Exceptional Performance
Reliable Multi-Threaded Code
Parallel Regression Tests
Runtime System
4
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 43
Cilk++
bullRace Conditions
Race Bugs
Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write
int x = 0cilk_for(int i=0 ilt2 ++i)
x++
A
B C x++
int x = 0
x++
A
B C
Example
copy 2009 Charles E Leiserson 44
x++assert(x == 2)
B C
D
x++
assert(x == 2)
x++B C
D
Dependency Graph
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 12
A Closer Look
1 2
x = 01
2 4int x = 0
A
r1 = x
r1++
x = r1
r2 = x
r2++
x = r2
2
3
4
5
67
x++
int x 0
assert(x == 2)
x++B C
D
copy 2009 Charles E Leiserson 45
assert(x == 2)8
x
r1
r2
0001 0 01111 1
Types of Races
Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)
A B Race Typeread read noneread write read racewrite read read race
A∥B (A is parallel to B)
copy 2009 Charles E Leiserson 46
write read read racewrite write write race
Two sections of code are independent if they have no determinacy races between them
Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs
bull Machine word size matters Watch out for races in packed data structures
copy 2009 Charles E Leiserson 47
packed data structures
structchar achar b
x
structchar achar b
x
Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)
Cilkscreen Race Detector
∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race
∙ Employs a regression-test methodology where the programmer provides test inputs
∙ Identifies filenames lines and variables involved in races including stack traces
copy 2009 Charles E Leiserson 48
g∙ Runs off the binary executable using dynamic
instrumentation ∙ Runs about 20 times slower than real-time
MIT OpenCourseWarehttpocwmitedu
6172 Performance Engineering of Software Systems Fall 2009
For information about citing these materials or our Terms of Use visit httpocwmiteduterms
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 10
Cilk++
∙Small set of linguistic extensions to C++ to support fork-join parallelism∙Developed by CILK ARTS an MIT spin-off∙Developed by CILK ARTS an MIT spin-off
which was acquired by Intel in July 2009 ∙Based on the award-winning Cilk
multithreaded language developed at MIT∙Features a provably efficient work-stealing
scheduler
copy 2009 Charles E Leiserson 37
∙Provides a hyperobject library for parallelizing code with global variables∙ Includes the Cilkscreen race detector and
Cilkview scalability analyzer
Nested Parallelism in Cilk++
int fib(int n)
if (n lt 2) return n
int fib(int n)
if (n lt 2) return n
The named childfunction may execute in parallel with the
llif (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
if (n lt 2) return nint x yx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn x+y
parent caller
Control cannot pass this Control cannot pass this point until all spawned h ld h d
copy 2009 Charles E Leiserson 38
children have returned
Cilk++ keywords grant permission for parallel execution They do not command parallel execution
Loop Parallelism in Cilk++
ExampleIn-place matrix
a11 a12 ⋯ a1n
a21 a22 ⋯ a2n
⋮ ⋮ ⋱ ⋮
a11 a21 ⋯ an1
a12 a22 ⋯ an2
⋮ ⋮ ⋱ ⋮
The iterations of a cilk for
indices run from 0 not 1cilk_for (int i=1 iltn ++i)
matrix transpose
⋮ ⋮ ⋱ ⋮an1 an2 ⋯ ann
⋮ ⋮ ⋱ ⋮a1n a2n ⋯ ann
A AT
copy 2009 Charles E Leiserson 39
of a cilk_forloop execute in parallel
for (int j=0 jlti ++j) double temp = A[i][j]A[i][j] = A[j][i]A[j][i] = temp
Serial Semantics
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_sync
( )
The C++ serialization of a Cilk++ program is always a legal interpretation of the
Cilk++ source
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
return (x+y)
gprogramrsquos semantics
Remember Cilk++ keywords grant permission for parallel execution They do not command parallel execution
copy 2009 Charles E Leiserson 40
To obtain the serializationdefine cilk_for fordefine cilk_spawndefine cilk_sync
Or specify a switch to the Cilk++ compiler
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 11
Scheduling
∙The Cilk++ concurrency platform allows the programmer to express
t ti l ll li i
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in
an application∙The Cilk++ scheduler
maps the executing program onto the processor cores d i ll i
Memory IO
cilk_syncreturn (x+y)
copy 2009 Charles E Leiserson 41
dynamically at runtime∙Cilk++rsquos work-stealing
scheduler is provably efficient
Network
hellipPP P P$ $ $
Cilk++Compiler
Conventional Compiler
HyperobjectLibrary1
2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)
Cilk++ Platform
Cilk++source
BinaryBinary CilkscreenRace Detector
Linker
5
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
CilkviewScalability Analyzer
6
copy 2009 Charles E Leiserson 42
Conventional Regression Tests
Reliable Single-Threaded Code
Exceptional Performance
Reliable Multi-Threaded Code
Parallel Regression Tests
Runtime System
4
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 43
Cilk++
bullRace Conditions
Race Bugs
Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write
int x = 0cilk_for(int i=0 ilt2 ++i)
x++
A
B C x++
int x = 0
x++
A
B C
Example
copy 2009 Charles E Leiserson 44
x++assert(x == 2)
B C
D
x++
assert(x == 2)
x++B C
D
Dependency Graph
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 12
A Closer Look
1 2
x = 01
2 4int x = 0
A
r1 = x
r1++
x = r1
r2 = x
r2++
x = r2
2
3
4
5
67
x++
int x 0
assert(x == 2)
x++B C
D
copy 2009 Charles E Leiserson 45
assert(x == 2)8
x
r1
r2
0001 0 01111 1
Types of Races
Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)
A B Race Typeread read noneread write read racewrite read read race
A∥B (A is parallel to B)
copy 2009 Charles E Leiserson 46
write read read racewrite write write race
Two sections of code are independent if they have no determinacy races between them
Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs
bull Machine word size matters Watch out for races in packed data structures
copy 2009 Charles E Leiserson 47
packed data structures
structchar achar b
x
structchar achar b
x
Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)
Cilkscreen Race Detector
∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race
∙ Employs a regression-test methodology where the programmer provides test inputs
∙ Identifies filenames lines and variables involved in races including stack traces
copy 2009 Charles E Leiserson 48
g∙ Runs off the binary executable using dynamic
instrumentation ∙ Runs about 20 times slower than real-time
MIT OpenCourseWarehttpocwmitedu
6172 Performance Engineering of Software Systems Fall 2009
For information about citing these materials or our Terms of Use visit httpocwmiteduterms
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 11
Scheduling
∙The Cilk++ concurrency platform allows the programmer to express
t ti l ll li i
int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)potential parallelism in
an application∙The Cilk++ scheduler
maps the executing program onto the processor cores d i ll i
Memory IO
cilk_syncreturn (x+y)
copy 2009 Charles E Leiserson 41
dynamically at runtime∙Cilk++rsquos work-stealing
scheduler is provably efficient
Network
hellipPP P P$ $ $
Cilk++Compiler
Conventional Compiler
HyperobjectLibrary1
2 3int fib (int n) if (nlt2) return (n)else int xyx = cilk_spawn fib(n-1)y = fib(n-2)cilk_syncreturn (x+y)
Cilk++ Platform
Cilk++source
BinaryBinary CilkscreenRace Detector
Linker
5
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
int fib (int n) if (nlt2) return (n)else int xyx = fib(n-1)y = fib(n-2)return (x+y)
Serialization
CilkviewScalability Analyzer
6
copy 2009 Charles E Leiserson 42
Conventional Regression Tests
Reliable Single-Threaded Code
Exceptional Performance
Reliable Multi-Threaded Code
Parallel Regression Tests
Runtime System
4
OUTLINE
bullShared-Memory HardwarebullConcurrency Platforms
Pthreads (and WinAPI Threads)Threading Building BlocksOpenMPCilk++
copy 2009 Charles E Leiserson 43
Cilk++
bullRace Conditions
Race Bugs
Definition A determinacy race occurs when two logically parallel instructions access the same memory location and at least one ofsame memory location and at least one of the instructions performs a write
int x = 0cilk_for(int i=0 ilt2 ++i)
x++
A
B C x++
int x = 0
x++
A
B C
Example
copy 2009 Charles E Leiserson 44
x++assert(x == 2)
B C
D
x++
assert(x == 2)
x++B C
D
Dependency Graph
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 12
A Closer Look
1 2
x = 01
2 4int x = 0
A
r1 = x
r1++
x = r1
r2 = x
r2++
x = r2
2
3
4
5
67
x++
int x 0
assert(x == 2)
x++B C
D
copy 2009 Charles E Leiserson 45
assert(x == 2)8
x
r1
r2
0001 0 01111 1
Types of Races
Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)
A B Race Typeread read noneread write read racewrite read read race
A∥B (A is parallel to B)
copy 2009 Charles E Leiserson 46
write read read racewrite write write race
Two sections of code are independent if they have no determinacy races between them
Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs
bull Machine word size matters Watch out for races in packed data structures
copy 2009 Charles E Leiserson 47
packed data structures
structchar achar b
x
structchar achar b
x
Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)
Cilkscreen Race Detector
∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race
∙ Employs a regression-test methodology where the programmer provides test inputs
∙ Identifies filenames lines and variables involved in races including stack traces
copy 2009 Charles E Leiserson 48
g∙ Runs off the binary executable using dynamic
instrumentation ∙ Runs about 20 times slower than real-time
MIT OpenCourseWarehttpocwmitedu
6172 Performance Engineering of Software Systems Fall 2009
For information about citing these materials or our Terms of Use visit httpocwmiteduterms
Parallelism and Performance 2009-10-22
MIT 6172 Lecture 11 12
A Closer Look
1 2
x = 01
2 4int x = 0
A
r1 = x
r1++
x = r1
r2 = x
r2++
x = r2
2
3
4
5
67
x++
int x 0
assert(x == 2)
x++B C
D
copy 2009 Charles E Leiserson 45
assert(x == 2)8
x
r1
r2
0001 0 01111 1
Types of Races
Suppose that instruction A and instruction Bboth access a location x and suppose that A∥B (A is parallel to B)
A B Race Typeread read noneread write read racewrite read read race
A∥B (A is parallel to B)
copy 2009 Charles E Leiserson 46
write read read racewrite write write race
Two sections of code are independent if they have no determinacy races between them
Avoiding Racesbull Iterations of a cilk_for should be independentbull Between a cilk_spawn and the corresponding cilk_sync the code of the spawned child should be independent of the code of the parent includingbe independent of the code of the parent including code executed by additional spawned or called childrenNote The arguments to a spawned function are evaluated in the parent before the spawn occurs
bull Machine word size matters Watch out for races in packed data structures
copy 2009 Charles E Leiserson 47
packed data structures
structchar achar b
x
structchar achar b
x
Ex Updating xa and xb in parallel may cause a race Nasty because it may depend on the compiler optimization level (Safe on Intel)
Cilkscreen Race Detector
∙ If an ostensibly deterministic Cilk++ program run on a given input could possibly behave any differently than its serialization the race detector guarantees to report and localize thedetector guarantees to report and localize the offending race
∙ Employs a regression-test methodology where the programmer provides test inputs
∙ Identifies filenames lines and variables involved in races including stack traces
copy 2009 Charles E Leiserson 48
g∙ Runs off the binary executable using dynamic
instrumentation ∙ Runs about 20 times slower than real-time
MIT OpenCourseWarehttpocwmitedu
6172 Performance Engineering of Software Systems Fall 2009
For information about citing these materials or our Terms of Use visit httpocwmiteduterms
MIT OpenCourseWarehttpocwmitedu
6172 Performance Engineering of Software Systems Fall 2009
For information about citing these materials or our Terms of Use visit httpocwmiteduterms