finalexamfinal exam - androbenchcsl.skku.edu/uploads/ice3003f09/23-pprog.pdffinalexamfinal exam...

Final ExamFinal ExamFinal ExamFinal Exam12:00 – 13:20, December 14 (Monday), 2009( y)#330110 (odd student id)#330118 (even student id)#330118 (even student id)

Scope:• Everything

Closed-book examFinal exam scores will be posted in the lectureFinal exam scores will be posted in the lecture homepage

1ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

P ll l P iP ll l P iParallel ProgrammingParallel Programming

Jin-Soo Kim ([email protected])Jin Soo Kim ([email protected])Computer Systems Laboratory

Sungkyunkwan Universityhtt // l kk dhttp://csl.skku.edu

ChallengesChallengesChallengesChallengesDifficult to write parallel programsp p g• Most programmers think sequentially• Performance vs. correctness tradeoffs• Missing good parallel abstractions

Automatic parallelization by compilers• Works with some applications:• Works with some applications:

loop parallelism, reduction• Unclear how we can apply to other complex• Unclear how we can apply to other complex

applications


Concurrent vs. ParallelConcurrent vs. ParallelConcurrent vs. ParallelConcurrent vs. ParallelConcurrent programp g• A program containing two or more processes

(threads)

Parallel program• A concurrent program in which each process (thread) co cu e t p og a c eac p ocess (t ead)

executes on its own processor, and hence the processes (threads) execute in parallel

Concurrent programming has been around for a while so why do we bother?for a while, so why do we bother?• Logical parallelism (GUIs, asynchronous events, …)

s Ph sical parallelism (for performance)


vs. Physical parallelism (for performance)

Think Parallel or PerishThink Parallel or PerishThink Parallel or PerishThink Parallel or Perish

The free lunch is over!


Parallel Prog. Models (1)Parallel Prog. Models (1)Parallel Prog. Models (1)Parallel Prog. Models (1)Shared address space programming modelp p g g• Single address space for all CPUs• Communication through regular load/store (implicit)g g / ( p )• Synchronization using locks and barriers (explicit)• Ease of programmingase o p og a g• Complex hardware for cache coherence

• Thread APIs (Pthreads, …)• OpenMPOpenMP• Intel TBB (Thread Building Blocks)


Parallel Prog. Models (2)Parallel Prog. Models (2)Parallel Prog. Models (2)Parallel Prog. Models (2)Message passing programming modelg p g p g g• Private address space per CPU• Communication through message send/receive over g g /

network interface (explicit)• Synchronization using blocking messages (implicit)• Need to program explicit communication• Simple hardware p

– No cache coherence supporting hardware

• RPC (Remote Procedure Calls)• PVM (obsolete)


• MPI (de factor standard)

Parallel Prog. Models (3)Parallel Prog. Models (3)Parallel Prog. Models (3)Parallel Prog. Models (3)Parallel programming models vs.p g gParallel computer architectures

P ll l A hiParallel Architectures

Shared‐memory Distributed‐memory

Single

AddressPthreads (CC) NUMA

Parallel

Prog.

Address

SpaceOpenMP Software DSM

Models Message

Passing

Multi‐processes

MPIMPI


Speedup (1)Speedup (1)Speedup (1)Speedup (1)Amdahl’s Law

Speedup 1=

S (1 P) : the time spent

nPPSpeedup

/)1( +−=

– S = (1- P) : the time spent executing the serial portion n : the number of processor cores

• A theoretical basis by which the speedup of parallel computations can be estimated.p

• The theoretical upper limit of speedup is limited by the serial portion of the code.


p

Speedup (2)Speedup (2)Speedup (2)Speedup (2)

Reducing the serial portion(20% 10%)

Doubling the number of coresDoubling the number of cores(8 16)


Speedup (3)Speedup (3)Speedup (3)Speedup (3)Speedup in realityp p y

)(/)1(1

nHnPP SpeedupPractical

++−=

• H(n) : overhead– Thread management & scheduling

)()(

– Communication & synchronization– Load balancing

E i– Extra computation– Operating system overhead

For better speedupFor better speedup,• Parallelize more (increase P)

P ll li ff ti l ( d ( ))


• Parallelize effectively (reduce H(n))

Calculating Pi (1)Calculating Pi (1)Calculating Pi (1)Calculating Pi (1)

π πtan = 1 atan(1) =

π4

π4

4 ⌠⌡0

1 11+x2 dx

= 4 (atan(1)–atan(0))

⌡0 1+x y = atan(x)y = 1+x21

π= 4 ( – 0) = ππ

4


Calculating Pi (2)Calculating Pi (2)Calculating Pi (2)Calculating Pi (2)Pi: Sequential versionq

#define N 20000000#define STEP (1 0 / (double) N)#define STEP (1.0 / (double) N)

double compute (){

1

1int i;double x;double sum = 0.0;

N

for (i = 0; i < N; i++){

x = (i + 0.5) * STEP;

i+0.5N

1+ ( )2

1

sum += 4.0 / (1.0 + x * x);}return sum * STEP;

}

0 1

(i+0.5)N


} N

Threads API (1)Threads API (1)Threads API (1)Threads API (1)Threads are the natural unit of execution for parallel programs on shared-memory hardware• Windows threading API• POSIX threads (Pthreads)( )

Threads have their own• Execution context• Execution context• Stack

Threads shareThreads share• Address space (code, data, heap)

R (PID fil t )


• Resources (PID, open files, etc.)

Threads API (2)Threads API (2)Threads API (2)Threads API (2)Pi: Pthreads version

#include <pthread.h>#define N 20000000#define STEP (1.0 / (double) N)

void *compute (void *arg){( / ( ) )

#define NTHREADS 8

double pi = 0.0;th d t t l k

{int i;double x;double sum = 0.0;i t id (i t) pthread_mutex_t lock =

PTHREAD_MUTEX_INITIALIZER;int main(){

int id = (int) arg;

for (i=id; i<N; i += NTHREADS){{

int i;pthread_t tid[NTHREADS];for (i = 0; i < NTHREADS; i++)

th d t (&tid[i] NULL

{x = (i + 0.5) * STEP;sum += 4.0 / (1.0 + x * x);

}th d t l k (&l k)pthread_create (&tid[i], NULL,

compute, (void *) i);for (i = 0; i < NTHREADS; i++)pthread join (tid[i], NULL);

pthread_mutex_lock (&lock);pi += (sum * STEP);pthread_mutex_unlock (&lock);

}


p _j ( [ ], );}

}

Threads API (3)Threads API (3)Threads API (3)Threads API (3)Pros• Flexible and widely available• The thread library gives you detailed control over the y g y

threads• Performance can be good

Cons• YOU have to take detailed control over the threads.• Relatively low level of abstraction• No easy code migration path from sequential

program


• Lack of structure means error-prone

OpenMP (1)OpenMP (1)OpenMP (1)OpenMP (1)What is OpenMP?p• A set of compiler directives and

library routines for portable parallel programming on shared memory systems– Fortran 77, Fortran 90, C, and C++

M lti d t f b th U i d Wi d– Multi-vendor support, for both Unix and Windows

• Standardizes loop-level (data) parallelismC bi i l d ll l d i i l• Combines serial and parallel code in single source

• Incremental approach to parallelizationhtt //• http://www.openmp.org,http://www.compunity.org– Standard documents tutorials sample codes


– Standard documents, tutorials, sample codes

OpenMP (2)OpenMP (2)OpenMP (2)OpenMP (2)Fork-join modelj• Master thread spawns a team of threads as needed• Synchronize when leaving parallel region (join)y g p g (j )• Only master executes sequential part


OpenMP (3)OpenMP (3)OpenMP (3)OpenMP (3)Parallel construct

#pragma omp parallel{

task();task();}


OpenMP (4)OpenMP (4)OpenMP (4)OpenMP (4)Work-sharing construct: sectionsg

# ll l ti#pragma omp parallel sections{#pragma omp section

phase1();#pragma omp section

phase2();#pragma omp section

phase3();}}


OpenMP (5)OpenMP (5)OpenMP (5)OpenMP (5)Work-sharing construct: loopg p

#pragma omp parallel forp g p pfor (i = 0; i < 12; i++)

c[i] = a[i] + b[i];

#pragma omp parallel{{task ();

#pragma omp forfor (i = 0; i < 12; i++)for (i = 0; i < 12; i++)

c[i] = a[i] + b[i];}


OpenMP (6)OpenMP (6)OpenMP (6)OpenMP (6)Pi: OpenMP versionp

#ifdef _OPENMPdouble compute (){

#include <omp.h>#endif

int i;double x;double sum = 0.0;

#define N 20000000#define STEP (1.0/(double) N)

i t i ()

#pragma omp parallel for private(x) reduction(+:sum)

f (i i i )int main(){

double pi;

for (i = 0; i < N; i ++){x = (i + 0.5) * STEP;s m + 4 0 / (1 0 + * );

pi = compute();}

sum += 4.0 / (1.0 + x * x);}return sum * STEP;

}


}

OpenMP (7)OpenMP (7)OpenMP (7)OpenMP (7)Environment• Dual Quad-core

Xeon (8 cores)7

8Pi (OpenMP)

N = 2e6

• 2.4GHz• Intel OpenMP

5

6

7

p

N = 2e7

N = 2e8

Ideal

Compiler

3

4

5

Speedu

p

1

2

3

0

1 2 3 4 5 6 7 8

The Number of Cores


The Number of Cores

OpenMP (8)OpenMP (8)OpenMP (8)OpenMP (8)Pros• Simple and portable• Single source code for both serial and parallel versiong p• Incremental parallelization

ConsCons• Primarily for bounded loops over built-in types

– Only for Fortran-style C/C++ programsOnly for Fortran style C/C++ programs– Pointer-chasing loops?

• Performance may be degraded due toy g– Synchronization overhead– Load imbalancing


– Thread management overhead

MPI (1)MPI (1)MPI (1)MPI (1)Message Passing Interfaceg g• Library standard defined by a committee of vendors,

implementers, and parallel programmers• Used to create parallel programs based on message

passing• Available on almost all parallel machines in C/Fortran• Two phases

– MPI 1: Traditional message-passing– MPI 2: Remote memory, parallel I/O, dynamic processes

T k t i ll t d h th j b• Tasks are typically created when the jobs are launched

• http://www mpi forum org


• http://www.mpi-forum.org

MPI (2)MPI (2)MPI (2)MPI (2)Point-to-point communicationp• Sending and receiving messages

Process 0

A[4]

Process 1

B[4]A[4] B[4]

MPI_Recv (B, 4, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD);

MPI_Send (A, 4, MPI_INT, 1, 1000, MPI_COMM_WORLD);


MPI (3)MPI (3)MPI (3)MPI (3)Collective communication• Broadcast

– MPI_Bcast (array, 100, MPI_INT, 0, comm);

array

array array array

• Reduce– MPI Reduce (&localsum, &sum, 1, MPI DOUBLE, _ ( , , , _ ,

MPI_SUM, 0, comm);

localsum localsum localsum localsum

++

+

localsum localsum localsum localsum


sum

MPI (4)MPI (4)MPI (4)MPI (4)Pi: MPI version#include “mpi.h”#include <stdio.h>#define N 20000000#d fi STEP (1 0 / (d bl ) N)#define STEP (1.0 / (double) N)#define f(x) (4.0 / (1.0 + (x)*(x))int main (int argc, char *argv[]) {

int i, numprocs, myid, start, end, n;, p , y , , , ;double sum = 0.0, mypi, pi;MPI_Init (&argc, &argv);MPI_Comm_size (MPI_COMM_WORLD, &numprocs); MPI Comm rank (MPI COMM WORLD &myid);MPI_Comm_rank (MPI_COMM_WORLD, &myid);start = (N / numprocs) * myid + 1;end = (myid==numprocs‐1)? n : (n/numprocs) * (myid+1);for (i = start; i <= end; i++)( ; ; )

sum += f(STEP * ((double) i + 0.5));myip = STEP * sum;MPI_Reduce (&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);MPI Finalize ();


MPI_Finalize ();}

MPI (5)MPI (5)MPI (5)MPI (5)Pros• Runs on either shared or distributed memory

architectures• Can be used on a wider range of problems• Distributed memory computers are less expensive

than large shared memory computers

Cons• Requires more programming changes to go from

serial to parallel version• Can be harder to debug• Performance is limited by the communication


network between the nodes

Parallel Benchmarks (1)Parallel Benchmarks (1)Parallel Benchmarks (1)Parallel Benchmarks (1)Linpackp• Matrix linear algebra• Basis for measuring “Top500 Supercomputing sites”g p p p g

(http://www.top500.org)

SPECrate• Parallel run of SPEC CPU programs• Job-level (or task-level) parallelismJob level (or task level) parallelism

SPLASH• Stanford Parallel Applications for Shared Memory• Stanford Parallel Applications for Shared Memory• Mix of kernels and applications• Strong scaling: keep the problem size fixed


• Strong scaling: keep the problem size fixed

Parallel Benchmarks (2)Parallel Benchmarks (2)Parallel Benchmarks (2)Parallel Benchmarks (2)NAS parallel benchmark suitep• By NASA Advanced Supercomputing (NAS)• Computational fluid dynamics kernelsp y

PARSEC suite• Princeton Application Repository for Shared Memory• Princeton Application Repository for Shared Memory

Computers• Multithreaded applications using Pthreads andMultithreaded applications using Pthreads and

OpenMP


Cache Coherence (1)Cache Coherence (1)Cache Coherence (1)Cache Coherence (1)Examplesp

CPU0 CPU1Read XRead X CPU0 CPU1Read X

Read XCPU0 CPU1

CacheX = 0

CacheX = 1Read X?

Read XWriteX 1

CPU0 CPU1

CacheX = 1

CacheX = 2

WriteX 1

Read X

Write

l h X?

WriteX 2

Flush X

MemoryX = 0

MemoryX = 2

Flush X?

Invalidate all copies before allowing a write to proceed

Disallow more than onemodified copy


allowing a write to proceed modified copy

Cache Coherence (2)Cache Coherence (2)Cache Coherence (2)Cache Coherence (2)Cache coherence protocolp• A set of rules to maintain the consistency of data

stored in the local caches as well as in main memory• Ensures multiple read-only copies and exclusive

modified copy

• Types– Snooping-based (or “Snoopy”)– Directory-based

W it t t i• Write strategies– Invalidation-based– Update-based


– Update-based

Cache Coherence (3)Cache Coherence (3)Cache Coherence (3)Cache Coherence (3)Snooping protocolp g p• All cache controllers monitor (or snoop) on the bus

– Send all requests for data to all processors– Processors snoop to see if they have a shared block– Requires broadcast, since caching info resides at processors

W k ll ith b ( t l b d t)– Works well with bus (natural broadcast)– Dominates for small scale machines

• Cache coherence unit• Cache coherence unit– Cache block (line) is the unit of management– False sharing is possible: Two processors share the same g p p

cache line but not the actual word– Coherence miss: Invalidate can cause a miss for the data

read before


read before

Cache Coherence (4)Cache Coherence (4)Cache Coherence (4)Cache Coherence (4)MESI protocol: invalidation-basedp• Modified (M)

– Cache has only one copy and copy is modified– Memory is not up-to-date

• Exclusive (E)– Cache has only one copy and copy is unmodified (clean)– Memory is up-to-date

Sh d (S)• Shared (S)– Copies may exist in other caches and all are unmodified– Memory is up-to-dateMemory is up to date

• Invalid (I)– Not in cache


o cac e

Cache Coherence (5)Cache Coherence (5)Cache Coherence (5)Cache Coherence (5)MESI protocol state machine CPU Read / -p

Bus ReadX /

CPU Read /Bus Read & S-signal on

(no Bus traffic)

Invalid Shared(read only)Bus ReadX /

-

CPU

WBus R

(read only)(read-only)

Bus Read /Bus S signal on

Write /

ReadX

Bus ReadX

Bus WriteB

(Flush)

Bus Bus S- s

Bus S-signal on

If cache miss occurs cache X

/Back

Read /

signal on

occurs, cache will write back modified block.

CPU Write / -(i lid i d d)

Exclusive(read only)Exclusive(read-only)(read/write)

Modified(read/write)


(invalidate is not needed)CPU Read / -(no Bus traffic)

CPU Read / -(no Bus traffic)

CPU Write / -(no Bus traffic)

Cache Coherence (6)Cache Coherence (6)Cache Coherence (6)Cache Coherence (6)Examples revisitedp

CPU0 CPU1Read XX = 0 (E) Read X CPU0 CPU1

Read XX 0 (S) Read XCPU0 CPU1

CacheX = 0

CacheX = 1

X = 0 (E)

X = 0 (S)

Read XX = 0 (S)WriteX 1

CPU0 CPU1

CacheX = 1

CacheX = 2

X = 0 (S)

WriteX 1

Read XX = 0 (S)

X (I)

Read X

X 1X = 1 (M)

X 1X = 1 (M) X (I)

WriteX 2

MemoryX = 0

X = 1 (S) Flush XX = 1 (S)

MemoryX = 2

Flush XX (I)

X 2X = 2 (M)Flush X

Invalidate all copies before allowing a write to proceed

Disallow more than onemodified copy


allowing a write to proceed modified copy

Cache Coherence (7)Cache Coherence (7)Cache Coherence (7)Cache Coherence (7)False sharingg

6

7Pi (OpenMP) N = 2e7

Pad = 0

Pad 28

struct thread_stat {int count;char pad[PADS];

} t t[8]

4

5

edup

Pad = 28

Pad = 60

Pad =124

} stat[8];

double compute () {int i, id;

1

2

3

Spe, ;

double x, sum = 0.0;

#pragma omp parallel for i t ( ) d ti ( )

0

1

1 2 3 4 5 6 7 8The Number of Cores

private(x) reduction(+:sum)for (i = 0; i < N; i ++) {int id=omp_get_thread_num();stat[id].count++; The Number of Cores[ ] ;x = (i + 0.5) * STEP;sum += 4.0 / (1.0 + x * x);

}t * STEP

stat[0] stat[1] ... stat[7]

padC C C


return sum * STEP;}

pad ...C C C

Cache-Aware Programming (1)Cache-Aware Programming (1)Cache Aware Programming (1)Cache Aware Programming (1)Pack data more tightlyg y• Usually, compilers do not pack structures

t t L { s ic

p

struct Loose {short s;int i;char c; p;Foo* p;

};

struct Tight {Foo* p; pint i;short s;char c;

};

pi s c


};

Cache-Aware Programming (2)Cache-Aware Programming (2)Cache Aware Programming (2)Cache Aware Programming (2)Work in the cache• Reduce the working set• Divide a problem into smaller subproblemsp p• Reorder steps in the code

Read-only data• Separate read-only data from read-write data• Separate read-only data from read-write data• Annotate constants with const keyword • Read only data are stored in separate sections• Read-only data are stored in separate sections• If possible, separate read-mostly variables


Cache-Aware Programming (3)Cache-Aware Programming (3)Cache Aware Programming (3)Cache Aware Programming (3)Read-write data• Group read-write variables which are used together

into a structure• Move read-write variables which are often written to

by different threads onto their own cache line– Reduce false sharing– May require padding

If i bl i d b lti l th d b t• If a variable is used by multiple threads, but every use is independent, move it into Thread Local Storage (TLS)Storage (TLS)


Cache-Aware Programming (4)Cache-Aware Programming (4)Cache Aware Programming (4)Cache Aware Programming (4)Lock variables• Keep the lock and the associated data on distinct

cache line.• If a lock protects data that is frequently

uncontended, try to keep the lock and the data on h h lithe same cache line.


Cache-Aware Programming (5)Cache-Aware Programming (5)Cache Aware Programming (5)Cache Aware Programming (5)Thread affinityy• Avoid moving a thread from one core to another

– Reduce context switching– Reduce cache misses– Reduce TLB misses

• Bind a process– sched_setaffinity () / sched_getaffinity()

• Bind a thread– pthread_setaffinity_np()– pthread_getaffinity_np()– pthread_attr_setaffinity_np()pthread attr getaffinity np()


– pthread_attr_getaffinity_np()

Summary: Why Hard?Summary: Why Hard?Summary: Why Hard?Summary: Why Hard?Impediments to parallel computingp p p g• Parallelization

– Parallel algorithms which maximizes concurrency– Eliminating the serial portion as much as possible– Lack of standardized APIs, environments, and tools

C ll li i• Correct parallelization– Shared resource identification

Difficult to debug (data races deadlock )– Difficult to debug (data races, deadlock, …)– Memory consistency

• Effective parallelizationEffective parallelization– Communication & synchronization overhead– Problem decomposition, granularity, load imbalance, affinity..


– Architecture dependency

finalexamfinal exam - androbenchcsl.skku.edu/uploads/ice3003f09/23-pprog.pdffinalexamfinal exam...

Documents