finalexamfinal exam - androbenchcsl.skku.edu/uploads/ice3003f09/23-pprog.pdffinalexamfinal exam...

44
Final Exam Final Exam Final Exam Final Exam 12:00 – 13:20, December 14 (Monday), 2009 #330110 (odd student id) #330118 (even student id) #330118 (even student id) Scope: • Everything Closed-book exam Final exam scores will be posted in the lecture Final exam scores will be posted in the lecture homepage 1 ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Upload: others

Post on 26-Feb-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Final ExamFinal ExamFinal ExamFinal Exam12:00 – 13:20, December 14 (Monday), 2009( y)#330110 (odd student id)#330118 (even student id)#330118 (even student id)

Scope:• Everything

Closed-book examFinal exam scores will be posted in the lectureFinal exam scores will be posted in the lecture homepage

1ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Page 2: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

P ll l P iP ll l P iParallel ProgrammingParallel Programming

Jin-Soo Kim ([email protected])Jin Soo Kim ([email protected])Computer Systems Laboratory

Sungkyunkwan Universityhtt // l kk dhttp://csl.skku.edu

Page 3: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

ChallengesChallengesChallengesChallengesDifficult to write parallel programsp p g• Most programmers think sequentially• Performance vs. correctness tradeoffs• Missing good parallel abstractions

Automatic parallelization by compilers• Works with some applications:• Works with some applications:

loop parallelism, reduction• Unclear how we can apply to other complex• Unclear how we can apply to other complex

applications

3ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Page 4: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Concurrent vs. ParallelConcurrent vs. ParallelConcurrent vs. ParallelConcurrent vs. ParallelConcurrent programp g• A program containing two or more processes

(threads)

Parallel program• A concurrent program in which each process (thread) co cu e t p og a c eac p ocess (t ead)

executes on its own processor, and hence the processes (threads) execute in parallel

Concurrent programming has been around for a while so why do we bother?for a while, so why do we bother?• Logical parallelism (GUIs, asynchronous events, …)

s Ph sical parallelism (for performance)

4ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

vs. Physical parallelism (for performance)

Page 5: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Think Parallel or PerishThink Parallel or PerishThink Parallel or PerishThink Parallel or Perish

The free lunch is over!

5ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Page 6: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Parallel Prog. Models (1)Parallel Prog. Models (1)Parallel Prog. Models (1)Parallel Prog. Models (1)Shared address space programming modelp p g g• Single address space for all CPUs• Communication through regular load/store (implicit)g g / ( p )• Synchronization using locks and barriers (explicit)• Ease of programmingase o p og a g• Complex hardware for cache coherence

• Thread APIs (Pthreads, …)• OpenMPOpenMP• Intel TBB (Thread Building Blocks)

6ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Page 7: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Parallel Prog. Models (2)Parallel Prog. Models (2)Parallel Prog. Models (2)Parallel Prog. Models (2)Message passing programming modelg p g p g g• Private address space per CPU• Communication through message send/receive over g g /

network interface (explicit)• Synchronization using blocking messages (implicit)• Need to program explicit communication• Simple hardware p

– No cache coherence supporting hardware

• RPC (Remote Procedure Calls)• PVM (obsolete)

7ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

• MPI (de factor standard)

Page 8: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Parallel Prog. Models (3)Parallel Prog. Models (3)Parallel Prog. Models (3)Parallel Prog. Models (3)Parallel programming models vs.p g gParallel computer architectures

P ll l A hiParallel Architectures

Shared‐memory Distributed‐memory

Single

AddressPthreads (CC) NUMA

Parallel

Prog.

Address

SpaceOpenMP Software DSM

Models Message

Passing

Multi‐processes

MPIMPI

8ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Page 9: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Speedup (1)Speedup (1)Speedup (1)Speedup (1)Amdahl’s Law

Speedup 1=

S (1 P) : the time spent

nPPSpeedup

/)1( +−=

– S = (1- P) : the time spent executing the serial portion n : the number of processor cores

• A theoretical basis by which the speedup of parallel computations can be estimated.p

• The theoretical upper limit of speedup is limited by the serial portion of the code.

9ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

p

Page 10: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Speedup (2)Speedup (2)Speedup (2)Speedup (2)

Reducing the serial portion(20%  10%)

Doubling the number of coresDoubling the number of cores(8  16)

10ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Page 11: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Speedup (3)Speedup (3)Speedup (3)Speedup (3)Speedup in realityp p y

)(/)1(1

nHnPP SpeedupPractical

++−=

• H(n) : overhead– Thread management & scheduling

)()(

– Communication & synchronization– Load balancing

E i– Extra computation– Operating system overhead

For better speedupFor better speedup,• Parallelize more (increase P)

P ll li ff ti l ( d ( ))

11ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

• Parallelize effectively (reduce H(n))

Page 12: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Calculating Pi (1)Calculating Pi (1)Calculating Pi (1)Calculating Pi (1)

π πtan = 1 atan(1) =

π4

π4

4 ⌠⌡0

1 11+x2 dx

= 4 (atan(1)–atan(0))

⌡0 1+x y = atan(x)y = 1+x21

π= 4 ( – 0) = ππ

4

12ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Page 13: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Calculating Pi (2)Calculating Pi (2)Calculating Pi (2)Calculating Pi (2)Pi: Sequential versionq

#define N 20000000#define STEP    (1 0 / (double) N)#define STEP    (1.0 / (double) N)

double compute (){

1

1int i;double x;double sum = 0.0;

N

for (i = 0; i < N; i++){

x = (i + 0.5) * STEP;

i+0.5N

1+ ( )2

1

sum += 4.0 / (1.0 + x * x);}return sum * STEP;

}

0 1

(i+0.5)N

13ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

} N

Page 14: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Threads API (1)Threads API (1)Threads API (1)Threads API (1)Threads are the natural unit of execution for parallel programs on shared-memory hardware• Windows threading API• POSIX threads (Pthreads)( )

Threads have their own• Execution context• Execution context• Stack

Threads shareThreads share• Address space (code, data, heap)

R (PID fil t )

14ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

• Resources (PID, open files, etc.)

Page 15: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Threads API (2)Threads API (2)Threads API (2)Threads API (2)Pi: Pthreads version

#include <pthread.h>#define N 20000000#define STEP    (1.0 / (double) N)

void *compute (void *arg){( / ( ) )

#define NTHREADS 8

double pi = 0.0;th d t t l k   

{int i;double x;double sum = 0.0;i t id   (i t)          pthread_mutex_t lock = 

PTHREAD_MUTEX_INITIALIZER;int main(){

int id = (int) arg;        

for (i=id; i<N; i += NTHREADS){{

int i;pthread_t tid[NTHREADS];for (i = 0; i < NTHREADS; i++)

th d t (&tid[i]  NULL  

{x = (i + 0.5) * STEP;sum += 4.0 / (1.0 + x * x);

}th d t l k (&l k)pthread_create (&tid[i], NULL, 

compute, (void *) i);for (i = 0; i < NTHREADS; i++)pthread join (tid[i], NULL);

pthread_mutex_lock (&lock);pi += (sum * STEP);pthread_mutex_unlock (&lock);

}

15ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

p _j ( [ ], );}

}

Page 16: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Threads API (3)Threads API (3)Threads API (3)Threads API (3)Pros• Flexible and widely available• The thread library gives you detailed control over the y g y

threads• Performance can be good

Cons• YOU have to take detailed control over the threads.• Relatively low level of abstraction• No easy code migration path from sequential

program

16ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

• Lack of structure means error-prone

Page 17: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

OpenMP (1)OpenMP (1)OpenMP (1)OpenMP (1)What is OpenMP?p• A set of compiler directives and

library routines for portable parallel programming on shared memory systems– Fortran 77, Fortran 90, C, and C++

M lti d t f b th U i d Wi d– Multi-vendor support, for both Unix and Windows

• Standardizes loop-level (data) parallelismC bi i l d ll l d i i l• Combines serial and parallel code in single source

• Incremental approach to parallelizationhtt //• http://www.openmp.org,http://www.compunity.org– Standard documents tutorials sample codes

17ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

– Standard documents, tutorials, sample codes

Page 18: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

OpenMP (2)OpenMP (2)OpenMP (2)OpenMP (2)Fork-join modelj• Master thread spawns a team of threads as needed• Synchronize when leaving parallel region (join)y g p g (j )• Only master executes sequential part

18ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Page 19: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

OpenMP (3)OpenMP (3)OpenMP (3)OpenMP (3)Parallel construct

#pragma omp parallel{

task();task();}

19ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Page 20: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

OpenMP (4)OpenMP (4)OpenMP (4)OpenMP (4)Work-sharing construct: sectionsg

# ll l  ti#pragma omp parallel sections{#pragma omp section

phase1();#pragma omp section

phase2();#pragma omp section

phase3();}}

20ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Page 21: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

OpenMP (5)OpenMP (5)OpenMP (5)OpenMP (5)Work-sharing construct: loopg p

#pragma omp parallel forp g p pfor (i = 0; i < 12; i++)

c[i] = a[i] + b[i];

#pragma omp parallel{{task ();

#pragma omp forfor (i = 0; i < 12; i++)for (i = 0; i < 12; i++)

c[i] = a[i] + b[i];}

21ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Page 22: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

OpenMP (6)OpenMP (6)OpenMP (6)OpenMP (6)Pi: OpenMP versionp

#ifdef _OPENMPdouble compute (){

#include <omp.h>#endif

int i;double x;double sum = 0.0;

#define N 20000000#define STEP  (1.0/(double) N)

i t i ()

#pragma omp parallel for private(x) reduction(+:sum)

f (i i i )int main(){

double pi;

for (i = 0; i < N; i ++){x = (i + 0.5) * STEP;s m +  4 0 / (1 0 +   *  );

pi = compute();}

sum += 4.0 / (1.0 + x * x);}return sum * STEP;

}

22ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

}

Page 23: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

OpenMP (7)OpenMP (7)OpenMP (7)OpenMP (7)Environment• Dual Quad-core

Xeon (8 cores)7

8Pi (OpenMP)

N = 2e6

• 2.4GHz• Intel OpenMP

5

6

7

p

N = 2e7

N = 2e8

Ideal

Compiler

3

4

5

Speedu

p

1

2

3

0

1 2 3 4 5 6 7 8

The Number of Cores

23ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

The Number of Cores 

Page 24: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

OpenMP (8)OpenMP (8)OpenMP (8)OpenMP (8)Pros• Simple and portable• Single source code for both serial and parallel versiong p• Incremental parallelization

ConsCons• Primarily for bounded loops over built-in types

– Only for Fortran-style C/C++ programsOnly for Fortran style C/C++ programs– Pointer-chasing loops?

• Performance may be degraded due toy g– Synchronization overhead– Load imbalancing

24ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

– Thread management overhead

Page 25: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

MPI (1)MPI (1)MPI (1)MPI (1)Message Passing Interfaceg g• Library standard defined by a committee of vendors,

implementers, and parallel programmers• Used to create parallel programs based on message

passing• Available on almost all parallel machines in C/Fortran• Two phases

– MPI 1: Traditional message-passing– MPI 2: Remote memory, parallel I/O, dynamic processes

T k t i ll t d h th j b• Tasks are typically created when the jobs are launched

• http://www mpi forum org

25ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

• http://www.mpi-forum.org

Page 26: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

MPI (2)MPI (2)MPI (2)MPI (2)Point-to-point communicationp• Sending and receiving messages

Process 0

A[4]

Process 1

B[4]A[4] B[4]

MPI_Recv (B, 4, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD);

MPI_Send (A, 4, MPI_INT, 1, 1000, MPI_COMM_WORLD);

26ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Page 27: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

MPI (3)MPI (3)MPI (3)MPI (3)Collective communication• Broadcast

– MPI_Bcast (array, 100, MPI_INT, 0, comm);

array

array array array

• Reduce– MPI Reduce (&localsum, &sum, 1, MPI DOUBLE, _ ( , , , _ ,

MPI_SUM, 0, comm);

localsum localsum localsum localsum

++

+

localsum localsum localsum localsum

27ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

sum

Page 28: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

MPI (4)MPI (4)MPI (4)MPI (4)Pi: MPI version#include “mpi.h”#include <stdio.h>#define N     20000000#d fi  STEP  (1 0 / (d bl ) N)#define STEP  (1.0 / (double) N)#define f(x)  (4.0 / (1.0 + (x)*(x))int main (int argc, char *argv[]) {

int i, numprocs, myid, start, end, n;, p , y , , , ;double sum = 0.0, mypi, pi;MPI_Init (&argc, &argv);MPI_Comm_size (MPI_COMM_WORLD, &numprocs);    MPI Comm rank (MPI COMM WORLD  &myid);MPI_Comm_rank (MPI_COMM_WORLD, &myid);start = (N / numprocs) * myid + 1;end = (myid==numprocs‐1)? n : (n/numprocs) * (myid+1);for (i = start; i <= end; i++)( ; ; )

sum += f(STEP * ((double) i + 0.5));myip = STEP * sum;MPI_Reduce (&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);MPI Finalize ();

28ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

MPI_Finalize ();}

Page 29: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

MPI (5)MPI (5)MPI (5)MPI (5)Pros• Runs on either shared or distributed memory

architectures• Can be used on a wider range of problems• Distributed memory computers are less expensive

than large shared memory computers

Cons• Requires more programming changes to go from

serial to parallel version• Can be harder to debug• Performance is limited by the communication

29ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

network between the nodes

Page 30: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Parallel Benchmarks (1)Parallel Benchmarks (1)Parallel Benchmarks (1)Parallel Benchmarks (1)Linpackp• Matrix linear algebra• Basis for measuring “Top500 Supercomputing sites”g p p p g

(http://www.top500.org)

SPECrate• Parallel run of SPEC CPU programs• Job-level (or task-level) parallelismJob level (or task level) parallelism

SPLASH• Stanford Parallel Applications for Shared Memory• Stanford Parallel Applications for Shared Memory• Mix of kernels and applications• Strong scaling: keep the problem size fixed

30ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

• Strong scaling: keep the problem size fixed

Page 31: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Parallel Benchmarks (2)Parallel Benchmarks (2)Parallel Benchmarks (2)Parallel Benchmarks (2)NAS parallel benchmark suitep• By NASA Advanced Supercomputing (NAS)• Computational fluid dynamics kernelsp y

PARSEC suite• Princeton Application Repository for Shared Memory• Princeton Application Repository for Shared Memory

Computers• Multithreaded applications using Pthreads andMultithreaded applications using Pthreads and

OpenMP

31ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Page 32: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Cache Coherence (1)Cache Coherence (1)Cache Coherence (1)Cache Coherence (1)Examplesp

CPU0 CPU1Read XRead X CPU0 CPU1Read X

Read XCPU0 CPU1

CacheX = 0

CacheX = 1Read X?

Read XWriteX  1

CPU0 CPU1

CacheX = 1

CacheX = 2

WriteX  1

Read X

Write

l h X?

WriteX  2

Flush X

MemoryX = 0

MemoryX = 2

Flush X?

Invalidate all copies before allowing a write to proceed

Disallow more than onemodified copy

32ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

allowing a write to proceed modified copy

Page 33: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Cache Coherence (2)Cache Coherence (2)Cache Coherence (2)Cache Coherence (2)Cache coherence protocolp• A set of rules to maintain the consistency of data

stored in the local caches as well as in main memory• Ensures multiple read-only copies and exclusive

modified copy

• Types– Snooping-based (or “Snoopy”)– Directory-based

W it t t i• Write strategies– Invalidation-based– Update-based

33ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

– Update-based

Page 34: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Cache Coherence (3)Cache Coherence (3)Cache Coherence (3)Cache Coherence (3)Snooping protocolp g p• All cache controllers monitor (or snoop) on the bus

– Send all requests for data to all processors– Processors snoop to see if they have a shared block– Requires broadcast, since caching info resides at processors

W k ll ith b ( t l b d t)– Works well with bus (natural broadcast)– Dominates for small scale machines

• Cache coherence unit• Cache coherence unit– Cache block (line) is the unit of management– False sharing is possible: Two processors share the same g p p

cache line but not the actual word– Coherence miss: Invalidate can cause a miss for the data

read before

34ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

read before

Page 35: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Cache Coherence (4)Cache Coherence (4)Cache Coherence (4)Cache Coherence (4)MESI protocol: invalidation-basedp• Modified (M)

– Cache has only one copy and copy is modified– Memory is not up-to-date

• Exclusive (E)– Cache has only one copy and copy is unmodified (clean)– Memory is up-to-date

Sh d (S)• Shared (S)– Copies may exist in other caches and all are unmodified– Memory is up-to-dateMemory is up to date

• Invalid (I)– Not in cache

35ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

o cac e

Page 36: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Cache Coherence (5)Cache Coherence (5)Cache Coherence (5)Cache Coherence (5)MESI protocol state machine CPU Read / -p

Bus ReadX /

CPU Read /Bus Read & S-signal on

(no Bus traffic)

Invalid Shared(read only)Bus ReadX /

-

CPU

WBus R

(read only)(read-only)

Bus Read /Bus S signal on

Write /

ReadX

Bus ReadX

Bus WriteB

(Flush)

Bus Bus S- s

Bus S-signal on

If cache miss occurs cache X

/Back

Read /

signal on

occurs, cache will write back modified block.

CPU Write / -(i lid i d d)

Exclusive(read only)Exclusive(read-only)(read/write)

Modified(read/write)

36ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

(invalidate is not needed)CPU Read / -(no Bus traffic)

CPU Read / -(no Bus traffic)

CPU Write / -(no Bus traffic)

Page 37: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Cache Coherence (6)Cache Coherence (6)Cache Coherence (6)Cache Coherence (6)Examples revisitedp

CPU0 CPU1Read XX = 0 (E) Read X CPU0 CPU1

Read XX 0 (S) Read XCPU0 CPU1

CacheX = 0

CacheX = 1

X = 0 (E)

X = 0 (S)

Read XX = 0 (S)WriteX 1

CPU0 CPU1

CacheX = 1

CacheX = 2

X = 0 (S)

WriteX 1

Read XX = 0 (S)

X (I)

Read X

X  1X = 1 (M)

X  1X = 1 (M) X (I)

WriteX 2

MemoryX = 0

X = 1 (S) Flush XX = 1 (S)

MemoryX = 2

Flush XX (I)

X  2X = 2 (M)Flush X

Invalidate all copies before allowing a write to proceed

Disallow more than onemodified copy

37ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

allowing a write to proceed modified copy

Page 38: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Cache Coherence (7)Cache Coherence (7)Cache Coherence (7)Cache Coherence (7)False sharingg

6

7Pi (OpenMP) N = 2e7

Pad = 0

Pad 28

struct thread_stat {int count;char pad[PADS];

}  t t[8]

4

5

edup

Pad = 28

Pad = 60

Pad =124

} stat[8];

double compute ()  {int i, id;

1

2

3

Spe, ;

double x, sum = 0.0;

#pragma omp parallel for i t ( )  d ti ( )

0

1

1 2 3 4 5 6 7 8The Number of Cores

private(x) reduction(+:sum)for (i = 0; i < N; i ++)  {int id=omp_get_thread_num();stat[id].count++; The Number of Cores[ ] ;x = (i + 0.5) * STEP;sum += 4.0 / (1.0 + x * x);

}t    * STEP

stat[0] stat[1] ... stat[7]

padC C C

38ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

return sum * STEP;}

pad ...C C C

Page 39: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Cache-Aware Programming (1)Cache-Aware Programming (1)Cache Aware Programming (1)Cache Aware Programming (1)Pack data more tightlyg y• Usually, compilers do not pack structures

t t L  { s ic

p

struct Loose {short s;int i;char c; p;Foo* p;

};

struct Tight {Foo* p; pint i;short s;char c;

};

pi s c

39ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

};

Page 40: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Cache-Aware Programming (2)Cache-Aware Programming (2)Cache Aware Programming (2)Cache Aware Programming (2)Work in the cache• Reduce the working set• Divide a problem into smaller subproblemsp p• Reorder steps in the code

Read-only data• Separate read-only data from read-write data• Separate read-only data from read-write data• Annotate constants with const keyword • Read only data are stored in separate sections• Read-only data are stored in separate sections• If possible, separate read-mostly variables

40ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Page 41: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Cache-Aware Programming (3)Cache-Aware Programming (3)Cache Aware Programming (3)Cache Aware Programming (3)Read-write data• Group read-write variables which are used together

into a structure• Move read-write variables which are often written to

by different threads onto their own cache line– Reduce false sharing– May require padding

If i bl i d b lti l th d b t• If a variable is used by multiple threads, but every use is independent, move it into Thread Local Storage (TLS)Storage (TLS)

41ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Page 42: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Cache-Aware Programming (4)Cache-Aware Programming (4)Cache Aware Programming (4)Cache Aware Programming (4)Lock variables• Keep the lock and the associated data on distinct

cache line.• If a lock protects data that is frequently

uncontended, try to keep the lock and the data on h h lithe same cache line.

42ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

Page 43: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Cache-Aware Programming (5)Cache-Aware Programming (5)Cache Aware Programming (5)Cache Aware Programming (5)Thread affinityy• Avoid moving a thread from one core to another

– Reduce context switching– Reduce cache misses– Reduce TLB misses

• Bind a process– sched_setaffinity () / sched_getaffinity()

• Bind a thread– pthread_setaffinity_np()– pthread_getaffinity_np()– pthread_attr_setaffinity_np()pthread attr getaffinity np()

43ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

– pthread_attr_getaffinity_np()

Page 44: FinalExamFinal Exam - AndroBenchcsl.skku.edu/uploads/ICE3003F09/23-pprog.pdfFinalExamFinal Exam 12:00 –13:20, December 14 ((y)Monday), 2009 #330110 (odd student id) #330118(evenstudentid)#330118

Summary: Why Hard?Summary: Why Hard?Summary: Why Hard?Summary: Why Hard?Impediments to parallel computingp p p g• Parallelization

– Parallel algorithms which maximizes concurrency– Eliminating the serial portion as much as possible– Lack of standardized APIs, environments, and tools

C ll li i• Correct parallelization– Shared resource identification

Difficult to debug (data races deadlock )– Difficult to debug (data races, deadlock, …)– Memory consistency

• Effective parallelizationEffective parallelization– Communication & synchronization overhead– Problem decomposition, granularity, load imbalance, affinity..

44ICE3003: Computer Architecture | Fall 2009 | Jin-Soo Kim ([email protected])

– Architecture dependency