parallel computing on multi-core systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/course...

26
Parallel Computing on Multi-Core Systems Instructor: Arash Tavakkol Department of Computer Engineering Sharif University of Technology Spring 2016

Upload: others

Post on 11-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

Parallel Computing on Multi-Core Systems

Instructor:

Arash Tavakkol

Department of Computer EngineeringSharif University of Technology

Spring 2016

Page 2: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

Optimization Techniquesin OpenMP programs

Some slides come from Dr. Cristina Amza@ http://www.eecg.toronto.edu/~amza/and professor Daniel Etiemble@ http://www.lri.fr/~de/

Page 3: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

Considerations in Using OpenMP Optimize Barrier Use

they are expensive operations. Reduce using it

#pragma omp parallel{

.........#pragma omp forfor (i=0; i<n; i++)

.........#pragma omp for nowaitfor (i=0; i<n; i++)

} /*-- End of parallel region - barrier is implied --*/

3Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.

Page 4: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

Optimize Barrier Use#pragma omp parallel default(none) \shared(n,a,b,c,d,sum) private(i){

#pragma omp for nowaitfor (i=0; i<n; i++)

a[i] += b[i];

#pragma omp for nowaitfor (i=0; i<n; i++)

c[i] += d[i];

#pragma omp barrier#pragma omp for nowait reduction(+:sum)

for (i=0; i<n; i++)sum += a[i] + c[i];

} /*-- End of parallel region --*/4

Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.

Page 5: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

Avoid Large Critical Regions#pragma omp parallel shared(a,b) private(c,d){

......#pragma omp critical{

a += 2 * c;c = b * d;

}} /*-- End of parallel region --*/

5Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.

Page 6: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

Considerations in Using OpenMP(Cont’d)

Maximize Parallel Regions Overheads are associated with starting and

terminating a parallel region Large parallel regions offer more opportunities

for using data in cache and provide a bigger context for other compiler optimizations Minimize the number of parallel regions

6Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.

Page 7: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

Maximize Parallel Regions#pragma omp parallel forfor (.....){

/*-- Work-sharing loop 1 --*/}#pragma omp parallel forfor (.....){

/*-- Work-sharing loop 2 --*/}

.........#pragma omp parallel forfor (.....){

/*-- Work-sharing loop N --*/}

#pragma omp parallel{#pragma omp for /*-- Work-sharing loop 1 --*/

{ ...... }

#pragma omp for /*-- Work-sharing loop 2 --*/

{ ...... }.........#pragma omp for /*-- Work-sharing loop N --*/

{ ...... }}

7Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.

Page 8: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

Avoid Parallel Regions in Inner Loops

for (i=0; i<n; i++)for (j=0; j<n; j++)

#pragma omp parallel forfor (k=0; k<n; k++){ .........}

#pragma omp parallelfor (i=0; i<n; i++)

for (j=0; j<n; j++)#pragma omp for

for (k=0; k<n; k++){ .........}

8Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.

Page 9: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

Address Poor Load Balance Threads might have different amounts of

work to do The threads wait at the next synchronization

point until the slowest one completes Use Schedule Clause

for (i=0; i<N; i++) {ReadFromFile(i,...);

for (j=0; j<ProcessingNum; j++ )ProcessData(); /* lots of work here */WriteResultsToFile(i);

}

9Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.

Page 10: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

Address Poor Load Balance#pragma omp parallel{

/* preload data to be used in first iteration of the i-loop */#pragma omp single{ReadFromFile(0,...);}for (i=0; i<N; i++) {

/* preload data for next iteration of the i-loop */#pragma omp single nowait{ReadFromFile(i+1...);}#pragma omp for schedule(dynamic)for (j=0; j<ProcessingNum; j++)

ProcessChunkOfData(); /* here is the work *//* there is a barrier at the end of this loop */#pragma omp single nowait

{WriteResultsToFile(i);}} /* threads immediately move on to next iteration of i-loop */

} /* one parallel region encloses all the work */ 10Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.

Page 11: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

False Sharing In systems with distributed coherent caches

When multiple threads access same cache line and at least one of them writes to it, it causes costly invalidation misses and upgrades.

When the threads actually communicate by accessing the same data, this is a necessary overhead.

Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.11

Page 12: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

False Sharing False Sharing

The threads do not actually communicate They may be accessing unrelated data that just happen to be

allocated in the same cache line Costly invalidation misses and upgrades are completely

unnecessary

Solution Splitting the data accessed by the different threads to different

cache lines

Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.12

Page 13: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

False Sharing Since sum1 and sum2

are defined next to each other, the compiler is likely to allocate them next to each other in memory, in the same cache line.

Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.13

int sum1;int sum2;void thread1(int v[], int v_count) {

sum1 = 0;for (int i = 0; i < v_count; i++)

sum1 += v[i];}

void thread2(int v[], int v_count) {sum2 = 0;for (int i = 0; i < v_count; i++)

sum2 += v[i];}

Page 14: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

False Sharing thread1 reads sum1 into its cache. Since the line

is not present in any other cache. thread1 gets it in exclusive state:

Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.14

Page 15: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

False Sharing thread2 now reads sum2. Since thread1 already

had the cache line in exclusive state, this causes a downgrade of the line in thread1's cache and the line is now in shared state in both caches:

Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.15

Page 16: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

False Sharing thread1 now writes its updated sum to sum1.

Since it only has the line in shared state, it must upgrade the line and invalidate the line in thread2's cache:

Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.16

Page 17: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

False Sharing thread2 now writes its updated sum to sum2.

Since thread1 has invalidate the cache line in it's cache it gets a coherence miss, and must invalidate the line in thread1's cache forcing thread1 to do a coherence write-back:

Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.17

Page 18: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

False Sharing The next iteration of the loops now starts,

and thread1 again reads sum1. Since thread2 just invalidated the cache line in thread1's cache, it gets a coherence miss. It must also downgrade the line in thread2's cache, forcing thread2 to do a coherence write-back:

Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.18

Page 19: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

False Sharing thread2 finally reads sum2. Since it has the

cache line in shared state, it can read it without and coherence activity, and we are back in the same situation as after step 2:

Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.19

Page 20: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

False Sharing For each iteration or the loops, steps 3 to

6 will repeat, each time with costly upgrades, coherence misses and coherence write-backs.

To fix a false sharing problem you need to make sure that the data accessed by the different threads is allocated to different cache lines.

Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.20

Page 21: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

Avoid False Sharing

#pragma omp parallel for shared(Nthreads,a) schedule(static,1)for (int i=0; i<Nthreads; i++)

a[i] += i;

Nthreads = 8 a[1] through a[7] are in the same cache line

array padding can be used to eliminate the problem

by dimensioning the array as a[n][8] changing the indexing from a[i] to a[i][0] eliminates the

21Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.

Page 22: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

Optimizing FFT functionvoid fft_c(int n, COMPLEX *x, COMPLEX *w){

COMPLEX u,temp,tm;inti,j,le,windex;windex = 1;for(le=n/2 ; le > 0 ; le/=2) {

wptr = w;for (j = 0 ; j< le ; j++) {

u = *wptr;for (i = j ; i<n ; i = i + 2*le) {

xi = x + i;xip = xi + le;temp.real = xi->real + xip->real;temp.imag = xi->imag + xip->imag;tm.real = xi->real - xip->real;tm.imag = xi->imag - xip->imag;xip->real = tm.real*u.real - tm.imag*u.imag;xip->imag = tm.real*u.imag + tm.imag*u.real;*xi = temp; }wptr = wptr + windex;}windex = 2*windex;}}

Slides come from Professor Daniel Etiemble

22

Page 23: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

Optimizing FFT functionvoid fft_c(int n, COMPLEX *x, COMPLEX *w){

COMPLEX u,temp,tm;inti,j,le,windex;windex = 1;for(le=n/2 ; le > 0 ; le/=2) {

for (j = 0 ; j< le ; j++) {k=0;for (i = j ; i<n ; i = i + 2*le) {

tm.real = x[i].real - x[i+le].real;tm.imag = x[i].imag - x[i+le].imag;x[i+le].real = tm.real*w[k].real - tm.imag* w[k].imag;x[i+le].imag;= tm.real*w[k].imag + tm.imag* w[k].real;x[i].real = x[i].real + x[i+le].real;x[i].imag = x[i].imag + x[i+le].imag;

}k += windex;

}windex = 2*windex } 23

Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.

Page 24: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

Optimizing FFT functionvoid fft_c(int n, float x_r[], float x_i[], float w_r[], float w_i[]){

register float tm_r, tm_i;inti,j,le,windex;windex = 1;for(le=n/2 ; le > 0 ; le/=2) {

for (i = 0 ; i<n ; i = i + 2*le) {for (j = 0 ; j< le ; j++){

tm_r = x_r[i+j] – x_r[i+j+le];tm_i = x_i[i+j]– x_i[i+j+le];x_r[i+j] = x_r[i+j] + x_r[i+j+le];x_i[i+j] = x_i[i+j] + x_i[i+j+le];x_r[i+j+le] = tm_r*w_r[j*windex] – tm_i* w_i[j*windex];x_i[i+j+le] = tm_r*w_i[j*windex] + tm_i* w_r[j*windex];

}}windex = 2*windex;

}}

24Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.

Page 25: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

Optimizing FFT functionvoid fft_c(int n, float x_r[], float x_i[], float w_r[], float w_i[]){

register float t_r,t_i;inti,j,le,windex, k;windex = 1;k=0;for(le=n/2 ; le > 0 ; le/=2,k++) {

for (i = 0 ; i<n ; i = i + 2*le) {for (j = 0 ; j< le ; j++) {

t_r=x_r[i+j]-x_r[i+j+le];t_i=x_i[i+j]-x_i[i+j+le];x_r[i+j]=x_r[i+j]+x_r[i+j+le];x_i[i+j]=x_i[i+j]+x_i[i+j+le];x_r[i+j+le]=t_r*w_r[(n/2)*k+j]-t_i*w_i[(n/2)*k+j];x_i[i+j+le]=t_r*w_i[(n/2)*k+j]+t_i*w_r[(n/2)*k+j];}}windex = 2*windex

}}

25Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.

Page 26: Parallel Computing on Multi-Core Systemsce.sharif.edu/courses/94-95/2/ce226-1/resources/root/Course Slides/07-OpenMP-II.pdfParallel Computing on Multi-Core Systems Instructor: Arash

QUESTIONS?

26Parallel Computing on Multicore Systems, SHARIF U. OF TECHNOLOGY, 2016.