introduction to parallel processing
DESCRIPTION
Introduction to Parallel Processing. Dr. Guy Tel- Zur Lecture 10. Agenda. Administration Final presentations Demos Theory Next week plan Home assignment #4 (last). Final Projects. Next Sunday: Groups 1-1 6 will present Next Monday: Groups 1 7 + will present - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/1.jpg)
Introduction to Parallel Processing
Dr. Guy Tel-ZurLecture 10
![Page 2: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/2.jpg)
Agenda
• Administration• Final presentations• Demos• Theory• Next week plan• Home assignment #4 (last)
![Page 3: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/3.jpg)
Final Projects
• Next Sunday: Groups 1-16 will present• Next Monday: Groups 17+ will present• 10 minutes presentation per group• All group members should present• Send to: [email protected] your
presentation by midnight of the previous day
נוכחות חובה
![Page 4: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/4.jpg)
Final Presentations
החלוקה לקבוצות הינה קשיחה• נקודות בציון5קבוצה שלא תציג תאבד •יש לבצע חזרה ולוודא עמידה בזמנים• שם הפרויקט, מטרתו, המצגת צריכה לכלול:•
האתגר בבעיה מבחינת החישוב המקבילי, דרכים לפתרון.
לא תתקבלנה מצגות בזמן השיעור! יש להקפיד •לשלוח אותן אל המרצה מבעוד מועד
![Page 5: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/5.jpg)
The Course RoadmapIntroduction
Message Passing
HTCHPC
Shared MemoryCondor
Grid Computing
Cloud Computing
MPI OpenMPCilk++
Today
Today
GPU Computing New!
Today
![Page 6: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/6.jpg)
Advanced Parallel Computing and Distributed Computing course
• A new course at the department: Distributed Computing: Advanced Parallel
Processing course + Grid Computing + Cloud Computing
Course Number: 361-1-4691
• If you are interested in this course please send me an email
![Page 7: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/7.jpg)
Today
• Algorithms – Numerical Algorithms (“slides11.ppt”)
• Introduction to Grid Computing• Some demos• Home assignment #4
![Page 8: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/8.jpg)
Futuristic A-Symmetric Multi-Core Chip
SACC Sequential Accelerator
![Page 9: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/9.jpg)
Theory
• Numerical Algorithms– Slides from:
UNIVERSITY OF NORTH CAROLINA AT CHARLOTTE Department of Computer Science
ITCS 4145/5145 Parallel Programming Spring 2009
Dr. Barry Wilkinson
Matrix multiplication, solving a system of linear equations, iterative methods
URL is Here
![Page 10: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/10.jpg)
Demos
• Hybrid Parallel Programming – MPI + OpenMP
• Cloud Computing– Setting a HPC cluster– Setting a Condor
machine(a separate presentation)
• StarHPC• Cilk++• GPU Computing (a
separate presentation)• Eclipse PTP• Kepler workflow
![Page 11: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/11.jpg)
Hybrid MPI + OpenMP DemoMachine File:
hobbit1hobbit2hobbit3hobbit4
Each hobbit has 8 cores
mpicc -o mpi_out mpi_test.c -fopenmp
MPI
OpenMP
An Idea for a final project!!!
cd ~/mpi program name: hybridpi.c
![Page 12: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/12.jpg)
MPI is not installed yet on the hobbits, in the meanwhile:vdwarf5vdwarf6vdwarf7vdwarf8
![Page 13: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/13.jpg)
top -u tel-zur -H -d 0.05
H – show threads, d – delay for refresh, u - user
![Page 14: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/14.jpg)
Hybrid MPI+OpenMP continued
![Page 15: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/15.jpg)
Hybrid Pi (MPI+OpenMP#include <stdio.h>#include <mpi.h>#include <omp.h>#define NBIN 100000#define MAX_THREADS 8
int main(int argc,char **argv) {int nbin,myid,nproc,nthreads,tid;double step,sum[MAX_THREADS]={0.0},pi=0.0,pig;
MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&myid);MPI_Comm_size(MPI_COMM_WORLD,&nproc);nbin = NBIN/nproc;
step = 1.0/(nbin*nproc);
![Page 16: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/16.jpg)
#pragma omp parallel private(tid){
int i;double x;nthreads = omp_get_num_threads();tid = omp_get_thread_num();for (i=nbin*myid+tid; i<nbin*(myid+1); i+=nthreads) {
x = (i+0.5)*step;sum[tid] += 4.0/(1.0+x*x);
}printf("rank tid sum = %d %d %e\n",myid,tid,sum[tid]);
}for(tid=0; tid<nthreads; tid++)
pi += sum[tid]*step;
MPI_Allreduce(&pi,&pig,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);if (myid==0) printf("PI = %f\n",pig);
MPI_Finalize();return 0;
}
![Page 17: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/17.jpg)
Cilk++
Simple, powerful expression of task parallelism: cilk_for – Parallelize for loops cilk_spawn – Specify the start of parallel execution cilk_sync – Specify the end of parallel execution
http://software.intel.com/en-us/articles/intel-cilk-plus/
![Page 18: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/18.jpg)
17/8/2011
![Page 19: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/19.jpg)
![Page 20: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/20.jpg)
Fibonachi (Fibonacci)Try:http://www.wolframalpha.com/input/?i=fibonacci+number
![Page 21: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/21.jpg)
Fibonachi Numbersserial version
// 1, 1, 2, 3, 5, 8, 13, 21, 34, ... // Serial version// Credit: http://myxman.org/dp/node/182
long fib_serial(long n) {
if (n < 2) return n;
return fib_serial(n-1) + fib_serial(n-2);
}
![Page 22: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/22.jpg)
Cilk++ Fibonachi (Fibonacci)#include <cilk.h>#include <stdio.h>
long fib_parallel(long n){ long x, y; if (n < 2) return n; x = cilk_spawn fib_parallel(n-1); y = fib_parallel(n-2); cilk_sync; return (x+y); }
int cilk_main(){ int N=50;
long result;result = fib_parallel(N);printf("fib of %d is %d\n",N,result);return 0;
}
![Page 23: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/23.jpg)
Cilk_spawn
ADD PARALLELISM USING CILK_SPAWN We are now ready to introduce parallelism into our qsort program. The cilk_spawn keyword indicates that a function )the child( may be executed in parallel with the code that follows the cilk_spawn statement )the parent(. Note that the keyword allows but does not require parallel operation. The Cilk++ scheduler will dynamically determine what actually gets executed in parallel when multiple processors are available. The cilk_sync statement indicates that the function may not continue until all cilk_spawn requests in the same function have completed. cilk_sync does not affect parallel strands spawned in other functions.
![Page 24: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/24.jpg)
Cilkview Fn(30)
![Page 25: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/25.jpg)
Strands and Knots A Cilk++ program fragments
... do_stuff_1(); // execute strand 1 cilk_spawn func_3(); // spawn strand 3 at knot A do_stuff_2(); // execute strand 2 cilk_sync; // sync at knot B do_stuff_4(); // execute strand 4 ...
DAG with two spawns (labeled A and B) and one sync (labeled C)
![Page 26: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/26.jpg)
Let's add labels to the strands to indicate the number of milliseconds it takes to execute each strand
a more complex Cilk++ program (DAG):
In ideal circumstances (e.g., if there is no scheduling overhead) then, if an unlimited number of processors are available, this program should run for 68 milliseconds.
![Page 27: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/27.jpg)
Work and SpanWorkThe total amount of processor time required to complete the program is the sum of all the numbers. We call this the work. In this DAG, the work is 181 milliseconds for the 25 strands shown, and if the program is run on a single processor, the program should run for 181 milliseconds. SpanAnother useful concept is the span, sometimes called the critical path length. The span is the most expensive path that goes from the beginning to the end of the program. In this DAG, the span is 68 milliseconds, as shown below:
![Page 28: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/28.jpg)
divide-and-conquer strategycilk_forShown here: 8 threads and 8 iterations
Here is the DAG for a serial loop that spawns each iteration. In this case, the work is not well balanced, because each child does the work of only one iteration before incurring the scheduling overhead inherent in entering a sync.
![Page 29: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/29.jpg)
Race conditionsCheck the “qsort-race” program with cilkscreen:
![Page 30: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/30.jpg)
StarHPC on the Cloud
Will be ready for PP201X?
![Page 31: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/31.jpg)
Eclipse PTPParallel Tools Platform
http://www.eclipse.org/ptp/
Will be ready for PP201X?
![Page 32: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/32.jpg)
Recursion in OpenMPlong fib_parallel(long n) { long x, y; if (n < 2) return n; #pragma omp task default(none) shared(x,n) { x = fib_parallel(n-1); } y = fib_parallel(n-2); #pragma omp taskwait return (x+y); }
#pragma omp parallel #pragma omp single { r = fib_parallel(n); }
Reference: http://myxman.org/dp/node/182
Use the taskwait pragma to specify a wait for child tasks
to be completed that are generated by the current
task.
The task pragma can be useful for parallelizing
irregular algorithms such as recursive algorithms
for which other OpenMP workshare constructs are
inadequate.
![Page 33: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/33.jpg)
Intel® Parallel Studio
• Use Parallel Composerto create and compile a parallel application
• Use Parallel Inspectorto improve reliability by finding memory and threading errors
• Use Parallel Amplifierto improve parallel performance by tuning threaded code
![Page 34: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/34.jpg)
Intel® Parallel Studio
![Page 35: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/35.jpg)
![Page 36: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/36.jpg)
![Page 37: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/37.jpg)
![Page 38: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/38.jpg)
![Page 39: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/39.jpg)
![Page 40: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/40.jpg)
![Page 41: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/41.jpg)
Parallel Studio add new features to Visual Studio
![Page 42: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/42.jpg)
Intel’s Parallel Amplifier – Execution Bottlenecks
![Page 43: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/43.jpg)
![Page 44: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/44.jpg)
Intel’s Parallel Inspector – Threading Errors
![Page 45: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/45.jpg)
Intel’s Parallel Inspector – Threading Errors
![Page 46: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/46.jpg)
![Page 47: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/47.jpg)
Error – Data Race
![Page 48: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/48.jpg)
![Page 49: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/49.jpg)
![Page 50: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/50.jpg)
Intel Parallel Studio - Composer
The installation of this part failed for me.Probably because I didn’t install before Intel’s C++ compiler.Sorry I can’t make a demo here…
![Page 51: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/51.jpg)
![Page 52: Introduction to Parallel Processing](https://reader036.vdocuments.us/reader036/viewer/2022062501/568161e8550346895dd2182f/html5/thumbnails/52.jpg)