parallel programming on the sgi origin2000 with thanks to moshe goldberg, tcc and igor zacharov sgi...
Post on 21-Dec-2015
215 views
TRANSCRIPT
Parallel Programming on theSGI Origin2000
With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI
Taub Computer CenterTechnion
Mar 2005
Anne Weill-Zrahia
Parallel Programming on the SGI Origin2000
1) Parallelization Concepts
2) SGI Computer Design
3) Efficient Scalar Design
4) Parallel Programming -OpenMP
5) Parallel Programming- MPI
4) Parallel Programming-OpenMP
ReadIL500
ReadIL500
IL500IL500
IL500
IL 0
IL 0
IL100
IL350
TakeIL150(WriteIL350)
TakeIL400(WriteIL100)
Limorin Haifa
Shimonin Tel Aviv
Is this your joint bank account?
IL150
IL400
IL350
Initialamount
Finalamount
Introduction
- Parallelization instruction to the compiler: f77 –o prog –mp prog.f Or: f77 –o prog –pfa prog.f
- Now try to understand what a compiler has to determine when deciding how to parallelize
- Note that when loosely talk about parallelization, what is meant is: “Is the program as presented here parallelizable?”
-This is an important distinction, because sometimes rewriting can transform non-parallelizable code into a parallelizable form, as we will see…
Data dependency types1) Iteration i depends on values calculated in the previous iteration i-1 (loop carried dependence) do i=2,n a(i) = a(i-1) cannot be parallelized enddo
2) Data dependence within single iteration (non-loop carried dependence) do i=2,n c = . . . . a(I) = . . . c . . . parallelizable enddo
3) Reduction do i=1,n s = s + x parallelizable enddo
All data dependencies in programs are variations on thesefundamental types.
Data dependency analysis
Question: Are the following loops parallelizable?
do i=2,n a(i) = b(i-1)enddo
do i=2,n a(i) = a(i-1)enddo
YES!
NO!
Why?
Data dependency analysis
do i=2,n a(i) = b(i-1)enddo
YES!
CPU1
CPU2
CPU3
A(2)=B(1)
A(3)=B(2)
A(4)=B(3)
A(5)=B(4)
A(6)=B(5)
A(7)=B(6)
cycle1 cycle2
Data dependency analysis
do i=2,n a(i) = a(i-1)enddo
CPU1 A(2)=A(1)
cycle1
A(3)=A(2)
cycle2
A(4)=A(3)
cycle3
Scalar (non-parallel) run:
A(5)=A(4)
cycle4
In each cycle NEW data from previous cycle is read
Data dependency analysis
do i=2,n a(i) = a(i-1)enddo
No!
CPU1
CPU2
CPU3
A(2)=A(1)
A(3)=A(2)
A(4)=A(3)
cycle1
Will probably readOLD data
Data dependency analysisData dependency analysis
do i=2,n a(i) = a(i-1)enddo
No!
CPU1
CPU2
CPU3
A(2)=A(1)
A(3)=A(2)
A(4)=A(3)
A(5)=A(4)
A(6)=A(5)
A(7)=A(6)
cycle1 cycle2
May read NEW data
Will probably read
OLD data
Data dependency analysis
Another question: Are the following loops parallelizable?
do i=3,n,2 a(i) = a(i-1)enddo
do i=1,n s = s + a(i)enddo
YES!
Depends!
Data dependency analysis
do i=3,n,2 a(i) = a(i-1)enddo
YES!
CPU1
CPU2
CPU3
A(3)=A(2)
A(5)=A(4)
A(7)=A(6)
A(9)=A(8)
A(11)=A(10)
A(13)=A(12)
cycle1 cycle2
Data dependency analysisData dependency analysis
do i=1,n s = s + a(i)enddo
Depends!
CPU1
CPU2
CPU3
S=S+A(1)
S=S+A(2)
S=S+A(3)
S=S+A(4)
S=S+A(5)
S=S+A(6)
cycle1 cycle2
-The value of S will be undetermined and typically it will vary from one run to the next- This bug in parallel programming is called a “race condition”
Data dependency analysis
What is the principle involved here?
The examples shown fall into two categories:
1) Data being read is independent of data that is written: a(i) = b(i-1) i=2,3,4. . . a(i) = a(i-1) i=3,5,7. . .
2) Data being read depends on data that is written: a(i) = a(I-1) i=2,3,4. . . s = s + a(i) i=1,2,3. . .
Data dependency analysis
Here is a typical situation:
Is there a data dependency in the following loop?
do i = 1,n a(i) = sin(x(i)) result = a(i) + b(i) c(i) = result * c(i)enddo
Clearly, “result” is a temporary variable that isreassigned for every iteration.
Note: “result” must be a “private” variable (this will be discussed later).
No!
Data dependency analysis
Here is a (slightly different) typical situation:
Is there a data dependency in the following loop?
do i = 1,n a(i) = sin(result) result = a(i) + b(i) c(i) = result * c(i)enddo
Yes!
The value of “result” is carried over from one iterationto the next.
This is the classical read/write situation but now it is somewhat hidden.
Data dependency analysis
do i = 1,n a(i) = sin(result(i-1)) result(i) = a(i) + b(i) c(i) = result(i) * c(i)enddo
do i = 1,n a(i) = sin(result(i-1)) result(i) = sin(result(i-1)) + b(i) c(i) = result(i) * c(i)enddo
The loop could (symbolically) be rewritten:
Now substitute the expression for a(i):
This is really of the type “a(i)=a(i-1)” !
Data dependency analysis
One more: Can the following loop be parallelized?
do i = 3,n a(i) = a(i-2)enddo
If this is parallelized, there will probably be differentanswers from one run to another.
Why?
Data dependency analysis
CPU1
CPU2
A(3)=A(1)
A(4)=A(2)
A(5)=A(3)
A(6)=A(4)
cycle1 cycle2
do i = 3,n a(i) = a(i-2)enddo
This looks like it will be safe.
Data dependency analysis
CPU1
CPU2
CPU3
A(3)=A(1)
A(4)=A(2)
A(5)=A(3)
cycle1
do i = 3,n a(i) = a(i-2)enddo
HOWEVER: what if there are 3 cpu’s and not 2?
In this case, a(3) isread and written intwo threads at once
RISC memory levels
CPU
Main memory
Cache
Single CPU
RISC memory levels
CPU
Main memory
Cache
Single CPU
RISC memory levels
Main memory
Multiple CPU’s
CPU
Cache 1
CPU0
1
Cache 0
RISC memory levels
Main memory
Multiple CPU’s
CPU
Cache 1
CPU0
1
Cache 0
Main memory
Multiple CPU’s
CPU
Cache 1
CPU0
1
Cache 0
RISC Memory Levels
Definition of OpenMP
- Application Program Interface (API) for Shared Memory Parallel Programming
- Directive based approach with library support
- Targets existing applications and widely used languages: Fortran API first released October 1997 C, C++ API first released October 1998
- Multi-vendor/platform support
Why was OpenMP developed?
- Parallel programming before OpenMP * Standards for distributed memory (MPI and PVM) * No standard for shared memory programming- Vendors had different directive-based API for SMP * SGI, Cray, Kuck&Assoc, DEC * Vendor proprietary, similar but not the same * Most were targeted at loop level parallelism- Commercial users, high end software vendors, have big investment in existing codes- End result: users wanting portability were forced to use MPI even for shared memory * This sacrifices built-in SMP hardware benefits * Requires major effort
The Spread of OpenMP
Organization: Architecture review board Web site: www.openmp.org
Hardware: HP/DEC IBM Intel SGI Sun
Software: Portland (PGI) NAG Intel Kuck & Assoc (KAI) Absoft
OpenMP Interface model
•Control structures•Work sharing•Data scope attributes * private,firstprivate, lastprivate * shared * reduction
-Control and query * number thread * nested parallel? * throughput mode
- Lock API
-Runtime environment * schedule type * max number threads * nested parallelism * throughput mode
DirectivesAnd
Pragmas
RuntimeLibraryroutines
Environmentvariables
OpenMP execution model
OpenMP programs starts in a single thread, sequential mode
To create additional threads, user opens a parallel region * additional slave threads launched * master thread is part of team * threads “disappear” at the end of parallel region run
This model is repeated as needed
Master thread
Parallel:4 threads
Parallel:2 threads
Parallel:3 threads
Creating parallel threadsFortran
C/C++
c$omp parallel [clause,clause] code to run in parallelc$omp end parallel
#pragma omp parallel [clause,clause]{ code to run in parallel}
Replicate execution
i=0C$omp parallel call foo(i,a,b)C$omp end parallel print*,i
foo foo foo foo
i=0
print*,i
Number of threads: set in library or environment call
OpenMP on the Origin 2000
Switches, formatsf77 -mp
c$omp parallel doc$omp+shared(a,b,c)ORc$omp parallel do shared(a,b,c)
c$ iam = omp_get_thread()+1
Conditional compilation
OpenMP on the Origin 2000 -C
Switches, formatscc -mp
#pragma omp parallel for\shared(a,b,c)OR#pragma omp parallel for shared(a,b,c)
OpenMP on the Origin 2000
Parallel Do Directive
c$omp parallel do private(I)
c$omp end parallel do --> optional
do I=1,na(I)= I+1enddo
Topics: Clauses, Detailed construct
OpenMP on the Origin 2000
Parallel Do Directive - Clauses
sharedprivatedefault(private|shared|none)firstprivatelastprivatereduction({operator|intrinsic}:var)schedule(type,[chunk])if(scalar_logical_expression)orderedcopyin(var)
S S
Single thread Parallel region Single thread
S = shared variableP = private variable
Allocating private and shared variables
Clauses in OpenMP - 1
Clauses for the “parallel” directive specify data association rulesand conditional computation
shared (list) - data accessible by all threads, which all refer to the same storageprivate (list) - data private to each thread - a new storage location is created with that name for each thread, and the of the storage are not available outside the parallel region
default (private | shared | none) - default association for variables not otherwise mentionedfirstprivate (list) - same as for private(list), but the contents are given an initial value from the variable with the same name, from outside the parallel regionlastprivate (list) - available only for work sharing constructs - a shared variable with that name is set to the last computed value of a thread private variable in the work sharing construct
Clauses in OpenMP - 2reduction ({op/intrinsic}:list) - variables in the list are named scalars of intrinsic type - a private copy of each variable will be made in each thread and initialized according to the intended operation - at the end of the parallel region or other synchronization point all private copies will be combined - the operation must be of one of the forms: x = x op expr x = intrinsic(x,expr) if (x.LT.expr) x = expr x++; x--; ++x; --x; where expr does not contain x
Op Init+ or - 0* 1& -0| 0^ 0&& 1|| 0
Op/intrinsic Init+ or - 0* 1.AND. .TRUE..OR. .FALSE..EQV. .TRUE..NEQV. .FALSE.MAX smallest numberMIN largest numberIAND all bits onIOR or IEOR 0
- example: c$omp parallel do reduction(+:a,y) reduction (.OR.:s)
Clauses in OpenMP - 3
copyin(list) - the list must contain common block (or global) names tahat have been declared threadprivate - data in the master thread in that common block will be copied to the thread private storage at the beginning of the parallel region - there is no “copyout” clause – data in private common block is not available outside of that threadif (scalar_logical_expression) - when an “if” clause is present, the enclosed code block is executed in parallel only if the scalar_logical_expression is .TRUE.ordered - only for do/for work sharing constructs – the code in the ORDERED block will be executed in the same sequence as sequential executionschedule (kind,[chunk]) - only for do/for work sharing constructs – specifies scheduling discipline for loop iterationsnowait - end of worksharing construct and SINGLE directive implies a synchronization\ point unless “nowait” is specified
OpenMP on the Origin 2000
Parallel Sections Directive
c$omp parallel sections private(I)
c$omp end parallel sections
c$omp section block1c$omp section block2
Topics: Clauses, Detailed construct
OpenMP on the Origin 2000
Parallel Sections Directive - Clauses
sharedprivatedefault(private|shared|none)firstprivatelastprivatereduction({operator|intrinsic}:var)if(scalar_logical_expression)copyin(var)
OpenMP on the Origin 2000
Defining a Parallel Region - Individual Do Loopsc$omp parallel shared(a,b)
do j=1,na(j)=jenddo
do k=1,nb(k)=kenddo
c$omp do private(j)
c$omp end do nowaitc$omp do private(k)
c$omp end doc$omp end parallel
OpenMP on the Origin 2000
Defining a Parallel Region - Explicit Sections
c$omp parallel shared(a,b)c$omp sectionblock1c$omp singleblock2c$omp sectionblock3c$omp end parallel
OpenMP on the Origin 2000
Synchronization Constructs
master/end mastercritical/end criticalbarrieratomicflushordered/end ordered
OpenMP on the Origin 2000
Run-Time Library Routines
Execution environment
omp_set_num_threadsomp_get_num_threadsomp_get_max_threadsomp_get_thread_numomp_get_num_procsomp_in_parallelomp_set_dynamic/omp_get_dynamicomp_set_nested/omp_get_nested
OpenMP on the Origin 2000
Run-Time Library Routines
Lock routines
omp_init_lockomp_destroy_lockomp_set_lockomp_unset_lockomp_test_lock
OpenMP on the Origin 2000
Environment Variables
OMP_NUM_THREADSorMP_SET_NUMTHREADSOMP_DYNAMICOMP_NESTED
Exercise 5 – OpenMP to parallelize a loop
main loop
initial values
Enhancing Performance
• Ensuring sufficient work : running a loop in parallel adds runtime costs
• Scheduling loops for load - balancing
The SCHEDULE clause
SCHEDULE (TYPE[,CHUNK])
Static Each thread is assigned one chunk of iterations, according to variable or equally sized
Dynamic At runtime, chunks are assigned to threads dynamically
OpenMP summary
- Small number of compiler directives to set up parallel execution of code and runtime library system for locking function- Portable directives (supported by different vendors in the same way)- Parallelization is for SMP programming model – the machine should have a global address space- Number of execution threads is controlled outside the program- A correct OpenMP program should not depend on the exact number of execution threads nor on the scheduling mechanism for work distribution- In addition, a correct OpenMP program should be (weakly) serially equivalent – that is, the results of the computation should be within rounding accuracy when compared to sequential program- On SGI, OpenMP programming can be mixed with MPI library, so that it is possible to have “hierarchical parallelism” * OpenMP parallelism in a single node (Global Address Space) * MPI parallelism between nodes in a cluster (Network connection)