1 iscm-10 taub computing center high performance computing for computational mechanics moshe...
TRANSCRIPT
1
ISCM-10
Taub Computing Center
High Performance Computingfor
Computational Mechanics
Moshe GoldbergMarch 29, 2001
2
High Performance Computing for CM
1) Overview2) Alternative Architectures3) Message Passing4) “Shared Memory”5) Case Study
Agenda:
4
* Understanding HPC concepts
* Why should programmers care
about the architecture?
* Do compilers make the right choices?
* Nowadays, there are alternatives
Some Important Points
5
Trends in computer development
*Speed of calculation is steadily increasing*Memory may not be in balance with high
calculation speeds*Workstations are approaching speeds of
especially efficient designs*Are we approaching the limit of the speed
of light?* To get an answer faster, we must perform
calculations in parallel
6
Some HPC concepts
* HPC* HPF / Fortran90 * cc-NUMA* Compiler directives* OpenMP* Message passing* PVM/MPI* Beowulf
7
MFLOPS for parix (origin2000), ax=b
0.0
1000.0
2000.0
3000.0
4000.0
1 2 3 4 5 6 7 8 9 10 11 12
processors
MF
LO
PS
n=2001
n=3501
n=5001
8
ideal parallel speedup
1.0
3.0
5.0
7.0
9.0
11.0
1 2 3 4 5 6 7 8 9 10 11 12
processors
sp
ee
up
ideal
speedup =
(time for 1 cpu) _____________
(time for (n) cpu's)
9
speedup for parix (origin2000), ax=b
1.0
3.0
5.0
7.0
9.0
11.0
1 2 3 4 5 6 7 8 9 10 11 12
processors
sp
ee
up
ideal
n=2001
n=3501
n=5001
10
"or" - MFLOPS for matrix multiply (n=3001)
0.0
2000.0
4000.0
6000.0
8000.0
10000.0
12000.0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33
processors
MF
LO
PS
source
blas
11
"or" - Speedup for Matrix multiply (n=3001)
1.0
3.0
5.0
7.0
9.0
11.0
13.0
15.0
17.0
19.0
21.0
23.0
25.0
27.0
29.0
31.0
33.0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33
processors
sp
ee
du
p
ideal
source
blas
12
"or" - solve linear equations
0.0
1000.0
2000.0
3000.0
4000.0
5000.0
6000.0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33
processors
MF
LO
PS
n=2001
n=3501
n=5001
13
"or" - solve linear equations
1.0
3.0
5.0
7.0
9.0
11.0
13.0
15.0
17.0
19.0
21.0
23.0
25.0
27.0
29.0
31.0
33.0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33
processors
spe
ed
up
ideal
n=2001
n=3501
n=5001
15
Units Shipped -- All Vectors
0
100
200
300
400
500
600
700
90 91 92 93 94 95 96 97 98 99 OO
Syst
ems
per
Yea
r
OtherNECFujitsuCray
Source: IDC, 2001
16
Units Shipped -- Capability Vector
0
20
40
60
80
100
120
140
90 91 92 93 94 95 96 97 98 99 OO
Syst
ems
per
Yea
rOtherNECFujitsuCray
Source: IDC, 2001
18
IUCC (Machba) computers
Cray J90 -- 32 cpu Memory - 4 GB (500 MW)
Origin2000 112 cpu (R12000, 400 MHz) 28.7 GB total memoryPC cluster 64 cpu (Pentium III, 550 MHz) Total memory - 9 GB
Mar 2001
22
CPU CPU
Memory
CPU CPU
Symmetric Multiple Processors
Examples: SGI Power Challenge, Cray J90/T90
Memory Bus
23
Memory
CPU
Memory
CPU
Memory
CPU
Memory
CPU
Distributed Parallel Computing
Examples: SP2, Beowulf
28
call MPI_SEND(sum,1,MPI_REAL,ito,itag,MPI_COMM_WORLD,ierror)
call MPI_RECV(sum,1,MPI_REAL,ifrom,itag,MPI_COMM_WORLD,istatus,ierror)
MPI commands -- examples
29
Some basic MPI functions
Setup: mpi_init mpi_finalize Environment: mpi_comm_size mpi_comm_rank
Communication: mpi_send mpi_receive
Synchronization: mpi_barrier
30
Other important MPI functionsAsynchronous communication: mpi_isend mpi_irecv mpi_iprobe mpi_wait/nowait
Collective communication: mpi_barrier mpi_bcast mpi_gather mpi_scatter mpi_reduce mpi_allreduceDerived data types: mpi_type_contiguous mpi_type_vector mpi_type_indexed mpi_type_pack mpi_type_commit mpi_type_free
Creating communicators: mpi_comm_dup mpi_comm_split mpi_intercomm_create mpi_comm_free
32
CRAY: CMIC$ DO ALLdo i=1,n
a(i)=ienddo
SGI:C$DOACROSSdo i=1,n
a(i)=ienddo
OpenMP: C$OMP parallel do do i=1,n a(i)=i enddo
Fortran directives --examples
33
OpenMP Summary
OpenMP standard – first published Oct 1997
Directives
Run-time Library Routines
Environment Variables
Versions for f77, f90, c, c++
34
OpenMP Summary
Parallel Do Directive
c$omp parallel do private(I) shared(a)
c$omp end parallel do optional
do I=1,na(I)= I+1enddo
35
OpenMP Summary
Defining a Parallel Region - Individual Do Loopsc$omp parallel shared(a,b)
do j=1,na(j)=jenddo
do k=1,nb(k)=kenddo
c$omp do private(j)
c$omp end do nowaitc$omp do private(k)
c$omp end doc$omp end parallel
36
OpenMP Summary
Parallel Do Directive - Clauses
sharedprivatedefault(private|shared|none)reduction({operator|intrinsic}:var)if(scalar_logical_expression)orderedcopyin(var)
37
OpenMP Summary
Run-Time Library Routines
Execution environment
omp_set_num_threadsomp_get_num_threadsomp_get_max_threadsomp_get_thread_numomp_get_num_procsomp_set_dynamic/omp_get_dynamicomp_set_nested/omp_get_nested
38
OpenMP Summary
Run-Time Library Routines
Lock routines
omp_init_lockomp_destroy_lockomp_set_lockomp_unset_lockomp_test_lock
45
subroutine xmult (x1,x2,y1,y2,z1,z2,n)
real x1(n),x2(n),y1(n),y2(n),z1(n),z2(n)
real a,b,c,d
do i=1,n
a=x1(i)*x2(i); b=y1(i)*y2(i)
c=x1(i)*y2(i); d=x2(i)*y1(i)
z1(i)=a-b; z2(i)=c+d
enddo
end
A sample program
46
subroutine xmult (x1,x2,y1,y2,z1,z2,n)
real x1(n),x2(n),y1(n),y2(n),z1(n),z2(n)
real a,b,c,d
c$omp parallel do
do i=1,n
a=x1(i)*x2(i); b=y1(i)*y2(i)
c=x1(i)*y2(i); d=x2(i)*y1(i)
z1(i)=a-b; z2(i)=c+d
enddo
end
A sample program
47
Run on Technion origin2000Vector length = 1,000,000Loop repeated 50 timesCompiler optimization: low (-O1)
Elapsed time, sec
threadsCompile 1 2 4
No parallel 15.0 15.3Parallel 16.0 26.0 26.8
Is this running in parallel?
A sample program
48
Run on Technion origin2000Vector length = 1,000,000Loop repeated 50 timesCompiler optimization: low (-O1)
Elapsed time, sec
threadsCompile 1 2 4
No parallel 15.0 15.3Parallel 16.0 26.0 26.8
Is this running in parallel? WHY NOT?
A sample program
49
c$omp parallel do
do i=1,n
a=x1(i)*x2(i); b=y1(i)*y2(i)
c=x1(i)*y2(i); d=x2(i)*y1(i)
z1(i)=a-b; z2(i)=c+d
enddo
Is this running in parallel? WHY NOT?
Answer: by default, variables a,b,c,dare defined as SHARED
A sample program
50
Elapsed time, sec
threadsCompile 1 2 4
No parallel 15.0 15.3Parallel 16.0 8.5 4.6
Solution: define a,b,c,d as PRIVATE: c$omp parallel do private(a,b,c,d)
This is now running in parallel
A sample program
52
HPC in the Technion
SGI Origin2000 22 cpu (R10000) -- 250 MHz Total memory -- 5.6 GB
PC cluster (linux redhat 6.1) 6 cpu (pentium II - 400MHz) Memory - 500 MB/cpu
53
Fluent test case
-- Stability of a subsonic
turbulent jet
Source: Viktoria SuponitskyFaculty of Aerospace Engineering,
Technion
55
Reading "Case25unstead.cas"...
10000 quadrilateral cells, zone 1, binary.
19800 2D interior faces, zone 9, binary.
50 2D wall faces, zone 3, binary.
100 2D pressure-inlet faces, zone 7, binary.
50 2D pressure-outlet faces, zone 5, binary.
50 2D pressure-outlet faces, zone 6, binary.
50 2D velocity-inlet faces, zone 2, binary.
100 2D axis faces, zone 4, binary.
10201 nodes, binary.
10201 node flags, binary.
Fluent test case
10 time steps, 20 iterations per time step
58
Host spawning Node 0 on machine "parix".
ID Comm. Hostname O.S. PID Mach ID HW ID Name
-------------------------------------------------------------
host net parix irix 19732 0 7 Fluent Host
n7 smpi parix irix 19776 0 7 Fluent Node
n6 smpi parix irix 19775 0 6 Fluent Node
n5 smpi parix irix 19771 0 5 Fluent Node
n4 smpi parix irix 19770 0 4 Fluent Node
n3 smpi parix irix 19772 0 3 Fluent Node
n2 smpi parix irix 19769 0 2 Fluent Node
n1 smpi parix irix 19768 0 1 Fluent Node
n0* smpi parix irix 19767 0 0 Fluent Node
Fluent test case
SMP command: fluent 2d -t8 -psmpi -g < inp
59
Fluent test caseCluster command:
fluent 2d -cnf=clinux1,clinux2,clinux3,clinux4,clinux5,clinux6
-t6 –pnet -g < inp
Node 0 spawning Node 5 on machine "clinux6".
ID Comm. Hostname O.S. PID Mach ID HW ID Name
-----------------------------------------------------------
n5 net clinux6 linux-ia32 3560 5 9 Fluent Node
n4 net clinux5 linux-ia32 19645 4 8 Fluent Node
n3 net clinux4 linux-ia32 16696 3 7 Fluent Node
n2 net clinux3 linux-ia32 17259 2 6 Fluent Node
n1 net clinux2 linux-ia32 18328 1 5 Fluent Node
host net clinux1 linux-ia32 10358 0 3 Fluent Host
n0* net clinux1 linux-ia32 10400 0 -1 Fluent Node
60
Fluent test - time for multiple cpu's
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
450.0
1 2 3 4 5 6 7 8
number of cpu's
tota
l ru
n t
ime
, se
c
origin2000
pc cluster
61
Fluent test - speedup by cpu's
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 3 4 5 6 7 8
number of cpu's
sp
ee
du
p
ideal
origin2000
pc cluster