programming the ibm power3 sp eric aubanel advanced computational research laboratory faculty of...
TRANSCRIPT
![Page 1: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/1.jpg)
Programming the IBM Power3 SP
Eric AubanelAdvanced Computational Research Laboratory
Faculty of Computer Science, UNB
![Page 2: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/2.jpg)
Advanced Computational Research Laboratory
• High Performance Computational Problem-Solving and Visualization Environment
• Computational Experiments in multiple disciplines: CS, Science and Eng.
• 16-Processor IBM SP3
• Member of C3.ca Association, Inc. (http://www.c3.ca)
![Page 3: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/3.jpg)
Advanced Computational Research Laboratory
www.cs.unb.ca/acrl
• Virendra Bhavsar, Director
• Eric Aubanel, Research Associate & Scientific Computing Support
• Sean Seeley, System Administrator
![Page 4: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/4.jpg)
![Page 5: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/5.jpg)
![Page 6: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/6.jpg)
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
![Page 7: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/7.jpg)
POWER chip: 1990 to 2003
1990– Performance Optimized with Enhanced RISC– Reduced Instruction Set Computer– Superscalar: combined floating point multiply-
add (FMA) unit which allowed peak MFLOPS rate = 2 x MHz
– Initially: 25 MHz (50 MFLOPS) and 64 KB data cache
![Page 8: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/8.jpg)
POWER chip: 1990 to 2003
1991: SP1– IBM’s first SP (scalable power parallel)– Rack of standalone POWER processors (62.5
MHz) connected by internal switch network– Parallel Environment & system software
![Page 9: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/9.jpg)
POWER chip: 1990 to 2003
1993: POWER2– 2 FMAs– Increased data cache size– 66.5 MHz (254 MFLOPS)– Improved instruction set (incl. Hardware square
root)– SP2: POWER2 + higher bandwidth switch for
larger systems
![Page 10: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/10.jpg)
POWER chip: 1990 to 2003
1993: POWERPCSupport SMP
1996: P2SCPOWER2 super chip: clock speeds up to 160
MHz
![Page 11: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/11.jpg)
POWER chip: 1990 to 2003
Feb. ‘99: POWER3– Combined P2SC & POWERPC– 64 bit architecture– Initially 2-way SMP, 200 MHz– Cache improvement, including L2 cache of 1-
16 MB– Instruction & data prefetch
![Page 12: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/12.jpg)
POWER3+ chip: Feb. 2000• Winterhawk II - 375 MHz
• 4- way SMP
• 2 MULT/ ADD - 1500 MFLOPS
• 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec
• 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec
• 1.6 GB/ s Memory Bandwidth
• 6 GFLOPS/ Node
• Nighthawk II - 375 MHz
• 16- way SMP
• 2 MULT/ ADD - 1500 MFLOPS
• 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec
• 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec
• 14 GB/ s Memory Bandwidth
• 24 GFLOPS/ Node
![Page 13: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/13.jpg)
The Clustered SMP
ACRL’s SP: Four 4-way SMPs
Each node has its own copy of the O/S
Processors on the node are closer than those on differentnodes
![Page 14: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/14.jpg)
Power3 Architecture
![Page 15: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/15.jpg)
Power4 - 32 way
• Logical UMA
• SP High Node
• L3 cache shared between all processors on node - 32 MB
• Up to 32 GB main memory
• Each processor: 1.1 GHz
• 140 Gflops total peak
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
GX Bus GX Bus
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
GX Bus GX Bus
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
GX Bus GX Bus
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
2 procsPrivate L1,
L2
GX Bus GX Bus
![Page 16: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/16.jpg)
Going to NUMA32 way GP High nodeOwn copy of AIX 128+ GFLOPS/high nodeMultiple Federation Adapters for scaleable inter-node BWNUMA up to 256 Processors
Federation Switch
SP GP Node
AIX
Federation Adapters
Memory
Processors / Intra-node Interconnect Up to 16
Links
SP GP Node
AIX
Federation Adapters
Memory
Up to 16 Links
Processors / Intra-node Interconnect
NUMA up to 256 processors - 1.1 Teraflops
![Page 17: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/17.jpg)
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
![Page 18: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/18.jpg)
Uni-processor Optimization
• Compiler options: – start with -O3 -qstrict, then -O3, -qarch=pwr3
• Cache re-use
• Take advantage of superscalar architecture – give enough operations per load/store
• Use ESSL - optimization already maximally exploited
![Page 19: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/19.jpg)
Memory Access Times
Memory to L2or L1
L2 to L1 L1 toRegisters
Width 16 bytes/2cycles
32 bytes/cycle 2 x 8bytes/cycle
Rate 1.6 GB/s 6.4 GB/s 3.2 GB/s
Latency 35 cycles(approximately)
6 to 7 cycles(approximately)
1 cycle
![Page 20: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/20.jpg)
Cache128 byte cache line
2 MB
2 MB
2 MB
2 MB
L2 cache: 4-way set-associative, 8 MB total
L1 cache: 128-way set-associative, 64 KB
![Page 21: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/21.jpg)
How to Monitor Performance?
• IBM’s hardware monitor: HPMCOUNT– Uses hardware counters on chip– Cache & TLB misses, fp ops, load-stores, …– Beta version – Available soon on ACRL’s SP
![Page 22: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/22.jpg)
HMPCOUNT sample output
real*8 a(256,256),b(256,256),c(256,256)
common a,b,c
do j=1,256
do i=1,256
a(i,j)=b(i,j)+c(i,j)
end do
end do
end
PM_TLB_MISS (TLB misses) : 66543
Average number of loads per TLB miss : 5.916
Total loads and stores : 0.525 M
Instructions per load/store : 2.749
Cycles per instruction : 2.378
Instructions per cycle : 0.420
Total floating point operations : 0.066 M
Hardware float point rate : 2.749 Mflop/sec
![Page 23: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/23.jpg)
HMPCOUNT sample output
real*8 a(257,256),b(257,256),c(257,256)
common a,b,c
do j=1,256
do i=1,257
a(i,j)=b(i,j)+c(i,j)
end do
end do
end
PM_TLB_MISS (TLB misses) : 1634
Average number of loads per TLB miss : 241.876
Total loads and stores : 0.527 M
Instructions per load/store : 2.749
Cycles per instruction : 1.271
Instructions per cycle : 0.787
Total floating point operations : 0.066 M
Hardware float point rate : 3.525 Mflop/sec
![Page 24: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/24.jpg)
ESSL
• Linear algebra, Fourier & related transforms, sorting, interpolation, quadrature, random numbers
• Fast!– 560x560 real*8 matrix multiply
• Hand coding: 19 Mflops
• dgemm: 1.2 GFlops
• Parallel (threaded and distributed) versions
![Page 25: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/25.jpg)
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
![Page 26: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/26.jpg)
ACRL’s IBM SP
• 4 Winterhawk II nodes– 16 processors
• Each node has:– 1 GB RAM
– 9 GB (mirrored) disk on each node
– Switch adapter
• High Perforrnance Switch
• Gigabit Ethernet (1 node)
• Control workstation
• Disk: SSA tower with 6 18.2 GB disks
Disk
Gigabit Ethernet
![Page 27: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/27.jpg)
![Page 28: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/28.jpg)
IBM Power3 SP Switch
• Bidirectional multistage interconnection networks (MIN)
• 300 MB/sec bi-directional
• 1.2 sec latency
![Page 29: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/29.jpg)
General Parallel File System
Application
GPFS Client
RVSD/VSD
Application
GPFS Client
RVSD/VSD
Application
GPFS Client
RVSD/VSD
Application
GPFS Server
RVSD/VSD
Node 2 Node 3 Node 4
Node 1
SP Switch
![Page 30: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/30.jpg)
ACRL Software• Operating System: AIX 4.3.3
• Compilers– IBM XL Fortran 7.1 (HPF not yet installed)
– VisualAge C for AIX, Version 5.0.1.0
– VisualAge C++ Professional for AIX, Version 5.0.0.0
– IBM Visual Age Java - not yet installed
• Job Scheduler: Loadleveler 2.2
• Parallel Programming Tools– IBM Parallel Environment 3.1: MPI, MPI-2 parallel I/O
• Numerical Libraries: ESSL (v. 3.2) and Parallel ESSL (v. 2.2 )
• Visualization: OpenDX (not yet installed)
• E-Commerce software (not yet installed)
![Page 31: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/31.jpg)
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
![Page 32: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/32.jpg)
Why Parallel Computing?• Solve large problems in reasonable time• Many algorithms are inherently parallel
– image processing, Monte Carlo
– Simulations (eg. CFD)
• High performance computers have parallel architectures– Commercial off-the shelf (COTS) components
• Beowulf clusters
• SMP nodes
– Improvements in network technology
![Page 33: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/33.jpg)
NRL Layered Ocean Model at Naval Research Laboratory
IBM Winterhawk II SP
![Page 34: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/34.jpg)
Parallel Computational Models
• Data Parallelism– Parallel program looks like serial program
• parallelism in the data
– Vector processors– HPF
![Page 35: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/35.jpg)
Parallel Computational Models
• Message Passing (MPI)– Processes have only local memory but can communicate
with other processes by sending & receiving messages– Data transfer between processes requires operations to be
performed by both processes– Communication network not part of computational
model (hypercube, torus, …)
Send Receive
![Page 36: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/36.jpg)
Parallel Computational Models
• Shared Memory (threads)– P(osix)threads– OpenMP: higher level standard
Address space
Processes
![Page 37: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/37.jpg)
Parallel Computational Models
• Remote Memory Operations– “One-sided” communication
• MPI-2, IBM’s LAPI
– One process can access the memory of another without the other’s participation, but does so explicitly, not the same way it accesses local memory
Put
Get
![Page 38: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/38.jpg)
Parallel Computational Models
• Combined: Message Passing & Threads– Driven by clusters of SMPs
– Leads to software complexity!
Address space
Processes
Address space
Processes
Address space
Processes
Network
![Page 39: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/39.jpg)
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
![Page 40: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/40.jpg)
Message Passing Interface
• MPI 1.0 standard in 1994
• MPI 1.1 in 1995 - IBM support
• MPI 2.0 in 1997– Includes 1.1 but adds new features
• MPI-IO
• One-sided communication
• Dynamic processes
![Page 41: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/41.jpg)
Advantages of MPI
• Universality
• Expressivity– Well suited to formulating a parallel algorithm
• Ease of debugging– Memory is local
• Performance– Explicit association of data with process allows
good use of cache
![Page 42: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/42.jpg)
MPI Functionality• Several modes of point-to-point message passing
– blocking (e.g. MPI_SEND)
– non-blocking (e.g. MPI_ISEND)
– synchronous (e.g. MPI_SSEND)
– buffered (e.g. MPI_BSEND)
• Collective communication and synchronization– e.g. MPI_REDUCE, MPI_BARRIER
• User-defined datatypes
• Logically distinct communicator spaces
• Application-level or virtual topologies
![Page 43: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/43.jpg)
Simple MPI Example
My_Id 0 1
This is from MPI process number 0
This is from MPI processes other than 0
![Page 44: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/44.jpg)
Simple MPI ExampleProgram Trivial
implicit none
include "mpif.h" ! MPI header file
integer My_Id, Numb_of_Procs, Ierr
call MPI_INIT ( ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr )
call MPI_COMM_SIZE ( MPI_COMM_WORLD, Numb_of_Procs, ierr )
print *, ' My_id, numb_of_procs = ', My_Id, Numb_of_Procs
if ( My_Id .eq. 0 ) then
print *, ' This is from MPI process number ',My_Id
else
print *, ' This is from MPI processes other than 0 ', My_Id
end if
call MPI_FINALIZE ( ierr ) ! bad things happen if you forget ierr
stop
end
![Page 45: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/45.jpg)
MPI Example with send/recv
My_Id 0 1
Send Receive
SendReceive
![Page 46: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/46.jpg)
MPI Example with send/recvProgram Simple
implicit none
Include "mpif.h"
Integer My_Id, Other_Id, Nx, Ierr
Parameter ( Nx = 100 )
Real A ( Nx ), B ( Nx )
call MPI_INIT ( Ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr )
Other_Id = Mod ( My_Id + 1, 2 )
A = My_Id
call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id, MPI_COMM_WORLD, Ierr )
call MPI_RECV ( B, Nx, MPI_REAL, Other_Id, Other_Id, MPI_COMM_WORLD, Ierr )
call MPI_FINALIZE ( Ierr )
stop
end
![Page 47: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/47.jpg)
What Will Happen?/* Processor 0 */
...
MPI_Send(sendbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD);
printf("Posting receive now ...\n");
MPI_Recv(recvbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD,
status);
/* Processor 1 */
...
MPI_Send(sendbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD);
printf("Posting receive now ...\n");
MPI_Recv(recvbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD,
status);
![Page 48: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/48.jpg)
MPI Message Passing Modes
Ready
Standard
Synchronous
Buffered
Ready
Eager
Rendezvous
Buffered
> eager limit
<= eager limit
Default Eager Limit on SP is 4 KB (can be up to 64 KB)
![Page 49: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/49.jpg)
MPI Performance Visualization
• ParaGraph– Developed by University of Illinois– Graphical display system for visualizing
behaviour and performance of MPI programs
![Page 50: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/50.jpg)
![Page 51: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/51.jpg)
![Page 52: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/52.jpg)
Message Passing on SMP
Call MPI_SEND Call MPI_RECEIVE
BufferBuffer
Memory Crossbar or Switch
Data toSend
ReceivedData
export MP_SHARED_MEMORY=yes|no
![Page 53: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/53.jpg)
Shared Memory MPI
MPI_SHARED_MEMORY=<yes|no>
Latency Bandwidth
(sec) (Mbytes/sec)– between 2 nodes: 24 133– same nodes: 30 (no) 80 (no)– same nodes: 10 (yes) 270(yes)
![Page 54: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/54.jpg)
Message Passing off Node
MPI Across all the processors
Many more messages going through the fabric
![Page 55: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/55.jpg)
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
![Page 56: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/56.jpg)
OpenMP• 1997: group of hardware and software vendors
announced their support for OpenMP, a new API for multi-platform shared-memory programming (SMP) on UNIX and Microsoft Windows NT platforms.
• www.openmp.org• OpenMP parallelism specified through the use of
compiler directives which are imbedded in C/C++ or Fortran source code. IBM does not yet support OpenMP for C++.
![Page 57: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/57.jpg)
OpenMP
• All processors can access all the memory in the parallel system
• Parallel execution is achieved by generating threads which execute in parallel
• Overhead for SMP parallelization is large (100-200 sec)- size of parallel work construct must be significant enough to overcome overhead
![Page 58: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/58.jpg)
OpenMP1.All OpenMP programs begin as a single process: the master thread
2.FORK: the master thread then creates a team of parallel threads
3.Parallel region statements executed in parallel among the various team threads
4.JOIN: threads synchronize and terminate, leaving only the master thread
![Page 59: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/59.jpg)
OpenMP
How is OpenMP typically used?
• OpenMP is usually used to parallelize loops:– Find your most time consuming loops.– Split them up between threads.
• Better scaling can be obtained using OpenMP parallel regions, but can be tricky!
![Page 60: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/60.jpg)
OpenMP Loop Parallelization!$OMP PARALLEL DO
do i=0,ilong
do k=1,kshort
...
end do
end do
#pragma omp parallel for
for(i=0; i <= ilong; i++)
for(k=1; k <= kshort; k++) {
...
}
![Page 61: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/61.jpg)
Variable Scoping• Most difficult part of Shared Memory
Parallelization– What memory is Shared
– What memory is Private - each processor has its own copy
• Compare MPI: all variables are private• Variables are shared by default, except:
– loop indices
– scalars that are set and then used in loop
![Page 62: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/62.jpg)
How Does Sharing Work?
THREAD 1: increment(x)
{
x = x + 1;
}
THREAD 1:
10 LOAD A, (x address)
20 ADD A, 1
30 STORE A, (x address)
THREAD 2: increment(x)
{ x = x + 1;
}
THREAD 2: 10 LOAD A, (x address)
20 ADD A, 1
30 STORE A, (x address)
Shared X initially 0
Result could be 1 or 2
Need synchronization
![Page 63: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/63.jpg)
False Sharing7
6
5
4
3
2
1
0
Processor 1 Processor 2
Block in Cache
Cache line
Address tag
Block
Say A(1-5)starts on cache line, then some of A(6-10) will be on first cache line so won’t be accessible until first thread finished
!$OMP PARALLEL DO do I=1,20 A(I)= ...enddo
![Page 64: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/64.jpg)
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
![Page 65: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/65.jpg)
Why Hybrid MPI-OpenMP?
• To optimize performance on “mixed-mode” hardware like the SP
• MPI is used for “inter-node” communication, and OpenMP is used for “intra-node” communication– threads have lower latency – threads can alleviate network contention of a
pure MPI implementation
![Page 66: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/66.jpg)
Hybrid MPI-OpenMP?• Unless you are forced against your will, for the hybrid
model to be worthwhile:– There has to be obvious parallelism to exploit
– The code has to be easy to program and maintain• easy to write bad OpenMP code
– It has to promise to perform at least as well as the equivalent all-MPI program
• Experience has shown that converting working MPI code to a hybrid model rarely results in better performance – especially true with applications having a single level of
parallelism
![Page 67: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/67.jpg)
Hybrid Scenario• Thread the computational portions of the code that
exist between MPI calls• MPI calls are “single-threaded” and therefore use
only a single CPU.• Assumes:
– application has two natural levels of parallelism– or that in breaking an MPI code with one level
of parallelism that communication between resulting threads is little/none
![Page 68: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/68.jpg)
Programming the IBM Power3 SP
• History and future of POWER chip
• Uni-processor optimization
• Description of ACRL’s IBM SP
• Parallel Processing– MPI– OpenMP
• Hybrid MPI/OpenMP• MPI-I/O (one slide)
![Page 69: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/69.jpg)
MPI-IO
• Part of MPI-2• Resulted work at IBM Research exploring the
analogy between I/O and message passing• See “Using MPI-2”, by Gropp et al. (MIT Press)
memory
processes
file
![Page 70: Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB](https://reader036.vdocuments.us/reader036/viewer/2022081518/551acc5655034656628b5ef0/html5/thumbnails/70.jpg)
Conclusion• Don’t forget uni-processor optimization
• If you choose one parallel programming API, choose MPI
• Mixed MPI-OpenMP may be appropriate in certain cases– More work needed here
• Remote memory access model may be the answer