n ational e nergy r esearch s cientific c omputing c enter march 17, 20031 libraries and their...
Post on 13-Dec-2015
217 Views
Preview:
TRANSCRIPT
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 1Libraries and Their Performance
Libraries and Their Performance
Frank V. Hale
Thomas M. DeBoni
NERSC User Services Group
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 2Libraries and Their Performance
Part I: Single Node Performance Measurement
• Use of hpmcount for measurement of total code performance
• Use of HPM Toolkit for measurement of code section performance
• Vector operations generally give better performance than scalar (indexed) operations
• Shared-memory, SMP parallelism can be very effective and easy to use
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 3Libraries and Their Performance
Demonstration Problem
• Compute using random points in unit square (ratio of points in unit circle to points in unit square)
• Use input file with sequence of 134,217,728 uniformly distributed random numbers in range 0-1; unformatted, 8-byte floating point numbers (1 gigabyte of data)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 4Libraries and Their Performance
A first Fortran code
% cat estpi1.f
implicit none
integer i,points,circle
real*8 x,y
read(*,*)points
open(10,file="runiform1.dat",status="old",form="unformatted")
circle = 0
c repeat for each (x,y) data point: read and compute
do i=1,points
read(10)x
read(10)y
if (sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5) circle = circle + 1
enddo
write(*,*)"Estimated pi using ",points," points as ", . ((4.*circle)/points)
end
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 5Libraries and Their Performance
Compile and Run with hpmcount
% cat jobestpi1
#@ class = debug
#@ shell = /usr/bin/csh
#@ wall_clock_limit = 00:29:00
#@ notification = always
#@ job_type = serial
#@ output = jobestpi1.out
#@ error = jobestpi1.out
#@ environment = COPY_ALL
#@ queue
setenv FC "xlf_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 "
$FC -o estpi1 estpi1.f
echo "10000" > estpi1.dat
hpmcount ./estpi1 <estpi1.dat
exit
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 6Libraries and Their Performance
Performance of first code
Points Pi Wall Clock
(sec.)
Mflips/sec.
10 3.56000 0.055 0.007
100 3.36000 0.030 0.033
1,000 3.196000 0.038 0.189
01,000 3.15000 0.120 0.587
100,000 3.14700 0.936 0.748
1,000,000 3.14099 8.979 0.780
10,000,000 3.14199 89.194 0.785
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 7Libraries and Their Performance
Performance of first code
0.01
0.1
1
10
100
10 100 1000 104 105 106 107
Wall Clock(sec.)
# Points
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 8Libraries and Their Performance
Some Observations
• Performance is not very good at all, less than 1 Mflip/s
(peak is 1,500 Mflip/s per processor)
• Scalar approach to computation
• Scalar I/O mixed with scalar computation
Suggestions: Separate I/O from computation Use vector operations on dynamically allocated vector data
structures
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 9Libraries and Their Performance
A second code, Fortran 90% cat estpi2.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x,y
c dynamically allocated vector data structures read(*,*)points allocate (x(points)) allocate (y(points)) allocate (ones(points)) ones = 1 open(10,file="runiform1.dat",status="old",form="unformatted") do i=1,points read(10)x(i) read(10)y(i) enddo circle = sum(ones,(sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5)) write(*,*)"Estimated pi using ",points," points as ", & ((4.*circle)/points) end
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 10Libraries and Their Performance
Performance of second code
Points Pi Wall Clock (sec.)
Mflips/sec.
10 3.56000 0.090 0.004
100 3.36000 0.030 0.034
1,000 3.19000 0.039 0.197
10,000 3.15000 0.120 0.612
100,000 3.14700 0.967 0.755
1,000,000 3.14099 9.152 0.798
10,000,000 3.14199 91.170 0.801
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 11Libraries and Their Performance
Performance of second code
0.01
0.1
1
10
100
10 100 1000 104 105 106 107
Wall Clock(sec.)
# Points
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 12Libraries and Their Performance
Observations on Second Code
• Operations on whole vectors should be faster, but
• No real improvement in performance of total code was observed.
• Suspect that most time is being spent on I/O.
• I/O is now separate from computation, so the code is easy to instrument in sections
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 13Libraries and Their Performance
Instrument code sections with HPM Toolkit
Four sections to be separately measured:
• Data structure initialization
• Read data
• Estimate • Write output
Calls to f_hpmstart and f_hpmstop around each section.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 14Libraries and Their Performance
Instrumented Code (1 of 2)
%cat estpi3.f
implicit none
integer :: i, points, circle
integer, allocatable, dimension(:) :: ones
real(kind=8), allocatable, dimension(:) :: x,y
#include "f_hpm.h"
call f_hpminit(0,"Instrumented code")
call f_hpmstart(1,"Initialize data structures")
read(*,*)points
allocate (x(points))
allocate (y(points))
allocate (ones(points))
ones = 1
call f_hpmstop(1)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 15Libraries and Their Performance
Instrumented Code (2 of 2) call f_hpmstart(2,"Read data") open(10,file="runiform1.dat",status="old",form="unformatted") do i=1,points read(10)x(i) read(10)y(i) enddo call f_hpmstop(2) call f_hpmstart(3,"Estimate pi") circle = sum(ones,(sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5)) call f_hpmstop(3) call f_hpmstart(4,"Write output") write(*,*)"Estimated pi using ",points," points as ", & ((4.*circle)/points) call f_hpmstop(4) call f_hpmterminate(0) end
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 16Libraries and Their Performance
Notes on Instrumented Code
• Entire executable code enclosed between hpm_init and hpm_terminate
• Code sections enclosed between hpm_start and hpm_stop
• Descriptive text labels appear in output file(s)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 17Libraries and Their Performance
Compile and Run with HPM Toolkit% cat jobestpi3#@ class = debug #@ shell = /usr/bin/csh#@ wall_clock_limit = 00:29:00#@ notification = always#@ job_type = serial#@ output = jobestpi3.out#@ error = jobestpi3.out#@ environment = COPY_ALL#@ queue module load hpmtoolkitsetenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT
-qsuffix=cpp=f"$FC -o estpi3 estpi3.f echo "10000000" > estpi3.dat./estpi3 <estpi3.dat exit
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 18Libraries and Their Performance
Notes on Use of HPM Toolkit
• Must load module hpmtoolkit• Need to include header file f_hpm.h in Fortran code, and
give preprocessor directions to compiler with -qsuffix• Performance output in a file named like
perfhpmNNNN.MMMMM
where NNNN is the task id
and MMMMM is the process id
• Message from sample executable:libHPM output in perfhpm0000.21410
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 19Libraries and Their Performance
Comparison of Code Sections
Section Wall Clock
(sec.)
% Time Mflips/sec.
Init Data Structs 0.248 0.27 0.000
Read Data 89.933 99.02 0.000
Estimate 0.641 0.71 114.327
Write Output 0.001 0.00 0.381
Total 90.823 100.00 0.801
10,000,000 points
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 20Libraries and Their Performance
Observations on Sections
• Optimization of the estimation of has little effect because
• The code spends 99% of the time reading the data
• Can the I/O be optimized?
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 21Libraries and Their Performance
Reworking the I/O
• Whole arrary I/O versus scalar I/O• Scalar I/O (one number per record) file is twice as big
(8 bytes for number, 8 bytes for end of record)• Whole array I/O file has only one end of record marker• Only one call for Fortran read routine for whole array I/O
read(10)xy• Need to use some fancy array footwork to sort out x(1), y(1),
x(2), y(2), … x(n), y(n) from xy array.x = xy(1::2)
y = xy(2::2)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 22Libraries and Their Performance
Revised Data Structures and I/O% cat estpi4.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x, y, xy#include "f_hpm.h" call f_hpminit(0,"Instrumented code") call f_hpmstart(1,"Initialize data structures") read(*,*)points allocate (x(points)) allocate (y(points)) allocate (xy(2*points)) allocate (ones(points)) ones = 1 call f_hpmstop(1) call f_hpmstart(2,"Read data") open(10,file="runiform.dat",status="old",form="unformatted") read(10)xy x = xy(1::2) y = xy(2::2) call f_hpmstop(2)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 23Libraries and Their Performance
Vector I/O Code Sections
Section Wall Clock
(sec.)
% Time Mflips/sec.
Init Data Structs 0.252 6.00 0.000
Read Data 3.162 75.34 0.000
Estimate Pi 0.771 18.37 94.053
Write Output 0.001 0.02 0.393
Total 4.197 100.00 15.4
10,000,000 points
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 24Libraries and Their Performance
Observations on New Sections
• The time spent reading the data as a vector rather than a scalar was reduced from 89.9 to 3.16 seconds, a reduction of 96% of the I/O time.
• There was no performance penalty for the additional data structure complexity.
• I/O design can have very significant performance impacts!
• Total code performance with hpmcount is now 15.4 Mflip/s, 20 times improved from the 0.801 Mflip/s of the scalar I/O code.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 25Libraries and Their Performance
Automatic Shared-Memory (SMP) Parallelization
• IBM Fortran provides a –qsmp option for automatic, shared-memory parallelization, allowing multithreaded computation within a node.
• Default number of threads is 16; the number of threads is controlled by OMP_NUM_THREADS environment variable
• Allows use of the SMP version of the ESSL library,
-lesslsmp
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 26Libraries and Their Performance
Compiler Options
• The source code is the same as the previous, vector operation example, estpi4.f
• Compiler options –qsmp and –lesslsmp enable automatic shared-memory parallelism (SMP)
• Compiler command line: xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3
$HPMTOOLKIT -qsuffix=cpp=f -qsmp –lesslsmp
-o estpi5 estpi4.f
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 27Libraries and Their Performance
SMP Code Sections
Section Wall Clock
(sec.)
% Time Mflips/sec.
Init Data Structs 0.534 10.87 0.000
Read Data 4.311 87.78 0.000
Estimate 0.064 1.30 1100.
(up from 94)
Write Output 0.002 0.04 0.117
Total 4.911 100.00 15.4
10,000,000 points
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 28Libraries and Their Performance
Observations on SMP Code
• Computational section is now showing 1,100 Mflip/sec, or 4.6% of theoretical peak of 24,000 Mflip/sec on 16 processor node.
• Computational section is now 12 times faster, with no changes to source code
• Recommendation: always use thread-safe compilers (with _r suffix) and –qsmp unless there is a good reason to do otherwise.
• There are no explicit parallelism directives in the source code; all threading is within the library.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 29Libraries and Their Performance
Too Many Threads Can Spoil Performance
• Each node has 16 processors, and usually having more threads than processors will not improve performance
0
200
400
600
800
1000
1200
0 4 8 12 16 20 24 28
Threads
Computation Mflip/s
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 30Libraries and Their Performance
Sidebar: Cost of Misaligned Common Block
• User code with Fortran77 style common blocks may receive an innocuous warning:
1514-008 (W) Variable … is misaligned. This may affect the efficiency of the code.
• How much can this affect the efficiency of the code?
• Test: put arrays x and y in misaligned common, with a 1-byte character in front of them
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 31Libraries and Their Performance
Potential Cost of Misaligned Common Blocks
• 10,000,000 points used for computing Pi;
• Properly aligned, dynamically allocated x and y
used 0.064 seconds at 1,100 Mflip/s
• Misaligned, statically allocated x and y in common block
used 0.834 seconds at 88.4 Mflip/s
• Common block alignment slowed computation by
a factor of 12
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 32Libraries and Their Performance
Part I Conclusion
• hpmcount can be used to measure the performance of the total code
• HPM Toolkit can be used to measure the performance of discrete code sections
• Optimization effort must be focused effectively
• Fortran90 vector operations are generally faster than Fortran77 scalar operations
• Use of automatic SMP parallelization may provide an easy performance boost
• I/O may be the largest factor in “whole code” performance
• Misaligned common blocks can be very expensive
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 33Libraries and Their Performance
Part II: Comparing Libraries
• In the rich user environment on seaborg, there are many alternative ways to do the same computation
• The HPM Toolkit provides the tools to compare alternative approaches to the same computation
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 34Libraries and Their Performance
Dot Product Functions
• User coded scalar computation
• User coded vector computaiton
• Single processor ESSL ddot• Multi-threaded SMP ESSL ddot• Single processor IMSL ddot
• Single processor NAG f06eaf• Multi-threaded SMP NAG f06eaf
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 35Libraries and Their Performance
Sample Problem
• Test Cauchy-Schwartz inequality for N vectors of length N
(X•Y)2 <= (X•X)(Y•Y)
• Generate 2N random numbers (array x2)
• Use 1st N for X; (X•X) computed once
• Vary vector Yfor i=1,n
y = 2.0*x2(i:n+(i-1))
First Y is 2X, second Y is 2(X2(2:N+1)), etc.
• Compute (2*N)+1 dot products of length N
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 36Libraries and Their Performance
Instrumented Code Section for Dot Products
call f_hpmstart(1,"Dot products")
xx = ddot(n,x,1,x,1)
do i=1,n
y = 2.0*x2(i:n+(i-1))
yy = ddot(n,y,1,y,1)
xy = ddot(n,x,1,y,1)
diffs(i) = (xx*yy)-(xy*xy)
enddo
call f_hpmstop(1)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 37Libraries and Their Performance
Two User Coded Functions
real*8 function myddot(n,x,y) integer :: i,n real*8 :: x(n),y(n),dp dp = 0. do i=1,n dp = dp + x(i)*y(i) ! User scalar loop enddo myddot = dp returnend real*8 function myddot(n,x,y) integer :: i,n real*8 :: x(n),y(n) myddot = sum(x*y) ! User vector computation returnend
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 38Libraries and Their Performance
Compile and Run User Functions
module load hpmtoolkit
echo "100000" > libs.dat
setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3
-qtune=pwr3 $HPMTOOLKIT-qsuffix=cpp=f"
$FC -o libs0 libs0.f
./libs0 <libs.dat
$FC -o libs0a libs0a.f
./libs0a <libs.dat
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 39Libraries and Their Performance
Compile and Run ESSL Versions
setenv FC "xlf90_r -q64 -O3 –qstrict -qarch=pwr3
-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f
-lessl"
$FC -o libs1 libs1.f
./libs1 <libs.dat
setenv FC "xlf90_r -q64 -O3 –qstrict -qarch=pwr3
-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f -qsmp
-lesslsmp"
$FC -o libs1smp libs1.f
./libs1smp <libs.dat
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 40Libraries and Their Performance
Compile and Run IMSL Version
module load imsl
setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3
-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $IMSL"
$FC -o libs1imsl libs1.f
./libs1imsl <libs.dat
module unload imsl
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 41Libraries and Their Performance
Compile and Run NAG Versions
module load nag_64
setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3
-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $NAG"
$FC -o libs1nag libsnag.f
./libs1nag <libs.dat
module unload nag
module load nag_smp64
setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3
-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $NAG_SMP6
-qsmp=omp -qnosave "
$FC -o libs1nagsmp libsnag.f
./libs1nagsmp <libs.dat
module unload nag_smp64
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 42Libraries and Their Performance
First Comparison of Dot Product(N=100,000)
Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest)
User Scalar 246 203 1.72
User Vector249 201 1.74
ESSL 145 346 1.01
ESSL-SMP 408 123 2.85 Slowest
IMSL 143 351 1.00 Fastest
NAG 250 200 1.75
NAG-SMP 180 278 1.26
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 43Libraries and Their Performance
Comments on First Comparisons
• The best results, by just a little, were obtained using the IMSL library, with ESSL a close second
• Third best was the NAG-SMP routine, with benefits from multi-threaded computation
• The user coded routines and NAG were about 75% slower than the ESSL and IMSL routines. In general, library routines are highly optimized and better than user coded routines.
• The ESSL-SMP library did very poorly on this computation; this unexpected result may be due to data structures in the library, or perhaps the number of threads (default is 16).
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 44Libraries and Their Performance
ESSL-SMP Performance vs. Number of Threads• All for N=100,000
• Number of threads controlled by environment variable OMP_NUM_THREADS
0
200
400
600
800
1000
1200
0 4 8 12 16 20
Threads
ddot Mflip/s
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 45Libraries and Their Performance
Revised First Comparison of Dot Product(N=100,000)
Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest)
User Scalar 246 203 4.9
User Vector249 201 5.0
ESSL 145 346 2.9
ESSL-SMP 50 1000 1.0 Fastest
4 threads
IMSL 143 351 2.9
NAG 250 200 5.0 Slowest
NAG-SMP 180 278 3.6
Tuning for Number of Threads is Very, Very Important for SMP codes !
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 46Libraries and Their Performance
Scaling up the Problem
• The first comparisons were for N=100,000 computing 200,001 dot products of vectors of length 100,000
• Second comparison for N=200,000 computes 400,001 dot products of vectors of length 200,000
• Increase computational complexity by a factor of 4.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 47Libraries and Their Performance
Second Comparison of Dot Product(N=200,000)
Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest)
User Scalar 1090 183 2.17
User Vector1180 169 2.35 Slowest
ESSL 739 271 1.47
ESSL-SMP 503 398 1.00 Fastest
IMSL 725 276 1.44
NAG 1120 179 2.23
NAG-SMP 864 231 1.72
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 48Libraries and Their Performance
Comments on Second Comparisons (N=200,000)
• Now the best results are from the ESSL-SMP library, with the default 16 threads
• The next best group is ESSL, IMSL and NAG-SMP, taking 50-75% longer than the ESSL-SMP routine.
• The worst results were seen from NAG (single thread) and the user code routines.
What is the impact of the number of threads on the ESSL-SMP library performance? It is already the best.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 49Libraries and Their Performance
ESSL-SMP Performance vs. Number of Threads• All for N=200,000
• Number of threads controlled by environment variable OMP_NUM_THREADS
0
200
400
600
800
1000
1200
1400
1600
0 4 8 12 16 20
Threads
ddot Mflip/s
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 50Libraries and Their Performance
Revised Second Comparison of Dot Product(N=200,000)
Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest)
User Scalar 1090 183 7.5
User Vector1180 169 8.1 Slowest
ESSL 739 271 5.1
ESSL-SMP 146 1370 1.0 Fastest
(6 threads)
IMSL 725 276 5.0
NAG 1120 179 7.7
NAG-SMP 864 231 5.9
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 51Libraries and Their Performance
Scaling with Problem Size?(N1=100,000; N2=200,000; Complexity ratio approx. 4)
Version N2/N1 Wall Clock(sec) N2/N1 Mflip/s
User Scalar 4.45 0.90
User Vector 4.75 0.84
ESSL 5.10 0.78
ESSL-SMP 2.92 1.37
(4 threads for N1;
6 threads for N2)
IMSL 5.07 0.79
NAG 4.48 0.90
NAG-SMP 4.80 0.83
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 52Libraries and Their Performance
Comments on Scaling Problem Size
• The ESSL-SMP performance, when tuned for the optimal number of threads, increased by almost 40% with the increased problem size.
• The untuned ESSL-SMP performance increased by a factor of 3.2 with the increased problem size.
• The user codes, ESSL, IMSL, NAG and NAG-SMP routines all showed 10%-22% decreases in performance with the larger problem size.
• It is not possible to determine, a priori, how the performance of different, functionally equivalent routines will scale with problem size.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 53Libraries and Their Performance
Matrix Multiplication
• User coded scalar computation
• Fortran intrinsic matmul• Single processor ESSL dgemm• Multi-threaded SMP ESSL dgemm• Single processor IMSL dmrrrr (32-bit)
• Single processor NAG f01ckf• Multi-threaded SMP NAG f01ckf
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 54Libraries and Their Performance
Sample Problem
• Multiply two dense N by N matrixes,
A and B
• A(i,j) = i + j
• B(i,j) = j – i
• Output C(N,N) to verify result
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 55Libraries and Their Performance
Kernel of user matrix multiply
do i=1,n do j=1,n a(i,j) = real(i+j) b(i,j) = real(j-i) enddo enddo call f_hpmstart(1,"Matrix multiply") do j=1,n do k=1,n do i=1,n c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo call f_hpmstop(1)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 56Libraries and Their Performance
Comparison of Matrix Multiply(N1=5,000)
Version Wall Clock(sec) Mflip/s Scaled Time (1=Fastest)
User Scalar 1,490 168 106 Slowest
Intrinsic 1,477 169 106 Slowest
ESSL 195 1,280 13.9
ESSL-SMP 14 17,800 1.0 Fastest
IMSL 194 1,290 13.8
NAG 195 1,280 13.9
NAG-SMP 14 17,800 1.0 Fastest
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 57Libraries and Their Performance
Observations on Matrix Multiply
• Fastest times were obtained by the two SMP libraries, ESSL-SMP and NAG-SMP, which both obtained 74% of the peak node performance
• All the single processor library functions took 14 times more wall clock time than the SMP versions, each obtaining about 85% of peak for a single processor
• Worst times were from the user code and the Fortran intrinsic, which took 100 times more wall clock time than the SMP libraries
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 58Libraries and Their Performance
Comparison of Matrix Multiply(N2=10,000)
Version Wall Clock(sec) Mflip/s Scaled TimeESSL-SMP 101 19,800 1.01
NAG-SMP 100 19,900 1.00
• Scaling with Problem Size (Complexity increase approx. 8 times)
Version Wall Clock(N2/N1) Mflip/s (N2/N1)ESSL-SMP 7.2 1.10NAG-SMP 7.1 1.12
Both ESSL-SMP and NAG-SMP showed 10% performance gains with the larger problem size.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 59Libraries and Their Performance
Observations on Scaling
• Scaling of problem size was only done for the SMP libraries, to fit into reasonable times.
• Doubling N results in 8 times increase of computational complexity for dense matrix multiplication
• Performance actually increased for both routines for larger problem size.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 60Libraries and Their Performance
ESSL-SMP Performance vs. Number of Threads• All for N=10,000
• Number of threads controlled by environment variable OMP_NUM_THREADS
0
4000
8000
12000
16000
20000
0 4 8 12 16 20 24 28 32 36
Threads
dgemm Mflip/s
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 61Libraries and Their Performance
Part II Conclusion
• The NERSC user environment provides a rich variety of mathematical libraries
• Performance can vary widely for the same computation, sometimes even for the same function name, from library to library; performance also varies with problem size and, for the SMP libraries, the number of threads
• It is not possible to know, a priori, which library will provide the best performance for a given function and problem size
• The HPM Toolkit provides a way to compare library routine performance and make informed choices
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 62Libraries and Their Performance
Part III: Moving to Multi-node Parallelism
• The examples so far have all been of single processor or multi-processor, shared-memory (SMP style) parallelism on a single 16 processor node
• The poe+ command is the multi-node equivalent of hpmcount, and poe+ can be used with MPI codes or multi-node, distributed memory parallel libraries such as PESSL and ScaLAPACK.
• poe+ is a perl script developed by David Skinner of the NERSC User Services Group which aggregates hpmcount results for each of distributed-memory process
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 63Libraries and Their Performance
Kernel of PESSL/ScaLAPACK matrix multiply
! Call PESSL library routine
call f_hpminit((me+1),"Instrumented code")
call f_hpmstart((me+1),"Matrix multiply")
call pdgemm('T','T',n,n,n,1.0d0, myA,1,1,ides_a, &
myB,1,1,ides_b,0.d0, &
myC,1,1,ides_c )
call f_hpmstop(me+1)
call f_hpmterminate(me+1)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 64Libraries and Their Performance
Comments on PESSL/ScaLAPACK Code
• Although the kernel on the previous slide looks like a simple progression from the ESSL version, actually there is a lot of work involved in understanding PESSL/ScaLAPACK for new users
• There are a number of data structure complexities which do not exist for the single-node libraries
• The “complete” matrix does not exist on any processor, but is block-cyclic distributed among processors
• There are added parameters of processor geometry and data distribution parameters.
• New users should study the ScaLAPACK tutorial on the Web at http://www.netlib.org/scalapack/tutorial/
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 65Libraries and Their Performance
Prolog for PESSL/ScaLAPACK matrix multiply
! Initialize blacs processor grid
call blacs_pinfo (me,procs)
call blacs_get (0, 0, icontxt)
call blacs_gridinit(icontxt, 'R', prow, pcol)
call blacs_gridinfo(icontxt, prow, pcol, myrow, mycol)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 66Libraries and Their Performance
More Prolog for PESSL/ScaLAPACK! Construct local arrays myArows = numroc(n, nb, myrow, 0, prow) myAcols = numroc(n, nb, mycol, 0, pcol)! Initialize local arrays allocate(myA(myArows,myAcols)) allocate(myB(myArows,myAcols)) allocate(myC(myArows,myAcols)) do i=1,n call g2l(i,n,prow,nb,iproc,myi) if (myrow==iproc) then do j=1,n call g2l(j,n,pcol,nb,jproc,myj) if (mycol==jproc) then myA(myi,myj) = real(i+j) myB(myi,myj) = real(i-j) myC(myi,myj) = 0.d0 endif enddo endif enddo
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 67Libraries and Their Performance
Still More Prolog for PESSL/ScaLAPACK
! Prepare array descriptors for PESSL (ScaLAPACK style)
ides_a(1) = 1 ! descriptor type
ides_a(2) = icontxt ! blacs context
ides_a(3) = n ! global number of rows
ides_a(4) = n ! global number of columns
ides_a(5) = nb ! row block size
ides_a(6) = nb ! column block size
ides_a(7) = 0 ! initial process row
ides_a(8) = 0 ! initial process column
ides_a(9) = myArows ! leading dimension of local array
do i=1,9
ides_b(i) = ides_a(i)
ides_c(i) = ides_a(i)
enddo
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 68Libraries and Their Performance
Compile Uninstrumented Codes and Run with poe+
setenv FC "mpxlf90 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 -bmaxdata:0x80000000 -bmaxstack:0x80000000 "
$FC -o ABCp -lblacs -lpessl ABCp.f
module load scalapack$FC -o ABCs -qfree $PBLAS $BLACS $SCALAPACK -lessl
ABCp.f
poe+ ./ABCp ! PESSL versionpoe+ ./ABCs ! ScaLAPACK version
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 69Libraries and Their Performance
Four Runs for PESSL and ScaLAPACK Codes
• N=5000, 16 processors (one node) in 4x4 processor array
• N=10,000, 16 processors (one node) in 4x4 processor array
• N=5000, 64 processors (four nodes) in 8x8 processor array
• N=10000, 64 processors (four nodes) in 8x8 processor array
• Compare “whole code” performance using poe+ with “whole code” results for single-node ESSL-SMP routine using hpmcount.
• poe+ returns average wall clock time across all processes, and aggregate Mflip/s of all processes
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 70Libraries and Their Performance
Comparison of PESSL/ScaLAPACK dgemm(n=5000, 16 processors, “whole code” performance)
Section Wall Clock
(sec.)
Mflips/sec. Scaled Time (1.00=ESSL-SMP, 22 s)
PESSL 28.3 8,850 1.30
ScaLAPACK 30.4 8,240 1.40
ESSL-SMP achieved 47% of theoretical peak performance for one node
PESSL achieved 37%, and ScaLAPACK achieved 34%.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 71Libraries and Their Performance
Comparison of PESSL/ScaLAPACK dgemm(n=10000, 16 processors, “whole code”)
Section Wall Clock
(sec.)
Mflips/sec. Scaled Time (1.00=ESSL-SMP, 120 s)
PESSL 141. 14,230 1.20
ScaLAPACK 160. 12,500 1.30
ESSL-SMP achieved 70% of theoretical peak performance for one node
PESSL achieved 59%, and ScaLAPACK achieved 52%.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 72Libraries and Their Performance
Comparison of PESSL/ScaLAPACK dgemm(n=5000, 64 processors, “whole code”)
Section Wall Clock
(sec.)
Mflips/sec. Scaled Time (1.00=ESSL-SMP, 22 s)
PESSL 15.3 16,400 0.70
ScaLAPACK 14.2 17,600 0.65
PESSL achieved 17% of the theoretical peak for 4 nodes (96,000 Mflip/s), and ScaLAPACK achieved 18%.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 73Libraries and Their Performance
Comparison of PESSL/ScaLAPACK dgemm(n=10000, 64 processors, “whole code”)
Section Wall Clock
(sec.)
Mflips/sec. Scaled Time (1.00=ESSL-SMP, 120 s)
PESSL 51.5 38,900 0.43
ScaLAPACK 58.3 34,400 0.49
PESSL achieved 41% of the theoretical peak for 4 nodes (96,000 Mflip/s), and ScaLAPACK achieved 36%.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 74Libraries and Their Performance
Comments PESSL and ScaLAPACK Codes
• For problem sizes that fit within one node, the shared-memory, SMP libraries may give better performance than the distributed-memory, parallel libraries because of differences in data communication costs
• As the number of nodes and processors is increased, wall-clock time for distributed-memory libraries may drop below shared-memory SMP libraries for the same problem size, but per-processor efficiency may also drop.
• For problems which cannot fit in a node, the distributed-memory parallel libraries provide the best solution
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 75Libraries and Their Performance
Comments on using HPM Toolkit with PESSL and ScaLAPACK Codes
• HPM Toolkit generates two output files per task (one for statistics, one for visualization).
• Performance statistics for each task are found in files with names perfhpmNNNN.PPPPP where NNNN is the task id (or processor number), and PPPPP is the AIX process id
• Performance variations between processors and nodes can be observed.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 76Libraries and Their Performance
PESSL dgemm results for Small Instrumented Section
• For N=5,000, 16 processors (one node), PESSL pdgemm– average time of 16.9 seconds
– aggregate 14,800 Mflip/s
– 62% of the theoretical peak performance for a node
• For N=10,000, 64 processors (four nodes) PESSL pdgemm– average time of 40.1 seconds
– aggregate 50,000 Mflip/s
– 52% of the theoretical peak performance for four nodes
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 77Libraries and Their Performance
Variability in PESSL dgemm Small Instrumented Section
• For N=5,000, 16 processors (one node), PESSL pdgemm– Wall clock for each processor varies from 16.4 to 17.4 sec
– Mflip/s for each processor varies from 850 to 1000
• For N=10,000, 64 processors (four nodes) PESSL pdgemm– Wall clock for each processor varies from 39.25 to 40.75
sec
– Mflip/s for each processor varies from 730 to 830
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 78Libraries and Their Performance
PESSL dgemm Task Variation(n=5000, 16 processors)
840
860
880
900
920
940
960
980
1000
1020
16.2 16.4 16.6 16.8 17 17.2 17.4 17.6
Wall Clock (s)
Mflip/s
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 79Libraries and Their Performance
PESSL dgemm Task Variation(n=10000, 64 processors)
720
740
760
780
800
820
840
39 39.5 40 40.5 41
Wall Clock (s)
Mflip/s
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 80Libraries and Their Performance
Part III Conclusion
• NERSC provides a variety of distributed-memory, multi-node mathematical libraries (PESSL, ScaLAPACK and NAG Parallel).
• Performance of these libraries can be measured using “whole code” approaches with poe+, similar to hpmcount for single node codes
• The HPM Toolkit can be used to instrument small sections of codes for more detailed analysis, include variation between tasks; but a number of output files are produced and must be analyzed by the user.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
March 17, 2003 81Libraries and Their Performance
References
• Information on hpmcount and poe+ for whole code performance measurement is available on the NERSC Website at http://hpcf.nersc.gov/software/ibm/hpmcount/
• Detailed information about the HPM Toolkit for measuring performance of discrete code sections is available on the NERSC Website at http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_2_4_2.html
• The list of mathematical libraries available on seaborg can be found on the NERSC Website at http://hpcf.nersc.gov/software/ibm/#mathlibs
top related