www.cineca.it optimization techniques carlo cavazzoni, hpc department, cineca
TRANSCRIPT
![Page 1: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/1.jpg)
www.cineca.it
Optimization techniques
Carlo Cavazzoni, HPC department, CINECA
![Page 2: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/2.jpg)
www.cineca.it
Modern node architecture
CPU
RAM
Disk
cacheI, D
Small & fast
![Page 3: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/3.jpg)
www.cineca.it
Cache
Hierarchy register L1 L2 L3 RAML1: Instruction and dataSize: L1 … LnSpeed: L1 … Ln
CPU looks for data in L1, if it is there (L1 cache hit), if not (L1 cache miss) and looks in L2 …
cache miss penaly in terms of clock cycle
![Page 4: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/4.jpg)
www.cineca.it
CACHE Direct Mapped
32 Kbyte
32 Kbyte
32 Kbyte
32 Kbyte
32 Kbyte
0
32 K
64 K
128 Kcache
![Page 5: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/5.jpg)
www.cineca.it
Cache set associative
32 Kbyte
32 Kbyte16 Kbyte
32 Kbyte
32 Kbyte
0
32 K
64 K
128 K16 Kbyte
16 K
Es. 2-ways
48 K
cache
LastRecentlyUsed
Round Robin
Random
![Page 6: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/6.jpg)
www.cineca.it
Loop optimization
![Page 7: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/7.jpg)
www.cineca.it
Loop fusionLocality in time
do i=1, n
a(i) = b(i) + 1.0
enddodo i=2, n c(i) = sqrt(a(i-1))enddo
do i=2, n
a(i) = b(i) + 1.0
c(i) = sqrt(a(i-1))enddoa(1) = b(1) + 1.0
if n is big enough, a is loaded, offloaded and loaded again into cache
Reuse the a(i) loaded into cache
![Page 8: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/8.jpg)
www.cineca.it
Loop interchangeLocality in space
do i=1, n
do j=1, n a(i,j) = b(i,j) + 1.0 enddoenddo
Load elements into cache lines and use only one before replacing them with new elements
Load elements into cache and use all of them before replacing them with new elements
do j=1, n
do i=1, n a(i,j) = b(i,j) + 1.0 enddoenddo
0x000x010x020x03
a b
j j
ii
![Page 9: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/9.jpg)
www.cineca.it
Cache thrashingreal, dimension (1024) :: a,b
COMMON /my_com/ a, b
do i=1, 1024 a(i) = b(i) + 1.0enddo
offset shift matrixes w.r.t. cache no more problems
Avoid power of 2 for matrix dimensions
integer offset = (linea_cache)/SIZE(REAL)
real, dimension (1024+offset) :: a,b
COMMON /my_com/ a, b
do i=1, 1024 a(i) = b(i) + 1.0enddo
Padding
size cache = 4*1024, direct mapped, a, b contiguous cache thrashing
array size = multiple of cache size possible source of cache thrashing
Set Associative help reducing thrashing problems
![Page 10: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/10.jpg)
www.cineca.it
Loop unrolling
do j=1, n
do i=1, (n-1) a(i,j)= b(i,j)+b(i+1,j)+1.0 enddoenddo
Equivalent Loops.
Fewer jump.
Fewer dependencies.
Fill pipelines and vector units.
do j=1, n
do i=1, (n-1), 2 a(i,j) = b(i,j) +b(i+1,j)+1.0 a(i+1,j) = b(i+1,j)+b(i+2,j)+1.0 enddoenddo
![Page 11: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/11.jpg)
www.cineca.it
Optimize with numerical libraries
Less coding
Tested and (almost) bug free
Standard
Efficient implementation
Optimized
![Page 12: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/12.jpg)
www.cineca.it
BLASBasic Linear Algebra Subprogram, Parallel BLAS and Basic Linear Algebra Communication Subsystem (www.netlib.org)• Level 1 BLAS: Vector-Vector operations
(scalar only).
• Level 2 BLAS, PBLAS: Vector-Matrix operations (scalar and parallel).
• Level 3 BLAS, PBLAS: Matrix-Matrix operation (scalar and parallel).
• Level 1 and 2 BLACS: vector reduction, vector and matrics communications.
![Page 13: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/13.jpg)
www.cineca.it
Lapack and ScalapackLinear Algebra Package and Scalable Lapack (www.netlib.org)
Matrix decomposition. Solution of Linear Systems. Eigenvalues and Eigenvetors Linear Least Square solutions
![Page 14: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/14.jpg)
www.cineca.it
MKLESSLACMLCUBLASMAGMAPLASMA
![Page 15: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/15.jpg)
www.cineca.it
MASS (IBM)• Accelerated version of SQRT, SIN, COS,
EXP, LOG, ecc… Scalar and vector
![Page 16: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/16.jpg)
www.cineca.it
VML
Equivalent to MASS (vector version only) For Intel processors
Accelerated version of:sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2, sinh, cosh, tanh, dnint, x**y
![Page 17: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/17.jpg)
www.cineca.it
VML
do i = 1, n r = r + sin( a(i) )end do
call vdsin( n, a, y )do i = 1, n r = r + y( i )end do
CALL vml_subroutine( n, a, y )
![Page 18: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/18.jpg)
www.cineca.it
BLASMatrix multiplication
DGEMM (transa, transb, l, n, m, alpha, a, lda, b, ldb, beta, c, ldc)
c = alpha op( a ) * op( b ) + beta c
Clm = n Aln Bnm + Clm Clm = n ATln Bnm + Clm
Clm = n Aln BTnm + Clm Clm = n AT
ln BTnm + Clm
real*8 a(lda,*), b(ldb,*), c(ldc,*)
![Page 19: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/19.jpg)
www.cineca.it
Profileing with gprof
Compiler flag “-pg” or “-p” (depend on the compiler)
gcc -pg –c mio.c ./a.out
gmon.out gprof
![Page 20: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/20.jpg)
www.cineca.it
gcc -pg -funroll-loops –O2 dotprod.c -static
[cineca@rfxoff1 Carlo]$ ./a.out d = 1000000.000000
% cumulative self self total time seconds seconds calls us/call us/call name 68.57 0.05 0.05 2 23437.50 23437.50 set_vector 31.43 0.07 0.02 1 21484.38 21484.38 dot_product 0.00 0.07 0.00 1 0.00 68359.38 main
gprof
![Page 21: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/21.jpg)
www.cineca.it
Profileing “by hand”
Find “hot spot” in your application
Use temporization functions
CALL SYSTEM_CLOCK(iclk1, count_rate=nclk)
CALL critical_subroutine( …… )
CALL SYSTEM_CLOCK(iclk2)
PRINT *,REAL(iclk2-iclk1)/nclk
t1 = cclock()
CALL critical_subroutine( …… )
t2 = cclock()
PRINT *, (t2-t1)
CALL CPU_TIME( t3 )
CALL critical_subroutine( …… )
CALL CPU_TIME( t4 )
PRINT *, (t4-t3)
![Page 22: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/22.jpg)
www.cineca.it
Mesure performances
#include<stdio.h>#include<time.h>#include<ctype.h>#include<sys/types.h>#include<sys/time.h>
double cclock_(){
/* Restituisce il valore del CLOCK di sistema in secondi */
struct timeval tmp; double sec; gettimeofday( &tmp, (struct timezone *)0 ); sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0; return sec;
}
![Page 23: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/23.jpg)
www.cineca.it
PROGRAM test_dgemm
IMPLICIT NONE
INTEGER, PARAMETER :: dim = 1000 REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:) INTEGER :: i,j,k REAL*8 :: t1, t2 REAL*8 :: cclock EXTERNAL :: cclock ALLOCATE( x( dim, dim ), y( dim, dim ) ) ALLOCATE( z( dim, dim ) ) y = 1.0d0 z = 1.0d0 / DBLE( dim ) x = 0.0d0 t1 = cclock( ) do j = 1, dim do i = 1, dim do k = 1, dim x(i,j) = x(i,j) + y(i,k) * z(k,j) end do end do end do t2 = cclock() write(*,*) ' Matrix sum = ', sum(x) write(*,*) ' tempo (secondi) ', t2-t1 DEALLOCATE( x, y, z )
END PROGRAM
PROGRAM test_dgemm
IMPLICIT NONE
INTEGER, PARAMETER :: dim = 1000 REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:) INTEGER :: i,j,k REAL*8 :: t1, t2 REAL*8 :: cclock EXTERNAL :: cclock ALLOCATE( x( dim, dim ), y( dim, dim ), z( dim, dim ) )
y = 1.0d0 z = 1.0d0 / DBLE( dim ) x = 0.0d0 t1 = cclock()
! x = matmul( y, z ) call dgemm('N', 'N', dim, dim, dim, 1.0d0, y, c dim, z, dim,0.0d0, x, dim)
t2 = cclock() write(*,*) ' Matrix sum = ', sum(x) write(*,*) ' tempo (secondi) ', t2-t1 DEALLOCATE( x, y, z )
END PROGRAM
![Page 24: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/24.jpg)
www.cineca.it
ATLAS
BLAS compatible
Automatically Tuned Linear Algebra Software
http://sourceforge.net/http://math-atlas.sourceforge.net/devel/
![Page 25: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/25.jpg)
www.cineca.it
http://www.fftw.org
Fast Fourier Trasform
FFT complex to complex
FFT complex to real
Parallel FFT
Moulti-thread FFT
![Page 26: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/26.jpg)
www.cineca.it
Advanced techniques
![Page 27: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/27.jpg)
www.cineca.it
do i=1,n do j=1,m y(j,i) = x(i,j) enddoenddo
Case Study: matrix transposition
y
x
Think Fortran: Consecutive elements in memory
![Page 28: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/28.jpg)
www.cineca.it
Suppose 2-way set associative
y x
For each value of x I need to load into cache a whole line.1) X “allocate” the 2nd way.2) Risk of thrashing3) When the cache is full, the proc. Start to overwrite cache lines
data mapped in cachey “allocate” the 1st way
What happens with the cache
![Page 29: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/29.jpg)
www.cineca.it
Suppose 2-way set associative
y x
As before for each value of x I need to load into cache a whole cache line.
data mapped in cachey “allocate” the 1st way
What happens with the cache, cont.
We can see that a lot of data are loaded into the cache but they are not used!
![Page 30: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/30.jpg)
www.cineca.it
Suppose 2-way set associative
yx
Block Algorithm
Load a block of data into cache
Swap data in cache
Write data back to memory
![Page 31: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/31.jpg)
www.cineca.it 31
do i=1,n do j=1,m y(j,i) = x(i,j) enddoenddo
do ib = 1, nb ioff = (ib-1) * bsiz do jb = 1, mb joff = (jb-1) * bsiz do j = 1, bsiz do i = 1, bsiz buf(i,j) = x(i+ioff, j+joff) enddo enddo do j = 1, bsiz do i = 1, j-1 bswp = buf(i,j) buf(i,j) = buf(j,i) buf(j,i) = bswp enddo enddo do i=1,bsiz do j=1,bsiz y(j+joff, i+ioff) = buf(j,i) enddo enddo enddoenddo
bsiz = block sizenb = n / bsizmb = m / bsiz
You need to handle: MOD(n / bsiz) /= 0 ORMOD(m / bsiz) /= 0
Solution: Block algorithm
![Page 32: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/32.jpg)
www.cineca.it
Whole block transpose do ib = 1, nb ioff = (ib-1) * bsiz do jb = 1, mb joff = (jb-1) * bsiz do j = 1, bsiz do i = 1, bsiz buf(i,j) = x(i+ioff,j+joff) enddo enddo do j = 1, bsiz do i = 1, j-1 bswp = buf(i,j) buf(i,j) = buf(j,i) buf(j,i) = bswp enddo enddo do i=1,bsiz do j=1,bsiz y(j+joff,i+ioff) = buf(j,i) enddo enddo enddoenddo
IF( min( 1, MOD(n,bsiz) ) .GT. 0 ) THEN ioff = nb * bsiz do jb = 1, mb joff = (jb-1) * bsiz do j = 1, bsiz do i = 1, MIN(bsiz, n-ioff) buf(i,j) = x(i+ioff, j+joff) enddo enddo do i = 1, MIN(bsiz, n-ioff) do j = 1, bsiz y(j+joff,i+ioff) = buf(i,j) enddo enddo enddoEND IF
IF( MIN(1, MOD(m, bsiz)) .GT. 0 ) THEN joff = mb * bsiz do ib = 1, nb ioff = (ib-1) * bsiz do j = 1, MIN(bsiz, m-joff) do i = 1, bsiz buf(i,j) = x(i+ioff, j+joff) enddo enddo do i = 1, bsiz do j = 1, MIN(bsiz, m-joff) y(j+joff,i+ioff) = buf(i,j) enddo enddo enddoEND IF
IF( MIN(1,MOD(n,bsiz)).GT.0 .AND. & & MIN(1,MOD(m,bsiz)).GT.0 ) THEN joff = mb * bsiz ioff = nb * bsiz do j = 1, MIN(bsiz, m-joff) do i = 1, MIN(bsiz, n-ioff) buf(i,j) = x(i+ioff, j+joff) enddo enddo do i = 1, MIN(bsiz, n-ioff) do j = 1, MIN(bsiz, m-joff) y(j+joff,i+ioff) = buf(i,j) enddo enddoEND IF
1
2
3
![Page 33: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/33.jpg)
www.cineca.it 33
Performance tuning and analysis: user codes
Matrix TraspositionMatrix size: 2048x2048
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0 20 40 60 80 100 120
block size
exec
uti
on
tim
e
Straightforward implementation
Block implementation
![Page 34: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/34.jpg)
www.cineca.it
DO l=1,nphase IF(au1(l,l) /= 0.D0) THEN lp1=l+1 div=1.D0/au1(l,l) DO lj=lp1,nphase au1(l,lj)=au1(l,lj)*div END DO bu1(l)=bu1(l)*div au1(l,l)=0.D0 DO li=1,nphase amul=au1(li,l) DO lj=lp1,nphase au1(li,lj)=au1(li,lj)-amul*au1(l,lj) END DO bu1(li)=bu1(li)-amul*bu1(l) END DO END IF END DO
IF( a(1,1) /= 0.D0 ) THEN div = 1.D0 / a(1,1) a(1,2) = a(1,2) * div a(1,3) = a(1,3) * div b(1) = b(1) * div a(1,1) = 0.D0 !li=2 amul = a(2,1) a(2,2) = a(2,2) - amul * a(1,2) a(2,3) = a(2,3) - amul * a(1,3) b(2) = b(2) - amul * b(1) !li=3 amul = a(3,1) a(3,2) = a(3,2) - amul * a(1,2) a(3,3) = a(3,3) - amul * a(1,3) b(3) = b(3) - amul * b(1)END IF
IF( a(2,2) /= 0.D0 ) THEN div=1.D0/a(2,2) a(2,3)=a(2,3)*div b(2)=b(2)*div a(2,2)=0.D0 !li=1 amul=a(1,2) a(1,3)=a(1,3)-amul*a(2,3) b(1)=b(1)-amul*b(2) !li=3 amul=a(3,2) a(3,3)=a(3,3)-amul*a(2,3) b(3)=b(3)-amul*b(2)END IF
IF( a(3,3) /= 0.D0 ) THEN div=1.D0/a(3,3) b(3)=b(3)*div a(3,3)=0.D0 !li=1 amul=a(1,3) b(1)=b(1)-amul*b(3) !li=2 amul=a(2,3) b(2)=b(2)-amul*b(3)END IF
Parameter Dependent Code & Unrolling
per un dato set di parametri(di input), riesco ad eliminareogni loop, ottimizzando cachee pipe di esecuzione
![Page 35: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/35.jpg)
www.cineca.it
Debugging (post mortem)
program hello_bug real(kind=8) :: a( 10 ) call clearv( a, 10000 ) print *, SUM( a )end program
subroutine clearv( a, n) real(kind=8) :: a( * ) integer :: n integer :: i do i = 1, n a( n ) = 0.0 end doend subroutine
gfortran –g hello_bug.f90
Remove core size limitulimit –c unlimited
./a.out Segmentation fault (core dumped)
gdb ./a.out core
![Page 36: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/36.jpg)
www.cineca.it
Debugging (in vivo)gfortran -g hello_bug.f90
gdb ./a.out
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-32.el5)Copyright (C) 2009 Free Software Foundation, Inc.License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>This is free software: you are free to change and redistribute it.There is NO WARRANTY, to the extent permitted by law. Type "show copying"and "show warranty" for details.This GDB was configured as "x86_64-redhat-linux-gnu".For bug reporting instructions, please see:<http://www.gnu.org/software/gdb/bugs/>...Reading symbols from /plx/userinternal/acv0/a.out...done.
(gdb) runStarting program: /plx/userinternal/acv0/a.out warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000
Program received signal SIGSEGV, Segmentation fault.0x0000000000400833 in clearv (a=0x7fffffffe2f0, n=@0x400968) at hello_bug.f90:1212 a( n ) = 0.0
![Page 37: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/37.jpg)
www.cineca.it
#include<sys/types.h>#include<sys/time.h>
double cclock_()
{
/* Restituisce il valore del CLOCK di sistema in secondi */
struct timeval tmp; double sec; gettimeofday( &tmp, (struct timezone *)0 ); sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0; return sec;
} program mat_mul integer, parameter :: n = 100 real*8 :: a(n,n), b(n,n), c(n,n) real*8 :: t1, t2 real*8 :: cclock external cclock a = 1.0d0 b = 1.0d0 t1 = cclock() call dgemm('N', 'N', n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n ) t2 = cclock() write(*,*) SUM(c), t2-t1end program
1) gcc –c cclock.c
2) f95 matmul_prof.f90 cclock.o -L. -lblas
Link Fortran and C
![Page 38: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/38.jpg)
www.cineca.it
Link Fortran and C
program rand real(kind=8) :: a external crand call crand( a ) print *,'this is random ', aend program
#include<stdlib.h>#include<time.h>
void crand_( double * x ){ (*x) = ( (double)random()/(double)RAND_MAX );}
Link a C subroutine with a Fortran program
rand.f90 crand.c
Fortran passes arguments by reference, C passes them by value
![Page 39: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/39.jpg)
www.cineca.it
Link a Fortran subroutine with a C program
#include<stdio.h>
int main(){ int n; double a[10], d;
n = 10; d = 1.0; setv_( a, &d, &n ); printf("%lf\n", a[0]); }
subroutine setv( a, d, n ) real(kind=8) :: a( * ) real(kind=8) :: d integer :: n integer :: i do i = 1, n a( i ) = d end doend subroutine
cvec.c vset.f90
Link Fortran and C
![Page 40: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/40.jpg)
www.cineca.it
Make CommandIf a code is large and/or it shares subroutines withother codes, it is useful to split the source in many files that could be placed in different directories.
In F90 there are dependencies among program units,i.e. modules must be compiled before than any other program units.Therefore there is a well defined order for compiling source files
To avoid compiling by hands the sources in the proper order,the make command could be used
![Page 41: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/41.jpg)
www.cineca.it
Make Command
The make command can be programmed to do the job for youusing a file containing instruction and dircetive.
By default the make command looks in the present directoryfor a file colled Makefile or makefile
![Page 42: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/42.jpg)
www.cineca.it
A simple makefile# this is a comment within the makefile
myprog.x : modules.o main.of90 –o myprog.x modules.o main.o
modules.o : modules.f90f90 –c modules.f90
main.o : modules.o main.f90f90 –c main.f90
this tell to the make commandthat myprog.x depend frommodules.o and main.o
make execute the command onlywhen modules.o and main.ohave been built
to compile the code, from the console the programmer issuethe command:> make
![Page 43: Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA](https://reader036.vdocuments.us/reader036/viewer/2022062511/5514bd5f550346b0478b45d4/html5/thumbnails/43.jpg)
www.cineca.it
A less simple makefile# this is a comment within the makefile
myprog.x : modules.o main.of90 –o myprog.x modules.o main.o
main.o : modules.o
.f90.of90 –c $<
this is an implicit dependency, it state that allfiles “.o” depend and should be generated fromthe corresponding “.f90” files
this is a make macro, and it is expandend with the proper “.f90” filename
In the above example, make try to built myprog.x but it realizes that main.o and modules.o should begenerated first. Then it starts looking for a rule to make the “.o”, and it finds that main.o depend onmodules.o, and thern make build an internal hierarchy for compilation in which modules.o come beforemain.o . Finally make finds the implicit rule and starts compiling the sources.