scientific computing on graphics processor units: an application in time-domain electromagnetic...

65
Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. [email protected] Department of Electrical Engineering, Northern Illinois University, DeKalb, IL 60115

Upload: kaelyn-edson

Post on 15-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

Scientific Computing on Graphics

Processor Units: An Application in Time-Domain Electromagnetic Simulations

Veysel Demir, [email protected]

Department of Electrical Engineering, Northern Illinois University, DeKalb, IL 60115

Page 2: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada

Outline

The Finite-Difference Time-Domain (FDTD) Method

Compute Unified Device Architecture (CUDA)

FDTD Using CUDA

Page 3: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

Bachelor of Science, Electrical and Electronics Engineering, Middle East Technical University, Ankara, Turkey, 1997.

System Analyst and Programmer, Pamukbank, Software Development Department, Istanbul, Turkey, July 1997 – August 2000.

Master of Science, Electrical Engineering, Syracuse University, Syracuse, NY, 2002.

Doctor of Philosophy, Electrical Engineering, Syracuse University, Syracuse, NY, 2004.

Research Assistant, Sonnet Software, Inc. Liverpool, NY, August 2000 – July 2004.

Visiting research scholar, University of Mississippi, Electrical Engineering Department, University, MS, July 2004 – Present.

Assistant Professor, Department of Electrical Engineering, Northern Illinois University, DeKalb,IL, August 2007 – present

Veysel Demir

Page 4: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

4/60

Integral equation methodsDifferential equation methods

Computational Electromagnetics

Maxwell’s equations can be given in differential or integral form

Finite-differencetime-domain

(FDTD)

Finite-differencefrequency-domain

(FDFD)

Method of Moments(MoM)

Fast multipole method (FMM)

Finite element method (FEM)

Transmission line matrix (TLM)

Page 5: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

Frequency domain methods

Time-domain methods

Computational Electromagnetics

Maxwell’s equations can be given in time domain or frequency domain

Finite-differencetime-domain

(FDTD)

Finite-differencefrequency-domain

(FDFD)

Method of Moments(MoM)

Fast multipole method (FMM)

Finite element method (FEM)

Transmission line matrix (TLM)

Page 6: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

Commercial software packages

Commercial software packages

Finite-difference time-domain (FDTD)

Method of Moments (MoM)Finite element method (FEM)

Transmission line matrix (TLM)

CST Microstripes

HFSS

ADS Momentum

Page 7: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

The Finite-Difference Time-Domain Method

Page 8: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

FDTD Books

Page 9: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

Yearly FDTD Publications

The most popular method in computational electromagnetics

Source: Allen Taflove, “A Perspective on the 40-Year History of FDTD Computational Electrodynamics,” Applied Computational Electromagnetics Society (ACES) Conference, Miami, Florida, March 15, 2006.Can be found at http://www.ece.northwestern.edu/ecefaculty/Allen1.html

Page 10: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

Maxwell’s Equations

The basic set of equations describing the electromagnetic world

Constitutive relations

0vD

B

BE

tD

H Jt

Gauss’s law

Gauss’s law for magnetism

Ampere’s law

Faraday’s law

andD E B H

Page 11: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

11/60

FDTD Overview – Finite Differences

Represent the derivatives in Maxwell’s curl equations by finite differences We use the second-order accurate central difference formula

( ) ( ) ( )( )

2

df x f x x f x xf x

dx x

Page 12: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

12/60

FDTD Overview – Cells

A three-dimensional problem space is composed of cells

Page 13: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

13/60

FDTD Overview – The Yee Cell

The FDTD (Finite Difference Time Domain) algorithm was first established by Yee as a three dimensional solution of Maxwell's curl equations.

K. Yee, IEEE Transactions on Antennas and Propagation, May 1966.

Page 14: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

14/60

FDTD Overview – Material grid

A three-dimensional problem space is composed of cells

Page 15: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

15/60

FDTD Overview – Updating Equations

Three scalar equations can be obtained from one vector curl equation.

EH

t

yx zx

y x zy

y xzz

HE H

t y z

E H H

t z xH HE

t x y

HE

t

yx zx

y xzy

yxzz

EH E

t z y

H EE

t x zEEH

t y x

Page 16: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

16/60

FDTD Overview – Updating Equations

Represent derivatives by finite-differences

yx zx

HE H

t y z

1

0.5 0.50.5 0.5

( , , ) ( , , )( , , )

( , , ) ( , , 1)( , , ) ( , 1, )

n nx x

x

n nn ny yz z

E i j k E i j ki j k

t

H i j k H i j kH i j k H i j k

y z

Page 17: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

17/60

FDTD Overview – Updating Equations

Represent derivatives by finite-differences

yx zx

EH E

t z y

0.5 0.5

0.5

( , , ) ( , , )( , , )

( , , 1) ( , , ) ( , 1, ) ( , , )

n nx x

x

n n n ny y z z

H i j k H i j ki j k

t

E i j k E i j k E i j k E i j k

z y

Page 18: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

18/60

FDTD Overview – Updating Equations

Express the future components in terms of the past components

0.5 0.5

1

0.5 0.5

( , , ) ( , 1, )

( , , ) ( , , )( , , ) ( , , ) ( , , 1)

n nz z

n nx x n n

x y y

H i j k H i j k

ytE i j k E i j k

i j k H i j k H i j k

z

0.5

0.5 0.5

( , , 1) ( , , )

( , , ) ( , , )( , , ) ( , 1, ) ( , , )

n ny y

n nx x n n

x z z

E i j k E i j k

t zH i j k H i j ki j k E i j k E i j k

y

Page 19: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

FDTD Overview – Leap-frog Algorithm

Page 20: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada

FDTD Overview – Updating Coefficients

FDTD code in FORTRAN

Number of coefficients will be less if nonconductive medium is assumed

1 1

2 2, , , ,

, , 1 , ,

, 1, , ,

, ,

, ,

, ,

n n

x x

n ny y

n nz z

hxh

hxey

hxez

H i j k H i j k

E i

C i j k

C j k E i j k

E i j k E i j

i j k

C i j k k

subroutine update_magnetic_fields! nx, ny, nz: number of cells in x, y, z directionsHx = Chxh * Hx &

+ Chxey * (Ey(:,:,2:nz+1) - Ey(:,:,1:nz)) & + Chxez * (Ez(:,2:ny+1,:) - Ez(:,1:ny,:));

Hy = Chyh * Hy &

+ Chyez * (Ez(2:nx+1,:,:) - Ez(1:nx,:,:)) &+ Chyex * (Ex(:,:,2:nz+1) - Ex(:,:,1:nz));

Hz = Chzh * Hz &

+ Chzex * (Ex(:,2:ny+1,:) - Ex(:,1:ny,:)) &+ Chzey * (Ey(2:nx+1,:,:) - Ey(1:nx,:,:));

end subroutine update_magnetic_fields

ymx zx x x

EH EH

t z y

Page 21: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

Absorbing Boundary Conditions

The three-dimensional problem space is truncated by absorbing boundaries

Most popular absorbing boundary is Perfectly Matched layers (PML)

Page 22: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

Active and Passive Lumped Elements

Active and passive lumped elements can be modeled in FDTD

EH E J

t

Voltage source Current source

Page 23: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

Active and Passive Lumped Elements

Resistor Capacitor Inductor Diode

Page 24: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

Active and Passive Lumped Elements

A diode circuit

Page 25: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

Transformation from Time-Domain to Frequency-Domain

Results can be obtained for frequency domain using Fourier Transform

A low-pass filter

S11

S21

Page 26: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

26

Near-Field to Far-field Transformations

An inverted-F antenna

Page 27: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

27/60

Modeling fine geometries

It is possible to model fine structures using appropriate formulations

A wire loop antenna

Page 28: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

28/60

Incident plane wave

Page 29: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

29/60

Scattering Problems

inc scat inc scatH H E Et

0inc incH Et

A dielectric sphere

Page 30: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

30/60

Scattering from a Dielectric Sphere

Page 31: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

31/60

Earth / Ionosphere Models in Geophysics

Snapshots of FDTD-Computed Global Propagation of ELF Electromagnetic Pulse Generated by Vertical Lightning Strike off South America Coast

Source: Allen Taflove, “A Perspective on the 40-Year History of FDTD Computational Electrodynamics,” Applied Computational Electromagnetics Society (ACES) Conference, Miami, Florida, March 15, 2006.Can be found at http://www.ece.northwestern.edu/ecefaculty/Allen1.html

Page 32: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

32/60

Wireless Personal Communications Devices

Source: Allen Taflove, “A Perspective on the 40-Year History of FDTD Computational Electrodynamics,” Applied Computational Electromagnetics Society (ACES) Conference, Miami, Florida, March 15, 2006.Can be found at http://www.ece.northwestern.edu/ecefaculty/Allen1.html

Page 33: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

33/60

Phantom Head Validation at 1.8 GHz

Source: Allen Taflove, “A Perspective on the 40-Year History of FDTD Computational Electrodynamics,” Applied Computational Electromagnetics Society (ACES) Conference, Miami, Florida, March 15, 2006.Can be found at http://www.ece.northwestern.edu/ecefaculty/Allen1.html

Page 34: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

Ultrawideband Microwave Detection of Early-Stage Breast Cancer

Source: Allen Taflove, “A Perspective on the 40-Year History of FDTD Computational Electrodynamics,” Applied Computational Electromagnetics Society (ACES) Conference, Miami, Florida, March 15, 2006.Can be found at http://www.ece.northwestern.edu/ecefaculty/Allen1.html

Page 35: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

Focusing Plasmonic Lens

Source: Allen Taflove, “A Perspective on the 40-Year History of FDTD Computational Electrodynamics,” Applied Computational Electromagnetics Society (ACES) Conference, Miami, Florida, March 15, 2006.Can be found at http://www.ece.northwestern.edu/ecefaculty/Allen1.html

Page 36: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

Image from: http://www.nvidia.com/object/GPU_Computing.html

NVIDIA® Tesla™ C10 SeriesGeForce GTX 285 (240 cores)Processor clock =1.4 GHzBandwidth = 159 GB/sMemory = 2 GB (GDDR3)

Intel® Quad Core Processor ~ 3 GHz

Copyright © 2010 Demir and Elsherbeni

NVIDIA® Tesla™ C20 Series (Fermi)GeForce GTX 480 (480 cores)Processor clock =1.4 GHzBandwidth = 177.4 GB/sMemory = 1.5 GB (GDDR5)

Graphics Processing Unit (GPU) Computing

GPU computing is the use of a GPU (graphics processing unit) to do general purpose scientific and engineering computing.

Page 37: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

RRVS, November 28, 2010

CPU vs GPU

The GPU Devotes More Transistors to Data Processing

Page 38: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

CPU vs GPU

http://www.nvidia.com

Floating-Point Operations per Second

Page 39: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 39/60

Programming FDTD for GPU Platform

OpenGL S. E. Krakiwsky, L. E. Turner, and M. M. Okoniewski, “Graphics Processor Unit (GPU) Acceleration of

Finite-Difference Time-Domain (FDTD) Algorithm,” Proc. 2004 International Symposium on Circuits and Systems, vol. 5, pp. V-265–V-268, May 2004.

S. E. Krakiwsky, L. E. Turner, and M. M. Okoniewski, “Acceleration of Finite-Difference Time-Domain (FDTD) Using Graphics Processor Units (GPU),” 2004 IEEE MTT-S International Microwave Symposium Digest, vol. 2, pp. 1033–1036, Jun. 2004.

S. Adams, J. Payne, and R. Boppana, “Finite Difference Time Domain (FDTD) Simulations Using Graphics Processors,” Proceedings of the 2007 DoD High Performance Computing Modernization Program Users Group (HPCMP) Conference, pp. 334–338, 2007.

High Level Shader Language (HLSL) N. Takada, N. Masuda, T. Tanaka, Y. Abe, and T. Ito, “A GPU Implementation of the 2-D Finite-

Difference Time-Domain Code Using High Level Shader Language,” Applied Computational Electromagnetics Society Journal, vol. 23, no. 4, pp. 309–316, 2008.

Brook M. J. Inman, A. Z. Elsherbeni, and C. E. Smith “GPU Programming for FDTD Calculations,” The Applied

Computational Electromagnetics Society (ACES) Conference, Honolulu, Hawaii, 2005. M. J. Inman and A. Z. Elsherbeni, “Programming Video Cards for Computational Electromagnetics

Applications,” IEEE Antennas and Propagation Magazine, vol. 47, no. 6, pp. 71–78, December 2005.

Compute Unified Device Architecture (CUDA)

OpenCL

Page 40: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 40/60

Compute Unified Device Architecture (CUDA)

NVIDIA® CUDA™ is a general purpose parallel computing architecture

Advantages Standard C language for parallel application development on the

GPU Standard numerical libraries for FFT and BLAS (Basic Linear

Algebra Subroutines) Scattered reads – code can read from arbitrary addresses in memory. Shared memory – CUDA exposes a fast shared memory region

(16KB in size) that can be shared amongst threads. This can be used as a user-managed cache, enabling higher bandwidth than is possible using texture lookups

Faster downloads and readbacks to and from the GPU Full support for integer and bitwise operations, including integer

texture lookups. Steep learning curve

Source: http://en.wikipedia.org/wiki/CUDA

Page 41: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 41/60

Compute Unified Device Architecture (CUDA)

Disadvantages Texture rendering is not supported. For double precision there are some deviations from the IEEE 754

standard In single precision, denormals and signalling NaNs are not supported;

only two IEEE rounding modes are supported and the precision of division/square root is slightly lower than single precision.

The bus bandwidth and latency between the CPU and the GPU may be a bottleneck.

Threads should be running in groups of at least 32 for best performance, with total number of threads numbering in the thousands.

Unlike OpenCL, CUDA-enabled GPUs are only available from NVIDIA

Source: http://en.wikipedia.org/wiki/CUDA

Page 42: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

RRVS, November 28, 2010 42/60

Grid of thread blocks

Page 43: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

RRVS, November 28, 2010 43/60

Memory hierarchy

Page 44: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

RRVS, November 28, 2010 44/60

Heterogeneous programming

Page 45: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 45/60

FDTD using CUDA

Performance Optimization Strategies in CUDA

Structure the algorithm in a way that exposes as much data parallelism as possible.

Ensure global memory accesses are coalesced whenever possible.

Minimize the use of global memory.

Use shared memory to avoid redundant transfers from global memory.

Maintain a high level of occupancy.

Use a multiple of 32 threads for the number of threads per block as this provides optimal computing efficiency and facilitates coalescing.

CUDA Best Practices Guide, http://www.nvidia.com/object/cuda_develop.html.

Page 46: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 46/60

Achieving Parallelism (xyz-mapping)

Mapping of threads to cells of an FDTD domain using the xyz-mapping. Each thread is mapped to a cell

block_dim_x = number_of_threads;block_dim_y = 1; n_blocks_y = nz;n_blocks_x = (nx*ny)/number_of_threads + (nx*ny)%number_of_threads==0?0:1);

Page 47: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 47/60

Achieving Parallelism (xy-mapping with for loop)

Mapping of threads to cells of an FDTD domain using the xy-mapping. Each thread is mapped to a cell in an xy plane cut Then the thread processes all of the cells in the same column in a for loop

block_dim_x = number_of_threads;block_dim_y = 1; n_blocks_y = 1;n_blocks_x = (nx*ny)/number_of_threads + (nx*ny)%number_of_threads==0?0:1);

Page 48: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

RRVS, November 28, 2010 48/60

Coalesced Global Memory Access

Global memory should be viewed in terms of aligned segments of 16 and 32 words

threads

memory

threads

memory

Page 49: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

RRVS, November 28, 2010 49/60

Uncoalesced Global Memory Access

threads

memory

Page 50: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 50/60

Coalesced Global Memory Acces

Unfortunately in FDTD updates the operations are dominated by memory accesses rather than arithmetic instruction.

The memory access inefficiency is the bottle neck for the efficiency of FDTD on GPU.

Global memory bandwidth is used most efficiently when the simultaneous memory accesses by threads can be coalesced.

The FDTD domain is extended by padded cells such that the number of cells in x and y directions is an integer multiple of 16.

Page 51: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 51/60

Use of shared memory

Because it is on-chip, the access to shared memory is much faster than the local and global memory.

Shared memory is especially useful when threads need to access to unaligned data.

subroutine update_magnetic_fields... Hy = Chyh * Hy &

+ Chyez * (Ez(2:nx+1,:,:) - Ez(1:nx,:,:)) &+ Chyex * (Ex(:,:,2:nz+1) - Ex(:,:,1:nz));

... end subroutine update_magnetic_fields

, ,yH i j k 1, ,zE i j k

, , 1xE i j k coalesced access

uncoalesced access (use shared memory)

Page 52: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 52/60

Data reuse

Data transfers from and to the global memory should be avoided as much as possible.

If some data is already transferred from the global memory and it is available, it is better to use it as many times as possible.

Therefore, a kernel function can be constructed based on the xy-mapping in which each thread traverses in the z direction in a for loop by incrementing k index of the cells.

subroutine update_magnetic_fields... Hy = Chyh * Hy &

+ Chyez * (Ez(2:nx+1,:,:) - Ez(1:nx,:,:)) &+ Chyex * (Ex(:,:,2:nz+1) - Ex(:,:,1:nz));

... end subroutine update_magnetic_fields

, ,yH i j k , ,xE i j k

, , 1xE i j k Already in the memory(use during the next iteration)

Page 53: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 53/60

Optimization of number of threads

CUDA Visual Profiler is used to determine optimum number of threads

For this test, an FDTD domain with size of 8 million cells is used.

Cpu time versus number of threads per block.

Page 54: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 54/60

Code using xyz-mapping

Code using the xyz-mapping

__global__ voidupdate_magnetic_fields_on_kernel(int nx, int ny, int nz, float *Ex, float *Ey, float *Ez, ...){

extern __shared__ float sEyz[];float *sEy = (float*) sEyz;float *sEz = (float*) &sEy[blockDim.x+16];// ci: cell index// si: index in shared memory arrayint ci = blockIdx.x * blockDim.x + threadIdx.x;int j = ci/nx;int i = ci-j*nx;int si = threadIdx.x; int sip1 = si+1; int nxxyy = nx*ny;int cizp; int ciyp; float ex;ci = ci + blockIdx.y*nxxyy;sEz[si] = Ez[ci];sEy[si] = Ey[ci];if (threadIdx.x<16){

sEz[blockDim.x+threadIdx.x] = Ez[ci+blockDim.x];sEy[blockDim.x+threadIdx.x] = Ey[ci+blockDim.x];

}__syncthreads();Hx[ci] = Chxh[ci]*Hx[ci]+Chxey[ci]*(Ey[ci+nxxyy]-Ey[ci])+Chxez[ci]*(Ez[ci+nx]-sEz[si]); Hy[ci] = Chyh[ci]*Hy[ci]+Chyez[ci]*(sEz[si+1]-sEz[si])+Chyex[ci]*( Ex[ci+nxxyy]-Ex[ci]);Hz[ci] = Chzh[ci]*Hz[ci]+Chzex[ci]*(Ex[ci+nx]-ex)+Chzey[ci]*(sEy[si+1]-sEy[si]);

}

Page 55: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 55/60

Code using xy-mapping

Code using the xy-mapping. __global__ voidupdate_magnetic_fields_on_kernel(int nxx, int nyy, int nx, int ny, int nz, …){

...ey = Ey[ci];ex = Ex[ci];for (int k=0;k<nz;k++){

cizp = ci + nxxyy;exzp = Ex[cizp];eyzp = Ey[cizp];sEz[si] = Ez[ci];if (threadIdx.x<16){

sEz[blockDim.x+threadIdx.x] = Ez[ci+blockDim.x];}__syncthreads();Hx[ci] = Chxh[ci]*Hx[ci]+ Chxey[ci]*(eyzp-ey)+ Chxez[ci]*(Ez[ci+nxx]-sEz[si]); Hy[ci] = Chyh[ci]*Hy[ci]+ Chyez[ci]*(sEz[sip1]-sEz[si])+Chyex[ci]*(exzp-ex); sEy[si] = ey;if (threadIdx.x<16){

sEy[blockDim.x+threadIdx.x] = Ey[ci+blockDim.x];}__syncthreads();Hz[ci] = Chzh[ci] * Hz[ci]+ Chzex[ci] * (Ex[ci+nxx]-ex)+ Chzey[ci] * (sEy[sip1]-sEy[si]); ci = cizp;ey = eyzp; ex = exzp;

}}

Page 56: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 56/60

FDTD CUDA Performance

Algorithm speed versus problem size

610number of time steps Nx Ny Nz

number of millioncells per secondtotal simulation time

xy mapping 450 million cells/second

xyz mapping 400 million cells/second

Tesla Computing Card Single precision Cubic problem domain PEC boundaries

Page 57: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 57/60

An Example

Microstrip-Fed Circularly Polarized Square-Ring Patch Antenna [1].

A highly resonant structure Problem size: 228x228x43 = 2,235,312 cells

CPML boundaries

[1] Hua-Ming Chen, Yang-Kai Wang, Yi-Fang Lin, Che-Yen Lin, and Shan-Cheng Pan,"Microstrip-Fed Circularly Polarized Square-Ring Patch Antenna for GPS Applications," IEEE TRANSACTIONS ON ANTENNAS AND PROPAGATION, VOL. 57, NO. 4, APRIL 2009.

Page 58: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 58/60

An Example - Transients at 50,000 time steps

SP starts to deviate from DP at very large time steps

Page 59: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 59/60

An Example – Simulation Time

22.84gpu

cpu

DP

DP

23.17gpu

cpu

SP

SP 1.94gpu

gpu

SP

DP

1.91cpu

cpu

SP

DP

  platform simualtion time (min)* NMCPS

single precision cpu 105.66 17.63

single precision gpu 4.56 408.50

double precision cpu 201.67 9.24

double precision gpu 8.83 210.96

* simulation time for 50,000 timesteps

Tesla floating point peak performance (GFLOPs/s) Single Precision 933 Double Precision 78

Arithmetic operations are hidden by the more dominant memory operations

cpu: Intel Xeon @ 2 GHz

Page 60: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 60/60

Conclusions

FDTD is a very suitable algorithm to program on a GPU platform

CUDA is an easy to learn and efficient architecture to program compatible GPU cards

20 times faster computation is achieved on a Tesla GPU card compared with a conventional CPU using CUDA

Computational speed can be improved even further on multi-GPU platforms

Page 61: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

2010 IEEE AP-S URSI, Toronto, Ontario, Canada 61/26

Thank You

Page 62: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

Single Precision vs Double Precision in CUDA GPU’s with compute capability 1.3 or higher can support double precision

Tesla floating point peak performance (GFLOPs/s) Single Precision 933 Double Precision 78

An Example Microstrip-Fed Circularly Polarized Square-Ring Patch Antenna [1].

A highly resonant structure Problem size: 228x228x43 = 2,235,312 cells

CPML boundaries

[1] Hua-Ming Chen, Yang-Kai Wang, Yi-Fang Lin, Che-Yen Lin, and Shan-Cheng Pan,"Microstrip-Fed Circularly Polarized Square-Ring Patch Antenna for GPS Applications," IEEE TRANSACTIONS ON ANTENNAS AND PROPAGATION, VOL. 57, NO. 4, APRIL 2009.

Copyright © 2010 Demir and Elsherbeni 62

Page 63: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

An Example - Transients at 50,000 time steps

SP starts to deviate from DP at very large time steps (time > 14 ns).

Copyright © 2010 Demir and Elsherbeni 63

Page 64: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

An Example - Transients at 400,000 time steps

Copyright © 2010 Demir and Elsherbeni 64

Page 65: Scientific Computing on Graphics Processor Units: An Application in Time-Domain Electromagnetic Simulations Veysel Demir, Ph.D. vdemir@niu.edu Department

Single Precision vs Double Precision in CUDA

Single precision is sufficient for most applications

Double precision is not necessary unless high level of accuracy is required

Efficiency of double precision vs single precision on GPU is comparable to that on CPU

Copyright © 2010 Demir and Elsherbeni 65