recent progress in sccs on gpu simulation of biomedical and hydrodynamic problems
TRANSCRIPT
![Page 1: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/1.jpg)
TAIPEI | SEP. 21-22, 2016
Tony W. H. Sheu, Neo Shih-Chao Kao, Maxim Solovchuk, Cheng-Tao Wu, Yu-Wei ChangNational Taiwan University
RECENT PROGRESS IN SCCS ON GPU SIMULATION OF BIOMEDICAL AND HYDRODYNAMIC PROBLEMS
Acknowledgement : SCCS (Scientific Computing and Cardiovascular Simulation) team working on GPU simulation(輝達)(Aug. 8, 2016)
![Page 2: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/2.jpg)
2
OBJECTIVE
Migration of in-house developed CPU codes* to Nvidia Cuda codes to experience the power of GPU acceleration on simulating large-sized problems
* 1. 3D finite element code to simulate incompressible Navier-Stokes
equations
2. 3D finite difference code to simulate incompressible Navier-Stokes
equations
3. 3D finite difference code to simulate Maxwell’s equations
4. 3D finite difference code to simulate Westervelt equation for
ultrasound wave propagation9/26/16
![Page 3: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/3.jpg)
3
CONTENT OF THE PRESENTATION
9/26/16
Cheng-Tao Wu (吳政道), CUDA programming on Frontal matrix solver for accelerating finite element calculation of incompressible Navier-Stokes solutions
Yu-Wei Chang (張育維), GPU acceleration of patient-specific airway image segmentation
Undergraduate students
![Page 4: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/4.jpg)
4
CONTENT OF THE PRESENTATIONResearch scientists
9/26/16
Neo Shih-Chao Kao (高仕超), OpenAccacceleration of the three-dimensionalincompressible Navier-Stokes equations
Maxim Solovchuk, Acceleration of HIFU (High Intensity Focused Ultrasound)ablation of liver tumor on K80(*4) GPUs
![Page 5: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/5.jpg)
5
![Page 6: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/6.jpg)
GTC-Taipei ; Sep. 21, 2016
國立臺灣大學工程科學及海洋工程學系吳政道
CUDA PROGRAMMING ON FRONTAL MATRIX SOLVER FOR ACCELERATING FINITE ELEMENT CALCULATION OF INCOMPRESSIBLE NAVIER-STOKES EQUATIONS
![Page 7: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/7.jpg)
7
AGENDA
Motivation and Objective
CPU-GPU computing environment
One important CUDA API feature - CUDA stream
Computational results
Future work
![Page 8: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/8.jpg)
8
MOTIVATION AND OBJECTIVEFinite Element Method
Finite Element Method(FEM) is a global integration method, rendering minimum energy in entire physical space. Large-sized matrix equation accounting for the total number of unknowns shall be dealt with.
GPU is an excellent choice of accomplishing computationally intensive tasks in FEM calculation of solutions.
Finite element matrix equation, shared the same weak formulation, results from assemblage of all local element matrix equations derived from the same integral equations.
1. GPU is an excellent choice of making good parallelization within the framework containing many core processors.
2. GPU is an excellent choice of storing tremendous individual element matrix equations in blocks of shared memory.
![Page 9: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/9.jpg)
9
MOTIVATION AND OBJECTIVEFinite Element Method
Data structure is a key to success of parallelization
1. Element numbering
2. Global nodal numbering
3. Local nodal numbering
![Page 10: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/10.jpg)
10
MOTIVATION AND OBJECTIVEFinite Element Method
In one element of current incompressible Navier-Stokes finite element formulation, it contains 22 unknowns.
• 9 u, v velocity components
• 4 p pressure components
Each element involve a 22x22 matrix equation.
Two elements involves a 37x37 matrix.
22*2(elements) – 3*2(u, v velocity) –2(pressure) = 37 unknowns
![Page 11: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/11.jpg)
11
MOTIVATION AND OBJECTIVEFinite Element Method
Elements 1 100 400
Matrix size 22x22 1003x1003 3803x3803
Elements 900 1600 2500
Matrix size 8403x8403 14803x14803 23003x23003
![Page 12: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/12.jpg)
12
MOTIVATION AND OBJECTIVESolution method
There are two kinds of matrix solvers.
Iterative solver:
Pro: memory and computing are less intensive
Con: no theory is available to guarantee convergent solution can be computed.
![Page 13: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/13.jpg)
13
MOTIVATION AND OBJECTIVESolution Solver
Direct solver:
Underlying Gaussian elimination method
Pro: solution can be computed for any non-ill-conditioned matrix equation
Con: memory and computing are very intensive
For the parallelization sake, element by element Frontal solver is chosen
![Page 14: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/14.jpg)
14
MOTIVATION AND OBJECTIVEFrontal Solver
Temporal conclusion- An efficient matrix solver is essential in finite element flow calculation
![Page 15: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/15.jpg)
15
MOTIVATION AND OBJECTIVEEvolution of computer chips
5/2016 GTX 10809 TFlop/s (SP)
$699180W
11/2001 #17.2 TFlop/s (DP)$110 million
3MW
Temporal conclusion – to perform HPC tasks, cost-effective GPU turns out to be a smart choice
![Page 16: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/16.jpg)
16
MOTIVATION AND OBJECTIVEEvolution of computer chips
June 2015 June 2016
Nvidia GPU AcceleratorSystems Share 54%
Nvidia GPU AcceleratorSystems Share 67%
![Page 17: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/17.jpg)
17
THEREFORE, MIGRATION OF THE ORIGINAL CPU CODE TO NVIDIA CUDA CODE CAN EXPERIENCE A TREMENDOUS BENEFIT.
![Page 18: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/18.jpg)
18
CPU-GPUCOMPUTATIONAL SYSTEM
![Page 19: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/19.jpg)
19
COMPUTING SYSTEM
CPU GPU
Name Intel Core i7 930 Nvidia K20c
Architecture Bloomfield Kepler
Number of Cores 4 cores 2496 CUDA cores,13 SMs
Memory Bandwidth 25.6GB/s 208 GB/sec
DP Flops/s ~100GFlops 1170GFlops
![Page 20: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/20.jpg)
20
COMPUTING SYSTEMComputing Aspect
One Thread
One Block
One Grid
~~ ~ ~ ~
~ ~ ~ ~~ ~ ~ ~~ ~ ~ ~~ ~ ~ ~
![Page 21: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/21.jpg)
21
COMPUTING SYSTEMComputing Aspect
has private Memory
data can be synchronised by setting a barrier and then
share the memory
memory will only be updated after finishing the execution or
encountering data conflict
~~ ~ ~ ~
~ ~ ~ ~~ ~ ~ ~~ ~ ~ ~~ ~ ~ ~
![Page 22: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/22.jpg)
22
COMPUTING SYSTEMCommunication concern
We can assume CPU as a manager, and GPU as his/her employees.
To fully utilize GPU, one should reduce the amount of communications between CPU and GPU.
![Page 23: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/23.jpg)
23
COMPUTING SYSTEMCommunication concern
![Page 24: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/24.jpg)
24
COMPUTING SYSTEMNvidia Kepler Architecture
In K20, it has 13 Streaming Multiprocessors (SMXs) and a aa scheduler GigaThread
SM
![Page 25: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/25.jpg)
25
ONE IMPORTANT CUDA API FEATURE- CUDA STREAM
![Page 26: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/26.jpg)
26
CUDA STREAM
CUDA stream is a working queue of GPU. Operations in different streams may be overlapped.
GPU scheduler can delete automatically managing kernels, programmers need not to specify it when executing the stream.
After CPU placing a request in a stream, it can keep operating until CUDA streams need to be synchronized.
![Page 27: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/27.jpg)
27
CUDA STREAM
Kernel 1
Kernel 2
Kernel 2
Kernel 3
Kernel 4
Kernel 2
Kernel 3 Kernel 4
Kernel 1 Kernel 2
Without CUDA Stream With CUDA Stream
Time
![Page 28: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/28.jpg)
28
CUDA STREAM
cudaStream_t stream[4];
#pragma omp parallel for
for(i = 0;i<4;i++){
cudaStreamCreate(&stream[i]);
cu_Func<<<blocks, threads, 0, stream[i]>>>();
// CPU task
cudaStreamSynchronize(stream[i]);
cudaStreamDestroy(stream[i]);
}
CPU 1
Stream 1
CPU 2
CPU 3
CPU 4
Stream 2
Stream 3Stream 4
![Page 29: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/29.jpg)
29
COMPUTING RESULTS
![Page 30: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/30.jpg)
30
COMPUTING RESULTSLid-driven cavity flow problem
[*] High Re solutions for incompressible flow using the Navier-Stokes equations and a multigrid method. U. Ghia, K. N. Ghia, and C.T. Shin
![Page 31: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/31.jpg)
31
COMPUTING RESULTSLid-driven cavity flow problem
![Page 32: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/32.jpg)
32
COMPUTING RESULTSLid-driven cavity flow problem
![Page 33: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/33.jpg)
33
COMPUTING RESULTSImprovement
9/26/16
0
50
100
150
200
250
300
100 400 900
Executiontime
No.Elements
C
CUDA
CUDAwithStream
3.6x 3.9x
![Page 34: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/34.jpg)
34
COMPUTING RESULTSImprovement
9/26/16
0
10
20
30
40
50
60
70
80
90
100
Prefrontal Assembly ForwardElimination
BackwardSubstitution
CPU%
C CUDA
![Page 35: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/35.jpg)
35
FUTURE WORK
In the future, multi-frontal direct solver will be integrated into the finite element flow instead of frontal solver, providing a better parallelized algorithm and reduce the computing time.
Our aim in the near future is point in NTU campus to solve the incompressible Navier-Stokes equations in a domain containing mesh size 2560*2560*2560 nodal points.
![Page 36: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/36.jpg)
April 4-7, 2016 | Silicon Valley
THANK YOU
![Page 37: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/37.jpg)
37
![Page 38: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/38.jpg)
GTC-Taipei ; Sep. 21, 2016
Neo Shih-Chao Kao (高仕超)
OPENACC ACCELERATION OF THE CALCULATION OF THREE-DIMENSONAL INCOMPRESSIBLE NAVIER-STOKES EQUATIONS
Acknowledgement :
Department of Engineering Science and Ocean Engineering, National Taiwan UniversityScientific Computing and Cardiovascular Simulation laboratory (SCCS), National Taiwan University
(輝達)
![Page 39: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/39.jpg)
39
AGENDA
1. Why GPU is needed ?
2. How GPU is used ?
3. What GPU helps me ?
4. Concluding remarks
![Page 40: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/40.jpg)
40
WHY GPU IS NEEDED ?Computational Fluid Dynamics (CFD)
(Incompressible flow equation)
High performance computing
Objective
To obtain convergent solutionFASTER (3D problem)
Discretization scheme
Objective
(Two major tasks)
http://homepage.ntu.edu.tw/~twhsheu/index.htm
To derive a finite difference model rendering minimized phase error in convection terms
High performance computing
< 8 hours !
![Page 41: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/41.jpg)
41
n The non-dimensional three-dimensional incompressible Navier-Stokes equations
where u={u,v,w} denotes the velocity vector , p the pressure field, Re the
Reynolds number and f is the force term.
n Finite difference method (FDM)
n Features of CPU code :
n Compiler : PGI workstation v13.10
n Column-major ordering (Fortran)
21+t Reu u u p u f¶+ ×Ñ = - Ñ Ñ +
¶
0uÑ× =
*J. Kim, P. Moin, Application of a Fractional-Step method to incompressible Navier-Stokes equations, Journal of Computational Physics, Vol. 59, pp. 308-323, 1985.
n The fractional-step algorithm of Kim* is adopted
![Page 42: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/42.jpg)
42
WHY GPU IS NEEDED ?
Schematic of problem
Ø Uniform mesh sizes ü h = 1/96,1/128,1/150
Ø Reynolds numbers : Re = 400,1000n Computational setting
n 3D benchmark flow problem (空穴流)n Solution resolution requirement
Ø Fine grid distribution (h << 1)
![Page 43: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/43.jpg)
43
INEFFECTIVE COMPUTING (CPU+OPENMP)
2016/9/26
Mesh length Re = 400 Re = 1000
1/96 15250.4 (s) 24007.7 (s)
1/128 37689.7 (s) 116439.0 (s)
1/150 196114.2 (s) 400228.2(s)
n OpenMP (8-threads)
n Time-consuming tasksComparison of velocity profiles
u(x,0.5,z) and w(0.5,y,z)
n The applicability of the proposed CPU code to predict high Reincompressible flow is confirmed
Streamlines at Re = 1000
H. Ding et al., Comput. Methods. Appl. Mech. Engrg., Vol. 195, pp. 516-533, 2006.
(Intel i7-4820K)
4.6 days
![Page 44: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/44.jpg)
44
THIS IS WHY GPU IS NEEDED!!
2016/9/26
![Page 45: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/45.jpg)
45
GPU (GRAPHIC PROCESSING UNIT)
2016/9/26
Deadpool (Quadro M6000)
GTA5 (Geforce GTX)http://www.geforce.com.tw/whats-new/articles/grand-theft-auto-v-nvidia-gameworks-and-technology
https://blogs.nvidia.com.tw/2016/02/deadpool-movie/
PC-Game
Movie
![Page 46: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/46.jpg)
46
WHY GPU IS NEEDED ?
n GPU programming :n Before 2007 : OpenGLn 2007 : CUDA n 2011 : OpenAcc
CPU architecture
l Multi-core structurel Sophisticated control
logic unitl Large cache to reduce
access latencies
GPU architecture
l Many-core structurel Minimized control logical unitl Large number of threadsl High peak performance
/memory bandwidthAcknowledgement : CUDA programming guide
CPU GPUALU : Arithmetic logical unit
![Page 47: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/47.jpg)
47heterogeneous CPU/GPU computing platform
Tasks
WHY GPU IS NEEDED ?Pr
ogra
mm
ing
runn
ing
CPUGPU
Intel i7-4820KComputing-
intensive tasks
CPU code
GPU code
Task 1
Task 2
……
Non-computing-intensive tasks
Computing-intensive tasks
Non-computing-intensive tasks
![Page 48: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/48.jpg)
48
OPENACCn It was developed by Nvidia, PGI, Cray and CAPS
n Similar to OpenMP programming model
n Directive is added to serial source code
ü Manage loop parallelization
ü Manage data copy between CPU and GPU
n The existing original source code (C/C++/Fortran) is reused
n Ideally, no modification of the original code is necessary
OpenAcc API
![Page 49: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/49.jpg)
49
EXAMPLE
C A B= +
Problem code_GPU_Acc…Data copy CPU --> GPU…!$acc paralleldo i = 1 , NC(i) = A(i) + B(i)
end do!$acc end parallel… Data copy GPU --> CPU…end program
OpenAcc
Problem code_CPU…do i = 1 , NC(i) = A(i) + B(i)
end do… end program
CPU
Module cuda_libuse CudaforContainsAttributes(global) subroutine add(C,A,B,N)
integer :: iinteger , value :: Nreal(kind=8) :: A(N), B(N), C(N)i = (blockid%x-1)*blockdim%x+threadidx%xif ( i < N ) then
C(i) = A(i) + B(i)end ifcall syncthreads()
end subroutineend module
Problem code_CUDA_Fortranuse module cuda_lib…Call Add_kernel<<<NB,NT>>>(C,A,B,N)… end program CUDA Fortran
![Page 50: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/50.jpg)
50
HOW GPU IS USED ?
CUDA model OpenAcc model
Grid
ThreadThread
ThreadThread
warp
ThreadThread
ThreadThread
warp
ThreadThread
ThreadThread
warp
ThreadThread
ThreadThread
warp
Block Block
VectorVector
VectorVector
worker
VectorVector
VectorVector
worker
VectorVector
VectorVector
worker
VectorVector
VectorVector
worker
Gang Gang
Parallel region
![Page 51: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/51.jpg)
512016/9/26
Non-continuous access
n Four degrees of freedom (u,v,w,p) for each node
U Node 1
VNode 1
W Node 1
P Node 1
U Node 2
VNode 2
W Node 2
P Node 2
U Node N
VNode N
W Node N
P Node N
……
GPU memory (global)Array Of Struct (AOS)
n N nodes
n The performance becomes deteriorated owing to an ineffective access
HOW GPU IS USED ?
n AOS
![Page 52: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/52.jpg)
522016/9/26
n SOA data format is effective for SIMD hardware (GPU)
Continuous accessContinuous access Continuous access Continuous access
GPU memory (global)Structs Of Array (SOA)
U Node 1
UNode 2
U Node N
V Node 1
VNode 2
V Node N
PNode 1
P Node 2
P Node N
WNode 1
W Node 2
W Node N
…… … …
n The data must be reordered following the SOA format given below
HOW GPU IS USED ?
![Page 53: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/53.jpg)
53
CPU GPU 1 GPU 2 GPU3Architecture Intel i7 4820k Nvidia K20 Nvidia K40 Nvidia K80
Cores 8 2496 (SP)832 (DP)
2880 (SP)960 (DP)
4992 (SP)1664 (DP)
Memory 32GB 5GB 12GB 24GB
Memory bandwidth 59.7 GB/S 208 GB/S 288 GB/S 480 GB/S
Peak performance 59.2 GFlops/s (DP) 1.17 TFlops/s (DP) 1.43 TFlops/s (DP) 1.87 TFlops/s (DP)
IEEE754 SP/DP YES YES YES YES
SP/DP : single/double precisionhttp://www.nvidia.com.tw/object/tesla_product_literature_tw.html
HARDWARE ARCHITECTURE
K20 K40 K80
Portability
![Page 54: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/54.jpg)
54
NUMERICAL RESULTS (GPU)
H. Ding et al., Comput. Methods. Appl. Mech. Engrg., Vol. 195, pp. 516-533, 2006.
u(x,0.5,z)
w(0.5,y,z)
![Page 55: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/55.jpg)
55
WHAT GPU HELPS ME ?
Mesh length OpenMP OpenAcc(K20) OpenAcc(K40) OpenAcc(K80)
1/96 15250.4 2693.4 2252.1 1846.5
1/128 37689.7 9660.7 7937.3 6711.4
1/150 196114.5 29838.2 23713.8 21456.7
Re = 400
![Page 56: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/56.jpg)
56
WHAT GPU HELPS ME ?
Mesh length OpenMP OpenAcc(K20) OpenAcc(K40) OpenAcc(K80)
1/96 24007.2 4280.1 3626.6 2974.5
1/128 116439.0 15145.0 12334.4 10426.5
1/150 400228.2 44865.0 33732.7 30505.2
Re = 1000
![Page 57: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/57.jpg)
57
CONCLUDING REMARKS
1. We have successfully ported, tested and benchmarked a complete 3D finite difference code using OpenAcc.
2. Code is portable across different GPU architectures.
3. Using OpenAcc, the original source code can be almost unchanged.
4. A large amount of computing time was reduced when executing a computational task on GPU architecture.
Acknowledgement :Computer Center in National Taiwan University
![Page 58: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/58.jpg)
April 4-7, 2016 | Silicon Valley
THANK YOU
![Page 59: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/59.jpg)
59
![Page 60: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/60.jpg)
GTC-Taipei;Sep.21,2016
Yu-Wei ChangEngineering science and ocean engineeringNational Taiwan University
GPU ACCELERATION OF PATIENT-SPECIFIC AIRWAY IMAGE SEGMENTATION
![Page 61: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/61.jpg)
61
OUTLINE
Introduction
Motivation and objective
Hardware environment
Application example
Concluding remarks
![Page 62: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/62.jpg)
62
INTRODUCTIONImportance of image segmentation
Source: http://www.vision.ee.ethz.ch/~rhayko/
![Page 63: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/63.jpg)
63
INTRODUCTIONDesirable goals to achieve for practical application to patient
Applying machine learning to radiotherapyplanning for head and neck cancer
The length procedure may be reduced to 1/4
Source: https://deepmind.com/health
30th August 2016
![Page 64: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/64.jpg)
64
INTRODUCTIONBuilding blocks of the 3D airway reconstruction
Acknowledgement: 高仕超學長所提供的個人胸腔CT
![Page 65: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/65.jpg)
65
INTRODUCTIONWith Intel® Xeon® Processor E5-2620, segmentation block takes 85% time
145.568
1424.284
106.9040
200
400
600
800
1000
1200
1400
1600
Preprocessing Segmentation Represnetation
tim
e (s
)
step
time (s)
time (s)
![Page 66: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/66.jpg)
66
MOTIVATION AND OBJECTIVEAmdahl’s law implies that the percentage of the code that benefits from
parallelization is important
𝑆 =1
1 − 𝑝 + 𝑝𝑖=
1
1 − 0.85 + 0.85𝑖≤ 6.6
S speedup
p the percentage of the execution time that benefits from parallelization
i is the speedup in latency of p
p Maximum speedup
0.85 (current) 6.6
0.95 20
0.99 100
![Page 67: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/67.jpg)
67
MOTIVATION AND OBJECTIVEIn a cloud computing environment, time is money
Type CPU GPU
Card Intel Xeon NVIDIA Kepler
/core/hour
0.03 0.4
Cost for 1000 patients(USD)
14 To be announced
Time for 1000 patient(hour)
465 To be announced
![Page 68: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/68.jpg)
68
HARDWARE ENVIRONMENTGPU has a 30 times better FLOPs performance
Card Nvidia Tesla K40c* Nvidia Tesla K20c* Intel Xeon E5-2630**Cores 2880 2496 6
Peak single precision floating point performance
4.29Tflops 3.52 Tflops 0.134 Tflops
**http://ark.intel.com/products/64593/Intel-Xeon-Processor-E5-2630-15M-Cache-2_30-GHz-7_20-GTs-Intel-QPI
*http://www.nvidia.com/object/tesla-workstations.html
![Page 69: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/69.jpg)
69
HARDWARE ENVIRONMENTGPU does a great job using only global memory
Processor Parallelization language
Time usage (sec) Speedup gain Note
Intel Xeon NA 158.08 Sequential code
Intel Xeon OpenMP 14.335 1 Double checkedlocking
Nvidia Tesla K20c CUDA 3 4.8 Global memoryonly
![Page 70: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/70.jpg)
70
YOU CANNOT MAKE BRICKS WITHOUT STRAW工欲善其事必先利其器
![Page 71: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/71.jpg)
71
HARDWARE ENVIRONMENTBlock and grid structure would affect the usage of share memory
Grid size
block size
3232
128128
256256
512512
Tesla K40c
3714.5 335.5 388.1 474.9
Tasksper
thread
2/ 20 21 < 21
Tune the block and thread number to optimize the performance.Let each thread do less job.
![Page 72: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/72.jpg)
72
HARDWARE ENVIRONMENTThe performance would benefit from the usage of shared and texture memory
Memory usage on Tesla K20c
Time usage(sec)
Speedup gain
Global memory only
3 1
Shared and global memory
0.471 6.36
Texture and global memory
0.321 9.34
Despite of faster performance, texturememory renders a lower accuracy.While in computational science, accuracy is of great importance,so shared memory is more preferable.
5120MB
![Page 73: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/73.jpg)
73
APPLICATION EXAMPLE
![Page 74: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/74.jpg)
74
APPLICATION EXAMPLEAcquisition and pre-processing
![Page 75: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/75.jpg)
75
APPLICATION EXAMPLEMathematical morphology, lung filter, and segmentation
Source: https://en.wikipedia.org/wiki/Mathematical_morphology
Opening operation Lung mask Segmentation
![Page 76: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/76.jpg)
76
APPLICATION EXAMPLEFull view of the lung from the top
Acknowledgement:高仕超學長所提供的個人胸腔CT
Left airway Right airway
![Page 77: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/77.jpg)
77
CONCLUDING REMARKSMemory, thread and block setting is important
Platform Time usage (sec) Speedup gain
CPU 14.335 1
CPU + 1 GPU 0.335 43
CPU + 2 GPU 0.232 61.8
Tune the block and thread number to optimize the performance.Let each thread do less job.
Despite of faster performance, texture memory renders a lower accuracy.While in computational science, accuracy is of great importance,so shared memory is more preferable.
![Page 78: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/78.jpg)
78
CONCLUDING REMARKSAmdahl’s law implies that the percentage of the code that benefits from
parallelization is important
𝑆 =1
1 − 𝑝 + 𝑝𝑖=
1
1 − 0.85 + 0.8561.8= 𝟔. 𝟏
p Maximum speedup
0.85 (current) 6.6
0.95 20
0.99 100
![Page 79: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/79.jpg)
79
CONCLUDING REMARKSIn a cloud computing environment, time is money
Type CPU 2 GPU + CPU
Card Intel Xeon NVIDIA Kepler + Intel Xeon
/core/hour
0.03 0.4 / 0.03
Cost for 1000patients (USD)
14 7
Time for 1000patients (hour)
465 77
![Page 80: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/80.jpg)
80
FUTURE WORK
Increase the resolution of CT from 64-slice to 256-slice
Use deep learning to classify tumor type from CT images and accelerate the whole process
![Page 81: Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems](https://reader031.vdocuments.us/reader031/viewer/2022021919/587587341a28ab901c8b4fad/html5/thumbnails/81.jpg)
81
REFERENCES
[1] Babin, D., et al. (2010). Segmentation of airways in lungs using projections in 3-D CT angiography images. 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.
[2] Arti T, Priya R, Amit Ujjlayan R. (2015). A performance study of image segmentation techniques. Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), 2015 4th International Conference on
Acknowledgement:許文翰老師張恆華老師
高仕超學長 提供胸腔CT(輝達)
國立臺灣大學計算機中心