recent progress in sccs on gpu simulation of biomedical and hydrodynamic problems

TAIPEI | SEP. 21-22, 2016

Tony W. H. Sheu, Neo Shih-Chao Kao, Maxim Solovchuk, Cheng-Tao Wu, Yu-Wei ChangNational Taiwan University

RECENT PROGRESS IN SCCS ON GPU SIMULATION OF BIOMEDICAL AND HYDRODYNAMIC PROBLEMS

Acknowledgement : SCCS (Scientific Computing and Cardiovascular Simulation) team working on GPU simulation(輝達)(Aug. 8, 2016)

2

OBJECTIVE

Migration of in-house developed CPU codes* to Nvidia Cuda codes to experience the power of GPU acceleration on simulating large-sized problems

* 1. 3D finite element code to simulate incompressible Navier-Stokes

equations

2. 3D finite difference code to simulate incompressible Navier-Stokes

equations

3. 3D finite difference code to simulate Maxwell’s equations

4. 3D finite difference code to simulate Westervelt equation for

ultrasound wave propagation9/26/16

3

CONTENT OF THE PRESENTATION

9/26/16

Cheng-Tao Wu (吳政道), CUDA programming on Frontal matrix solver for accelerating finite element calculation of incompressible Navier-Stokes solutions

Yu-Wei Chang (張育維), GPU acceleration of patient-specific airway image segmentation

Undergraduate students

4

CONTENT OF THE PRESENTATIONResearch scientists

9/26/16

Neo Shih-Chao Kao (高仕超), OpenAccacceleration of the three-dimensionalincompressible Navier-Stokes equations

Maxim Solovchuk, Acceleration of HIFU (High Intensity Focused Ultrasound)ablation of liver tumor on K80(*4) GPUs

GTC-Taipei ; Sep. 21, 2016

國立臺灣大學工程科學及海洋工程學系吳政道

CUDA PROGRAMMING ON FRONTAL MATRIX SOLVER FOR ACCELERATING FINITE ELEMENT CALCULATION OF INCOMPRESSIBLE NAVIER-STOKES EQUATIONS

7

AGENDA

Motivation and Objective

CPU-GPU computing environment

One important CUDA API feature - CUDA stream

Computational results

Future work

8

MOTIVATION AND OBJECTIVEFinite Element Method

Finite Element Method(FEM) is a global integration method, rendering minimum energy in entire physical space. Large-sized matrix equation accounting for the total number of unknowns shall be dealt with.

GPU is an excellent choice of accomplishing computationally intensive tasks in FEM calculation of solutions.

Finite element matrix equation, shared the same weak formulation, results from assemblage of all local element matrix equations derived from the same integral equations.

1. GPU is an excellent choice of making good parallelization within the framework containing many core processors.

2. GPU is an excellent choice of storing tremendous individual element matrix equations in blocks of shared memory.

9


Data structure is a key to success of parallelization

1. Element numbering

2. Global nodal numbering

3. Local nodal numbering

10


In one element of current incompressible Navier-Stokes finite element formulation, it contains 22 unknowns.

• 9 u, v velocity components

• 4 p pressure components

Each element involve a 22x22 matrix equation.

Two elements involves a 37x37 matrix.

22*2(elements) – 3*2(u, v velocity) –2(pressure) = 37 unknowns

11


Elements 1 100 400

Matrix size 22x22 1003x1003 3803x3803

Elements 900 1600 2500

Matrix size 8403x8403 14803x14803 23003x23003

12

MOTIVATION AND OBJECTIVESolution method

There are two kinds of matrix solvers.

Iterative solver:

Pro: memory and computing are less intensive

Con: no theory is available to guarantee convergent solution can be computed.

13

MOTIVATION AND OBJECTIVESolution Solver

Direct solver:

Underlying Gaussian elimination method

Pro: solution can be computed for any non-ill-conditioned matrix equation

Con: memory and computing are very intensive

For the parallelization sake, element by element Frontal solver is chosen

14

MOTIVATION AND OBJECTIVEFrontal Solver

Temporal conclusion- An efficient matrix solver is essential in finite element flow calculation

15

MOTIVATION AND OBJECTIVEEvolution of computer chips

5/2016 GTX 10809 TFlop/s (SP)

$699180W

11/2001 #17.2 TFlop/s (DP)$110 million

3MW

Temporal conclusion – to perform HPC tasks, cost-effective GPU turns out to be a smart choice

16

MOTIVATION AND OBJECTIVEEvolution of computer chips

June 2015 June 2016

Nvidia GPU AcceleratorSystems Share 54%

Nvidia GPU AcceleratorSystems Share 67%

17

THEREFORE, MIGRATION OF THE ORIGINAL CPU CODE TO NVIDIA CUDA CODE CAN EXPERIENCE A TREMENDOUS BENEFIT.

18

CPU-GPUCOMPUTATIONAL SYSTEM

19

COMPUTING SYSTEM

CPU GPU

Name Intel Core i7 930 Nvidia K20c

Architecture Bloomfield Kepler

Number of Cores 4 cores 2496 CUDA cores,13 SMs

Memory Bandwidth 25.6GB/s 208 GB/sec

DP Flops/s ~100GFlops 1170GFlops

20

COMPUTING SYSTEMComputing Aspect

One Thread

One Block

One Grid

~~ ~ ~ ~

~ ~ ~ ~~ ~ ~ ~~ ~ ~ ~~ ~ ~ ~

21

COMPUTING SYSTEMComputing Aspect

has private Memory

data can be synchronised by setting a barrier and then

share the memory

memory will only be updated after finishing the execution or

encountering data conflict

~~ ~ ~ ~

~ ~ ~ ~~ ~ ~ ~~ ~ ~ ~~ ~ ~ ~

22

COMPUTING SYSTEMCommunication concern

We can assume CPU as a manager, and GPU as his/her employees.

To fully utilize GPU, one should reduce the amount of communications between CPU and GPU.

23

COMPUTING SYSTEMCommunication concern

24

COMPUTING SYSTEMNvidia Kepler Architecture

In K20, it has 13 Streaming Multiprocessors (SMXs) and a aa scheduler GigaThread

SM

25

ONE IMPORTANT CUDA API FEATURE- CUDA STREAM

26

CUDA STREAM

CUDA stream is a working queue of GPU. Operations in different streams may be overlapped.

GPU scheduler can delete automatically managing kernels, programmers need not to specify it when executing the stream.

After CPU placing a request in a stream, it can keep operating until CUDA streams need to be synchronized.

27

CUDA STREAM

Kernel 1

Kernel 2

Kernel 2

Kernel 3

Kernel 4

Kernel 2

Kernel 3 Kernel 4

Kernel 1 Kernel 2

Without CUDA Stream With CUDA Stream

Time

28

CUDA STREAM

cudaStream_t stream[4];

#pragma omp parallel for

for(i = 0;i<4;i++){

cudaStreamCreate(&stream[i]);

cu_Func<<<blocks, threads, 0, stream[i]>>>();

// CPU task

cudaStreamSynchronize(stream[i]);

cudaStreamDestroy(stream[i]);

}

CPU 1

Stream 1

CPU 2

CPU 3

CPU 4

Stream 2

Stream 3Stream 4

29

COMPUTING RESULTS

30

COMPUTING RESULTSLid-driven cavity flow problem

[*] High Re solutions for incompressible flow using the Navier-Stokes equations and a multigrid method. U. Ghia, K. N. Ghia, and C.T. Shin

31


32


33

COMPUTING RESULTSImprovement

9/26/16

0

50

100

150

200

250

300

100 400 900

Executiontime

No.Elements

C

CUDA

CUDAwithStream

3.6x 3.9x

34

COMPUTING RESULTSImprovement

9/26/16

0

10

20

30

40

50

60

70

80

90

100

Prefrontal Assembly ForwardElimination

BackwardSubstitution

CPU%

C CUDA

35

FUTURE WORK

In the future, multi-frontal direct solver will be integrated into the finite element flow instead of frontal solver, providing a better parallelized algorithm and reduce the computing time.

Our aim in the near future is point in NTU campus to solve the incompressible Navier-Stokes equations in a domain containing mesh size 2560*2560*2560 nodal points.

April 4-7, 2016 | Silicon Valley

THANK YOU

GTC-Taipei ; Sep. 21, 2016

Neo Shih-Chao Kao (高仕超)

OPENACC ACCELERATION OF THE CALCULATION OF THREE-DIMENSONAL INCOMPRESSIBLE NAVIER-STOKES EQUATIONS

Acknowledgement :

Department of Engineering Science and Ocean Engineering, National Taiwan UniversityScientific Computing and Cardiovascular Simulation laboratory (SCCS), National Taiwan University

(輝達)

39

AGENDA

1. Why GPU is needed ?

2. How GPU is used ?

3. What GPU helps me ?

4. Concluding remarks

40

WHY GPU IS NEEDED ?Computational Fluid Dynamics (CFD)

(Incompressible flow equation)

High performance computing

Objective

To obtain convergent solutionFASTER (3D problem)

Discretization scheme

Objective

(Two major tasks)

http://homepage.ntu.edu.tw/~twhsheu/index.htm

To derive a finite difference model rendering minimized phase error in convection terms

High performance computing

< 8 hours !

41

n The non-dimensional three-dimensional incompressible Navier-Stokes equations

where u={u,v,w} denotes the velocity vector , p the pressure field, Re the

Reynolds number and f is the force term.

n Finite difference method (FDM)

n Features of CPU code :

n Compiler : PGI workstation v13.10

n Column-major ordering (Fortran)

21+t Reu u u p u f¶+ ×Ñ = - Ñ Ñ +

¶

0uÑ× =

*J. Kim, P. Moin, Application of a Fractional-Step method to incompressible Navier-Stokes equations, Journal of Computational Physics, Vol. 59, pp. 308-323, 1985.

n The fractional-step algorithm of Kim* is adopted

42

WHY GPU IS NEEDED ?

Schematic of problem

Ø Uniform mesh sizes ü h = 1/96,1/128,1/150

Ø Reynolds numbers : Re = 400,1000n Computational setting

n 3D benchmark flow problem (空穴流)n Solution resolution requirement

Ø Fine grid distribution (h << 1)

43

INEFFECTIVE COMPUTING (CPU+OPENMP)

2016/9/26

Mesh length Re = 400 Re = 1000

1/96 15250.4 (s) 24007.7 (s)

1/128 37689.7 (s) 116439.0 (s)

1/150 196114.2 (s) 400228.2(s)

n OpenMP (8-threads)

n Time-consuming tasksComparison of velocity profiles

u(x,0.5,z) and w(0.5,y,z)

n The applicability of the proposed CPU code to predict high Reincompressible flow is confirmed

Streamlines at Re = 1000

H. Ding et al., Comput. Methods. Appl. Mech. Engrg., Vol. 195, pp. 516-533, 2006.

(Intel i7-4820K)

4.6 days

44

THIS IS WHY GPU IS NEEDED!!

2016/9/26

45

GPU (GRAPHIC PROCESSING UNIT)

2016/9/26

Deadpool (Quadro M6000)

GTA5 (Geforce GTX)http://www.geforce.com.tw/whats-new/articles/grand-theft-auto-v-nvidia-gameworks-and-technology

https://blogs.nvidia.com.tw/2016/02/deadpool-movie/

PC-Game

Movie

46

WHY GPU IS NEEDED ?

n GPU programming :n Before 2007 : OpenGLn 2007 : CUDA n 2011 : OpenAcc

CPU architecture

l Multi-core structurel Sophisticated control

logic unitl Large cache to reduce

access latencies

GPU architecture

l Many-core structurel Minimized control logical unitl Large number of threadsl High peak performance

/memory bandwidthAcknowledgement : CUDA programming guide

CPU GPUALU : Arithmetic logical unit

47heterogeneous CPU/GPU computing platform

Tasks

WHY GPU IS NEEDED ?Pr

ogra

mm

ing

runn

ing

CPUGPU

Intel i7-4820KComputing-

intensive tasks

CPU code

GPU code

Task 1

Task 2

……

Non-computing-intensive tasks

Computing-intensive tasks

Non-computing-intensive tasks

48

OPENACCn It was developed by Nvidia, PGI, Cray and CAPS

n Similar to OpenMP programming model

n Directive is added to serial source code

ü Manage loop parallelization

ü Manage data copy between CPU and GPU

n The existing original source code (C/C++/Fortran) is reused

n Ideally, no modification of the original code is necessary

OpenAcc API

49

EXAMPLE

C A B= +

Problem code_GPU_Acc…Data copy CPU --> GPU…!$acc paralleldo i = 1 , NC(i) = A(i) + B(i)

end do!$acc end parallel… Data copy GPU --> CPU…end program

OpenAcc

Problem code_CPU…do i = 1 , NC(i) = A(i) + B(i)

end do… end program

CPU

Module cuda_libuse CudaforContainsAttributes(global) subroutine add(C,A,B,N)

integer :: iinteger , value :: Nreal(kind=8) :: A(N), B(N), C(N)i = (blockid%x-1)*blockdim%x+threadidx%xif ( i < N ) then

C(i) = A(i) + B(i)end ifcall syncthreads()

end subroutineend module

Problem code_CUDA_Fortranuse module cuda_lib…Call Add_kernel<<<NB,NT>>>(C,A,B,N)… end program CUDA Fortran

50

HOW GPU IS USED ?

CUDA model OpenAcc model

Grid

ThreadThread

ThreadThread

warp

ThreadThread

ThreadThread

warp

ThreadThread

ThreadThread

warp

ThreadThread

ThreadThread

warp

Block Block

VectorVector

VectorVector

worker

VectorVector

VectorVector

worker

VectorVector

VectorVector

worker

VectorVector

VectorVector

worker

Gang Gang

Parallel region

512016/9/26

Non-continuous access

n Four degrees of freedom (u,v,w,p) for each node

U Node 1

VNode 1

W Node 1

P Node 1

U Node 2

VNode 2

W Node 2

P Node 2

U Node N

VNode N

W Node N

P Node N

……

GPU memory (global)Array Of Struct (AOS)

n N nodes

n The performance becomes deteriorated owing to an ineffective access

HOW GPU IS USED ?

n AOS

522016/9/26

n SOA data format is effective for SIMD hardware (GPU)

Continuous accessContinuous access Continuous access Continuous access

GPU memory (global)Structs Of Array (SOA)

U Node 1

UNode 2

U Node N

V Node 1

VNode 2

V Node N

PNode 1

P Node 2

P Node N

WNode 1

W Node 2

W Node N

…… … …

n The data must be reordered following the SOA format given below

HOW GPU IS USED ?

53

CPU GPU 1 GPU 2 GPU3Architecture Intel i7 4820k Nvidia K20 Nvidia K40 Nvidia K80

Cores 8 2496 (SP)832 (DP)

2880 (SP)960 (DP)

4992 (SP)1664 (DP)

Memory 32GB 5GB 12GB 24GB

Memory bandwidth 59.7 GB/S 208 GB/S 288 GB/S 480 GB/S

Peak performance 59.2 GFlops/s (DP) 1.17 TFlops/s (DP) 1.43 TFlops/s (DP) 1.87 TFlops/s (DP)

IEEE754 SP/DP YES YES YES YES

SP/DP : single/double precisionhttp://www.nvidia.com.tw/object/tesla_product_literature_tw.html

HARDWARE ARCHITECTURE

K20 K40 K80

Portability

54

NUMERICAL RESULTS (GPU)

H. Ding et al., Comput. Methods. Appl. Mech. Engrg., Vol. 195, pp. 516-533, 2006.

u(x,0.5,z)

w(0.5,y,z)

55

WHAT GPU HELPS ME ?

Mesh length OpenMP OpenAcc(K20) OpenAcc(K40) OpenAcc(K80)

1/96 15250.4 2693.4 2252.1 1846.5

1/128 37689.7 9660.7 7937.3 6711.4

1/150 196114.5 29838.2 23713.8 21456.7

Re = 400

56

WHAT GPU HELPS ME ?

Mesh length OpenMP OpenAcc(K20) OpenAcc(K40) OpenAcc(K80)

1/96 24007.2 4280.1 3626.6 2974.5

1/128 116439.0 15145.0 12334.4 10426.5

1/150 400228.2 44865.0 33732.7 30505.2

Re = 1000

57

CONCLUDING REMARKS

1. We have successfully ported, tested and benchmarked a complete 3D finite difference code using OpenAcc.

2. Code is portable across different GPU architectures.

3. Using OpenAcc, the original source code can be almost unchanged.

4. A large amount of computing time was reduced when executing a computational task on GPU architecture.

Acknowledgement :Computer Center in National Taiwan University

April 4-7, 2016 | Silicon Valley

THANK YOU

GTC-Taipei;Sep.21,2016

Yu-Wei ChangEngineering science and ocean engineeringNational Taiwan University

GPU ACCELERATION OF PATIENT-SPECIFIC AIRWAY IMAGE SEGMENTATION

61

OUTLINE

Introduction

Motivation and objective

Hardware environment

Application example

Concluding remarks

62

INTRODUCTIONImportance of image segmentation

Source: http://www.vision.ee.ethz.ch/~rhayko/

63

INTRODUCTIONDesirable goals to achieve for practical application to patient

Applying machine learning to radiotherapyplanning for head and neck cancer

The length procedure may be reduced to 1/4

Source: https://deepmind.com/health

30th August 2016

64

INTRODUCTIONBuilding blocks of the 3D airway reconstruction

Acknowledgement: 高仕超學長所提供的個人胸腔CT

65

INTRODUCTIONWith Intel® Xeon® Processor E5-2620, segmentation block takes 85% time

145.568

1424.284

106.9040

200

400

600

800

1000

1200

1400

1600

Preprocessing Segmentation Represnetation

tim

e (s

)

step

time (s)

time (s)

66

MOTIVATION AND OBJECTIVEAmdahl’s law implies that the percentage of the code that benefits from

parallelization is important

𝑆 =1

1 − 𝑝 + 𝑝𝑖=

1

1 − 0.85 + 0.85𝑖≤ 6.6

S speedup

p the percentage of the execution time that benefits from parallelization

i is the speedup in latency of p

p Maximum speedup

0.85 (current) 6.6

0.95 20

0.99 100

67

MOTIVATION AND OBJECTIVEIn a cloud computing environment, time is money

Type CPU GPU

Card Intel Xeon NVIDIA Kepler

/core/hour

0.03 0.4

Cost for 1000 patients(USD)

14 To be announced

Time for 1000 patient(hour)

465 To be announced

68

HARDWARE ENVIRONMENTGPU has a 30 times better FLOPs performance

Card Nvidia Tesla K40c* Nvidia Tesla K20c* Intel Xeon E5-2630**Cores 2880 2496 6

Peak single precision floating point performance

4.29Tflops 3.52 Tflops 0.134 Tflops

**http://ark.intel.com/products/64593/Intel-Xeon-Processor-E5-2630-15M-Cache-2_30-GHz-7_20-GTs-Intel-QPI

*http://www.nvidia.com/object/tesla-workstations.html

69

HARDWARE ENVIRONMENTGPU does a great job using only global memory

Processor Parallelization language

Time usage (sec) Speedup gain Note

Intel Xeon NA 158.08 Sequential code

Intel Xeon OpenMP 14.335 1 Double checkedlocking

Nvidia Tesla K20c CUDA 3 4.8 Global memoryonly

70

YOU CANNOT MAKE BRICKS WITHOUT STRAW工欲善其事必先利其器

71

HARDWARE ENVIRONMENTBlock and grid structure would affect the usage of share memory

Grid size

block size

3232

128128

256256

512512

Tesla K40c

3714.5 335.5 388.1 474.9

Tasksper

thread

2/ 20 21 < 21

Tune the block and thread number to optimize the performance.Let each thread do less job.

72

HARDWARE ENVIRONMENTThe performance would benefit from the usage of shared and texture memory

Memory usage on Tesla K20c

Time usage(sec)

Speedup gain

Global memory only

3 1

Shared and global memory

0.471 6.36

Texture and global memory

0.321 9.34

Despite of faster performance, texturememory renders a lower accuracy.While in computational science, accuracy is of great importance,so shared memory is more preferable.

5120MB

73

APPLICATION EXAMPLE

74

APPLICATION EXAMPLEAcquisition and pre-processing

75

APPLICATION EXAMPLEMathematical morphology, lung filter, and segmentation

Source: https://en.wikipedia.org/wiki/Mathematical_morphology

Opening operation Lung mask Segmentation

76

APPLICATION EXAMPLEFull view of the lung from the top

Acknowledgement:高仕超學長所提供的個人胸腔CT

Left airway Right airway

77

CONCLUDING REMARKSMemory, thread and block setting is important

Platform Time usage (sec) Speedup gain

CPU 14.335 1

CPU + 1 GPU 0.335 43

CPU + 2 GPU 0.232 61.8

Tune the block and thread number to optimize the performance.Let each thread do less job.

Despite of faster performance, texture memory renders a lower accuracy.While in computational science, accuracy is of great importance,so shared memory is more preferable.

78

CONCLUDING REMARKSAmdahl’s law implies that the percentage of the code that benefits from

parallelization is important

𝑆 =1

1 − 𝑝 + 𝑝𝑖=

1

1 − 0.85 + 0.8561.8= 𝟔. 𝟏

p Maximum speedup

0.85 (current) 6.6

0.95 20

0.99 100

79

CONCLUDING REMARKSIn a cloud computing environment, time is money

Type CPU 2 GPU + CPU

Card Intel Xeon NVIDIA Kepler + Intel Xeon

/core/hour

0.03 0.4 / 0.03

Cost for 1000patients (USD)

14 7

Time for 1000patients (hour)

465 77

80

FUTURE WORK

Increase the resolution of CT from 64-slice to 256-slice

Use deep learning to classify tumor type from CT images and accelerate the whole process

81

REFERENCES

[1] Babin, D., et al. (2010). Segmentation of airways in lungs using projections in 3-D CT angiography images. 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.

[2] Arti T, Priya R, Amit Ujjlayan R. (2015). A performance study of image segmentation techniques. Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), 2015 4th International Conference on

Acknowledgement:許文翰老師張恆華老師

高仕超學長提供胸腔CT(輝達)

國立臺灣大學計算機中心

recent progress in sccs on gpu simulation of biomedical and hydrodynamic problems

Technology