accelerating image registration on gpus - fau image registration on gpus ... programming tools: cuda...

46
Computer Science X - System Simulation Group Harald Köstler ([email protected]) Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010

Upload: nguyenhanh

Post on 24-Mar-2018

240 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Accelerating imageregistration on GPUs

Harald Köstler, Sunil Ramgopal Tatavarty

SIAM Conference on Imaging Science (IS10)13.4.2010

Page 2: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Contents

Motivation: Image registration with FAIR

GPU Programming

Combining Matlab and CUDA

Future Work

2

Page 3: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

MOTIVATIONImage registration in FAIR

3

Page 4: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Image Registration in FAIR

4

Page 5: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Variational model for image registration

Find mapping or deformation field

such that the energy functional

with data term D and regularizer S is minimized.

5

][][)( uSJ,uDuE ⋅+= α

{ }3,2,: ∈dRRu dd a

Page 6: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

FAIR: Flexible Algorithms for Image Registration

Framework for imageregistration in Matlab

Different transformations, data terms, regularizers

Different (multi-level) numerical optimizationtechniques

Constrained imageregistration

FAIR, J. Modersitzki, SIAM, 2009

6

Page 7: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Image Registration cycle

7

Page 8: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Runtime distribution for rigid registration cycle

8

4%

15%

59%

22% FunctionalOptimizationInterpolationMatlab plots

Page 9: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

COMPUTER ARCHITECTUREGPUs

9

Page 11: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Why GPUs?

Graphics Processing Units (GPUs) are a massively parallel computer architecture GPUs offer high computational performance at low costsGPGPU

General-Purpose computing on a GPUUsing graphic hardware for non-graphic computationsProgramming Tools: CUDA or OpenCL

Without explicitly parallel algorithms, the performance potential of current hardware architectures cannot be used any more

11

Page 12: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

CPU vs. GPU

CPUs are great for task parallelismFast cachesBranching adaptability

GPUs are great for data parallelismMultiple ALUsFast onboard memoryHigh throughput on parallel tasks

Executes program on each fragment

Think of the GPU (device) as a massively-threaded co-processor

12

Page 13: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Memory Architecture

Constant Memory

Shared Memory

Texture Memory

Device Memory

13

Page 14: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Data-parallel Programming

What is CUDA?Compute Unified Device ArchitectureSoftware architecture for managing data-parallel programming

Write kernels (functions) that execute on the device and process multiple data elements in parallelMassive threadingLocal memory

14

Page 15: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Paradigm of the CUDA toolkit I

0

1

2

3

4

5

Thread 0

Thread 1

Thread 2

OpenMPDivide domain into huge chunks

Equally distribute to threads

Parallelize outer most loop to minimize overhead

15

Page 16: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Paradigm of the CUDA toolkit II

But!Data Mapping is different to en-block distribution of OpenMP

Alignment constraints must be met

No cache and cache lines need to be considered

0

3

1

4

2

5

Thread 2

CUDADivide domain into small pieces

Thread 1

Thread 1

Thread 0

Thread 0

Thread 2

Block 1

Block 2

16

Page 17: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Grid Size and Block Size

Grid size: The size and shape of the data that the program will be working on

Block size: The block size indicates the sub-area of the original grid that will be assigned to a multiprocessor

17

Page 18: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

CUDA Example

GPU:

__global__ void sum(float* a, float* b, float* c) {int idx = blockIdx.x*blockDim.x + threadIdx.x; // computes array index for current threadc[idx] = a[idx] + b[idx]; // adds array entry (global memory)}

CPU:

void vector_add(float *in_a, float *in_b, float *out, unsigned int length){float* a, *b, *c;

cudaMalloc((void**)&a_GPU, sizeof(float)*length);cudaMalloc((void**)&b_GPU, sizeof(float)*length);cudaMalloc((void**)&out_GPU, sizeof(float)*length);

cudaMemcpy ( a_GPU,in_a, sizeof ( float ) *length, cudaMemcpyHostToDevice ); cudaMemcpy ( b_GPU,in_a, sizeof ( float ) *length, cudaMemcpyHostToDevice );

dim3 dimblock(256,1);dim3 dimgrid(length/dimblock.x,1);

sum<<<dimgrid,dimblock>>>(a,b,c);

cudaMemcpy ( out_GPU,out, sizeof ( float ) *length, cudaMemcpyDeviceToHost );}

Allocate data on GPU

Define block/grid sizes

Write back data

to CPU

Copy data to GPU

Declare data

Kernel execution

18

Page 19: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

OpenCL

OpenCL (Open Computing Language) is an open standard for heterogeneous parallel computing

Managed by non-profit technology consortium Khronos group www.khronos.org/opencl

Analogous to the open industry standards OpenGL for 3D graphics

OpenCL exploits task-based and data-based parallelism

Similar to CUDA, OpenCL includes a language (based on C99) for writing kernels executing on devices like GPU, cell, or multi-core processors

Performance of high level heterogeneous tools?

19

Page 20: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

COMBINING MATLAB AND CUDAnvmex

20

Page 21: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

MATLAB MEX interface

21

Page 22: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

CUDA MEX environment

MATLAB script nvmex.m compiles CUDA code tocreate MATLAB function files

1. Allocate memory on the GPU

2. Transfer the data from the host to the GPU

3. Perform computation on GPU (library, custom code)

4. Transfer results from the GPU to the host

22

Page 23: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Basic interpolation schemes: Host side

23

Page 24: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Basic interpolation schemes: GPU kernel

24

Page 25: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

B-Spline Interpolation [Sigg, C. and Hadwiger, M.]

Combine 2 linear interpolations to obtain B-spline interpolation

25

Page 26: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Runtime Results Interpolation I

26

Page 27: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

0,0001

0,001

0,01

0,1

1

2048 8192 32768 131072

time

(ms) SplineInter2D

(bilinear texture) Theoretical Best Case runtimeTheoretical worst Case runtime

Runtime Results Interpolation II

27

Page 28: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Discretization Error Accuracy

28

|f-f(h)|

|f+hf´-f(h)|

h

Page 29: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Persistent Memory and Hybrid Memory

Goal: avoid allocation and deallocation of memory for eachcall to a GPU kernel through Matlab

Static variables

routine to clear CUDA MEX persistent variable

Saves up to 17% runtime for bigger images (2048x1024)

29

Page 30: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

CUDA MEX Registration cycle

30

Page 31: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Runtime summary for rigid registration

Time includes data transfers and Matlab visualization (plots about 20 %)GPU pays off for large images (speedup factor about 2x, pure GPU kernels about 20x)A lot more potential to move more parts to GPU

31

xSize ySize Matlab Matlab+GPU

128 64 15 s 14 s

256 128 45 s 33 s

512 256 202 s 92 s

Page 32: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

MULTIGRID ON GPUA Multigrid Solver for Diffusion-based PDEs

32

Page 33: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

What is Multigrid?

May consume a lot of time during optimization step to find an updated transformationA very fast and efficient iterative solver for huge sparse, linear (and nonlinear) systems

A u = fGeometric multigrid on structured grids is well suited for GPU implementation because of regular data access patternsMultigrid exhibits a convergence rate that is independent of the number of unknowns N, i.e. a multigrid solver has complexity O(N)

33

Page 34: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Multigrid Idea

Multigrid methods are based on two principles:

1. Smoothing property

34

Smooth error on fine grid

Page 35: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Multigrid Idea

Multigrid methods are based on two principles:

1. Smoothing property

2. Coarse grid principle

35

Approximate smooth error on coarser grids

Page 36: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Complex Diffusion on Nvidia GeForce GTX 295

0

10

20

30

40

50

60

70

80

1024x1024 1024x2048 2048x2048 2048x4096 4096x4096

runt

ime

V(2,

2) in

ms

Image size

Runtime on 2 x C2 Penryn (8 cores) @ 2.8 GHz, 8 x 22,4 GFLOPs is 1100 ms for 4096 x 4096→ Speedup factor 14

36

Page 37: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Summary and Future Work

When to use GPGPUSuitable algorithms?Get the promised performance?Get it at low effort and cost?

Future Work Multi-GPU

OpenCL

Adapt GPU multigrid to image registration

37

Page 38: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Questions?

Page 39: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Acknowledgements

Nvidia Corporationdeveloper.nvidia.com/CUDA

Technical Brief – Architecture Overview

CUDA Programming Guide

Supercomputing 2008 Education Program: CUDA Introduction, Christian Trefftz / Greg Wolffe (Grand Valley State University) AMD/ATIhttp://developer.amd.com/gpu_assets/Stream_Computing_Overview.pdf

ATI Stream SDK User Guide

39

Page 40: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

PossibleImprovements

40

Page 41: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Runtime Results Interpolation II

41

Page 42: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Full Multigrid Cycle

Exact solution

Interpolation oferror and correction

of solution u

Restriction of residualr = f - Au

Smoothing

Interpolation ofsolution u

V-cycle

42

Page 43: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Multigrid on GTX 295 with red-black splitting

43

0

2

4

6

8

10

12

14

16

18

1024x1024 1024x2048 2048x2048 2048x4096 4096x4096

Run

time

V(2,

2) in

ms

Image size

Page 44: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Memory Bandwidth

In percent from maximum measured (rounded) streaming bandwidth (100 GB / s)

44

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1024x1024 1024x2048 2048x2048 2048x4096 4096x4096

Perc

ent o

f mem

ory

band

wid

th

Image size

Page 45: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Frames per second for Image Stitching

45

8371

63

45

25

143

111

83

50

2817

9 4 2 1

1024x1024 1024x2048 2048x2048 2048x4096 4096x4096

fps (stitching GTX 295) fps (solver GTX 295) fps (CPU)

CPU: Intel Core2 Quad [email protected] with OpenMP (4 cores)

Page 46: Accelerating image registration on GPUs - FAU image registration on GPUs ... Programming Tools: CUDA or OpenCL ... ATI Stream SDK User Guide; 39;

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Interpolation and Restriction

46