accelerating matlab image processing toolbox functions on gpus

Accelerating MATLAB Image Processing Toolbox Functions on GPUs

Jingfei Kong, Martin Dimitrov, Yi Yang, Janaka Liyanage, Lin Cao, Jacob Staples, Mike Mantor,

Huiyang Zhou

University of Central Florida 2

Motivation

• With high memory bandwidth and teraflops computing capability, Graphics Processor Units (GPUs) become quite attractive for accelerating general purpose applications

• Developing high-performance GPU programs, however, requires deep understanding of both application algorithms and GPU hardware architecture

• A systematic way of dealing with a generic class of applications is missing

Our Contributions

• Compare performance-critical hardware features in different GPUs

• Develop high-quality open-source library code for some representative functions in MATLAB™ Image Processing Toolbox (IPT)– https://sites.google.com/site/iptatiproject/ [15]

• Reveal insights on efficiently accelerating a wide range of image processing algorithms

Presentation Outline

• Motivation• Our Contributions• Implication of GPU hardware on GPGPU programming• A GPGPU library for IPT functions

– categorization and optimization strategies• Case Studies

– 2D convolution– dither

• Conclusions

Implication of GPU hardware on GPGPU programmingPerformance-Critical Hardware Features

Implication on GPGPU Programs AMD/ATI HD5870 NVIDIA GTX280

Memory Access Bandwidth

Vector-type (float2 or float4) data access

Scalar-type (float) or vector-type (float2) data access

Register File A high number of registers (1k float4 registers or 16kB) per core implies more computational work in each core

A relatively small number of registers (2k float registers or 8kB) per core implies less computational work in each core.

Shared Memory/Local Data Share

Higher amount of shared memory (32kB per SM) implies large tile sizes (one tile is the workload of one thread block or work group)

Smaller amount of shared memory (16kB per SM) implies small tile sizes (one tile is the workload of one thread block or work group)

Ratio of (Peak Computation Throughput / Peak Memory Bandwidth)

(2.72 TFLOPS)/(154 GB/s) means more computation needs to be performed for each loaded data item

(0.62 TFLOPS)/(141 GB/s) means a relatively small amount of computation needs to be performed for each loaded data item

Stream processor pipeline

5-way VLIW ILP needed to make ALUs busy

Scalar pipeline less ILP needed

Implication on GPGPU Programs

AMD/ATI HD5870 (RV870)

NVIDIA GTX280

Our experiments show:

Using float instead would reduce bandwidth by at least 10%

Using float4 instead would reduce bandwidth by at least 16%

NVIDIA GTX280

Register File

256KB per SIMD engine * 20 SIMD engines = 5MB in total

64KB per SM * 30 SMs = 1.875MB in total

NVIDIA GTX280

Shared memory/ local data share

32KB per SIMD engine * 20 SIMD engines = 640KB in total

16KB per SM * 30 SMs = 480KB in total

NVIDIA GTX280

Ratio of Peak Computation Throughput / Peak Memory Bandwidth

(2720 GFLOPS)/(154 GB/s) = 17.7 flop/B= 70.6 flop/word

(624 GFLOPS)/(141 GB/s) = 4.4 flop/B= 17.7 flop/word

NVIDIA GTX280

5-way VLIW Scalar pipeline

Summary of the LibraryMATLAB Image Processing Toolbox (IPT) Function Classification

Function Category Function Name Function Description

(A) Data independent

intlut Convert integer values using lookup tableimadjust Adjust image intensity valuesimlincomb Linear combination of images

(B) Data sharing

edge Find edges in grayscale imageimregionalmax Regional maxima of an imageordfilt2 2-D order-statistic filtering conv2 2D convolution of an imagemean2 Average of matrix elementsimdilate/imerode Dilate/erode a grayscale image

(C) Algorithm dependent

bwdist Euclidean distance transform of a binary imageradon Radon transform

(D) Data dependent

dither Represent grayscale images in binary format

MATLAB IPT Function Classification and Optimization Strategies

(B) Data sharing

(D) Data dependent

Data Independent

Strategies:effectively utilize bandwidth by packing multiple pixels,

perform multiple such light-weight tasks if possible to amortize the CPU-GPU data transfer overhead

Characteristics: straightforward one on one mapping, abundant

parallelism

(B) Data sharing

(D) Data dependent

Data sharing

Strategies:data reuse, computation reuse

Characteristics: still one on one mapping, but there is an overlapping

over input pixels for computing adjacent output pixel

(B) Data sharing

(D) Data dependent

Algorithm dependent

Strategies:re-think algorithms, explore inherent

parallelism

Characteristics: lack of explicit parallelism

(B) Data sharing

(D) Data dependent

DataDependent

Strategies:give it a shot and you might have some surprise

Characteristics: lack of explicit parallelism, sequential nature with

data dependency and fine-grain communication requirements

Summary of the LibraryPerformance Comparison against MATLAB CPU (single-threaded)

Function Category

Function Name Kernel Speedup on GTX 280 Kernel Speedup on HD5870

CUDA OpenCL OpenCL

intlut 17.7 17.5 12.7imadjust 21.4 15.7 11.9imlincomb 944.6 593.7 1385.4

(B) Data sharing

edge 3385.9 1175.2 4955.1imregionalmax 2117.8 798.4 3694.0ordfilt2 1199.6 171.6 1727.1conv2 345.5 156.9 649.8mean2 50.5 25.2 34.7imdilate/imerode 951.5 523.3 1579.8

(C) Algorithm dependentbwdist 134.8 126.2 104.3radon 84.3 67.4 61.2

(D) Data dependent dither 10.2 6.5 7.6

Function Category

CUDA OpenCL OpenCL

(B) Data sharing

Geometric mean

206x 110x 218x

Kernel speedup on GTX 280

Kernel speedup onHD 5870

CUDA OpenCL OpenCL

Function Category

CUDA OpenCL OpenCL

(B) Data sharing

Function name

CUDA on GTX 280

OpenCL on HD 5870

imlincomb 944.6 1385.4edge 3385.9 4955.1imregionalmax

2117.8 3694.0

ordfilt2 1199.6 1727.1conv2 345.5 649.8imdilate 951.5 1579.8

Function Category

CUDA OpenCL OpenCL

(B) Data sharing

Geometric mean

206x 110x

Kernel speedup on GTX 280

CUDA OpenCL

2D Convolution Overview

3 x 3 filter

input pixels output pixels

2D Convolution Overview

• Drag the filter over the each pixel of the source image and multiply and accumulate the overlapped input elements to generate an output pixel.

Input Image

filter

2D Convolution: Intra-Thread Data Reuse

• Each thread computes multiple pixels along the column

• Intra-Thread reuse: – For a 7x7 filter we reuse each input pixel up to 7 times

Input Image

Thread iThread iThread i

2D Convolution: Inter-Thread Data Reuse

• Threads in the same warp/wavefront access the same row. • Inter-thread reuse

– The row is fetched into texture cache/shared memory and reused by different threads on subsequent accesses.

Input Image

Reused row in texture cache/shared memory

2 3threads

2D Convolution Performance

A 4096 x 4096 image with a 7 x 7 filter

• Jacket ‘s: – around 20 GFLOPS on GTX 280– Jacket 1.2.2 trial version (released on 1/4/2010) from

Accelereyes®

• Ours:– around 350 GFLOPS on GTX 280– around 733 GFLOPS on HD 5870

Data Dependent Case Study: Dither

Dither

input pixels

230 < 128?

1230error

Error = 230 – 128 = 102output pixels

Dither – Data Dependency

pixel at (i, j)

Dither – Parallel Processing Schedule

...1 2 4 53 6 7

3 4 6 75 9 10

5 6 97 10 11 12

7 10 119 12 13 14

9 10 12 1311 14 15 16

11 12 14 1513 16 17 18

13 14 16 1715 18 19 20

15 16 18 1917 20 21 22

From P. Metaxas [8]

Dither – Our GPU Implementation

12 3 4 5

45 6 7

78 9 10

1011 12 13

A relatively small amount of thread blocks/threads are active at any given time

• low resource utilization• synchronization overhead (among thread blocks/threads)

We still get up to 10.3x kernel speedup and 3.5x overall speedup!

Conclusions

• We identify performance-critical hardware features for GPGPU programs

• We present our experience and optimization strategies in developing high performance GPU code for functions from MATLAB Image Processing Toolbox

Our Open-source Library Project Website

https://sites.google.com/site/iptatiproject/ [15]You are more than welcome to contribute!

Thank you and Questions?

accelerating matlab image processing toolbox functions on gpus

Documents

d 2 ma: accelerating coarse-grained data transfer for gpus

accelerating astronomy & astrophysics in the new era of...

deep learning and gpus - hpc advisory council€¦ ·...

accelerating lattice qcd multigrid on gpus using … ·...

accelerating real-time lidar data processing using gpus

accelerating chemical similarity search using gpus a

accelerating deep neural network training for action...

accelerating image registration on gpus - fau image...

accelerating lattice qcd simulations using multiple gpus

accelerating ansys fluent 15.0 using … ansys fluent using...

accelerating swhe based pirs using gpus · accelerating...

adaptively accelerating map-reduce/spark with gpus: a case...

accelerating ansys fluent 15.0 using nvidia gpus ·...

accelerating three-body md potentials using nvidia tesla...

accelerating the exact evaluation of geometric predicates...

accelerating mcae with gpus information sciences institute...

application engineer mathworks - university of...

on the use of gp-gpus for accelerating compute-intensive...

accelerating matlab using cuda- enabled gpus ·...

can fpgas beat gpus in accelerating next-generation deep...