"understanding the role of integrated gpus in vision applications," a presentation from...

Copyright © 2015 ARM 1

Roberto Mijat, Visual Computing Marketing Manager

12 May 2015

Understanding the Role of Integrated

GPUs in Vision Applications


• World leading semiconductor IP licensor

• Founded in 1990

• >1K processor licenses (>350 partners)

• >12bn shipments in 2014

• >50bn shipments to date

• Business model

• Designing and licensing of IP

• Not manufacturing chips

• Products

• CPUs

• Suite of integrated media IP

• Interconnect

• Physical IP

Introduction to ARM and Mali


What is GPU Compute?

Cost-effective, efficient,

and high-performance

parallel computation

• 2D/3D Graphics

• Image processing

• Multimedia

• Computer Vision

OS and applications

The GPU is now programmable

through C-like high-level languages

Managed as

accelerator or

companion

processor

CPU GPU


Comprehensive Heterogeneous Compute

Generic CPU

Serial workloads

Task parallel workloads

< 10 threads

1-8 cores

Short pipeline (generally <20 stages)

Low latency

General purpose

SIMD engine

Generic GPU

Data parallel workloads

100s-1000s threads

1-100s cores

Long pipeline (generally >50 stages)

Very high latency

High throughput

2D/3D Graphics

Stream processing


• Reduced power, performance and area compared to desktop/HPC

• Designed for fan-less mobile devices

• Optimized for energy efficiency

• Integrated in System-On-Chip

• Sharing physical main memory with CPU (and other processors)

• Local caches

• I/O coherency available in newer platforms

• Primary use case is 3D graphics acceleration

• Gaming and user interfaces

• Modern designs support GPU Compute (aka GPGPU)

• GPU is a standard feature in mobile devices

Key Characteristics of Integrated Mobile GPU


• Option #1: Manually optimize your code

• Study your algorithm, determine how to partition it

• Optimize using low-level NEON™ (CPU) and OpenCL (GPU)

• Manually fine tune load balance between CPU and GPU

• Option #2: Utilize GPU Compute enabled middleware, for example:

• Gesture UI middleware from eyeSight

• OpenCL enabled functionality from ArcSoft, Fotonation

• Computer Vision mobile library from ArrayFire

• But the key question is not how, but: WHEN should you be using the GPU

for computer vision?

• This presentation will try to answer this question through examples

Enabling Computer Vision on Integrated GPU


• Set of sub-sampled images

• Each level

• Apply smoothing filter

• Sub-sample in both direction

• Widely used in computer vision

• Feature extraction

• Stereo vision

• Object detection

Image Pyramid: The Algorithm

x y

y/2

y/4

x/2

x/4


• In principle well suited to GPU

• Embarrassingly parallel problem

• No data dependencies

• Generally the GPU improves performance

• Architectural specific optimizations

• Algorithm structure changes

• GPU specific optimization stages

• Added interleaving conversions to enable planar level operations

• Consolidated GPU kernels to improve efficiently

• Used OpenCL data structures and vector maths

Image Pyramid: Optimizing for GPU


• Popular tuneable algorithm to extract edges from images

• 4 main stages

• Gaussian filter (reduce noise)

• Sobel filter (identify candidate edges)

• Remove pixels that are not a local maximum

• Hysteresis thresholding (to form high-quality edges)

Canny Edge Detection—The Algorithm

IMAGES SOURCE: Wikipedia


• Canny Edges Detection overall adapts well to GPU acceleration

• Convolution stages map well to parallelism and vectorisation

• Hysteresis is very serial in nature but constitutes minor component

• Large performance uplift of the algorithm from CPU-only reference

implementation through an elementary port using OpenCL

Canny Edges Detection: First GPU Port

Resolution Speed-up (*)

720 HD x7.48

1080p HD x7.24

4k x8.30

(*) only kernel execution measured


• Optimization stages on GPU (OpenCL)

• Utilize vector load to reduce the pressure on the L/S pipeline

• Loop-unroll to increase performance of arithmetically bound kernels

• Trade-off between branching and redundant operations

• Use padding to avoid boundary checks

• Datatypes size reduction

Canny Edges Detection: Optimized GPU Port

Resolution

Further

improvement of

GPU version (*)

720 HD x4.97

1080p HD x5.67

4k x6.85

(*) only kernel execution measured


The Hidden Cost of Using the GPU (1)

GPU time CPU time

Driver & kernel setup

Cache coherency

Cache coherency

Driver clean up

Total GPU time Pyramid

on CPU Pyramid

on GPU

(diagram is conceptual, not in scale)


The Hidden Cost of Using the GPU (2)

• To benefit from GPU acceleration

• Computational workload must overshadow the overheads

• Run repeated passes (multiple-frames)

• Use multiple buffers to pipeline read-backs whilst GPU moves on

Canny edge detection—single frame Canny edge detection—200 frames


Complex Imaging Pipeline Example: HoG

• We examined a complex computer vision pipeline

• Histogram of Gradients often used in image recognition pipelines

• We investigated how the GPU can improve computation

• CPU version combined many of the stages

• On GPU each stage was kept separate for simplicity

Derivative

Dx and Dy

Phase and

Magnitude

Dx Greyscale

Image

Orientation

binning

Magnitude

block

calculation

Normalise

Dy

Phase

Magnitude

Descriptor

Extractor &

Classifier


Histogram of Gradients: GPU Implementation

• We applied common optimizations as per

pyramid and canny edge

• Arctangent function applied to each pixel in

Phase and Magnitude computation

• Default CPU atan2() library function slow

• Approximation version 2x faster

• GPU built in function 6x faster

• Another built in function (sqrt) is used by the

normalise stage


Histogram of Gradients: The Results

• Significant performance

improvement on GPU

• Improvement reduced with

smaller images

• When running on the CPU at

smaller resolutions, most of

the data will be in the cache

• On CPU we have fewer

threads, which means fewer

chances to hide latency

• Can we improve further?

8.2x

6.2x 3.0x


HoG: Migrate Small Tasks back to CPU?


Screenshots of ARM DS-5 Streamline Tool

CPU and GPU Work Correlation


• More efficient processing is achieved by keeping the GPU busy

Reducing CPU and GPU Serialization

Screenshots of ARM DS-5 Streamline Tool

enqueue Frame 0

enqueue Frame 1

wait for Frame 0

to complete…

enqueue Frame 2

wait for Frame 1

to complete…

etc.

enqueue Frame 0

wait for Frame 0

to complete…

enqueue Frame 1

wait for Frame 1

to complete…

etc.

Interleaved

CPU/GPU

activity

Serialised

CPU/GPU

activity


• www.malideveloper.com

• Download guides, papers, tools, etc.

• http://community.arm.com/welcome

• Community forums, blogs and more

• [email protected]

• Graphics and GPU Compute developer support

• http://malideveloper.arm.com/develop-for-mali/opencl-renderscript-tutorials/

• A range of video and written tutorials for GPU Compute, OpenCL and RenderScript

• http://malideveloper.arm.com/develop-for-mali/features/mali-t6xx-gpu-user-space-

drivers/

• ARM® Mali™-T600 series GPU user-space binary drivers available for download

• Linaro BSP now available with Mali-T600 series GPU support

• And most importantly:

• The Mali ecosystem of partners

• The Embedded Vision Alliance

Resources

http://www.malideveloper.com/

http://community.arm.com/welcom

mailto:[email protected]

http://malideveloper.arm.com/develop-for-mali/opencl-renderscript-tutorials/









http://malideveloper.arm.com/develop-for-mali/features/mali-t6xx-gpu-user-space-drivers/
















• The GPU is architecturally suitable for several computer vision

algorithms

• Workload characteristics & size determine optimal CPU/GPU

balance

• Computation load must overwhelm system overheads

• Kernel & system optimization extract optimal performance

• Stable well-understood algorithms typically evolve to hardware

• If software solution needed by choice (cost) or necessity (time-to-

market)

• GPU can increase performance and reduce power vs. CPU-only

• Add flexibility and reduce cost for chip, sensor and ISP vendors

• Improve performance of software on existing silicon

In Conclusion: The Role of GPU Compute

"understanding the role of integrated gpus in vision applications," a presentation from...

Technology