"understanding the role of integrated gpus in vision applications," a presentation from...

21
Copyright © 2015 ARM 1 Roberto Mijat, Visual Computing Marketing Manager 12 May 2015 Understanding the Role of Integrated GPUs in Vision Applications

Upload: embedded-vision-alliance

Post on 16-Aug-2015

73 views

Category:

Technology


0 download

TRANSCRIPT

Copyright © 2015 ARM 1

Roberto Mijat, Visual Computing Marketing Manager

12 May 2015

Understanding the Role of Integrated

GPUs in Vision Applications

Copyright © 2015 ARM 2

• World leading semiconductor IP licensor

• Founded in 1990

• >1K processor licenses (>350 partners)

• >12bn shipments in 2014

• >50bn shipments to date

• Business model

• Designing and licensing of IP

• Not manufacturing chips

• Products

• CPUs

• Suite of integrated media IP

• Interconnect

• Physical IP

Introduction to ARM and Mali

Copyright © 2015 ARM 3

What is GPU Compute?

Cost-effective, efficient,

and high-performance

parallel computation

• 2D/3D Graphics

• Image processing

• Multimedia

• Computer Vision

OS and applications

The GPU is now programmable

through C-like high-level languages

Managed as

accelerator or

companion

processor

CPU GPU

Copyright © 2015 ARM 4

Comprehensive Heterogeneous Compute

Generic CPU

Serial workloads

Task parallel workloads

< 10 threads

1-8 cores

Short pipeline (generally <20 stages)

Low latency

General purpose

SIMD engine

Generic GPU

Data parallel workloads

100s-1000s threads

1-100s cores

Long pipeline (generally >50 stages)

Very high latency

High throughput

2D/3D Graphics

Stream processing

Copyright © 2015 ARM 5

• Reduced power, performance and area compared to desktop/HPC

• Designed for fan-less mobile devices

• Optimized for energy efficiency

• Integrated in System-On-Chip

• Sharing physical main memory with CPU (and other processors)

• Local caches

• I/O coherency available in newer platforms

• Primary use case is 3D graphics acceleration

• Gaming and user interfaces

• Modern designs support GPU Compute (aka GPGPU)

• GPU is a standard feature in mobile devices

Key Characteristics of Integrated Mobile GPU

Copyright © 2015 ARM 6

• Option #1: Manually optimize your code

• Study your algorithm, determine how to partition it

• Optimize using low-level NEON™ (CPU) and OpenCL (GPU)

• Manually fine tune load balance between CPU and GPU

• Option #2: Utilize GPU Compute enabled middleware, for example:

• Gesture UI middleware from eyeSight

• OpenCL enabled functionality from ArcSoft, Fotonation

• Computer Vision mobile library from ArrayFire

• But the key question is not how, but: WHEN should you be using the GPU

for computer vision?

• This presentation will try to answer this question through examples

Enabling Computer Vision on Integrated GPU

Copyright © 2015 ARM 7

• Set of sub-sampled images

• Each level

• Apply smoothing filter

• Sub-sample in both direction

• Widely used in computer vision

• Feature extraction

• Stereo vision

• Object detection

Image Pyramid: The Algorithm

x y

y/2

y/4

x/2

x/4

Copyright © 2015 ARM 8

• In principle well suited to GPU

• Embarrassingly parallel problem

• No data dependencies

• Generally the GPU improves performance

• Architectural specific optimizations

• Algorithm structure changes

• GPU specific optimization stages

• Added interleaving conversions to enable planar level operations

• Consolidated GPU kernels to improve efficiently

• Used OpenCL data structures and vector maths

Image Pyramid: Optimizing for GPU

Copyright © 2015 ARM 9

• Popular tuneable algorithm to extract edges from images

• 4 main stages

• Gaussian filter (reduce noise)

• Sobel filter (identify candidate edges)

• Remove pixels that are not a local maximum

• Hysteresis thresholding (to form high-quality edges)

Canny Edge Detection—The Algorithm

IMAGES SOURCE: Wikipedia

Copyright © 2015 ARM 10

• Canny Edges Detection overall adapts well to GPU acceleration

• Convolution stages map well to parallelism and vectorisation

• Hysteresis is very serial in nature but constitutes minor component

• Large performance uplift of the algorithm from CPU-only reference

implementation through an elementary port using OpenCL

Canny Edges Detection: First GPU Port

Resolution Speed-up (*)

720 HD x7.48

1080p HD x7.24

4k x8.30

(*) only kernel execution measured

Copyright © 2015 ARM 11

• Optimization stages on GPU (OpenCL)

• Utilize vector load to reduce the pressure on the L/S pipeline

• Loop-unroll to increase performance of arithmetically bound kernels

• Trade-off between branching and redundant operations

• Use padding to avoid boundary checks

• Datatypes size reduction

Canny Edges Detection: Optimized GPU Port

Resolution

Further

improvement of

GPU version (*)

720 HD x4.97

1080p HD x5.67

4k x6.85

(*) only kernel execution measured

Copyright © 2015 ARM 12

The Hidden Cost of Using the GPU (1)

GPU time CPU time

Driver & kernel setup

Cache coherency

Cache coherency

Driver clean up

Total GPU time Pyramid

on CPU Pyramid

on GPU

(diagram is conceptual, not in scale)

Copyright © 2015 ARM 13

The Hidden Cost of Using the GPU (2)

• To benefit from GPU acceleration

• Computational workload must overshadow the overheads

• Run repeated passes (multiple-frames)

• Use multiple buffers to pipeline read-backs whilst GPU moves on

Canny edge detection—single frame Canny edge detection—200 frames

Copyright © 2015 ARM 14

Complex Imaging Pipeline Example: HoG

• We examined a complex computer vision pipeline

• Histogram of Gradients often used in image recognition pipelines

• We investigated how the GPU can improve computation

• CPU version combined many of the stages

• On GPU each stage was kept separate for simplicity

Derivative

Dx and Dy

Phase and

Magnitude

Dx Greyscale

Image

Orientation

binning

Magnitude

block

calculation

Normalise

Dy

Phase

Magnitude

Descriptor

Extractor &

Classifier

Copyright © 2015 ARM 15

Histogram of Gradients: GPU Implementation

• We applied common optimizations as per

pyramid and canny edge

• Arctangent function applied to each pixel in

Phase and Magnitude computation

• Default CPU atan2() library function slow

• Approximation version 2x faster

• GPU built in function 6x faster

• Another built in function (sqrt) is used by the

normalise stage

Copyright © 2015 ARM 16

Histogram of Gradients: The Results

• Significant performance

improvement on GPU

• Improvement reduced with

smaller images

• When running on the CPU at

smaller resolutions, most of

the data will be in the cache

• On CPU we have fewer

threads, which means fewer

chances to hide latency

• Can we improve further?

8.2x

6.2x 3.0x

Copyright © 2015 ARM 17

HoG: Migrate Small Tasks back to CPU?

Copyright © 2015 ARM 18

Screenshots of ARM DS-5 Streamline Tool

CPU and GPU Work Correlation

Copyright © 2015 ARM 19

• More efficient processing is achieved by keeping the GPU busy

Reducing CPU and GPU Serialization

Screenshots of ARM DS-5 Streamline Tool

enqueue Frame 0

enqueue Frame 1

wait for Frame 0

to complete…

enqueue Frame 2

wait for Frame 1

to complete…

etc.

enqueue Frame 0

wait for Frame 0

to complete…

enqueue Frame 1

wait for Frame 1

to complete…

etc.

Interleaved

CPU/GPU

activity

Serialised

CPU/GPU

activity

Copyright © 2015 ARM 20

• www.malideveloper.com

• Download guides, papers, tools, etc.

• http://community.arm.com/welcome

• Community forums, blogs and more

[email protected]

• Graphics and GPU Compute developer support

• http://malideveloper.arm.com/develop-for-mali/opencl-renderscript-tutorials/

• A range of video and written tutorials for GPU Compute, OpenCL and RenderScript

• http://malideveloper.arm.com/develop-for-mali/features/mali-t6xx-gpu-user-space-

drivers/

• ARM® Mali™-T600 series GPU user-space binary drivers available for download

• Linaro BSP now available with Mali-T600 series GPU support

• And most importantly:

• The Mali ecosystem of partners

• The Embedded Vision Alliance

Resources

Copyright © 2015 ARM 21

• The GPU is architecturally suitable for several computer vision

algorithms

• Workload characteristics & size determine optimal CPU/GPU

balance

• Computation load must overwhelm system overheads

• Kernel & system optimization extract optimal performance

• Stable well-understood algorithms typically evolve to hardware

• If software solution needed by choice (cost) or necessity (time-to-

market)

• GPU can increase performance and reduce power vs. CPU-only

• Add flexibility and reduce cost for chip, sensor and ISP vendors

• Improve performance of software on existing silicon

In Conclusion: The Role of GPU Compute