"understanding the role of integrated gpus in vision applications," a presentation from...
TRANSCRIPT
Copyright © 2015 ARM 1
Roberto Mijat, Visual Computing Marketing Manager
12 May 2015
Understanding the Role of Integrated
GPUs in Vision Applications
Copyright © 2015 ARM 2
• World leading semiconductor IP licensor
• Founded in 1990
• >1K processor licenses (>350 partners)
• >12bn shipments in 2014
• >50bn shipments to date
• Business model
• Designing and licensing of IP
• Not manufacturing chips
• Products
• CPUs
• Suite of integrated media IP
• Interconnect
• Physical IP
Introduction to ARM and Mali
Copyright © 2015 ARM 3
What is GPU Compute?
Cost-effective, efficient,
and high-performance
parallel computation
• 2D/3D Graphics
• Image processing
• Multimedia
• Computer Vision
OS and applications
The GPU is now programmable
through C-like high-level languages
Managed as
accelerator or
companion
processor
CPU GPU
Copyright © 2015 ARM 4
Comprehensive Heterogeneous Compute
Generic CPU
Serial workloads
Task parallel workloads
< 10 threads
1-8 cores
Short pipeline (generally <20 stages)
Low latency
General purpose
SIMD engine
Generic GPU
Data parallel workloads
100s-1000s threads
1-100s cores
Long pipeline (generally >50 stages)
Very high latency
High throughput
2D/3D Graphics
Stream processing
Copyright © 2015 ARM 5
• Reduced power, performance and area compared to desktop/HPC
• Designed for fan-less mobile devices
• Optimized for energy efficiency
• Integrated in System-On-Chip
• Sharing physical main memory with CPU (and other processors)
• Local caches
• I/O coherency available in newer platforms
• Primary use case is 3D graphics acceleration
• Gaming and user interfaces
• Modern designs support GPU Compute (aka GPGPU)
• GPU is a standard feature in mobile devices
Key Characteristics of Integrated Mobile GPU
Copyright © 2015 ARM 6
• Option #1: Manually optimize your code
• Study your algorithm, determine how to partition it
• Optimize using low-level NEON™ (CPU) and OpenCL (GPU)
• Manually fine tune load balance between CPU and GPU
• Option #2: Utilize GPU Compute enabled middleware, for example:
• Gesture UI middleware from eyeSight
• OpenCL enabled functionality from ArcSoft, Fotonation
• Computer Vision mobile library from ArrayFire
• But the key question is not how, but: WHEN should you be using the GPU
for computer vision?
• This presentation will try to answer this question through examples
Enabling Computer Vision on Integrated GPU
Copyright © 2015 ARM 7
• Set of sub-sampled images
• Each level
• Apply smoothing filter
• Sub-sample in both direction
• Widely used in computer vision
• Feature extraction
• Stereo vision
• Object detection
Image Pyramid: The Algorithm
x y
y/2
y/4
x/2
x/4
Copyright © 2015 ARM 8
• In principle well suited to GPU
• Embarrassingly parallel problem
• No data dependencies
• Generally the GPU improves performance
• Architectural specific optimizations
• Algorithm structure changes
• GPU specific optimization stages
• Added interleaving conversions to enable planar level operations
• Consolidated GPU kernels to improve efficiently
• Used OpenCL data structures and vector maths
Image Pyramid: Optimizing for GPU
Copyright © 2015 ARM 9
• Popular tuneable algorithm to extract edges from images
• 4 main stages
• Gaussian filter (reduce noise)
• Sobel filter (identify candidate edges)
• Remove pixels that are not a local maximum
• Hysteresis thresholding (to form high-quality edges)
Canny Edge Detection—The Algorithm
IMAGES SOURCE: Wikipedia
Copyright © 2015 ARM 10
• Canny Edges Detection overall adapts well to GPU acceleration
• Convolution stages map well to parallelism and vectorisation
• Hysteresis is very serial in nature but constitutes minor component
• Large performance uplift of the algorithm from CPU-only reference
implementation through an elementary port using OpenCL
Canny Edges Detection: First GPU Port
Resolution Speed-up (*)
720 HD x7.48
1080p HD x7.24
4k x8.30
(*) only kernel execution measured
Copyright © 2015 ARM 11
• Optimization stages on GPU (OpenCL)
• Utilize vector load to reduce the pressure on the L/S pipeline
• Loop-unroll to increase performance of arithmetically bound kernels
• Trade-off between branching and redundant operations
• Use padding to avoid boundary checks
• Datatypes size reduction
Canny Edges Detection: Optimized GPU Port
Resolution
Further
improvement of
GPU version (*)
720 HD x4.97
1080p HD x5.67
4k x6.85
(*) only kernel execution measured
Copyright © 2015 ARM 12
The Hidden Cost of Using the GPU (1)
GPU time CPU time
Driver & kernel setup
Cache coherency
Cache coherency
Driver clean up
Total GPU time Pyramid
on CPU Pyramid
on GPU
(diagram is conceptual, not in scale)
Copyright © 2015 ARM 13
The Hidden Cost of Using the GPU (2)
• To benefit from GPU acceleration
• Computational workload must overshadow the overheads
• Run repeated passes (multiple-frames)
• Use multiple buffers to pipeline read-backs whilst GPU moves on
Canny edge detection—single frame Canny edge detection—200 frames
Copyright © 2015 ARM 14
Complex Imaging Pipeline Example: HoG
• We examined a complex computer vision pipeline
• Histogram of Gradients often used in image recognition pipelines
• We investigated how the GPU can improve computation
• CPU version combined many of the stages
• On GPU each stage was kept separate for simplicity
Derivative
Dx and Dy
Phase and
Magnitude
Dx Greyscale
Image
Orientation
binning
Magnitude
block
calculation
Normalise
Dy
Phase
Magnitude
Descriptor
Extractor &
Classifier
Copyright © 2015 ARM 15
Histogram of Gradients: GPU Implementation
• We applied common optimizations as per
pyramid and canny edge
• Arctangent function applied to each pixel in
Phase and Magnitude computation
• Default CPU atan2() library function slow
• Approximation version 2x faster
• GPU built in function 6x faster
• Another built in function (sqrt) is used by the
normalise stage
Copyright © 2015 ARM 16
Histogram of Gradients: The Results
• Significant performance
improvement on GPU
• Improvement reduced with
smaller images
• When running on the CPU at
smaller resolutions, most of
the data will be in the cache
• On CPU we have fewer
threads, which means fewer
chances to hide latency
• Can we improve further?
8.2x
6.2x 3.0x
Copyright © 2015 ARM 19
• More efficient processing is achieved by keeping the GPU busy
Reducing CPU and GPU Serialization
Screenshots of ARM DS-5 Streamline Tool
enqueue Frame 0
enqueue Frame 1
wait for Frame 0
to complete…
enqueue Frame 2
wait for Frame 1
to complete…
etc.
enqueue Frame 0
wait for Frame 0
to complete…
enqueue Frame 1
wait for Frame 1
to complete…
etc.
Interleaved
CPU/GPU
activity
Serialised
CPU/GPU
activity
Copyright © 2015 ARM 20
• www.malideveloper.com
• Download guides, papers, tools, etc.
• http://community.arm.com/welcome
• Community forums, blogs and more
• Graphics and GPU Compute developer support
• http://malideveloper.arm.com/develop-for-mali/opencl-renderscript-tutorials/
• A range of video and written tutorials for GPU Compute, OpenCL and RenderScript
• http://malideveloper.arm.com/develop-for-mali/features/mali-t6xx-gpu-user-space-
drivers/
• ARM® Mali™-T600 series GPU user-space binary drivers available for download
• Linaro BSP now available with Mali-T600 series GPU support
• And most importantly:
• The Mali ecosystem of partners
• The Embedded Vision Alliance
Resources
Copyright © 2015 ARM 21
• The GPU is architecturally suitable for several computer vision
algorithms
• Workload characteristics & size determine optimal CPU/GPU
balance
• Computation load must overwhelm system overheads
• Kernel & system optimization extract optimal performance
• Stable well-understood algorithms typically evolve to hardware
• If software solution needed by choice (cost) or necessity (time-to-
market)
• GPU can increase performance and reduce power vs. CPU-only
• Add flexibility and reduce cost for chip, sensor and ISP vendors
• Improve performance of software on existing silicon
In Conclusion: The Role of GPU Compute