viewdle inc. - amddeveloper.amd.com/wordpress/media/2013/06/2911_2_final.pdf · detection is only...
TRANSCRIPT
Viewdle Inc.
1
Use cases
Faces tagging in photo and video, enabling:
sharing
media editing
automatic media mashuping
entertaining
Augmented reality
Games
2
Why OpenCL matter?
OpenCL is going to bring such use cases to
new level of user experience:
Faster
Lower power
Experience like never before
3
Typical face recognition
application workflow
4
Extract faces
User tags clusters
Group faces to clusters
User uploads photos
Suggest tag
User select photos
Face detection pipeline
5
Media input Face
detection
Feature
detection Normalization
Output
Tracking
* Tracking is a part of video pipeline only
Distribution of working time. Large
photos
17%
75%
8%
190 photos with 220 faces
Input
Face detection
Feature detection
Normalization
Output
6
Distribution of working time. Small
photos
3%
73%
24%
1440 photos with 2040 faces
Input
Face detection
Feature detection
Normalization
Output
7
Distribution of working time.
Video 720P
28%
57%
8%
1% 3% 3%
video, duration ~ 2.5 min
Input
Face detection
Feature detection
Tracking
Normalization
Output
8
We accelerate
Face detection
Video input (UVD)
9
Face detect problem statement
Evaluate every rectangle inside image P
for face presence
Where evaluated rectangles set is
gathered over number of downscaled
copies of image Pk with downscale
factor α k (i.e α^k)
So that in every image Pk all rectangles
MxN {Rmnk} are evaluated
11
Scaling
12
Scaling of image
with step α=1.1
Usually about 20-30
scales per image
Linear interpolation
Produce several
scales per call
Image scales
13
One scale view
Rectangle to cell mapping
14
Rectangles to check in all scales
15
Cascaded detection
16
• Output is an answer “Face or not face?”
Face? Face? Face? Yes Yes
Not a face
Face
Rectangle
Yes
No No No
Unbalancing due to cascades
17
Detection is only in blue cells
Challenges
Parallelization by scale does not work
well on GPU because every scale
contains different number of rectangles
Rebalancing
18
Face detection steps
19
Image
integration
Results
merge
Integral image
scaling
Detection
cascades
Integral image
scaling
Detection
cascades
Integral image
scaling
Detection
cascades
Integral image
scaling
Detection
cascades
Image integration
20
Detection
21
• Based on cascade scheme
• Input is from rectangle of constant size
• Every cascade try rejecting non-faces rectangles
• Output is an answer “Face or not face?”
Face? Face? Face? Yes Yes
Not a face
Face
Rectangle
Yes
No No No
Evaluation of rectangle
We need to check does rectangle contains a
face or not?
Size of evaluated rectangle is equal to size of
training sample image (for example, [20; 20])
Rectangle is checked by different cascades
22
What is not effective
Parallel inside single scale or
straight by scales
Run all cascades till the end on
all rectangles in parallel fashion
23
If parallel straight by scales
Scales are stored one-by-one in linear chunk
of memory
Information about scales (width, height, stride,
offset) is stored in additional structures
24
Parallel scales. Continued
Scales are introduced by additional OpenCL
working space dimension
Better, because no overhead on call for every
scale, but:
25
Challenges
Cascade scheme brings a lot of unbalancing –
very bad for GPU
Number of candidates-rectangles in scale can
vary from 1 to (image.width –
trainingSample.width)*(image.height –
trainingSample.height)
26
What is effective
Evaluate all rectangles from different
scales in parallel (don’t use scale
number as OCL workgroup dimension)
Split evaluation into stages, with
rebalancing among OCL workitems after
every or couple of stages
27
Our solution detail
Process several scales of image per one
iteration (all if possible, is limited by available
memory size)
Split work in scales by blocks
Number of OpenCL workgroups does not
depend on image sizes
Forced rebalancing of working tasks every
several stages based on queues
28
Queues
Presented as linear chunk of memory
Contain records with description of items to
process:
block of rectangles
single rectangle
Allows balancing work among GPU workitems
efficiently
29
Structure of queue Header
A several sub-queues(a number of sub-queues is
equal to number of workgroups; one sub-queue ~
one workgroup)
30
Structure of queue. Continued
All buffers for sub-queues have equal size
Header contains a one record per sub-queue:
a number of items in sub-queue
31
Number of workgroups
Does not depends on image sizes, but only on
hardware characteristics
In our case, we found it experimentally, best
performance is obtained with:
Number of computing units * N,
where N = 6,8,12
32
Queue rebalancing
0 2 4 6 8 10 12
Su
b-q
ueu
es
33
0 2 4 6 8 10 12
Su
b-q
ueu
es
• Separate kernel for unbalancing
• Balancing is performed every N (2, 3, 4) stages
runs
• Formed sub-queues have almost equal length
Face detection relative
performance
0
1
2
3
4
5
6
0.08Mpx 1MPx 4MPx 6MPx
GP
U s
pe
ed
up
Size of input images
GPU vs CPU
34
The higher
the better
Modes of face detector
Homogeneous
Parallel (CPU only)
Ocl (GPU only)
Heterogeneous
Balanced
Split
35
Homogeneous modes
Parallel – multi-threaded mode, everything is
processed on CPU
Ocl - single-threaded mode, everything is
processed on GPU using OpenCL
36
Heterogeneous modes
Balanced – mix of parallel and ocl modes.
Split – uses GPU for small scales (having
many rectangles) and CPU for large.
Goal: get maximum performance from using
both CPU and GPU
37
Balanced heterogeneous face
detection pipeline
38
Media input
CPU face
detection Feature
detection Normalization
Output
Tracking
* Every Tracking is a part of video pipeline only
GPU face
detection
CPU face
detection
Heterogeneous face detect
Detection filter runs in multiple threads.
Every thread acquires detector from the pools.
First pool contains GPU detector (usually one). Second pool has only CPU detectors.
In balanced mode thread acquires one detector per thread (if GPU is not available then use CPU).
Split mode acquires GPU detector and then CPU detector in thread sequentially.
39
Split heterogeneous face
detection pipeline
40
Media input
CPU +GPU
face detection Feature
detection Normalization
Output
Tracking
CPU + GPU
face detection
CPU +GPU
face detection
Challenges
Make sure GPU has work to do all the
time
TBB is not suitable for that
TBB pushes number of items into
pipelines and runs them almost
simultaneously in the same filters
Detection filter can be without work
some time and then GPU is idle
41
Ways to minimize GPU being idle
Make all filters parallel (feature detect
example)
Make CPU/GPU balancing via queuing
Make sure there is always extra item in
the queue for GPU
42
Profiling AMD APP Profiler 2.2
Parallel Path Analyzer:
43
Debugging
Copy data back from GPU and compare it with
“golden” results
Printf via cl_amd_printf extension
The OpenCL Emulator-Debugger from AMD
44
Performance results. APU
0
50
100
150
200
250
Parallel Balanced Splitted
Pro
cessin
g t
ime
Mode of detector
Photos set (190 images)
Sabine(Torpedo) APU
Engineering sample
Quad Core 1.8GHz
1.88x
45
Performance results. Mobile
platform
0
20
40
60
80
100
120
140
160
180
Parallel Ocl Balanced Splitted
Pro
cessin
g t
ime
Mode of detector
Photos set (190 images)
AMD Phenom II N930
Quad Core 2.00 GHz
ATI MOBILITY HD5650
2.03x
46
Performance results. Mobile
platform. Continued
0
100
200
300
400
500
600
700
Parallel Ocl Balanced Splitted
Pro
cessin
g t
ime
Mode of detector
Photos set (1440 images)
AMD Phenom II N930
Quad Core 2.00 GHz
ATI MOBILITY HD5650
1.75x
47
Performance results. Desktop
system
0
10
20
30
40
50
60
70
80
Parallel Balanced Ocl Split
Pro
cessin
g t
ime
Mode of detector
Photos set (190 images)
Intel Core i7 870
4x3GHz
AMD Radeon HD6970
48
2.72x
Performance results. Desktop
system. Continued
0
50
100
150
200
250
300
350
Parallel Balanced Ocl Split
Pro
cessin
g t
ime
Mode of detector
Photos set (1440 images)
Intel Core i7 870
4x3GHz
AMD Radeon HD6970
2.49x
49
Speedup on photos Platform Small photos Large photos
Sabine(Torpedo) APU
Engineering sample
Quad Core 1.8GHz
*** 1.88
AMD Phenom II N930
Quad Core 2.00 GHz
ATI MOBILITY HD5650
1.75 2.03
Intel Core i7 870 4x3GHz
AMD Radeon HD6970 2.49 2.72
50
2-3 times for full face detection
pipeline!
Platform: Windows 7 x64
FBUploader
51
FBUploader
52
CPU / GPU comparison
53
UVD
UVD is a dedicated video decode processing
unit
Offloads CPU from the decoding process
Reduces power usage
54
UVD. Details
In Microsoft Windows works via DXVA(DirectX
Video Acceleration) API
We are using it to decode H.264 video
55
UVD. Performance
w/o UVD w/i UVD
FPS 75 64
CPU load 95 30
56
• Less FPS
• Less CPU load
• Offloads work to GPU
*Hardware: Sabine(Torpedo), Quad Core 1.8GHz
Video performance
0
20
40
60
80
100
120
140
160
CPU + no UVD CPU + UVD GPU + UVD
Pro
cessin
g t
ime
Mode
Video 720P
Sabine(Torpedo) APU
Engineering sample
Quad Core 1.8GHz
57
1.34x
Video performance
0
20
40
60
80
100
120
140
CPU + no UVD CPU + UVD GPU + UVD
Pro
cessin
g t
ime
Mode
Video 720P
Sabine
(Armorhead)APU
Engineering sample
Quad Core 2.4GHz
1.29x
58
Speedup on video
Platform Video 720P
Sabine(Torpedo) APU
Engineering sample
Quad Core 1.8GHz
1.34
Sabine(Armorhead) APU
Engineering sample
Quad Core 2.4GHz
1.29
59
Results are obtained with Viewdle Video application
Platform: Windows 7 x64
Viewdle Video application
60
Viewdle Video application. Details
Analyzes and indexes video input
automatically (based on face recognition)
Organizes videos and sorts groups of frames
(aka “clusters”) and groups of clusters (aka
“clouds”) according to the faces in the video
61
User benefits
Processing speedup
Full utilization of user's hardware
Longer battery life
62
OpenCL: why we love it
Brings speed of processing to new level
Best way to utilize current GPU hardware
Protects investment
Open standard
Fast growing community
63
Conclusion
OpenCL accelerates Viewdle face recognition
solution by 1.6-3x in photos and by 1.3x in
videos using AMD GPUs and APUs
64
Questions?
65
Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies,
omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including
but not limited to product and roadmap changes, component and motherboard version changes, new model and/or
product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware
upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we
reserve the right to revise this information and to make changes from time to time to the content hereof without
obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO
RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN
THIS INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE
EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY
DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY
INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH
DAMAGES.
AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other
names used in this presentation are for informational purposes only and may be trademarks of their respective
owners.
The contents of this presentation were provided by individual(s) and/or company listed on the title page. The
information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions.
Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.