viewdle inc. - amddeveloper.amd.com/wordpress/media/2013/06/2911_2_final.pdf · detection is only...

Viewdle Inc.

1

Use cases

Faces tagging in photo and video, enabling:

sharing

media editing

automatic media mashuping

entertaining

Augmented reality

Games

2

Why OpenCL matter?

OpenCL is going to bring such use cases to

new level of user experience:

Faster

Lower power

Experience like never before

3

Typical face recognition

application workflow

4

Extract faces

User tags clusters

Group faces to clusters

User uploads photos

Suggest tag

User select photos

Face detection pipeline

5

Media input Face

detection

Feature

detection Normalization

Output

Tracking

* Tracking is a part of video pipeline only

Distribution of working time. Large

photos

17%

75%

8%

190 photos with 220 faces

Input

Face detection

Feature detection

Normalization

Output

6

Distribution of working time. Small

photos

3%

73%

24%

1440 photos with 2040 faces

Input

Face detection

Feature detection

Normalization

Output

7

Distribution of working time.

Video 720P

28%

57%

8%

1% 3% 3%

video, duration ~ 2.5 min

Input

Face detection

Feature detection

Tracking

Normalization

Output

8

We accelerate

Face detection

Video input (UVD)

9

Face detect problem statement

Evaluate every rectangle inside image P

for face presence

Where evaluated rectangles set is

gathered over number of downscaled

copies of image Pk with downscale

factor α k (i.e α^k)

So that in every image Pk all rectangles

MxN {Rmnk} are evaluated

11

Scaling

12

Scaling of image

with step α=1.1

Usually about 20-30

scales per image

Linear interpolation

Produce several

scales per call

Image scales

13

One scale view

Rectangle to cell mapping

14

Rectangles to check in all scales

15

Cascaded detection

16

• Output is an answer “Face or not face?”

Face? Face? Face? Yes Yes

Not a face

Face

Rectangle

Yes

No No No

Unbalancing due to cascades

17

Detection is only in blue cells

Challenges

Parallelization by scale does not work

well on GPU because every scale

contains different number of rectangles

Rebalancing

18

Face detection steps

19

Image

integration

Results

merge

Integral image

scaling

Detection

cascades

Integral image

scaling

Detection

cascades

Integral image

scaling

Detection

cascades

Integral image

scaling

Detection

cascades

Image integration

20

Detection

21

• Based on cascade scheme

• Input is from rectangle of constant size

• Every cascade try rejecting non-faces rectangles

• Output is an answer “Face or not face?”

Face? Face? Face? Yes Yes

Not a face

Face

Rectangle

Yes

No No No

Evaluation of rectangle

We need to check does rectangle contains a

face or not?

Size of evaluated rectangle is equal to size of

training sample image (for example, [20; 20])

Rectangle is checked by different cascades

22

What is not effective

Parallel inside single scale or

straight by scales

Run all cascades till the end on

all rectangles in parallel fashion

23

If parallel straight by scales

Scales are stored one-by-one in linear chunk

of memory

Information about scales (width, height, stride,

offset) is stored in additional structures

24

Parallel scales. Continued

Scales are introduced by additional OpenCL

working space dimension

Better, because no overhead on call for every

scale, but:

25

Challenges

Cascade scheme brings a lot of unbalancing –

very bad for GPU

Number of candidates-rectangles in scale can

vary from 1 to (image.width –

trainingSample.width)*(image.height –

trainingSample.height)

26

What is effective

Evaluate all rectangles from different

scales in parallel (don’t use scale

number as OCL workgroup dimension)

Split evaluation into stages, with

rebalancing among OCL workitems after

every or couple of stages

27

Our solution detail

Process several scales of image per one

iteration (all if possible, is limited by available

memory size)

Split work in scales by blocks

Number of OpenCL workgroups does not

depend on image sizes

Forced rebalancing of working tasks every

several stages based on queues

28

Queues

Presented as linear chunk of memory

Contain records with description of items to

process:

block of rectangles

single rectangle

Allows balancing work among GPU workitems

efficiently

29

Structure of queue Header

A several sub-queues(a number of sub-queues is

equal to number of workgroups; one sub-queue ~

one workgroup)

30

Structure of queue. Continued

All buffers for sub-queues have equal size

Header contains a one record per sub-queue:

a number of items in sub-queue

31

Number of workgroups

Does not depends on image sizes, but only on

hardware characteristics

In our case, we found it experimentally, best

performance is obtained with:

Number of computing units * N,

where N = 6,8,12

32

Queue rebalancing

0 2 4 6 8 10 12

Su

b-q

ueu

es

33

0 2 4 6 8 10 12

Su

b-q

ueu

es

• Separate kernel for unbalancing

• Balancing is performed every N (2, 3, 4) stages

runs

• Formed sub-queues have almost equal length

Face detection relative

performance

0

1

2

3

4

5

6

0.08Mpx 1MPx 4MPx 6MPx

GP

U s

pe

ed

up

Size of input images

GPU vs CPU

34

The higher

the better

Modes of face detector

Homogeneous

Parallel (CPU only)

Ocl (GPU only)

Heterogeneous

Balanced

Split

35

Homogeneous modes

Parallel – multi-threaded mode, everything is

processed on CPU

Ocl - single-threaded mode, everything is

processed on GPU using OpenCL

36

Heterogeneous modes

Balanced – mix of parallel and ocl modes.

Split – uses GPU for small scales (having

many rectangles) and CPU for large.

Goal: get maximum performance from using

both CPU and GPU

37

Balanced heterogeneous face

detection pipeline

38

Media input

CPU face

detection Feature


Output

Tracking

* Every Tracking is a part of video pipeline only

GPU face

detection

CPU face

detection

Heterogeneous face detect

Detection filter runs in multiple threads.

Every thread acquires detector from the pools.

First pool contains GPU detector (usually one). Second pool has only CPU detectors.

In balanced mode thread acquires one detector per thread (if GPU is not available then use CPU).

Split mode acquires GPU detector and then CPU detector in thread sequentially.

39

Split heterogeneous face

detection pipeline

40

Media input

CPU +GPU

face detection Feature


Output

Tracking

CPU + GPU

face detection

CPU +GPU

face detection

Challenges

Make sure GPU has work to do all the

time

TBB is not suitable for that

TBB pushes number of items into

pipelines and runs them almost

simultaneously in the same filters

Detection filter can be without work

some time and then GPU is idle

41

Ways to minimize GPU being idle

Make all filters parallel (feature detect

example)

Make CPU/GPU balancing via queuing

Make sure there is always extra item in

the queue for GPU

42

Profiling AMD APP Profiler 2.2

Parallel Path Analyzer:

43

Debugging

Copy data back from GPU and compare it with

“golden” results

Printf via cl_amd_printf extension

The OpenCL Emulator-Debugger from AMD

44

Performance results. APU

0

50

100

150

200

250

Parallel Balanced Splitted

Pro

cessin

g t

ime

Mode of detector

Photos set (190 images)

Sabine(Torpedo) APU

Engineering sample

Quad Core 1.8GHz

1.88x

45

Performance results. Mobile

platform

0

20

40

60

80

100

120

140

160

180

Parallel Ocl Balanced Splitted

Pro

cessin

g t

ime

Mode of detector


AMD Phenom II N930

Quad Core 2.00 GHz

ATI MOBILITY HD5650

2.03x

46

Performance results. Mobile

platform. Continued

0

100

200

300

400

500

600

700

Parallel Ocl Balanced Splitted

Pro

cessin

g t

ime

Mode of detector


AMD Phenom II N930

Quad Core 2.00 GHz

ATI MOBILITY HD5650

1.75x

47

Performance results. Desktop

system

0

10

20

30

40

50

60

70

80

Parallel Balanced Ocl Split

Pro

cessin

g t

ime

Mode of detector


Intel Core i7 870

4x3GHz

AMD Radeon HD6970

48

2.72x

Performance results. Desktop

system. Continued

0

50

100

150

200

250

300

350

Parallel Balanced Ocl Split

Pro

cessin

g t

ime

Mode of detector


Intel Core i7 870

4x3GHz

AMD Radeon HD6970

2.49x

49

Speedup on photos Platform Small photos Large photos

Sabine(Torpedo) APU

Engineering sample

Quad Core 1.8GHz

*** 1.88

AMD Phenom II N930

Quad Core 2.00 GHz

ATI MOBILITY HD5650

1.75 2.03

Intel Core i7 870 4x3GHz

AMD Radeon HD6970 2.49 2.72

50

2-3 times for full face detection

pipeline!

Platform: Windows 7 x64

FBUploader

51

FBUploader

52

CPU / GPU comparison

53

UVD

UVD is a dedicated video decode processing

unit

Offloads CPU from the decoding process

Reduces power usage

54

UVD. Details

In Microsoft Windows works via DXVA(DirectX

Video Acceleration) API

We are using it to decode H.264 video

55

UVD. Performance

w/o UVD w/i UVD

FPS 75 64

CPU load 95 30

56

• Less FPS

• Less CPU load

• Offloads work to GPU

*Hardware: Sabine(Torpedo), Quad Core 1.8GHz

Video performance

0

20

40

60

80

100

120

140

160

CPU + no UVD CPU + UVD GPU + UVD

Pro

cessin

g t

ime

Mode

Video 720P

Sabine(Torpedo) APU

Engineering sample

Quad Core 1.8GHz

57

1.34x

Video performance

0

20

40

60

80

100

120

140

CPU + no UVD CPU + UVD GPU + UVD

Pro

cessin

g t

ime

Mode

Video 720P

Sabine

(Armorhead)APU

Engineering sample

Quad Core 2.4GHz

1.29x

58

Speedup on video

Platform Video 720P

Sabine(Torpedo) APU

Engineering sample

Quad Core 1.8GHz

1.34

Sabine(Armorhead) APU

Engineering sample

Quad Core 2.4GHz

1.29

59

Results are obtained with Viewdle Video application

Platform: Windows 7 x64

Viewdle Video application

60

Viewdle Video application. Details

Analyzes and indexes video input

automatically (based on face recognition)

Organizes videos and sorts groups of frames

(aka “clusters”) and groups of clusters (aka

“clouds”) according to the faces in the video

61

User benefits

Processing speedup

Full utilization of user's hardware

Longer battery life

62

OpenCL: why we love it

Brings speed of processing to new level

Best way to utilize current GPU hardware

Protects investment

Open standard

Fast growing community

63

Conclusion

OpenCL accelerates Viewdle face recognition

solution by 1.6-3x in photos and by 1.3x in

videos using AMD GPUs and APUs

64

Questions?

65

Disclaimer & Attribution

The information presented in this document is for informational purposes only and may contain technical inaccuracies,

omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including

but not limited to product and roadmap changes, component and motherboard version changes, new model and/or

product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware

upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we

reserve the right to revise this information and to make changes from time to time to the content hereof without

obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO

RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN

THIS INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE

EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY

DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY

INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH

DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other

names used in this presentation are for informational purposes only and may be trademarks of their respective

owners.

The contents of this presentation were provided by individual(s) and/or company listed on the title page. The

information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions.

Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.

viewdle inc. - amddeveloper.amd.com/wordpress/media/2013/06/2911_2_final.pdf · detection is only...

Documents