deep learning for articulated human pose estimation

48
End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang Department of Electronic Engineering, The Chinese University of Hong Kong

Upload: dolien

Post on 14-Feb-2017

244 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning for Articulated Human Pose Estimation

End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks

for Human Pose Estimation

Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang

Department of Electronic Engineering, The Chinese University of Hong Kong

Page 2: Deep Learning for Articulated Human Pose Estimation

Outline

Introduction

Methodology

Experiments

2

Page 3: Deep Learning for Articulated Human Pose Estimation

Human Pose Estimation What is Human Pose Estimation?

3Results are generated by the proposed methods (without temporal constraints).Video Credit: Peter Jasko solo - M-idzomer 2013

Page 4: Deep Learning for Articulated Human Pose Estimation

Human Pose Estimation is to recover the joint positions of articulated limbs of human body given an image or a video.

4Results are generated by the proposed methods (without temporal constraints).Video Credit: Peter Jasko solo - M-idzomer 2013

Page 5: Deep Learning for Articulated Human Pose Estimation

Activity Recognition Game and Animation Clothing Parsing

Applications

5

[Yamaguchi et al. CVPR’14]

Page 6: Deep Learning for Articulated Human Pose Estimation

Challenges Articulation

Forshortening

Clothing

6

Dantone et al. CVPR 2013

Page 7: Deep Learning for Articulated Human Pose Estimation

Structure Modeling

7

Figure Drawing

• Cylinders for each body parts

• Join up the cylinders

Pictorial Structures

• Unary templates

• Pairwise springs

Unable to handle large variations

Image credit:artintegrity.wordpress.com

Fischler & Elschlager 1973

Felzenszwalb & Huttenlocher 2005

Springs

Unary template

Page 8: Deep Learning for Articulated Human Pose Estimation

Structure Modeling

8

Figure Drawing

• Cylinders for each body parts

• Join up the cylinders

Pictorial Structures

• Unary templates

• Pairwise springs

Unable to handle large variations

Mixtures of each part

• Unary template for each mixture type

• Clustering on (∆𝑥,∆𝑦)

Yang & Ramanan 2011Image credit:artintegrity.wordpress.com

Fischler & Elschlager 1973

Felzenszwalb & Huttenlocher 2005

Graph Structures

• Trees• Latent trees• Loopy graphs• Fully connected

graphs

Page 9: Deep Learning for Articulated Human Pose Estimation

Deep Learning Based Methods

9

Benefits from better neural network architectures VGG

GoogLeNet,

ResNet

Structures are only used for post processing

Fully Convolutional Networks

Heat map prediction

Page 10: Deep Learning for Articulated Human Pose Estimation

Gap Between Deep Models and Structure Modeling

10

Gap

Page 11: Deep Learning for Articulated Human Pose Estimation

End-to-End Learning

Gap Between Deep Models and Structure Modeling

11

Page 12: Deep Learning for Articulated Human Pose Estimation

CNN

Local appearance is ambiguous

Forward

Backward

Motivation: Geometric Constraints Among Body Parts Helps in Learning Better Representation

12

Forward

Backward

Global consistency helps training

Page 13: Deep Learning for Articulated Human Pose Estimation

Outline

Introduction

Methodology

Experiments

13

Page 14: Deep Learning for Articulated Human Pose Estimation

Graph Models

𝐺 = (𝑉, 𝐸)

Vertices

Locations and mixture types of

body parts

Modeled by a front-end CNN

Edges

Pairwise spatial relationships

between body parts

Modeled by message passing

layers

14

message passing

Page 15: Deep Learning for Articulated Human Pose Estimation

CNN

Log

ari

thm

So

ftm

ax

Framework

… … …

l.shou

neck

r.shou

r.knee

r.ankle

head

max

max

max

max

max

max

15

Message Passing Layers

Page 16: Deep Learning for Articulated Human Pose Estimation

CNN

Log

ari

thm

So

ftm

ax

Framework

… … …

l.shou

neck

r.shou

r.knee

r.ankle

head

max

max

max

max

max

max

𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =

𝑖∈𝑉

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)

+

𝑖,𝑗 ∈𝐸

𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗

16

Message Passing Layers

Page 17: Deep Learning for Articulated Human Pose Estimation

CNN

Log

ari

thm

So

ftm

ax

Framework

… … …

l.shou

neck

r.shou

r.knee

r.ankle

head

max

max

max

max

max

max

𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =

𝑖∈𝑉

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)

17

𝒍 = 𝒍𝑖 = {(𝑥𝑖 , 𝑦𝑖)}: location of part 𝑖

Message Passing Layers

+

𝑖,𝑗 ∈𝐸

𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗

Page 18: Deep Learning for Articulated Human Pose Estimation

CNN

Log

ari

thm

So

ftm

ax

Framework

… … …

l.shou

neck

r.shou

r.knee

r.ankle

head

max

max

max

max

max

max

𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =

𝑖∈𝑉

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)

18

𝒍 = 𝒍𝑖 = {(𝑥𝑖 , 𝑦𝑖)}: location of part 𝑖

𝒕 = 𝒕𝑖 : mixture type of part 𝑖

Message Passing Layers

+

𝑖,𝑗 ∈𝐸

𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗

Page 19: Deep Learning for Articulated Human Pose Estimation

CNN

Log

ari

thm

So

ftm

ax

Front-End CNNPart Appearance Terms

… … …

l.shou

neck

r.shou

r.knee

r.ankle

head

max

max

max

max

max

max

𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =

𝑖∈𝑉

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)

19

+

𝑖,𝑗 ∈𝐸

𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗

Page 20: Deep Learning for Articulated Human Pose Estimation

Message Passing LayersSpatial Relationship Terms

CNN

Log

ari

thm

So

ftm

ax

… … …

l.shou

neck

r.shou

r.knee

r.ankle

head

max

max

max

max

max

max

𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =

𝑖∈𝑉

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)

20

+

𝑖,𝑗 ∈𝐸

𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗

Page 21: Deep Learning for Articulated Human Pose Estimation

conv+

norm+

pool

conv+

norm+

pool

con

v

con

v

con

v

1x1 conv

+dropout

1x1

co

nv

Dro

po

ut

Dro

po

ut

3

3 3 3 3 3

1x1

co

nv

4096

4096

PxK

CNN

Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖 :

21

Front end CNN

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃) = log 𝑝 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 = log𝜎(𝑓 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 )

Page 22: Deep Learning for Articulated Human Pose Estimation

conv+

norm+

pool

conv+

norm+

pool

con

v

con

v

con

v

1x1 conv

+dropout

1x1

co

nv

Dro

po

ut

Dro

po

ut

3

3 3 3 3 3

1x1

co

nv

4096

4096

PxK

CNN

Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖 :

22

Front end CNN

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃) = log 𝑝 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 = log𝜎(𝑓 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 )

output of the front-end CNN

Page 23: Deep Learning for Articulated Human Pose Estimation

conv+

norm+

pool

conv+

norm+

pool

con

v

con

v

con

v

1x1 conv

+dropout

1x1

co

nv

Dro

po

ut

Dro

po

ut

3

3 3 3 3 3

1x1

co

nv

4096

4096

PxK

CNN

Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖 :

23

Front end CNN

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃) = log 𝑝 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 = log𝜎(𝑓 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 )

Page 24: Deep Learning for Articulated Human Pose Estimation

conv+

norm+

pool

conv+

norm+

pool

con

v

con

v

con

v

1x1 conv

+dropout

1x1

co

nv

Dro

po

ut

Dro

po

ut

3

3 3 3 3 3

1x1

co

nv

4096

4096

PxK

CNN

Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖 :

24

Front end CNN

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃) = log 𝑝 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 = log𝜎(𝑓 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 )

Softmax function

Page 25: Deep Learning for Articulated Human Pose Estimation

conv+

norm+

pool

conv+

norm+

pool

con

v

con

v

con

v

1x1 conv

+dropout

1x1

co

nv

Dro

po

ut

Dro

po

ut

3

3 3 3 3 3

1x1

co

nv

4096

4096

PxK

CNN

Local confidence of the appearance of part 𝑖 with mixture type 𝑡𝑖 :

25

Front end CNN

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃) = log 𝑝 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 = log𝜎(𝑓 𝒍𝑖 , 𝑡𝑖 𝐼; 𝜃 )

Logarithm

Page 26: Deep Learning for Articulated Human Pose Estimation

CNN

Log

ari

thm

So

ftm

ax

… … …

l.shou

neck

r.shou

r.knee

r.ankle

head

max

max

max

max

max

max

𝐹 𝒍, 𝒕 𝐼; 𝜽,𝝎 =

𝑖∈𝑉

𝜙(𝒍𝑖 , 𝑡𝑖|𝐼; 𝜃)

26

+

𝑖,𝑗 ∈𝐸

𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗 𝐼; 𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗

Message Passing LayersSpatial Relationship Terms

Page 27: Deep Learning for Articulated Human Pose Estimation

𝜓 𝒍𝑖 , 𝒍𝑗, 𝑡𝑖 , 𝑡𝑗 𝐼;𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗 = 𝝎𝑖,𝑗

𝑡𝑖,𝑡𝑗 , 𝑑(𝒍𝑖 , 𝒍𝑗)

Spatial Relationship Terms

Quadratic deformation constraints

𝑑 𝒍𝑖 , 𝒍𝑗 = ∆𝑥, ∆𝑥2, ∆𝑦, ∆𝑦2

∆𝑥 = 𝑥𝑖 − 𝑥𝑗 , ∆𝑦 = 𝑦𝑖 − 𝑦𝑗

27

Yang Y, Ramanan D. Articulated pose estimation with flexible mixtures-of-parts. CVPR, 2011.

Page 28: Deep Learning for Articulated Human Pose Estimation

Message Passing: Max-Sum Algorithm

28

𝑚𝑖𝑗 𝒍𝑗 , 𝑡𝑗 ← 𝛼𝑚max𝒍𝑖,𝑡𝑖

𝑢𝑖 𝒍𝑖 , 𝑡𝑖 + 𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗The message passed from part 𝑖 to 𝑗

1

2

𝑚21

Page 29: Deep Learning for Articulated Human Pose Estimation

Message Passing: Max-Sum Algorithm

29

𝑚𝑖𝑗 𝒍𝑗 , 𝑡𝑗 ← 𝛼𝑚max𝒍𝑖,𝑡𝑖

𝑢𝑖 𝒍𝑖 , 𝑡𝑖 + 𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗2

𝑚21

1

𝑢𝑖 𝒍𝑖 , 𝑡𝑖 ← 𝛼𝑢 𝜙 𝒍𝑖 , 𝑡𝑖 + 𝑘∈𝑁(𝑖)

𝑚𝑘𝑖(𝒍𝑖 , 𝑡𝑖)The belief of part 𝑖

The message passed from part 𝑖 to 𝑗

Page 30: Deep Learning for Articulated Human Pose Estimation

𝑢𝑖 𝒍𝑖 , 𝑡𝑖 ← 𝛼𝑢 𝜙 𝒍𝑖 , 𝑡𝑖 + 𝑘∈𝑁(𝑖)

𝑚𝑘𝑖(𝒍𝑖 , 𝑡𝑖)

Message Passing: Max-Sum Algorithm

30

𝑚𝑖𝑗 𝒍𝑗 , 𝑡𝑗 ← 𝛼𝑚max𝒍𝑖,𝑡𝑖

𝑢𝑖 𝒍𝑖 , 𝑡𝑖 + 𝜓 𝒍𝑖 , 𝒍𝑗 , 𝑡𝑖 , 𝑡𝑗

𝒍𝑖∗, 𝑡𝑖

∗ = argmax 𝑢𝑖∗ 𝒍𝑖 , 𝑡𝑖

𝒍𝑖,𝑡𝑖

Page 31: Deep Learning for Articulated Human Pose Estimation

Diagnosis on Message Passing Layers

31

1st

Page 32: Deep Learning for Articulated Human Pose Estimation

Diagnosis on Message Passing Layers

32

2nd

Page 33: Deep Learning for Articulated Human Pose Estimation

Diagnosis on Message Passing Layers

33

3rd

Page 34: Deep Learning for Articulated Human Pose Estimation

Outline

Introduction

Methodology

Experiments

34

Page 35: Deep Learning for Articulated Human Pose Estimation

Datasets

LSP

35

FLIC

Sports1000 training1000 testing

Films3987 training1016 testing

Page 36: Deep Learning for Articulated Human Pose Estimation

36

Qualitative Results on the LSP Dataset

Page 37: Deep Learning for Articulated Human Pose Estimation

37

Page 38: Deep Learning for Articulated Human Pose Estimation

38

PCP on the LSP dataset[Percentage of Correct Parts]

Page 39: Deep Learning for Articulated Human Pose Estimation

39

30

40

50

60

70

80

90

100

TORSO HEAD UPPER ARMS LOWER ARMS UPPER LEGS LOWER LEGS MEAN

STRICT PCP ON THE LSP DATASET

Yang&Ramanan, CVPR'11 Pishchulin et al., CVPR'13

Eichner&Ferrari, ACCV'13 Kiefel&Gehler, ECCV'14

Pose Machines, ECCV'14 Ouyang et al., CVPR'14

Pishchulin et al., ICCV'13 DeepPose, CVPR'14

Chen&Yuille, NIPS'14 Ours

Page 40: Deep Learning for Articulated Human Pose Estimation

Percentage of Detected Joints (PDJ)

40LSP FLIC

Our method

Page 41: Deep Learning for Articulated Human Pose Estimation

Sports1000 training1000 testing

Films3987 training1016 testing

Datasets

LSP

41

Image ParseFLIC

Activities100 training205 testing

Cross-dataset validation

Page 42: Deep Learning for Articulated Human Pose Estimation

42

GeneralizationImage Parse Dataset

30 50 70 90

TORSO

HEAD

UPPER ARMS

LOWER ARMS

UPPER LEGS

LOWER LEGS

MEAN

Ours

Ouyang et al., CVPR'14

Yang&Ramanan, TPAMI'13

Pishchulin et al., ICCV'13

Pishchulin et al., CVPR'13

Pishchulin et al., CVPR'12

Johnson&Everingham, CVPR'11

Yang&Ramanan, CVPR'11

Page 43: Deep Learning for Articulated Human Pose Estimation

Component Analysis

43

Page 44: Deep Learning for Articulated Human Pose Estimation

Unary Term vs. Full Model83.4

69

53.5

34.9

72.2

63.5

60.1

93

82.1

70.6

55.4

82.1

75.3

74.2

96.5

83.1

78.8

66.7

88.7

81.7

81.1

30

40

50

60

70

80

90

100

TORSO HEAD UPPER ARMS LOWER ARMS UPPER LEGS LOWER LEGS MEAN

STRICT PCP ON THE LSP DATASET (VGG-LG)

Unary term only

Full model (seperately trained)

Full model (jointly trained)

44

Page 45: Deep Learning for Articulated Human Pose Estimation

Tree-Structured Model vs. Loopy Model

45

96.2

83.4

78.7

65.8

87.9

81.1

80.7

96.5

83.1

78.8

66.7

88.7

81.7

81.1

62 72 82 92

Torso

Head

Upper Arms

Lower Arms

Upper Legs

Lower Legs

Mean

Loopy Model Tree Model

Page 46: Deep Learning for Articulated Human Pose Estimation

Human Pose Estimation in This CVPR

46

Chu et al. Structured Feature Learning for Pose Estimation.

Wei et al. Convolutional Pose Machines.

Carreira et al. Human Pose Estimation With Iterative Error Feedback.

Page 47: Deep Learning for Articulated Human Pose Estimation

CNNStructure

Summary

End-to-End Learning

Bridge the gap between structure modeling and deep learning

A new message passing layer, which is flexible to tree-structured/loopy relational graphs.

Page 48: Deep Learning for Articulated Human Pose Estimation

Scan for more details

CNNStructure

Thank you.Welcome to our poster session #5