watching unlabeled video helps learn new human actions...

1
Reading Walking Riding horse Phoning Running Playing instrument Taking photo Using computer Riding bike Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman The University of Texas at Austin Problem Goal: learn actions from static images Approach Related Work Results Recognizing activity in images Conclusions - Learn actions with discriminative pose and appearance features. e.g. [Maji et al. 2011, Yang et al. 2010, Yao et al. 2010, Delaitre et al. 2011] - Expand training data by mirroring images and videos. e.g. [Papageorgiou et al. 2000, Wang et al. 2009] - Synthesize images for action recognition and pose estimation. e.g. [Matikainen et al. 2011, Shakhnarovich et al. 2003, Grauman et al. 2003, Shotton et al. 2011,] - Ours: expanding the training set for “free” via pose dynamics learned from unlabeled data. Representing body pose Expand snapshots by pose dynamics learned from videos Problems of static snapshots: - May have only few training examples for some actions. - Often limited to “canonical” instances of the action. Hollywood Human Actions (unlabeled) PASCAL VOC 2010 (11 verbs) Image Stanford 40 Actions (10 selected verbs) Image Recognizing activity in videos - Augment training data without additional labeling cost by leveraging unlabeled video. - Simple but effective exemplar/manifold extrapolation strategies. - Significant advantage when labeled training examples are sparse. - Domain adaptation connects real and generated examples. Unlabeled video data Synthesize pose examples: + Few canonical snapshots Unlabeled video pool (generic human action) Given training snapshot Infer pose before Infer pose after + Detected poselets PAV P 1 P 2 P 3 P 1 P 2 P 3 Assumptions: - Videos cover the space of human pose dynamics. - No action labels are given. - People are detectable and trackable. Training snapshot Matched frame Find Nearby synthetic poses Generate 1) Example based strategy 2) Manifold based strategy - Poselet activation vector (PAV) by [Maji et al. 2011] - Each poselet captures part of the pose from a given viewpoint - Robust to occlusion and cluttered background Poselet feature Unlabeled video pool Training image before after t t+T t-T Locally Linear Embedding [Roweis and Saul 2000] Similarity function for temporal nearness and pose similarity: Pose Training image Frames from video (source) Static snapshots (target) + + - - - - - - - - - Real - Overview Synthesize pose examples Few static snapshots Unlabeled video data Train with real & synthetic poses Temporal Train with real & synthetic poses + + - - - - - - - - - + + + + Real Synthetic poses - Domain adaptation by [Daume III 2007] Pose feature space Pose feature space PASCAL(# train = 15) Stanford 10(# train = 17) 0.1 0.2 0.3 0.4 Accuracy (mAP) PASCAL Stanford 10 # of training images # of training images 15 0.1 0.2 0.3 0.1 0.2 17 20 30 37 50 20 25 33 40 50 Accuracy (mAP) Pose diversity among training images Our accuracy gain Most benefit when lack pose diversity Synthetic after Real trainings Synthetic feature examples Accuracy 0.2 0.4 0.6 - Substantial gains when recognizing actions in video, since our method infers intermediate poses not covered in original snapshots. Training Testing Recognize action Our idea Let the system: - Watch videos to learn how human poses change over time. - Infer nearby poses to expand the sparse training snapshots. Synthetic before Accuracy of video activity recognition on 78 testing videos from HMDB51+ASLAN+UCF data. Generate Generate

Upload: others

Post on 26-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Watching Unlabeled Video Helps Learn New Human Actions ...vision.cs.utexas.edu/...snapshot/...poster_final.pdf · Training snapshot . Matched frame . Find . Nearby synthetic poses

Reading

Walking

Riding horse

Phoning

Running

Playing instrument Taking photo Using computer

Riding bike

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots

Chao-Yeh Chen and Kristen Grauman The University of Texas at Austin

Problem Goal: learn actions from static images

Approach

Related Work

Represent body pose

Results

Recognizing activity in images

Conclusions

- Learn actions with discriminative pose and appearance features. e.g. [Maji et al. 2011, Yang et al. 2010, Yao et al. 2010, Delaitre et al. 2011] - Expand training data by mirroring images and videos. e.g. [Papageorgiou et al. 2000, Wang et al. 2009] - Synthesize images for action recognition and pose estimation. e.g. [Matikainen et al. 2011, Shakhnarovich et al. 2003, Grauman et al. 2003, Shotton et al. 2011,] - Ours: expanding the training set for “free” via pose dynamics learned from unlabeled data.

Representing body pose

Expand snapshots by pose dynamics learned from videos

Problems of static snapshots: - May have only few training examples for some actions. - Often limited to “canonical” instances of the action.

Hollywood Human Actions (unlabeled)

PASCAL VOC 2010 (11 verbs) Image

Stanford 40 Actions (10 selected verbs) Image

Recognizing activity in videos

- Augment training data without additional labeling cost by leveraging unlabeled video. - Simple but effective exemplar/manifold extrapolation strategies. - Significant advantage when labeled training examples are sparse. - Domain adaptation connects real and generated examples.

Unlabeled video data

Synthesize pose examples:

+ Few canonical snapshots Unlabeled video pool (generic human action)

Given training snapshot Infer pose before Infer pose after

+

Detected poselets PAV

P1

P2

P3 … P1 P2 P3

Assumptions: - Videos cover the space of human pose dynamics. - No action labels are given. - People are detectable and trackable.

Training snapshot Matched frame

Find

Nearby synthetic poses

Generate

1) Example based strategy

2) Manifold based strategy

- Poselet activation vector (PAV) by [Maji et al. 2011] - Each poselet captures part of the pose from a given viewpoint - Robust to occlusion and cluttered background

Poselet feature

Unlabeled video pool Training image

before after

t t+T t-T

Locally Linear Embedding [Roweis and Saul 2000]

Similarity function for temporal nearness and pose similarity:

Pose

Training image

Frames from video (source)

Static snapshots (target)

+

+

- -

-

-

- -

-

-

- Real

-

Overview Synthesize pose examples Few static snapshots

Unlabeled video data

Train with real & synthetic poses

Temporal

Train with real & synthetic poses

+

+

- -

-

-

- -

-

-

- +

+ +

+

Real Synthetic poses

-

Domain adaptation by [Daume III 2007]

Pose feature space Pose feature space

PASCAL(# train = 15) Stanford 10(# train = 17)

0.1

0.2

0.3

0.4

Accu

racy

(mAP

)

PASCAL Stanford 10 # of training images # of training images

15

0.1

0.2

0.3

0.1

0.2

17 20 30 37 50 20 25 33 40 50

Accu

racy

(mAP

)

Pose diversity among training images

Our

acc

urac

y ga

in

Most benefit when lack pose diversity Synthetic after Real trainings

Synthetic feature examples

Accu

racy

0.2

0.4

0.6

- Substantial gains when recognizing actions in video, since our method infers intermediate poses not covered in original snapshots.

Training Testing

Recognize action Our idea

Let the system: - Watch videos to learn how human poses change over time. - Infer nearby poses to expand the sparse training snapshots.

Synthetic before

Accuracy of video activity recognition on 78 testing videos from HMDB51+ASLAN+UCF data.

Generate Generate