ivan laptev "action recognition"

216
Microsoft Computer Vision School 28 July - 3 August 2011 Moscow Russia 28 July 3 August 2011, Moscow , Russia Human Action Recognition Human Action Recognition Ivan Laptev i l t @i i f ivan.laptev@inria.fr INRIA, WILLOW, ENS/INRIA/CNRS UMR 8548 Laboratoire dInformatique, Ecole Normale Supérieure, Paris Laboratoire d Informatique, Ecole Normale Supérieure, Paris Includes slides from: Alyosha Efros, Mark Everingham and Andrew Zisserman

Upload: anton-konushin

Post on 10-May-2015

2.596 views

Category:

Education


3 download

TRANSCRIPT

Page 1: Ivan Laptev "Action recognition"

Microsoft Computer Vision School28 July - 3 August 2011 Moscow Russia28 July 3 August 2011, Moscow, Russia

Human Action RecognitionHuman Action Recognition

Ivan Laptevi l t @i i [email protected]

INRIA, WILLOW, ENS/INRIA/CNRS UMR 8548Laboratoire d’Informatique, Ecole Normale Supérieure, ParisLaboratoire d Informatique, Ecole Normale Supérieure, Paris

Includes slides from: Alyosha Efros, Mark Everingham and Andrew Zisserman

Page 2: Ivan Laptev "Action recognition"

Lecture overviewLecture overviewLecture overviewLecture overview

MotivationHistoric reviewA li i d h llApplications and challenges

Human Pose EstimationPictorial structuresRecent advances

Appearance-based methodsMotion history imagesA ti h d l & M ti iActive shape models & Motion priors

Motion-based methodsGeneric and parametric Optical FlowMotion templates

Page 3: Ivan Laptev "Action recognition"
Page 4: Ivan Laptev "Action recognition"

Computer vision grand challenge: Computer vision grand challenge: Video understandingVideo understanding

indoors outdoorsoutdoors car personcountryside exit outdoorsObjects:cars glasses

Actions:drinking runningperson

d i ki kidnappinghouse person

outdoors car enter

persony

car

through a doorbuilding

cars, glasses, people, etc…

drinking, running, door exit, car enter, etc…

glass

drinking carperson

kidnappingcar

car crash

buildingconstraints

glass

candlecarcar

street

roadfield

car

peopleScene categories:indoors, outdoors, street scene,

Geometry:Street, wall, field, candlecarstreet

streetcarstreet scene, etc… stair, etc…

Page 5: Ivan Laptev "Action recognition"

Motivation I: Artistic RepresentationMotivation I: Artistic RepresentationMotivation I: Artistic RepresentationMotivation I: Artistic RepresentationEarly studies were motivated by human representations in ArtsEarly studies were motivated by human representations in Arts

Da Vinci: “it is indispensable for a painter, to become totally familiar with the anatomy of nerves, bones, muscles, and sinews, such that he understands y , , , ,for their various motions and stresses, which sinews or which muscle causes a particular motion”

“I ask for the weight [pressure] of this man for every segment of motionI ask for the weight [pressure] of this man for every segment of motion when climbing those stairs, and for the weight he places on b and on c. Note the vertical line below the center of mass of this man.”

Leonardo da Vinci (1452–1519): A man going upstairs, or up a ladder.

Page 6: Ivan Laptev "Action recognition"

Motivation II: BiomechanicsMotivation II: BiomechanicsMotivation II: BiomechanicsMotivation II: Biomechanics

The emergence of biomechanicsBorelli applied to biology the analytical and geometrical methods, developed by Galileo Galilei

developed by Galileo Galilei

He was the first to understand that bones serve as levers and muscles

bones serve as levers and muscles function according to mathematical principlesp p

His physiological studies included muscle analysis and a mathematical

muscle analysis and a mathematical discussion of movements, such as running or jumping

Giovanni Alfonso Borelli (1608–1679)

Page 7: Ivan Laptev "Action recognition"

Motivation III: Motion perceptionMotivation III: Motion perceptionMotivation III: Motion perceptionMotivation III: Motion perceptionEtienne Jules Marey:Etienne-Jules Marey: (1830–1904) made Chronophotographic experiments influentialexperiments influential for the emerging field of cinematography

Eadweard MuybridgeEadweard Muybridge (1830–1904) invented a machine for displaying the recorded series of images. He pioneered motion pictures and applied his technique to movement studies

Page 8: Ivan Laptev "Action recognition"

Motivation III: Motion perceptionMotivation III: Motion perceptionGunnar Johansson [1973] pioneered studies on the use of image

Motivation III: Motion perceptionMotivation III: Motion perception[ ] p g

sequences for a programmed human motion analysis

“Moving Light Displays” (LED) enable identification of familiar peopleMoving Light Displays (LED) enable identification of familiar people and the gender and inspired many works in computer vision.

Gunnar Johansson, Perception and Psychophysics, 1973

Page 9: Ivan Laptev "Action recognition"

Human actions: Historic overviewHuman actions: Historic overviewHuman actions: Historic overviewHuman actions: Historic overview

15th centurystudies of

t 17th century

emergence of

anatomy

emergence ofbiomechanics

19th centuryemergence of

cinematography

cinematography1973 studies of human motion perception

M d t i iModern computer vision

Page 10: Ivan Laptev "Action recognition"

Modern applications: Motion captureModern applications: Motion captureModern applications: Motion captureModern applications: Motion captureand animationand animation

Avatar (2009)

Page 11: Ivan Laptev "Action recognition"

Modern applications: Motion captureModern applications: Motion captureModern applications: Motion captureModern applications: Motion captureand animationand animation

Avatar (2009)Leonardo da Vinci (1452–1519)

Page 12: Ivan Laptev "Action recognition"

Modern applications: Video editingModern applications: Video editingModern applications: Video editingModern applications: Video editing

Space-Time Video CompletionY. Wexler, E. Shechtman and M. Irani, CVPR 2004

Page 13: Ivan Laptev "Action recognition"

Modern applications: Video editingModern applications: Video editingModern applications: Video editingModern applications: Video editing

Space-Time Video CompletionY. Wexler, E. Shechtman and M. Irani, CVPR 2004

Page 14: Ivan Laptev "Action recognition"

Modern applications: Video editingModern applications: Video editingModern applications: Video editingModern applications: Video editing

Recognizing Action at a DistanceAlexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik, ICCV 2003

Page 15: Ivan Laptev "Action recognition"

Modern applications: Video editingModern applications: Video editingModern applications: Video editingModern applications: Video editing

Recognizing Action at a DistanceAlexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik, ICCV 2003

Page 16: Ivan Laptev "Action recognition"

Why Action Recognition?Why Action Recognition?Why Action Recognition?Why Action Recognition?Video indexing and search is useful in TV production, entertainment, Video indexing and search is useful in TV production, entertainment, education, social studies, security,…

HomeHome videos: e.g.“My daughter

TV & Web: e.g. daughter

climbing”e.g.

“Fight in a parlament”

Sociology research:

Surveillance: 260K views in 7 days on

Manually analyzed smoking actions in 900 i YouTube900 movies

Page 17: Ivan Laptev "Action recognition"

How action recognition is relatedHow action recognition is relatedHow action recognition is relatedHow action recognition is relatedto computer vision?to computer vision?

SkSkSkySky Street signStreet sign

CarCar CarCar CarCarCarCar

CarCarCarCar CarCar CarCar

CarCar CarCarRoadRoad

Page 18: Ivan Laptev "Action recognition"

We can recognize cars and roads,We can recognize cars and roads,g ,g ,What’s next?What’s next?

12,184,113 images, 17624 synsets

Page 19: Ivan Laptev "Action recognition"
Page 20: Ivan Laptev "Action recognition"

Airplane

A plain has crashed, theA plain has crashed, the cabin is broken, somebody is likely to be injured or deadlikely to be injured or dead.

Page 21: Ivan Laptev "Action recognition"

cat

woman

t h bitrash bin

Page 22: Ivan Laptev "Action recognition"
Page 23: Ivan Laptev "Action recognition"

Vision is person-centric: We mostly care about

things which are important to us, people

Actions of people reveal the function of objects p p j

Future challenges:

- Function: What can I do with this and how?

- Prediction: What can happen if someone does that?- Prediction: What can happen if someone does that?

- Recognizing goals: What this person is trying to do?

Page 24: Ivan Laptev "Action recognition"

How many personHow many person pixels are there?pixels are there?How many personHow many person--pixels are there?pixels are there?

Movies TV

Y T bYouTube

Page 25: Ivan Laptev "Action recognition"

How many personHow many person pixels are there?pixels are there?How many personHow many person--pixels are there?pixels are there?

Movies TVMovies TV

YouTube

Page 26: Ivan Laptev "Action recognition"

How many personHow many person pixels are there?pixels are there?How many personHow many person--pixels are there?pixels are there?

35%35% 34%

Movies TVMovies TV

40%YouTube

Page 27: Ivan Laptev "Action recognition"

How much data do we have?How much data do we have?How much data do we have?How much data do we have?Huge amount of video is available and growing Huge amount of video is available and growing

TV-channels recorded since 60’s

>34K hours of video

since 60 s

>34K hours of video upload every day

~30M surveillance cameras in US => ~700K video hours/day 700K video hours/day

If we want to interpret this data, we should better understand what person-pixels are telling us!

Page 28: Ivan Laptev "Action recognition"

What this course is about?What this course is about?What this course is about?What this course is about?

Page 29: Ivan Laptev "Action recognition"

GoalGoalGoalGoal

Get familiar with:Get familiar with:

P bl f l tiP bl f l ti

Get familiar with:Get familiar with:

•• Problem formulationsProblem formulations•• Mainstream approachesMainstream approachesMainstream approachesMainstream approaches•• Particular existing techniquesParticular existing techniques•• Current benchmarksCurrent benchmarks

A il bl b li h dA il bl b li h d•• Available baseline methodsAvailable baseline methods•• Promising future directionsPromising future directionsPromising future directionsPromising future directions•• HandsHands--on experience (practical session)on experience (practical session)

Page 30: Ivan Laptev "Action recognition"

What is Action Recognition?What is Action Recognition?What is Action Recognition?What is Action Recognition?

TerminologyTerminology• What is an “action”?

Output representationOutput representation• What do we want to say about an image/video?

Neither question has satisfactory answer yetNeither question has satisfactory answer yet

Page 31: Ivan Laptev "Action recognition"

TerminologyTerminology

The terms “action recognition”, “activity recognition”, “ t iti ” d i i t tl“event recognition”, are used inconsistently• Finding a common language for describing videos is an open

problemproblem

Page 32: Ivan Laptev "Action recognition"

Terminology exampleTerminology exampleTerminology exampleTerminology example

“Action” is a low-level primitive with semantic meaning• E.g. walking, pointing, placing an object

“Activity” is a higher-level combination with some temporal relationsrelations• E.g. taking money out from ATM, waiting for a bus

“Event” is a combination of activities, often involving multiple individuals• E.g. a soccer game, a traffic accident

This is contentiousThis is contentious• No standard, rigorous definition exists

Page 33: Ivan Laptev "Action recognition"

O t t R t tiO t t R t tiOutput RepresentationOutput Representation

• Given this image what is the desired output?

This image contains a man walkingwalking• Action classification /

recognitiong

The man walking is here• Action detectionAction detection

Page 34: Ivan Laptev "Action recognition"

O t t R t tiO t t R t tiOutput RepresentationOutput Representation

• Given this image what is the desired output?

This image contains 5 men walking, 4 jogging, 2walking, 4 jogging, 2 running

The 5 men walking are hereThe 5 men walking are hereThis is a soccer game

Page 35: Ivan Laptev "Action recognition"

O t t R t tiO t t R t tiOutput RepresentationOutput Representation

• Given this image what is the desired output?

• Frames 1-20 the man ran to the left, then frames 21-25 he ran away from the cameraframes 21 25 he ran away from the camera

• Is this an accurate description?• Are labels and video frames in 1 1• Are labels and video frames in 1-1

correspondence?

Page 36: Ivan Laptev "Action recognition"

DATASETSDATASETS

Page 37: Ivan Laptev "Action recognition"

Dataset: KTHDataset: KTH--ActionsActionsDataset: KTHDataset: KTH ActionsActions

• 6 action classes by 25 persons in 4 different scenarios• 6 action classes by 25 persons in 4 different scenarios• Total of 2391 video samples

• Specified train validation test setsSpecified train, validation, test sets• Performance measure: average accuracy over all classes

[Schuldt, Laptev, Caputo ICPR 2004]

Page 38: Ivan Laptev "Action recognition"

UCFUCF--SportsSportsUCFUCF--SportsSports• 10 different action classes• 10 different action classes• 150 video samples in total• Evaluation method: leave-one-outEvaluation method: leave one out • Performance measure: average accuracy over all classes

Diving Kicking Walking

Skateboarding High-Bar-Swinging Golf-Swinging

Rodriguez, Ahmed, and Shah CVPR 2008

Page 39: Ivan Laptev "Action recognition"

UCF UCF -- YouTube Action DatasetYouTube Action Dataset

11 categories 1168 videos11 categories, 1168 videos• Evaluation method: leave-one-out • Performance measure: average accuracy over all classes• Performance measure: average accuracy over all classes

[sLiu, Luo and Shah CVPR 2009]

Page 40: Ivan Laptev "Action recognition"

Semantic Description of Human Activities Semantic Description of Human Activities (ICPR 2010)(ICPR 2010)

3 challenges: interaction aerial view wide area3 challenges: interaction, aerial view, wide-area Interaction

• 6 classes 120 instances over ~20 min video• 6 classes, 120 instances over ~20 min. video• Classification and detection tasks (+/- bounding boxes)• Evaluation method: leave one out• Evaluation method: leave-one-out

[Ryoo et al. ICPR 2010 challenge]

Page 41: Ivan Laptev "Action recognition"

Hollywood2Hollywood2yy

• 12 action classes from 69 Hollywood movies12 action classes from 69 Hollywood movies• 1707 video sequences in total• Separate movies for training / testingSeparate movies for training / testing • Performance measure: mean average precision (mAP) over all

classes

GetOutCar AnswerPhone Kiss

HandShake StandUp DriveCar

[Marszałek, Laptev, Schmid CVPR 2009]

Page 42: Ivan Laptev "Action recognition"

TRECVidTRECVid Surveillance Event DetectionSurveillance Event DetectionTRECVidTRECVid Surveillance Event DetectionSurveillance Event Detection10 actions: person runs, take picture, cell to ear, …5 cameras, ~100h video from LGW airportDetection (in time, not space); multiple detections count as false

positivesEvaluation method: specified training / test videos, evaluation at NISTP f t ti ti DETPerformance measure: statistics on DET curves

[Smeaton, Over, Kraaij, TRECVid]

Page 43: Ivan Laptev "Action recognition"

Dataset DesiderataDataset DesiderataDataset DesiderataDataset Desiderata

ClutterNot choreographed by dataset collectorsNot choreographed by dataset collectors

• Real-world variation

S lScale• Large amount of video

Rarity of actions• Detection harder than classificatione ec o a de a c ass ca o• Chance performance should be very low

Clear definition of training/test splitClear definition of training/test split• Validation set for parameter tuning?• Reproducing / comparing to other methods?

Page 44: Ivan Laptev "Action recognition"

Lecture overviewLecture overviewLecture overviewLecture overview

MotivationHistoric reviewA li i d h llApplications and challenges

Human Pose EstimationPictorial structuresRecent advances

Appearance-based methodsMotion history imagesA ti h d l & M ti iActive shape models & Motion priors

Motion-based methodsGeneric and parametric Optical FlowMotion templates

Page 45: Ivan Laptev "Action recognition"

Lecture overviewLecture overviewLecture overviewLecture overview

MotivationHistoric reviewA li i d h llApplications and challenges

Human Pose EstimationPictorial structuresRecent advances

Appearance-based methodsMotion history imagesA ti h d l & M ti iActive shape models & Motion priors

Motion-based methodsGeneric and parametric Optical FlowMotion templates

Page 46: Ivan Laptev "Action recognition"

Objective and motivationObjective and motivation

Determine human body pose (layout)Determine human body pose (layout)

Why? To recognize poses, gestures, actionsSlide credit: Andrew Zisserman

Page 47: Ivan Laptev "Action recognition"

Activities characterized by a posey p

Slide credit: Andrew Zisserman

Page 48: Ivan Laptev "Action recognition"

Activities characterized by a posey p

Slide credit: Andrew Zisserman

Page 49: Ivan Laptev "Action recognition"

Activities characterized by a posey p

Page 50: Ivan Laptev "Action recognition"

Challenges: articulations and deformationsg

Page 51: Ivan Laptev "Action recognition"

Challenges: of (almost) unconstrained imagesg ( ) g

i ill i ti d l t t i d b k dvarying illumination and low contrast; moving camera and background;multiple people; scale changes; extensive clutter; any clothing

Page 52: Ivan Laptev "Action recognition"

Pictorial StructuresPictorial Structures

• Intuitive model of an object

• Model has two components

1. parts (2D image fragments)pa s ( age ag e s)

2. structure (configuration of parts)

• Dates back to Fischler & Elschlager 1973

Page 53: Ivan Laptev "Action recognition"

Long tradition of using pictorial structures for humansg g p

Finding People by Sampling Ioffe & Forsyth, ICCV 1999y ,

Pictorial Structure Models for Object RecognitionPictorial Structure Models for Object RecognitionFelzenszwalb & Huttenlocher, 2000

Learning to Parse Pictures of People Ronfard, Schmid & Triggs, ECCV 2002

Page 54: Ivan Laptev "Action recognition"

Felzenszwalb & Huttenlocher

NB: requires background subtraction

Page 55: Ivan Laptev "Action recognition"

Variety of Posesy

Page 56: Ivan Laptev "Action recognition"

Variety of Posesy

Page 57: Ivan Laptev "Action recognition"

Objective: detect human and determine upper body pose (layout)

si f

a

f

a1

a2

a1

Page 58: Ivan Laptev "Action recognition"

Pictorial structure model – CRF

si f

a

f

a1

a2

a1

Page 59: Ivan Laptev "Action recognition"

Complexityp y

si f

a

f

a1

a2

a1

Page 60: Ivan Laptev "Action recognition"

Are trees the answer?

He T

UAUAleft right

UAUA

LA LALA

Ha Ha

LA

• With n parts and h possible discrete locations per part, O(hn)

• For a tree, using dynamic programming this reduces to O(nh2)

• If model is a tree and has certain edge costs, then complexity g p yreduces to O(nh) using a distance transform [Felzenszwalb & Huttenlocher, 2000, 2005]

Page 61: Ivan Laptev "Action recognition"

Are trees the answer?

He T

UAUAleft right

UAUA

LA LALA

Ha Ha

LA

• With n parts and h possible discrete locations per part, O(hn)

• For a tree, using dynamic programming this reduces to O(nh2)

• If model is a tree and has certain edge costs, then complexity g p yreduces to O(nh) using a distance transform [Felzenszwalb & Huttenlocher, 2000, 2005]

Page 62: Ivan Laptev "Action recognition"

Kinematic structure vs graphical (independence) structure

Graph G = (V,E)

He T He T

UAUAleft right

UAUAleft right

LA LA LA LARequires more

Ha Ha Ha HaRequires more connections than a tree

Page 63: Ivan Laptev "Action recognition"

More recent work on human pose estimationp

D. Ramanan. Learning to parse images of articulated bodies NIPS 2007bodies. NIPS, 2007

Learn image and person-specific unary terms• initial iteration edgesg• following iterations edges & colour

V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive search space reduction for human pose estimation In Proc CVPR 2008/2009estimation. In Proc. CVPR, 2008/2009

(Almost) unconstrained images• Person detector & foreground highlightingg g g g

VP. Buehler, M. Everingham and A. Zisserman. , gLearning sign language by watching TV. In Proc. CVPR 2009

L ith k t t l t tiLearns with weak textual annotation• Multiple instance learning

Page 64: Ivan Laptev "Action recognition"

Pose estimation is a very active research areay

Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In Proc. CVPR 2011

Extension of LSVM model of Felzenszwalb et al.

Y. Wang, D. Tran and Z. Liao. Learning Hierarchical Poselets for Human Parsing. In Proc. CVPR 2011.

Builds on Poslets idea of Bourdev et al.

S. Johnson and M. Everingham. Learning Effective Human Pose Estimation from Inaccurate Annotation. In Proc. CVPR 2011.

Learns from lots of noisy annotations

B. Sapp, D.Weiss and B. Taskar. Parsing Human Motion with Stretchable Models. In Proc. CVPR 2011.

Explores temporal continuity

Page 65: Ivan Laptev "Action recognition"

Pose estimation is a very active research areay

J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman and A. Blake. Real-Time Human Pose Recognition in Parts from

Single Depth Images. Best paper award at CVPR 2011Single Depth Images. Best paper award at CVPR 2011

Exploits lots of synthesized depth images for training

Page 66: Ivan Laptev "Action recognition"

Pose Search

Q

V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive search space reduction for human pose estimation. In Proc. CVPR2009

Page 67: Ivan Laptev "Action recognition"

Pose Search

QQ

V. Ferrari, M. Marin-Jimenez, and A. Zisserman Progressive search spaceZisserman. Progressive search space

reduction for human pose estimation. In Proc. CVPR2009

Page 68: Ivan Laptev "Action recognition"

ApplicationApplication

Learning sign language by watching TV g g g g y g(using weakly aligned subtitles)

Patrick BuehlerPatrick Buehler

Mark Everingham

A d ZiAndrew Zisserman

CVPR 2009CVPR 2009

Page 69: Ivan Laptev "Action recognition"

Objective

Learn signs in British Sign Language (BSL) corresponding to text words:• Training data from TV broadcasts with simultaneous signing • Supervision solely from sub-titles

Input: video + subtitle

Output: automaticallylearned signs (4x slow motion)

Office

Government

Use subtitles to find video sequences containing word. These are the positivetraining sequences. Use other sequences as negative training sequences.

Page 70: Ivan Laptev "Action recognition"

Given an English wordGiven an English word e.g. “tree” what is the corresponding British Sign Language sign?

positivesequencessequences

negativenegativeset

Page 71: Ivan Laptev "Action recognition"

Use sliding window to choose sub-sequence of poses in one positive

1st sliding window

q p psequence and determine ifsame sub-sequence of poses occurs in other positive sequences somewhere, but does not occur in the negative setdoes not occur in the negative set

positivesequencessequences

negativenegativeset

Page 72: Ivan Laptev "Action recognition"

Use sliding window to choose sub-sequence of poses in one positive

5th sliding window

q p psequence and determine ifsame sub-sequence of poses occurs in other positive sequences somewhere, but does not occur in the negative setdoes not occur in the negative set

positivesequencessequences

negativenegativeset

Page 73: Ivan Laptev "Action recognition"

Multiple instance learningp g

NegativePositivebags

Negativebag

sign ofi t tinterest

Page 74: Ivan Laptev "Action recognition"

Examplep

Learn signs in British Sign Language (BSL) corresponding to g g g g ( ) p gtext words.

Page 75: Ivan Laptev "Action recognition"

Evaluation

Good results for a variety of signs:

Signs where Signs where Signs where Signs which Signs whichghand movement

is important

ghand shape is important

both handsare together

are finger--spelled

are perfomed in front of the face

Navy Lung Fungi Kew Whale

Prince Garden Golf Bob Rose

Page 76: Ivan Laptev "Action recognition"

What is missed?

truncation is not modelled

Page 77: Ivan Laptev "Action recognition"

What is missed?

occlusion is not modelled

Page 78: Ivan Laptev "Action recognition"

Modelling person-object-pose interactionsg p j p

W Yang Y Wang and Greg Mori RecognizingW. Yang, Y. Wang and Greg Mori. Recognizing Human Actions from Still Images with Latent Poses. In Proc. CVPR 2010.

Some limbs may not be important for recognizing a particular action (e.g. sitting)(e.g. sitting)

B Y d L F i F i M d li M t lB. Yao and L. Fei-Fei. Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities. In Proc. CVPR 2010.

Pose estimation helps object detection and vice versa

Page 79: Ivan Laptev "Action recognition"

Towards functional object understandingj g

A. Gupta, S. Satkin, A.A. Efros and M. Hebert, From 3D Scene Geometry to HumanWorkspace. In Proc. CVPR 2011p

Predicts the “workspace” of a human

H. Grabner, J. Gall and L. Van Gool. What Makes a Chair a Chair? In Proc. CVPR 2011

Page 80: Ivan Laptev "Action recognition"

Conclusions: Conclusions: Human Human posesposespp

Exciting progress in pose estimation in realistic still images and video.g

Industry-strength pose estimation from depth sensors Industry strength pose estimation from depth sensors

Pose estimation from RGB is still very challenging Pose estimation from RGB is still very challenging

Human Poses ≠ Human Actions!

Page 81: Ivan Laptev "Action recognition"

Lecture overviewLecture overviewLecture overviewLecture overview

MotivationHistoric reviewA li i d h llApplications and challenges

Human Pose EstimationPictorial structuresRecent advances

Appearance-based methodsMotion history imagesA ti h d l & M ti iActive shape models & Motion priors

Motion-based methodsGeneric and parametric Optical FlowMotion templates

Page 82: Ivan Laptev "Action recognition"

Lecture overviewLecture overviewLecture overviewLecture overview

MotivationHistoric reviewA li i d h llApplications and challenges

Human Pose EstimationPictorial structuresRecent advances

Appearance-based methodsMotion history imagesA ti h d l & M ti iActive shape models & Motion priors

Motion-based methodsGeneric and parametric Optical FlowMotion templates

Page 83: Ivan Laptev "Action recognition"

How to recognize actions?How to recognize actions?How to recognize actions?How to recognize actions?

Page 84: Ivan Laptev "Action recognition"

Action understanding: KeyAction understanding: Key componentscomponentsAction understanding: Key Action understanding: Key componentscomponents

Foreground

Image measurements Prior knowledge

Deformable contoursegmentation

Image gradients Association

Deformable contour models

2D/3D body models

Optical flow

Local space-time

y

Local space-time features

Automatic inference

Learning associations from strong / weak

i i

Motion priorsBackground models

Action labels

supervision

Page 85: Ivan Laptev "Action recognition"

Foreground segmentationForeground segmentationForeground segmentationForeground segmentation

Image differencing: a simple way to measure motion/changeImage differencing: a simple way to measure motion/change

- > Const> Const

B B k d / F d i h d iBetter Background / Foreground separation methods exist:

Modeling of color variation at each pixel with Gaussian Mixture g p

Dominant motion compensation for sequences with moving camera

Motion layer separation for scenes with non-static backgrounds

Page 86: Ivan Laptev "Action recognition"

Temporal TemplatesTemporal TemplatesTemporal TemplatesTemporal Templates

Idea: summarize motion in video in aMotion History Image (MHI):

Descriptor: Hu moments of different ordersp

[A.F. Bobick and J.W. Davis, PAMI 2001]

Page 87: Ivan Laptev "Action recognition"

Aerobics datasetAerobics datasetAerobics datasetAerobics dataset

Nearest Neighbor classifier: 66% accuracy

Page 88: Ivan Laptev "Action recognition"

Temporal Templates: SummaryTemporal Templates: SummaryTemporal Templates: SummaryTemporal Templates: Summary

+ Si l d f tPros:

Not all shapes are valid Restrict the space

of admissible silhouettes

+ Simple and fast+ Works in controlled settings

of admissible silhouettes

- Prone to errors of background subtractionCons:

g

Does not capture interior

Variations in light, shadows, clothing… What is the background here?

- Does not capture interiormotion and shape

SilhouetteSilhouette tells little about actions

Page 89: Ivan Laptev "Action recognition"

Active ShapeActive Shape ModelsModelsActive Shape Active Shape ModelsModelsPoint Distribution Model

Represent the shape of samples by a set of corresponding points or landmarks

Assume each shape can be represented by the linear combination of basis shapes

such that

for the mean shape

and some parameter vector

[Cootes et al. 1995]

Page 90: Ivan Laptev "Action recognition"

Active ShapeActive Shape ModelsModelsActive Shape Active Shape ModelsModelsDistribution of eigenvalues of g

A small fraction of basisA small fraction of basis shapes (eigenvecors) accounts for the most of shape variation (=> landmarks are redundant)

Three main modes of lips-shape variation:Three main modes of lips shape variation:

Page 91: Ivan Laptev "Action recognition"

Active ShapeActive Shape Models:Models:Active Shape Active Shape Models:Models:effect of regularization effect of regularization

P j ti t th h l i tiProjection onto the shape-space serves as a regularization

Page 92: Ivan Laptev "Action recognition"

Active Shape Models [Active Shape Models [CootesCootes et alet al.].]Active Shape Models [Active Shape Models [CootesCootes et alet al.].]

Constrains shape deformation in PCA-projected space

Example: face alignment Illustration of face shape space

Active Shape Models: Their Training and ApplicationActive Shape Models: Their Training and Application[T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham, CVIU 1995]

Page 93: Ivan Laptev "Action recognition"

Person TrackingPerson TrackingPerson TrackingPerson Tracking

Learning flexible models from image sequences[A. Baumberg and D. Hogg, ECCV 1994]

Page 94: Ivan Laptev "Action recognition"

Learning dynamic priorLearning dynamic priorLearning dynamic priorLearning dynamic priorDynamic model: 2nd order Auto-Regressive Process y g

State

Update rule:

Model parameters:

Learning scheme:Learning scheme:

Page 95: Ivan Laptev "Action recognition"

Learning dynamic priorLearning dynamic priorLearning dynamic priorLearning dynamic prior

Learning point sequenceRandom simulation of the learned dynamical modelLearning point sequence learned dynamical model

[A. Blake, B. Bascle, M. Isard and J. MacCormick, Phil.Trans.R.Soc. 1998]

Page 96: Ivan Laptev "Action recognition"

Learning dynamic priorLearning dynamic priorLearning dynamic priorLearning dynamic prior

R d i l ti f th l d t d iRandom simulation of the learned gate dynamics

Page 97: Ivan Laptev "Action recognition"

Motion priorsMotion priorsMotion priorsMotion priors

C t i t l l ti f hConstrain temporal evolution of shape

Help accurate tracking

G l f l t ti d l f diff t t f ti

Recognize actions

Goal: formulate motion models for different types of actionsand use such models for action recognition

Example:Drawing with 3 action

line drawing

gmodes

scribbling

idleidle

[M. Isard and A. Blake, ICCV 1998]

Page 98: Ivan Laptev "Action recognition"

Lecture overviewLecture overviewLecture overviewLecture overview

MotivationHistoric reviewA li i d h llApplications and challenges

Human Pose EstimationPictorial structuresRecent advances

Appearance-based methodsMotion history imagesA ti h d l & M ti iActive shape models & Motion priors

Motion-based methodsGeneric and parametric Optical FlowMotion templates

Page 99: Ivan Laptev "Action recognition"

Lecture overviewLecture overviewLecture overviewLecture overview

MotivationHistoric reviewA li i d h llApplications and challenges

Human Pose EstimationPictorial structuresRecent advances

Appearance-based methodsMotion history imagesA ti h d l & M ti iActive shape models & Motion priors

Motion-based methodsGeneric and parametric Optical FlowMotion templates

Page 100: Ivan Laptev "Action recognition"

Shape and Appearance vs MotionShape and Appearance vs MotionShape and Appearance vs. MotionShape and Appearance vs. MotionShape and appearance in images depends on many factors:Shape and appearance in images depends on many factors: clothing, illumination contrast, image resolution, etc…

[Efros et al 2003]

Motion field (in theory) is invariant to shape and can be used directly to describe human actions

[Efros et al. 2003]

Page 101: Ivan Laptev "Action recognition"

Shape and Appearance vs MotionShape and Appearance vs MotionShape and Appearance vs. MotionShape and Appearance vs. Motion

Moving Light Displays

Gunnar Johansson, Perception and Psychophysics, 1973

Page 102: Ivan Laptev "Action recognition"

Motion estimation: Optical FlowMotion estimation: Optical FlowMotion estimation: Optical FlowMotion estimation: Optical FlowClassic problem of computer vision [Gibson 1955] Classic problem of computer vision [Gibson 1955]

Goal: estimate motion field

How? We only have access to image pixelsEstimate pixel-wise correspondence between frames = Optical Flowbetween frames Optical Flow

Brightness Change assumption: corresponding pixels preserve their intensity (color)

preserve their intensity (color)

Useful assumption in many cases

Breaks at occlusions andillumination changes

Physical and visual

illumination changes

motion may be different

Page 103: Ivan Laptev "Action recognition"

Generic Optical FlowGeneric Optical FlowGeneric Optical FlowGeneric Optical FlowBrightness Change Constraint Equation (BCCE)Brightness Change Constraint Equation (BCCE)

Optical flowImage gradient

O ti t k > t b l d di tlOne equation, two unknowns => cannot be solved directly

Integrate several measurements in the local neighborhood d bt i L t S S l ti [L & K d 1981]and obtain a Least Squares Solution [Lucas & Kanade 1981]

Second-moment

Denotes integration over a spatial (or spatio-temporal)

matrix, the same one used to compute Harris

neighborhood of a point co pute a sinterest points!

Page 104: Ivan Laptev "Action recognition"

Generic Optical FlowGeneric Optical FlowGeneric Optical FlowGeneric Optical FlowBrightness Change Constraint Equation (BCCE)Brightness Change Constraint Equation (BCCE)

Optical flowImage gradient

O ti t k > t b l d di tlOne equation, two unknowns => cannot be solved directly

Integrate several measurements in the local neighborhood d bt i L t S S l ti [L & K d 1981]and obtain a Least Squares Solution [Lucas & Kanade 1981]

Second-moment

Denotes integration over a spatial (or spatio-temporal)

matrix, the same one used to compute Harris

neighborhood of a point co pute a sinterest points!

Page 105: Ivan Laptev "Action recognition"

Generic Optical FlowGeneric Optical FlowGeneric Optical FlowGeneric Optical FlowThe solution of assumesThe solution of assumes

1. Brightness change constraint holds in

2. Sufficient variation of image gradient in

3. Approximately constant motion in pp y

Motion estimation becomes inaccurate if any of assumptions 1-3 is violated1-3 is violated.

Solutions:

(2) Insufficient gradient variationknown as aperture problem

I i t ti i hb h dIncrease integration neighborhood

(3) Non-constant motion in

Use more sophisticated motion models

Page 106: Ivan Laptev "Action recognition"

Recent methods for OF estimationRecent methods for OF estimationRecent methods for OF estimationRecent methods for OF estimation

T. Brox, A. Bruhn, N. Papenberg, J. Weickert, ECCV 2004High accuracy optical flow estimation based on a theory for warping

High accuracy optical flow estimation based on a theory for warpingCode on-line: http://lmb.informatik.uni-freiburg.de/resources/binaries

C. Liu. Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. PhD Thesis. Massachusetts Institute f T h l M 2009

of Technology. May 2009.Code on-line: http://people.csail.mit.edu/celiu/OpticalFlow

Page 107: Ivan Laptev "Action recognition"

Generic Optical FlowGeneric Optical FlowGeneric Optical FlowGeneric Optical FlowThe solution of assumesThe solution of assumes

1. Brightness change constraint holds in

2. Sufficient variation of image gradient in

3. Approximately constant motion in pp y

Motion estimation becomes inaccurate if any of assumptions 1-3 is violated1-3 is violated.

Solutions:

(2) Insufficient gradient variationknown as aperture problem

I i t ti i hb h dIncrease integration neighborhood

(3) Non-constant motion in

Use more sophisticated motion models

Page 108: Ivan Laptev "Action recognition"

Parameterized Optical FlowParameterized Optical FlowParameterized Optical FlowParameterized Optical FlowA th t i f th t t ti d l i t tAnother extension of the constant motion model is to compute PCA basis flow fields from training examples

1. Compute standard Optical Flow for many examples2. Put velocity components into one vector

3. Do PCA on and obtain most informative PCA flow basis vectors

Training samples PCA flow bases

[M.J. Black, Y. Yacoob, A.D. Jepson and D.J. Fleet, CVPR 1997]

Page 109: Ivan Laptev "Action recognition"

Parameterized Optical FlowParameterized Optical FlowParameterized Optical FlowParameterized Optical FlowEstimated coefficients of PCA flow bases can be used as action Estimated coefficients of PCA flow bases can be used as action descriptors

Frame numbers

Optical flow seems to be an interesting descriptor for motion/action recognitionmotion/action recognition

Page 110: Ivan Laptev "Action recognition"

Spatial Motion DescriptorSpatial Motion Descriptor

Image frame Optical flow yxFg p yxF ,

yx FF , yyxx FFFF ,,, blurred

yyxx FFFF ,,,

A. A. Efros, A.C. Berg, G. Mori and J. Malik. Recognizing Action at a Distance. In Proc. ICCV 2003

Page 111: Ivan Laptev "Action recognition"

SpatioSpatio--Temporal Motion DescriptorTemporal Motion DescriptorTemporal extent E

……

Sequence A

……

Sequence B

t

A AEA A

E

I matrixE

B B

E

Eframe-to-framesimilarity matrix

motion-to-motionsimilarity matrixblurry I

Page 112: Ivan Laptev "Action recognition"

Football Actions: matchingFootball Actions: matchingFootball Actions: matchingFootball Actions: matching

InputSequenceq

Matched Frames

input matched

Page 113: Ivan Laptev "Action recognition"

Football Actions: classificationFootball Actions: classificationFootball Actions: classificationFootball Actions: classification

10 actions; 4500 total frames; 13-frame motion descriptor

Page 114: Ivan Laptev "Action recognition"

Classifying Ballet ActionsClassifying Ballet Actions16 Actions; 24800 total frames; 51-frame motion descriptor. Men

y gy g

used to classify women and vice versa.

Page 115: Ivan Laptev "Action recognition"

Classifying Tennis ActionsClassifying Tennis ActionsClassifying Tennis ActionsClassifying Tennis Actions

6 actions; 4600 frames; 7-frame motion descriptorWoman player used as training, man as testing.

Page 116: Ivan Laptev "Action recognition"

Classifying Tennis ActionsClassifying Tennis ActionsClassifying Tennis ActionsClassifying Tennis Actions

Red bars illustrate classification confidence for each action[A. A. Efros, A. C. Berg, G. Mori, J. Malik, ICCV 2003]

Page 117: Ivan Laptev "Action recognition"

Motion recognition Motion recognition ggwithoutwithout motion estimationsmotion estimations

• Motion estimation from video is a often noisy/unreliableMeasure motion consistency between a template and test video• Measure motion consistency between a template and test video

[Schechtman and Irani, PAMI 2007]

Page 118: Ivan Laptev "Action recognition"

Motion recognition Motion recognition ggwithoutwithout motion estimationsmotion estimations

• Motion estimation from video is a often noisy/unreliableMeasure motion consistency between a template and test video• Measure motion consistency between a template and test video

Test video

Template video Correlation result

[Schechtman and Irani, PAMI 2007]

Page 119: Ivan Laptev "Action recognition"

Motion recognition Motion recognition ggwithoutwithout motion estimationsmotion estimations

• Motion estimation from video is a often noisy/unreliableMeasure motion consistency between a template and test video• Measure motion consistency between a template and test video

Test video

Template video Correlation result

[Schechtman and Irani, PAMI 2007]

Page 120: Ivan Laptev "Action recognition"

MotionMotion--based template matchingbased template matchingMotionMotion based template matchingbased template matching

+ Depends less on variations in appearancePros:

+ Depends less on variations in appearance

- Can be slow- Does not model negatives

Cons:

Does not model negatives

Improvements possible using discriminatively-trainedImprovements possible using discriminatively-trained template-based action classifiers

Page 121: Ivan Laptev "Action recognition"

Where are we so far?Where are we so far?Where are we so far?Where are we so far?Body-pose estimation Shape models with(out) motion priors

O ti l Fl b d th d ti t l tOptical Flow based methods, motion templates

Page 122: Ivan Laptev "Action recognition"

What’s coming next?What’s coming next?What s coming next?What s coming next?Local space-time features Discriminatively-trained action templates

fBag-of-Features action classification

A ti l ifi ti i li ti id

Weakly supervised training using video scripts

Action classification in realistic videosg p

Page 123: Ivan Laptev "Action recognition"

Goal:Interpret complex d idynamic scenes

Common methods: Common problems:

• Segmentation

Common problems:

• Complex & changing BG?• Tracking • Changing appearance?

No global assumptions about the scene

Page 124: Ivan Laptev "Action recognition"

SpaceSpace--timetimeNo global assumptions

Consider local spatio temporal neighborhoodsConsider local spatio-temporal neighborhoods

boxinghand waving

Page 125: Ivan Laptev "Action recognition"

Actions == SpaceActions == Space--time objects?time objects?

Page 126: Ivan Laptev "Action recognition"

Local approach: Bag of Visual WordsLocal approach: Bag of Visual Words

Airplanes

Motorbikes

Faces

Wild Cats

Leaves

People

Bikes

Page 127: Ivan Laptev "Action recognition"

SpaceSpace--time local featurestime local features

Page 128: Ivan Laptev "Action recognition"

SpaceSpace--Time Interest Points: DetectionTime Interest Points: DetectionppWhat neighborhoods to consider?

Distinctive neighborhoods

High image variation in space

and time

Look at the distribution of the

gradientand time g

O i i l i

Definitions:

Original image sequence

Space-time Gaussian with covariance

Gaussian derivative of

Space-time gradientSpace-time gradient

Second-moment matrix

Page 129: Ivan Laptev "Action recognition"

SpaceSpace--Time Interest Points: DetectionTime Interest Points: Detection

Properties of :

pp

defines second order approximation for the local distribution of within neighborhood

p

1D space-time variation of , e.g. moving bar

2D space-time variation of , e.g. moving ballg g

3D space-time variation of , e.g. jumping ball

Large eigenvalues of can be detected by thelocal maxima of H over (x,y,t):

(similar to Harris operator [Harris and Stephens, 1988])

Page 130: Ivan Laptev "Action recognition"

SpaceSpace--Time interest pointsTime interest points

/Velocity changes

appearance/ disappearance split/merge

Page 131: Ivan Laptev "Action recognition"

SpaceSpace--Time Interest Points: ExamplesTime Interest Points: ExamplesMotion event detection

pp pp

Page 132: Ivan Laptev "Action recognition"

SpatioSpatio--temporal scale selectiontemporal scale selectionpp pp

Stability to size changes, e.g. camera zoom

Page 133: Ivan Laptev "Action recognition"

SpatioSpatio--temporal scale selectiontemporal scale selectionSpatioSpatio temporal scale selectiontemporal scale selection

Selection of t l ltemporal scales captures the frequency of events

Page 134: Ivan Laptev "Action recognition"

Local features for human actionsLocal features for human actions

Page 135: Ivan Laptev "Action recognition"

Local features for human actionsLocal features for human actionsboxing

walking

hand waving

Page 136: Ivan Laptev "Action recognition"

Local spaceLocal space--time descriptor: HOG/HOFtime descriptor: HOG/HOF

Multi-scale space-time patches

pp pp

Multi scale space time patches

Histogram of oriented spatial

d (HOG)

Histogram of optical

grad. (HOG) flow (HOF)Public code available at www.irisa.fr/vista/actions

3x3x2x4bins HOGdescriptor

3x3x2x5bins HOF descriptor

Page 137: Ivan Laptev "Action recognition"

Visual Vocabulary: KVisual Vocabulary: K--means clusteringmeans clusteringyy gg

Group similar points in the space of image descriptors using p p p g p gK-means clustering

Select significant clusters

c1Clustering c1

c2

c3

c4

Classification

Page 138: Ivan Laptev "Action recognition"

Visual Vocabulary: KVisual Vocabulary: K--means clusteringmeans clustering

Group similar points in the space of image descriptors using

yy gg

p p p g p gK-means clustering

Select significant clusters

c1Clustering c1

c2

c3

c4

Classification

Page 139: Ivan Laptev "Action recognition"

Local SpaceLocal Space--time features:time features: MatchingMatchingpp gg Find similar events in pairs of video sequences

Page 140: Ivan Laptev "Action recognition"

Action Classification: OverviewAction Classification: OverviewBag of space-time features + multi-channel SVM

C f

[Laptev’03, Schuldt’04, Niebles’06, Zhang’07]

Collection of space-time patches

Histogram of visual words

Multi-channelSVM

HOG & HOFpatch

Classifierpatch

descriptors

Page 141: Ivan Laptev "Action recognition"

Action recognition in KTH datasetAction recognition in KTH datasetgg

Sample frames from the KTH actions sequences, all six classes (columns) and scenarios (rows) are presented

Page 142: Ivan Laptev "Action recognition"

Classification results on KTH datasetClassification results on KTH dataset

Confusion matrix for KTH actionsConfusion matrix for KTH actions

Page 143: Ivan Laptev "Action recognition"

What about What about 33D?D?Local motion and appearance features are not invariant to view changes

Page 144: Ivan Laptev "Action recognition"

MultiMulti--view action recognitionview action recognitionMultiMulti view action recognitionview action recognitionDifficult to apply standard multi-view methods:

Do not want to search for multi-view point correspondence

view point correspondence ---Non-rigid motion, clothing changes, … --> It’s Hard!

Do not want to identify body Do not want to identify body parts. Current methods are not reliable enough.

Yet, want to learn actions from one viewand recognize actions in very different views

Page 145: Ivan Laptev "Action recognition"

Temporal selfTemporal self--similaritiessimilaritiesTemporal selfTemporal self--similaritiessimilaritiesIdea:

Cross-view matching is hard but cross-time matching (tracking) is relatively easy.Measure self-(dis)similarities across time:

Measure self (dis)similarities across time:

Example:

Distance matrix / self-similarity matrix (SSM):

P1

P2P2

Page 146: Ivan Laptev "Action recognition"

Temporal selfTemporal self--similarities: Multisimilarities: Multi--viewsviewsTemporal selfTemporal self--similarities: Multisimilarities: Multi--viewsviewsSide view Top view

Appear very

similarsimilar despite the view h !change!

Intuition: 1. Distance between similar poses is low in any view2. Distance among different poses is likely to be large in most views

Page 147: Ivan Laptev "Action recognition"

Temporal selfTemporal self--similarities:similarities: MoCapMoCapTemporal selfTemporal self--similarities: similarities: MoCapMoCap

person 1Self-similarities can be measured

person 2

can be measured from Motion Capture (MoCap) datadata

person 1

person 2p

Page 148: Ivan Laptev "Action recognition"

Temporal selfTemporal self--similarities: Videosimilarities: VideoTemporal selfTemporal self--similarities: Videosimilarities: Video

Self-similarities can be measured directly from video:HOG or Optical Flow d i t idescriptors in image frames

Page 149: Ivan Laptev "Action recognition"

SelfSelf--similarity descriptorsimilarity descriptorSelfSelf--similarity descriptorsimilarity descriptor

Goal:define a quantitative qmeasure to compare self-similarity matrices

Define a local histogram ~SIFT descriptort d SSMdescriptor hi for each point

i on the diagonal.computed on SSM

Sequence alignment:Dynamic Programming for two sequences of

Action recognition:• Visual vocabulary for h

two sequences of descriptors {hi}, {hj}

• BoF representation of {hi}• SVM

Page 150: Ivan Laptev "Action recognition"

MultiMulti--view alignmentview alignmentMultiMulti--view alignmentview alignment

Page 151: Ivan Laptev "Action recognition"

MultiMulti--view action recognition: Videoview action recognition: VideoMultiMulti--view action recognition: Videoview action recognition: Video

SSM-based recognition Alternative view-dependent method (STIP)

Page 152: Ivan Laptev "Action recognition"

What are Human Actions?What are Human Actions?What are Human Actions?What are Human Actions?

Actions in recent datasets:

Is it just about kinematics?

Should actions be defined by the purpose?

Kinematics + Objects

Page 153: Ivan Laptev "Action recognition"

What are Human Actions?What are Human Actions?What are Human Actions?What are Human Actions?

Actions in recent datasets:

Is it just about kinematics?

Should actions be defined by the purpose?

Kinematics + Objects + Scenes

Page 154: Ivan Laptev "Action recognition"
Page 155: Ivan Laptev "Action recognition"

Action recognition in realistic settingsAction recognition in realistic settings

St d d

A ti “I th Wild”

Standard action datasets

Actions “In the Wild”:

Page 156: Ivan Laptev "Action recognition"

Action Dataset and AnnotationAction Dataset and Annotation

Manual annotation of drinking actions in movies: g“Coffee and Cigarettes”; “Sea of Love”

“Drinking”: 159 annotated samples

T l t ti

g p“Smoking”: 149 annotated samples

KeyframeFirst frame Last frame

Temporal annotation

Spatial annotationhead rectangle

torso rectangle

Page 157: Ivan Laptev "Action recognition"

“Drinking” action samples“Drinking” action samplesg pg p

Page 158: Ivan Laptev "Action recognition"

Action representationAction representationppHist. of GradientHi t f O ti FlHist. of Optic Flow

Page 159: Ivan Laptev "Action recognition"

Action learningAction learning

b ti

selected features

��

boosting

weak classifier

• Efficient discriminative classifier [Freund&Schapire’97]G d f f f d t ti [Vi l &J ’01]

��

AdaBoost: • Good performance for face detection [Viola&Jones’01]AdaBoost:

Haar features

optimal thresholdpre-aligned samples featuressamples

Fisher discriminant

Histogram features

Page 160: Ivan Laptev "Action recognition"

KeyKey--frame action classifierframe action classifier

b ti

selected features

���

boosting

weak classifier2D HOG f

• Efficient discriminative classifier [Freund&Schapire’97]G d f f f d t ti [Vi l &J ’01]AdaBoost:

2D HOG features

• Good performance for face detection [Viola&Jones’01]AdaBoost:

Haar features

optimal thresholdpre-aligned samples featuressamples

Fisher discriminant

Histogram features

see [Laptev BMVC’06]for more details

[Laptev, Pérez 2007]

Page 161: Ivan Laptev "Action recognition"

KeyframeKeyframe primingprimingyy p gp g

Training False positives of static HOG action detector

Positive training sample

Negative training samplesp p

Test

Page 162: Ivan Laptev "Action recognition"

Action detectionAction detectionTest set:

• 25min from “Coffee and Cigarettes” with GT 38 drinking actionsg g• No overlap with the training set in subjects or scenes

Detection: h ll ti l ti d ti t l• search over all space-time locations and spatio-temporal

extents

KeyframeKeyframe priming

No KeyframeKeyframe priming

Page 163: Ivan Laptev "Action recognition"

Action Detection Action Detection (ICCV 2007)(ICCV 2007)( )( )

Test episodes from the movie “Coffee and cigarettes”Test episodes from the movie Coffee and cigarettes

Video available at http://www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html

Page 164: Ivan Laptev "Action recognition"

20 most confident detections20 most confident detections

Page 165: Ivan Laptev "Action recognition"

Learning Actions from Movies Learning Actions from Movies gg• Realistic variation of human actions• Many classes and many examples per classMany classes and many examples per class

Problems:• Typically only a few class-samples per movie• Manual annotation is very time consuming

Page 166: Ivan Laptev "Action recognition"

Automatic video annotationAutomatic video annotation

• Scripts available for >500 movies (no time synchronization)

with scriptswith scripts• Scripts available for >500 movies (no time synchronization)

www.dailyscript.com, www.movie-page.com, www.weeklyscript.com …• Subtitles (with time info.) are available for the most of movies

… subtitles movie script

• Can transfer time to scripts by text alignment

117201:20:17,240 --> 01:20:20,437

Why weren't you honest with me?Why'd you keep your marriage a secret?

…RICK

Why weren't you honest with me? Whydid you keep your marriage a secret?Why d you keep your marriage a secret?

117301:20:20,640 --> 01:20:23,598

lt wasn't my secret Richard

did you keep your marriage a secret?

Rick sits down with Ilsa.01:20:17

01:20:23lt wasn't my secret, Richard.Victor wanted it that way.

117401:20:23 800 > 01:20:26 189

ILSA

Oh, it wasn't my secret, Richard. Victor wanted it that way. Not even our closest friends knew about our01:20:23,800 --> 01:20:26,189

Not even our closest friendsknew about our marriage.…

our closest friends knew about our marriage.

Page 167: Ivan Laptev "Action recognition"

ScriptScript--based action annotationbased action annotation

– On the good side:

ScriptScript based action annotation based action annotation

g• Realistic variation of actions: subjects, views, etc…• Many examples per class, many classes• No extra overhead for new classes• Actions, objects, scenes and their combinations• Character names may be used to resolve “who is doing what?”• Character names may be used to resolve who is doing what?

– Problems:• No spatial localization• Temporal localization may be poor

Missing actions e g scripts do not al a s follo the mo ie• Missing actions: e.g. scripts do not always follow the movie• Annotation is incomplete, not suitable as ground truth for

testing action detection• Large within-class variability of action classes in text

Page 168: Ivan Laptev "Action recognition"

Script alignment: EvaluationScript alignment: EvaluationScript alignment: Evaluation Script alignment: Evaluation • Annotate action samples in textp• Do automatic script-to-video alignment• Check the correspondence of actions in scripts and movies

Example of a “visual false positive”

A bl k ll tA black car pulls up, two army officers get out.

a: quality of subtitle-script matching

Page 169: Ivan Laptev "Action recognition"

TextText--based action retrievalbased action retrievalTextText based action retrieval based action retrieval • Large variation of action expressions in text:

“… Will gets out of the Chevrolet. …” “… Erin exits her new truck…”

GetOutCar action:

Potential false positives: “…About to sit down, he freezes…”

• => Supervised text classification approach

Page 170: Ivan Laptev "Action recognition"

Automatically annotated action samplesAutomatically annotated action samplesAutomatically annotated action samplesAutomatically annotated action samples

[Laptev, Marszałek, Schmid, Rozenfeld 2008]

Page 171: Ivan Laptev "Action recognition"

HollywoodHollywood--2 actions dataset2 actions datasetHollywoodHollywood 2 actions dataset 2 actions dataset

Training and test samples are obtained from 33 and 36 distinct movies respectively.

Hollywood-2 dataset is on-line:http://www irisa fr/vistahttp://www.irisa.fr/vista/actions/hollywood2

[Laptev, Marszałek, Schmid, Rozenfeld 2008]

Page 172: Ivan Laptev "Action recognition"

Action Classification: OverviewAction Classification: OverviewBag of space-time features + multi-channel SVM

C f

[Laptev’03, Schuldt’04, Niebles’06, Zhang’07]

Collection of space-time patches

Histogram of visual words

Multi-channelSVM

HOG & HOFpatch

Classifierpatch

descriptors

Page 173: Ivan Laptev "Action recognition"

Action classification (CVPR08)Action classification (CVPR08)

Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”

Page 174: Ivan Laptev "Action recognition"

Evaluation of local featuresEvaluation of local featuresEvaluation of local featuresEvaluation of local featuresfor action recognitionfor action recognition

• Local features provide a popular approach to video description for action recognition:– ~50% of recent action recognition methods (cvpr09, iccv09,

bmvc09) are based on local features– Large variety of feature detectors and descriptors is availableLarge variety of feature detectors and descriptors is available– Very limited and inconsistent comparison of different features

Goal:• Systematic evaluation of local feature-descriptor combinations• Compare performance on common datasets• Compare performance on common datasets• Propose improvements

Page 175: Ivan Laptev "Action recognition"

Evaluation of local featuresEvaluation of local featuresEvaluation of local featuresEvaluation of local featuresfor action recognitionfor action recognition

• Evaluation study [Wang et al. BMVC’09]– Common recognition framework– Common recognition framework

• Same datasets (varying difficulty): KTH, UCF sports, Hollywood2

• Same train/test data• Same classification method

Alternative local feature detectors and descriptors from– Alternative local feature detectors and descriptors from recent literature

– Comparison of different detector-descriptor combinations

Page 176: Ivan Laptev "Action recognition"

Action recognition frameworkAction recognition frameworkBag of space-time features + SVM [Schuldt’04, Niebles’06, Zhang’07]

E l i

gg

space-time patchesExtraction of Local features

Evaluation

Local features

Occurrence histogram of visual words

K-means clustering(k=4000)

Evaluation

Featuredescription

Feature

Non-linear SVM with χ2

kernele u e

quantization

Page 177: Ivan Laptev "Action recognition"

L l f t d t t /d i tL l f t d t t /d i tLocal feature detectors/descriptorsLocal feature detectors/descriptors

• Four types of detectors:– Harris3D [Laptev’05]Harris3D [Laptev 05]– Cuboids [Dollar’05]– Hessian [Willems’08]– Regular dense sampling

• Four different types of descriptors:yp p– HoG/HoF [Laptev’08]– Cuboids [Dollar’05]

H G3D [Klä ’08]– HoG3D [Kläser’08] – Extended SURF [Willems’08]

Page 178: Ivan Laptev "Action recognition"

Illustration of ST detectorsIllustration of ST detectors

H i 3D H i

Illustration of ST detectorsIllustration of ST detectors

Harris3D Hessian

Cuboid DenseCuboid Dense

Page 179: Ivan Laptev "Action recognition"

Dataset: KTHDataset: KTH--ActionsActions

6 action classes by 25 persons in 4 different scenarios• 6 action classes by 25 persons in 4 different scenarios• Total of 2391 video samples• Performance measure: average accuracy over all classes• Performance measure: average accuracy over all classes

57

Page 180: Ivan Laptev "Action recognition"

UCFUCF--Sports Sports ---- samplessamplespp pp• 10 different action classes

1 0• 150 video samples in total- We extend the dataset by flipping videos

• Evaluation method: leave-one-out• Evaluation method: leave-one-out• Performance measure: average accuracy over all classes

Diving Kicking Walking

Skateboarding High-Bar-Swinging Golf-Swinging

Page 181: Ivan Laptev "Action recognition"

Dataset: Hollywood2Dataset: Hollywood2yy

• 12 different action classes from 69 Hollywood movies12 different action classes from 69 Hollywood movies• 1707 video sequences in total• Separate movies for training / testing

P f i i ( AP) ll l• Performance measure: mean average precision (mAP) over all classes

GetOutCar AnswerPhone Kiss

HandShake StandUp DriveCarHandShake StandUp DriveCar

59

Page 182: Ivan Laptev "Action recognition"

KTHKTH ActionsActions resultsresultsKTHKTH--Actions Actions ---- resultsresultsDetectors

Harris3D Cuboids Hessian DenseHOG3D 89.0% 90.0% 84.6% 85.3%

HOG/HOF 91 8% 88 7% 88 7% 86 1%HOG/HOF 91.8% 88.7% 88.7% 86.1%

HOG 80.9% 82.3% 77.7% 79.0%

HOF 92.1% 88.2% 88.6% 88.0%ripto

rs

HOFCuboids - 89.1% - -

E-SURF - - 81.4% -Des

c

• Best results for Sparse Harris3D + HOF• Good results for Harris3D and Cuboid detectors with• Good results for Harris3D and Cuboid detectors with

HOG/HOF and HOG3D descriptors• Dense features perform relatively poor compared to• Dense features perform relatively poor compared to

sparse features

Page 183: Ivan Laptev "Action recognition"

UCFUCF SportsSports resultsresultsUCFUCF--SportsSports ---- resultsresultsDetectors

Harris3D Cuboids Hessian DenseHOG3D 79.7% 82.9% 79.0% 85.6%

HOG/HOF 78.1% 77.7% 79.3% 81.6%

HOG 71.4% 72.7% 66.0% 77.4%

HOF 75 4% 76 7% 75 3% 82 6%crip

tors

HOF 75.4% 76.7% 75.3% 82.6%

Cuboids - 76.6% - -

E-SURF - - 77.3% -

Des

c

E SURF 77.3%

• Best results for Dense + HOG3D

• Good results for Dense and HOG/HOF

• Cuboids: good performance with HOG3Dg p

Page 184: Ivan Laptev "Action recognition"

Hollywood2Hollywood2 resultsresultsHollywood2 Hollywood2 ---- resultsresultsDetectors

Harris3D Cuboids Hessian DenseHOG3D 43.7% 45.7% 41.3% 45.3%

Detectors

HOG/HOF 45.2% 46.2% 46.0% 47.4%

HOG 32.8% 39.4% 36.2% 39.4%

ripto

rs

HOF 43.3% 42.9% 43.0% 45.5%

Cuboids - 45.0% - -

E SURF 38 2%

Des

cri

E-SURF - - 38.2% -

• Best results for Dense + HOG/HOFBest results for Dense + HOG/HOF• Good results for HOG/HOF

Page 185: Ivan Laptev "Action recognition"

Evaluation summaryEvaluation summary

• Dense sampling consistently outperforms all the tested sparse

yy

Dense sampling consistently outperforms all the tested sparse features in realistic settings (UCF + Hollywood2)

- Importance of realistic video data- Limitations of current feature detectors- Note: large number of features (15-20 times more)

• Sparse features provide more or less similar results (sparse features better than Dense on KTH)

• Descriptors’ performance• Descriptors performance- Combination of gradients + optical flow seems a good choice (HOG/HOF & HOG3D)

Page 186: Ivan Laptev "Action recognition"

How to improve How to improve BoFBoF classification?classification?ppActions are about people

Why not try to combine BoF with person detection? y y p

Detect and track peopleCompute BoF on person-centered grids:2x2, 3x2, 3x3…

Surprise!

Page 187: Ivan Laptev "Action recognition"

How to improve How to improve BoFBoF classification?classification?

• Do not remove background2nd attampt:

pp

Local featuresbi F t ith

g• Improve local descriptors with region-level information

Visual Vocabulary

ambiguous features

Features with disambiguated

labels

Regions

R2 Histogram representation

SVM

R1 R1 R1

R2 SVM ClassificationR2R3

Page 188: Ivan Laptev "Action recognition"

Video SegmentationVideo Segmentation

• Spatio-temporal grids R3

gg

Spatio temporal grids

R2

R3

• Static action detectors [Felzenszwalb’08]

R1

– Trained from ~100 web-images per class

• Object and Person detectors (Upper body) [Felzenszwalb’08][ ]

Page 189: Ivan Laptev "Action recognition"

Video SegmentationVideo Segmentationgg

Page 190: Ivan Laptev "Action recognition"

HollywoodHollywood--2 action classification2 action classification

Attributed feature Performance

yy

(meanAP)BoF 48.55Spatiotemoral grid 24 channels 51 83Spatiotemoral grid 24 channels 51.83Motion segmentation 50.39Upper body 49.26pp yObject detectors 49.89Action detectors 52.77Spatiotemoral grid + Motion segmentation 53.20Spatiotemoral grid + Upper body 53.18S ti t l id + Obj t d t t 52 97Spatiotemoral grid + Object detectors 52.97Spatiotemoral grid + Action detectors 55.72Spatiotemoral grid + Motion segmentation + Upper 55.33Spatiotemoral grid + Motion segmentation + Upper body + Object detectors + Action detectors

55.33

Page 191: Ivan Laptev "Action recognition"

HollywoodHollywood--2 2 action classificationaction classificationyy

Page 192: Ivan Laptev "Action recognition"

Actions in ContextActions in Context (CVPR 2009)(CVPR 2009)Actions in Context Actions in Context (CVPR 2009)(CVPR 2009)

Human actions are frequently correlated with particular scene classesq y p

Reasons: physical properties and particular purposes of scenes

Eating -- kitchen Eating -- cafe

Running -- road Running -- street

Page 193: Ivan Laptev "Action recognition"

Mining scene captionsMining scene captions

ILSA

01:22:0001:22:03

I wish I didn't love you so much.

She snuggles closer to Rick.

CUT TO:

EXT. RICK'S CAFE - NIGHT

Laszlo and Carl make their way through the darkness toward a side entrance of Rick's. They run inside the entryway.

The headlights of a speeding police car sweep toward them.

They flatten themselves against a wall to avoid detection.

The lights move past them.

01:22:1501:22:17

CARLI think we lost them.…

Page 194: Ivan Laptev "Action recognition"

Mining scene captionsMining scene captions

INT. TRENDY RESTAURANT - NIGHTINT MARSELLUS WALLACE’S DINING ROOM MORNINGINT. MARSELLUS WALLACE S DINING ROOM MORNINGEXT. STREETS BY DORA’S HOUSE - DAY.INT. MELVIN'S APARTMENT, BATHROOM – NIGHTEXT NEW YORK CITY STREET NEAR CAROL'S RESTAURANT – DAYEXT. NEW YORK CITY STREET NEAR CAROL S RESTAURANT DAYINT. CRAIG AND LOTTE'S BATHROOM - DAY

• Maximize word frequency street, living room, bedroom, car ….

M d ith i il i W dN t

taxi -> car, cafe -> restaurant

• Merge words with similar senses using WordNet:

• Measure correlation of words with actions (in scripts) and

• Re-sort words by the entropyfor P = p(action | word)

Page 195: Ivan Laptev "Action recognition"

CoCo--occurrence of actions and scenes occurrence of actions and scenes in scriptsin scripts

Page 196: Ivan Laptev "Action recognition"

CoCo--occurrence of actions and scenes occurrence of actions and scenes in scriptsin scripts

Page 197: Ivan Laptev "Action recognition"

CoCo--occurrence of actions and scenes occurrence of actions and scenes in scriptsin scripts

Page 198: Ivan Laptev "Action recognition"

CoCo--occurrence of actions and scenesoccurrence of actions and scenesin text vs. videoin text vs. video

Page 199: Ivan Laptev "Action recognition"

Automatic gathering of relevant scene classes Automatic gathering of relevant scene classes and visual samplesand visual samples

Source:69 movies aligned with gthe scripts

Hollywood-2 dataset is on-line:http://www.irisa.fr/vista/actions/hollywood2

Page 200: Ivan Laptev "Action recognition"

Results: actions and scenes (separately)Results: actions and scenes (separately)

Page 201: Ivan Laptev "Action recognition"

Classification with the help of contextClassification with the help of context

Action classification score

Scene classification score

Weight estimated from text:

New action score

Weight, estimated from text:

Page 202: Ivan Laptev "Action recognition"

Results: actions and scenes (jointly)Results: actions and scenes (jointly)

Actionsi hin the

contextof

Scenes

Scenesin the

contextcontextof

Actions

Page 203: Ivan Laptev "Action recognition"

WeaklyWeakly--Supervised Supervised yy ppTemporal Action AnnotationTemporal Action Annotation

Answer questions: WHAT actions and WHEN they happened ? Answer questions: WHAT actions and WHEN they happened ?

Knock on the door Fight Kiss

Train visual action detectors and annotate actions with the minimal manual supervision

Page 204: Ivan Laptev "Action recognition"

WHATWHAT actions?actions?WHATWHAT actions?actions?Automatic discovery of action classes in text (movie scripts)

-- Text processing: Part of Speech (POS) tagging; Named Entity Recognition (NER);WordNet pruning; Visual Noun filteringWordNet pruning; Visual Noun filtering

-- Search action patterns

Person+Verb Person+Verb+Prep Person+Verb+Prep+Vis NounPerson+Verb3725 /PERSON .* is 2644 /PERSON .* looks 1300 /PERSON .* turns

Person+Verb+Prep. Person+Verb+Prep+Vis.Noun989 /PERSON .* looks .* at384 /PERSON .* is .* in363 /PERSON .* looks .* up

41 /PERSON .* sits .* in .* chair37 /PERSON .* sits .* at .* table31 /PERSON .* sits .* on .* bed

916 /PERSON .* takes 840 /PERSON .* sits 829 /PERSON .* has 807 /PERSON .* walks 701 /PERSON * stands

234 /PERSON .* is .* on215 /PERSON .* picks .* up196 /PERSON .* is .* at139 /PERSON .* sits .* in138 /PERSON * is * with

29 /PERSON .* sits .* at .* desk26 /PERSON .* picks .* up .* phone23 /PERSON .* gets .* out .* car23 /PERSON .* looks .* out .* window21 /PERSON * looks * around * room701 /PERSON .* stands

622 /PERSON .* goes591 /PERSON .* starts 585 /PERSON .* does 569 /PERSON * gets

138 /PERSON .* is .* with134 /PERSON .* stares .* at129 /PERSON .* is .* by126 /PERSON .* looks .* down124 /PERSON * sits * on

21 /PERSON .* looks .* around .* room18 /PERSON .* is .* at .* desk17 /PERSON .* hangs .* up .* phone17 /PERSON .* is .* on .* phone17 /PERSON * looks * at * watch569 /PERSON . gets

552 /PERSON .* pulls 503 /PERSON .* comes 493 /PERSON .* sees 462 /PERSON .* are/VBP

124 /PERSON . sits . on122 /PERSON .* is .* of114 /PERSON .* gets .* up109 /PERSON .* sits .* at107 /PERSON .* sits .* down

17 /PERSON . looks . at . watch16 /PERSON .* sits .* on .* couch15 /PERSON .* opens .* of .* door15 /PERSON .* walks .* into .* room14 /PERSON .* goes .* into .* room

Page 205: Ivan Laptev "Action recognition"

WHENWHEN:: Video Data and AnnotationVideo Data and AnnotationWHENWHEN:: Video Data and AnnotationVideo Data and AnnotationWant to target realistic video data

Want to avoid manual video annotation for training

Use movies + scripts for automatic annotation of training samplesUse movies + scripts for automatic annotation of training samples

24:25

erta

inty

!U

nce

24:5124:51

Page 206: Ivan Laptev "Action recognition"

OverviewOverviewOverviewOverviewInput: Automatic collection of training clipsp

• Action type, e.g. Person Opens Door

g p

• Videos + aligned scripts

Clustering of positive segmentsO Clustering of positive segmentsTraining classifier

Sliding-i d l

Output:

window-style temporal

action localization

Page 207: Ivan Laptev "Action recognition"

Action clusteringAction clusteringAction clustering Action clustering [Lihi Zelnik-Manor and Michal Irani CVPR 2001]

Descriptor spaceSpectral clustering

Descriptor space

Clustering results

Ground truth

Page 208: Ivan Laptev "Action recognition"

Action clusteringAction clusteringAction clustering Action clustering Complex data:

Standard clustering methods do not work on

this datathis data

Page 209: Ivan Laptev "Action recognition"

Action clusteringAction clusteringAction clustering Action clustering Our view at the problem

Video spaceFeature space

N i l !

N t i hb

Negative samples!

Nearest neighbor solution: wrong! Random video samples: lots of them,

very low chance to be positives

Page 210: Ivan Laptev "Action recognition"

Action clusteringAction clusteringAction clustering Action clustering Formulation

di i i ti t

[Xu et al. NIPS’04][Bach & Harchaoui NIPS’07]

Feature spacediscriminative cost [ ac & a c aou S 0 ]

Loss on positive samples

Loss on negative samples

negative samples

t i d iti lparameterized positive samples

SVM solution for

Optimization

Coordinate descent on

Page 211: Ivan Laptev "Action recognition"

Clustering resultsClustering resultsClustering resultsClustering resultsDrinking actions in Coffee and Cigarettes

Page 212: Ivan Laptev "Action recognition"

Detection resultsDetection resultsDetection resultsDetection resultsDrinking actions in Coffee and Cigarettes

T i i B f F t l ifiTraining Bag-of-Features classifier Temporal sliding window classification Non-maximum suppression

Detection trained on simulated clusters

Test set:• 25min from “Coffee and• 25min from Coffee and

Cigarettes” with GT 38drinking actions

Page 213: Ivan Laptev "Action recognition"

Detection resultsDetection resultsDetection resultsDetection resultsDrinking actions in Coffee and Cigarettes

T i i B f F t l ifiTraining Bag-of-Features classifier Temporal sliding window classification Non-maximum suppression

Detection trained on automatic clusters

Test set:• 25min from “Coffee and• 25min from Coffee and

Cigarettes” with GT 38drinking actions

Page 214: Ivan Laptev "Action recognition"

Detection resultsDetection resultsDetection resultsDetection results“Sit Down” and “Open Door” actions in ~5 hours of movies

Page 215: Ivan Laptev "Action recognition"

Temporal detection of “Sit Down” and “Open Door” actions in movies:The Graduate, The Crying Game, Living in Oblivion

Page 216: Ivan Laptev "Action recognition"

ConclusionsConclusions

B f d d l tl d i t th Bag-of-words models are currently dominant, the structure (human poses, etc.) should be integrated.

Vocabulary of actions is not well-defined – it depends on the goal and the task g

Actions should be used for the functional interpretation of the visual world