learning the appearance and motion of people in video
DESCRIPTION
Learning the Appearance and Motion of People in Video. Hedvig Sidenbladh, KTH [email protected], www.nada.kth.se/~hedvig/ Michael Black, Brown University [email protected], www.cs.brown.edu/people/black/. Collaborators. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/1.jpg)
Learning the Appearance and Learning the Appearance and Motion of People in VideoMotion of People in Video
Learning the Appearance and Learning the Appearance and Motion of People in VideoMotion of People in Video
Hedvig Sidenbladh, [email protected], www.nada.kth.se/~hedvig/
Michael Black, Brown [email protected], www.cs.brown.edu/people/black/
![Page 2: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/2.jpg)
CollaboratorsCollaborators
• David Fleet, Xerox PARC [email protected]
• Dirk Ormoneit, Stanford University [email protected]
• Jan-Olof Eklundh, KTH [email protected]
![Page 3: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/3.jpg)
GoalGoalGoalGoal
• Tracking and reconstruction of human motion in 3D– Articulated 3D model– Monocular sequence – Pinhole camera model– Unknown, cluttered
environment
![Page 4: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/4.jpg)
Why is it Important?Why is it Important?
• Human-machine interaction– Robots– Intelligent rooms
• Video search
• Animation, motion capture
• Surveillance
![Page 5: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/5.jpg)
Why is it Hard?Why is it Hard?Why is it Hard?Why is it Hard?
• Structure is unobservable - inference from visible parts
• People appear in many ways - how find a model that fits them all?
![Page 6: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/6.jpg)
Why is it Hard?Why is it Hard?Why is it Hard?Why is it Hard?
• People move fast and non-linearly
• 3D to 2D projection ambiguities
• Large occlusion• Similar appearance of
different limbs• Large search space Extreme case
![Page 7: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/7.jpg)
Exploit cues in the images. Learn likelihood models: p(image cue | model)
Build models of human form and motion. Learnpriors over model parameters:
p(model)
Represent the posterior distribution: p(model | cue) p(cue | model) p(model)
BayesianBayesian InferenceInference
![Page 8: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/8.jpg)
Human ModelHuman Model
• Limbs = truncated cones in 3D
• Pose determined by parameters
![Page 9: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/9.jpg)
Bregler and Malik ‘98Bregler and Malik ‘98
State of the Art.
• Brightness constancy cue – Insensitive to appearance
• Full-body required multiple cameras
• Single hypothesis
![Page 10: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/10.jpg)
Brightness ConstancyBrightness Constancy
I(x, t+1) = I(x+u, t) +
Image motion of foreground as a function of the 3D motion of the body.
Problem: no fixed model of appearance (drift).
t1t
![Page 11: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/11.jpg)
Cham and Rehg ‘99Cham and Rehg ‘99
State of the Art.
I(x, t) = I(x+u, 0) +
• Single camera, multiple hypotheses
• 2D templates (no drift but view dependent)
![Page 12: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/12.jpg)
Multiple HypothesesMultiple Hypotheses
• Posterior distribution over model parameters often multi-modal (due to ambiguities)
• Represent whole distribution:– sampled representation– each sample is a pose– predict over time using a particle
filtering approach
![Page 13: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/13.jpg)
State of the Art.
Deutscher, North, Deutscher, North, Bascle, & Blake ‘00Bascle, & Blake ‘00
• Multiple hypotheses
• Multiple cameras
• Simplified clothing, lighting and background
![Page 14: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/14.jpg)
State of the Art.
Sidenbladh, Black, & Fleet ‘00Sidenbladh, Black, & Fleet ‘00
• Multiple hypotheses• Monocular• Brightness constancy• Activity specific prior • Significant changes in
view and depth, template-based methods will fail
![Page 15: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/15.jpg)
– Need a constraining likelihood model that is also invariant to variations in human appearance
How to Address the ProblemsHow to Address the Problems
– Need a good model of how people move
– Need an effective way to explore the model space (very high dimensional)
Bayesian formulation:
p(model | cue) p(cue | model) p(model)
![Page 16: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/16.jpg)
Edge Detection?Edge Detection?
• Probabilistic model?
• Under/over-segmentation, thresholds, …
![Page 17: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/17.jpg)
Key Idea #1Key Idea #1
• Use the 3D model to predict the location of limb boundaries in the scene.
• Compute various filter responses steered to the predicted orientation of the limb.
• Compute likelihood of filter responses using a statistical model learned from examples.
![Page 18: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/18.jpg)
“Explain” the entire image.
p(image | foreground, background)
Generic, unknown, background
Foreground person
Key Idea #2Key Idea #2
![Page 19: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/19.jpg)
p(image | foreground, background)
Do not look in parts of the image considered background
Foreground part of image
Key Idea #2Key Idea #2
p(foreground part of image | foreground)p(foreground part of image | background)
![Page 20: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/20.jpg)
Training DataTraining DataTraining DataTraining Data
Points on limbs Points on background
![Page 21: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/21.jpg)
Edge DistributionsEdge DistributionsEdge response steered to model edge:
),(cos),(sin),,( xxx yxe fff
Similar to Konishi et al., CVPR 99
![Page 22: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/22.jpg)
Ridge DistributionsRidge Distributions
|),(cossin2),(sin),(cos|
|),(cossin2),(cos),(sin|),,(22
22
xxx
xxxx
xyyyxx
xyyyxxr
fff
ffff
Ridge response steered to limb orientation
Ridge response only on certain image scales!
![Page 23: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/23.jpg)
Motion Training DataMotion Training DataMotion Training DataMotion Training Data
Motion response = I(x, t+1) - I(x+u, t)
x x+u
Motion response = temporal brightness change given model of motion = noise term in brightness constancy assumption
![Page 24: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/24.jpg)
Motion distributionsMotion distributions
Different underlying motion models
![Page 25: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/25.jpg)
Likelihood FormulationLikelihood Formulation
• Independence assumptions: – Cues: p(image | model) = p(cue1 | model) p(cue2 | model)
– Spatial: p(image | model) = p(image(x) | model)
– Scales: p(image | model) = p(image() | model)
• Combines cues and scales!
• Simplification, in reality there are dependencies
ximage
=1,...
![Page 26: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/26.jpg)
LikelihoodLikelihood
pixelsbackpixelsfore
backimagepforeimagepbackforeimagep )|()|(),|(
pixelsfore
pixelsforepixelsall
backimagep
foreimagepbackimagep
)|(
)|()|(
pixelsfore
pixelsfore
backimagep
foreimagepconst
)|(
)|(
Foreground pixels
Background pixels
![Page 27: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/27.jpg)
– Need a constraining likelihood model that is also invariant to variations in human appearance
Step One Discussed…Step One Discussed…
– Need a good model of how people move
Bayesian formulation:
p(model | cue) p(cue | model) p(model)
![Page 28: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/28.jpg)
Models of Human DynamicsModels of Human Dynamics
• Model of dynamics are used to propagate the sampled distribution in time
• Constant velocity model– All DOF in the model parameter space, ,
independent– Angles are assumed to change with constant speed– Speed and position changes are randomly sampled
from normal distribution
![Page 29: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/29.jpg)
Models of Human DynamicsModels of Human Dynamics
• Action-specific model - Walking– Training data: 3D motion capture data– From training set, learn mean cycle and
common modes of deviation (PCA)
Mean cycle Small noise Large noise
![Page 30: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/30.jpg)
– Need a constraining likelihood model that is also invariant to variations in human appearance
Step Two Also Discussed…Step Two Also Discussed…
– Need a good model of how people move
– Need an effective way to explore the model space (very high dimensional)
Bayesian formulation:
p(model | cue) p(cue | model) p(model)
![Page 31: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/31.jpg)
Particle FilterParticle Filter
samplesample
samplesample
normalizenormalize
Posterior )I|( 11 ttp
Temporal dynamics)|( 1ttp
Likelihood
)|I( ttp )I|( ttp
Posteri
orProblem: Expensive represententation of posterior! Approaces to solve problem: • Lower the number of samples. (Deutsher et al., CVPR00)• Represent the space in other ways (Choo and Fleet, ICCV01)
![Page 32: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/32.jpg)
Tracking an ArmTracking an Arm
Moving camera, constant velocity model
1500 samples~2 min/frame
![Page 33: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/33.jpg)
Self OcclusionSelf Occlusion
Constant velocity model
1500 samples~2 min/frame
![Page 34: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/34.jpg)
Walking PersonWalking Person
Walking model
2500 samples~10 min/frame
#samples from 15000 to 2500by using the learned likelihood
![Page 35: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/35.jpg)
Ongoing and Future WorkOngoing and Future Work
• Learned dynamics
• Correlation across scale
• Estimate background motion
• Statistical models of color and texture
• Automatic initialization
![Page 36: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/36.jpg)
Lessons LearnedLessons Learned
• Probabilistic (Bayesian) framework allows– Integration of information in a principled way– Modeling of priors
• Particle filtering allows– Multi-modal distributions– Tracking with ambiguities and non-linear models
• Learning image statistics and combining cues improves robustness and reduces computation
![Page 37: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/37.jpg)
ConclusionsConclusions
• Generic, learned, model of appearance– Combines multiple cues– Exploits work on image statistics
• Use the 3D model to predict features
• Model of foreground and background– Exploits the ratio between foreground and
background likelihood– Improves tracking
![Page 38: Learning the Appearance and Motion of People in Video](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814f05550346895dbc9762/html5/thumbnails/38.jpg)
Other Related WorkOther Related Work
J. Sullivan, A. Blake, M. Isard, and J.MacCormick. Object localization by Bayesian correlation. ICCV99.
J. Sullivan, A. Blake, and J.Rittscher. Statistical foreground modelling for object localisation. ECCV00.
J. Rittscher, J. Kato, S. Joga, and A. Blake. A Probabilistic Background Model for Tracking. ECCV00.
S. Wachter and H. Nagel. Tracking of persons in monocular image sequences. CVIU, 74(3), 1999.