2014 eccv stm
TRANSCRIPT
Temporal Segmentation of Human motion
Spatio-Temporal Matchingfor Human Detection in VideoFeng ZhouFernando De La Torre1
Good morning, everyone.
My name is Feng Zhou. I am working with Dr. Fernando de la Torre.
Today, I will present our work on spatial and temporal alignment of human behavior. 1
Human Detection in VideoView-point ChangeNon-rigid DeformationTemporal Misalignment
2
Occlusion & Noise
Challenges
Matching with 2D VideosMatching with 3D MocapsProposed Work
To introduce our method, lets consider the following example.
Suppose we are given a video of people kicking a ball. In order to recognize the action of this video, one way is to match it with other videos in a dataset.2
Previous WorkSingle-Frame DPMMulti-Frame DPMAndriluka et al., 2012Agarwal & Triggs, 20062D-3D LiftingSensitive to NoiseExpensive to OptimizeRequire Large Data3
Sapp et al., 2011
Yang & Ramanan, 2011Ferrari et al., 2008
Burgos et al., 2013DPM + Smoothing
Sigal & Black, 2006
Now lets look at the second problem.
In this problem, we are trying to align multi-modal sequences in time.
By multi-modal, we mean the sequences are captured by different sensor.
For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.
The goal here is to find the correspondence between frames.3
Spatio-Temporal MatchingInput Video4
GoalMany-to-One Mapping
Motion Capture TemplatesFinding Correspondence between Trajectories
To address these issues, we propose our method as follows
First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.
O
Given the 4
WorkflowInput Video5
Many-to-One Mapping
Motion Capture TemplatesJointResponse
DenseTrajectory
Spatio-TemporalShape ModelMatching
To address these issues, we propose our method as follows
First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.
O
Given the 5
Dense TrajectoriesDense Trajectories and Motion Boundary Descriptors for Action Recognition, IJCV, 20136
2D Trajectory Coordinates
#Frames (eg, 15) / Segment#Trajectories (>800)
To address these issues, we propose our method as follows
First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.
O
Given the 6
Joint ResponseArticulated pose estimation with flexible mixtures-of-parts, PAMI, 2013
N-best maximal decoders for part models,ICCV, 20117
HeadRight Foot
Trajectory-to-Joint Response
#Joints (=14)
To address these issues, we propose our method as follows
First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.
O
Given the 7
WorkflowInput Video8
Many-to-One Mapping
Motion Capture TemplatesJointResponse
DenseTrajectory
Spatio-TemporalShape ModelMatching
To address these issues, we propose our method as follows
First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.
O
Given the 8
Spatio-Temporal Shape Model9
Original SegmentsAfter AlignmentProcrustes analysis Cluster 1Cluster 3Cluster 2Cluster 4
To address these issues, we propose our method as follows
First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.
O
Given the 9
Spatio-Temporal Shape Model10
Cluster 33D Trajectory Coordinates
WeightsShape BasesTrajectory Bases
Linear OperatorBilinear spatiotemporal basis modelsACM Trans. Graphics, 2012
DCTClosed-form Solution
To address these issues, we propose our method as follows
First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.
O
Given the 10
Spatio-Temporal Shape Model11
Cluster 3
DCTBasesReconstructionShapeBases
To address these issues, we propose our method as follows
First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.
O
Given the 11
WorkflowInput Video12
Many-to-One Mapping
Motion Capture TemplatesJointResponse
DenseTrajectory
Spatio-TemporalShape ModelMatching
To address these issues, we propose our method as follows
First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.
O
Given the 12
Weights
Trajectory Coordinates
Correspondence Matrix
3D-2D Projection
DCT Bases
Shape Bases
MatchingJoint Response
Many-to-One Mapping
Orthographic Projection
Many-to-One Mapping
13
Joint Response
Regularization
Weights
Trajectory Coordinates
Correspondence Matrix
3D-2D Projection
DCT Bases
Shape Bases
MatchingJoint ResponseMany-to-One Mapping
Linear ProgrammingL1 Procrustes Analysis
14
Results CMU Motion Capture Dataset15
20 outliersCamera 1Camera 2
5 Actions (Walk, Run, Jump, Kick, Swing)8 Sequences / Action0~200 Outliers4 Cameras14 Joints
Now lets look at the second problem.
In this problem, we are trying to align multi-modal sequences in time.
By multi-modal, we mean the sequences are captured by different sensor.
For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.
The goal here is to find the correspondence between frames.15
16Results CMU Motion Capture Dataset
InputGreedySTM (Proposed)Kick
Walk
Now lets look at the second problem.
In this problem, we are trying to align multi-modal sequences in time.
By multi-modal, we mean the sequences are captured by different sensor.
For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.
The goal here is to find the correspondence between frames.16
17Results CMU Motion Capture DatasetErrors (All Actions)#Outliers
Generic ModelAction-Specific Model
Now lets look at the second problem.
In this problem, we are trying to align multi-modal sequences in time.
By multi-modal, we mean the sequences are captured by different sensor.
For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.
The goal here is to find the correspondence between frames.17
Results - Berkeley MHAD Dataset1812 Persons2 Cameras11 Actions
DPM (Yang & Ramanan) STM (Proposed)Jump14 Joints
Now lets look at the second problem.
In this problem, we are trying to align multi-modal sequences in time.
By multi-modal, we mean the sequences are captured by different sensor.
For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.
The goal here is to find the correspondence between frames.18
Results - Berkeley MHAD Dataset1912 Persons2 Cameras11 ActionsDPM Yang & Ramanan STM ProposedSitDown14 Joints
Now lets look at the second problem.
In this problem, we are trying to align multi-modal sequences in time.
By multi-modal, we mean the sequences are captured by different sensor.
For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.
The goal here is to find the correspondence between frames.19
Results - Berkeley MHAD Dataset2012 Persons2 Cameras11 ActionsDPM Yang & Ramanan STM ProposedThrow14 Joints
Now lets look at the second problem.
In this problem, we are trying to align multi-modal sequences in time.
By multi-modal, we mean the sequences are captured by different sensor.
For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.
The goal here is to find the correspondence between frames.20
Results - Berkeley MHAD Dataset21
Generic ModelAction-Specific Model
Now lets look at the second problem.
In this problem, we are trying to align multi-modal sequences in time.
By multi-modal, we mean the sequences are captured by different sensor.
For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.
The goal here is to find the correspondence between frames.21
Results Human 3.6M Dataset2211 Persons2 Cameras17 Actions14 Joints
Walk
Now lets look at the second problem.
In this problem, we are trying to align multi-modal sequences in time.
By multi-modal, we mean the sequences are captured by different sensor.
For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.
The goal here is to find the correspondence between frames.22
Results Human 3.6M Dataset23
Generic ModelAction-Specific Model
Now lets look at the second problem.
In this problem, we are trying to align multi-modal sequences in time.
By multi-modal, we mean the sequences are captured by different sensor.
For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.
The goal here is to find the correspondence between frames.23
24ConclusionSpatio-Temporal MatchingTemporal Smoothing by Trajectories
View-invariant Matching Efficient Linear Programming Solution
Future WorksTemporal Consistency between SegmentsImproving Optimization
In this talk, I will present two methods.
In the first problem, we try to matching feature point between images. This problem is called as spatial matching, because we need to find the correspondence in space.
In the second category, given several video and mocap sequence, we try to temporally align the sequence.
Lets first look at the first problem of space matching.
24
Backup Slides25
Linear Programming Approximation26
26
Previous Work on 3D-2D matchingPoint Set Matching3D-2D RenderingMotion History VolumeWeinland et al., 2007Epiploar GeometryRao et al, 2002ICP & RANSACGold et al., 1998, David et al., 2004
View-invariant RepresentationSensitive to NoiseBranch-BoundLi & Hartley, 2007Expensive to Optimize
Model RecommendationMatikainen et al., 2012Require Large Data
27Self-similarityJunejo et al., 2011
Now lets look at the second problem.
In this problem, we are trying to align multi-modal sequences in time.
By multi-modal, we mean the sequences are captured by different sensor.
For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.
The goal here is to find the correspondence between frames.27