【visapp2016】activity prediction using a space-time cnn and bayesian framework
TRANSCRIPT
Activity Prediction Using a Space-Time CNN and Bayesian Framework
Hirokatsu KATAOKA, Yoshimitsu AOKI†, Kenji IWATA, Yutaka SATOH
National Institute of Advanced Industrial Science and Technology (AIST) † Keio University
http://www.hirokatsukataoka.net/
Background • Computer vision for human sensing – Detection, tracking, trajectory analysis – Posture estimation, action analysis – Action recognition is able to extend human sensing applications
Mental state
Body Situation
Attention
Action Analysis
shakinghands
Look at people
Detection Gaze Estimation
Action Recognition
Posture Estimation
Face Recognition
Trajectory extraction
Tracking
Related work 1: Action Recognition • Action is a low-level primitive with semantic meaning – e.g. walking, running, sitting
This image contains a man walking - The classification (location is given)
Action recognition
Walking
Is action recognition enough?
Time-series
Post-detection
Event detection (Action tag : Ai)
Time-series
Event prediction (Prediction tag : Aj)
Pre-estimation
Related work 2: Early Action Recognition • Prediction in early part of action – Integral bag-of-words – Accumulating likelihood through time-sequence
M. S. Ryoo, “Human Activity Prediction: Early Recognition of Ongoing Activities from Streaming Videos”, International Conference on Computer Vision (ICCV), pp.1036-1043, 2011.
Proposal • Action prediction within a ST-CNN and Bayesian framework – Action recognition – Database analysis
???Daytime (Time Zone)
Walking (Previous Action)
Sitting (Current Action)
??? (Next Action)
xtimezone xprevious xcurrent
θ = “Using a PC”
Given Not givenTime series
Problem settings
• Three different works in action analysis – Action recognition
• Recognizing At given 1 ~ t frames
– Early action recognition
• Recognizing At given 1 ~ t-L frames
– Action Prediction
• Recognizing At+L given 1 ~ t frames
Approach Setting Action Recognition
Early Action Recognition
Action Prediction
f (F1...tA )→ At
f (F1...t−LA )→ At
f (F1...tA )→ At+L
Process flow • Consist of (i) action recognition (ii) action prediction
1. Action recognition 1.1 Improved dense trajectories (IDT) 1.2 Space-time convolutional neural networks (ST-CNN)
2. Action prediction 2.1 Bayesian framework 2.2 Database
xxxxxxxxxxxxxxx
xxx
Trajectory (in t + L frames)
Feature extraction (HOG, HOF, MBH, Traj.)
Bag-of-words (BoW)
Pedestrian detection IDT
Input
ConvConv
Pool
FC
ConvConv
Pool
ConvConv
Pool
ConvConv
Pool
ConvConv
Pool
ST-CNNOxford VGG architecture (VGGNet)
Action Recognition (1/2) • Improved Dense Trajectories (IDT) [Wang+, ICCV2013] – Pyramidal image sequences and flow tracking – Feature descriptors on trajectories – Feature representation with bag-of-words (BoW)
sitting walking
Action Recognition (1/2) • IDT + Co-occurrence HOG [Kataoka+, ACCV2014]
CoHOG: edge-pair counting to corresponding histogram position
Extended CoHOG(ECoHOG): edge-magnitude accumulation
– PCA dim. reduction: 103 - 104 dims into 101-102 ,easy to divide in feature space
Action Recognition (2/2) • Space-time Convolutional Neural Networks (ST-CNN) – Based on VGG 16-layer architecture (VGGNet) [Simonyan+, ICLR2015] – Statio-temporal feature concatenation (around 10 frames)
Space-time CNN (ST-CNN) Feature
Input
Conv
Conv
Pool
FC
FC
Conv
Conv
Pool
Conv
Conv
Pool
Conv
Conv
Pool
Conv
Conv
Pool
FC
So3max
・・・
CNN architecture with VGGNet
Action Prediction (1/2) • Prediction model
- Action sequence Predicting “Using a PC” at “Walk” => “Sit”
- Time zone (supplemental info.) Day time
???Daytime (Time Zone)
Walking (Previous Activity)
Sitting (Current Activity)
??? (Next Activity)
xtimezone xprevious xcurrent
θ = “Using a PC”
Given Not givenTime series
• Database: ST-action tags + attribute – Time zone
• “morning”, “day time”, “night”
– Previous & current action
• “walk”, “bend”, “stand”, “sit”…
– Next action (objective)
• “use a PC”, “read”, “meal”…
Action Prediction (2/2)
Action History DB
Walking
Sitting
Using a PC
Daytime
Results • Action recognition – IDT (HOG, HOF, MBH, CoHOG, ECoHOG, All) – Per-frame CNN – ST-CNN – Combined vector
Results • Action prediction
Time Attributes
Estimated Intention
Action
PC (0.82) Read (0.11)
Predicted activity
Read (1.00) PC (0.00)