deep learning for computer vision pr. jenny benois-pineau...
TRANSCRIPT
Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université Bordeaux Chapter 6. Temporal aspects. Applications
Chapter 6
Summary. 1. Temporal aspects RNN,LSTM 2. Applications. 3D Conv.nets…
Video Analysis & Coding/Computer Vision 2
1. RNN
➔ Reccurent neural networks (RNNs) are a family of neural networks for processing sequential data.
➔ Formally, it is a neural network which is specialized for processing a sequence of values
➔ Advantage : sharing parameters across different parts of the model ( applied to the different time observations)
➔ We consider a RNN operating on a sequence of vectors
➔ In practice, RNN usually operate on minibatches of such sequences.
Video Analysis & Coding/Computer Vision 3
x 1( ) ,x 2( ) ,....,x τ( )
Rumelhart, D.E., McCelland, J.L., and the PDP Research Group (1986) Parallel Distibuted Processing: Explorations in the Microstructure of Cognition, MIT Press, Cambridge
x 1( ) ,x 2( ) ,....,x τ( )
The idea of computational graph unfolding
➔ Consider a classical form of dynamical system (in CV ex. Dynamic model of a moving object in a video sequence ( e.g. with a constant velocity)
➔ is called the state of the system ➔ The equation is recurrent
Video Analysis & Coding/Computer Vision 4
s t( ) = f s t−1( );θ( )s t( )
Unfolding
➔ For a finite number of time steps
➔ Unfolding the equation by repeatedly applying the definition in this way has yielded an expression that does not involve recurrence.
Video Analysis & Coding/Computer Vision 5
s 3( ) = f s 2( );θ( ) = f f s 1( );θ( );θ( )
s ...( ) s t−1( ) s t( ) s t+1( )s ...( )
f f f f
RNN as
Video Analysis & Coding/Computer Vision 6
➔ The equation using external signals (h is a state)
h t( ) = f h t−1( ) ,x t( );θ( )
It is possible to use the same transition
h t−1( ) h t( )h t+1( )
x t−1( )x t( ) x t+1( )
f f ffh ...( ) h ...( )
And finally
➔ Unfolded recurrent network with Loss
Video Analysis & Coding/Computer Vision 7
h t−1( ) h t( )h t+1( )
x t−1( )x t( ) x t+1( )
W W WWh ...( ) h ...( )
Lt−1( ) Lt( ) Lt+1( )
o t−1( )o t( ) o t+1( )
U UU
V V V
y t−1( ) y t( ) y t+1( )
Equations of RNN
➔ Forward propagation
➔ Parameter estimation : backpropagation and gradient descent
➔ Difficult to train
Video Analysis & Coding/Computer Vision 8
a t( ) = b+Wh t−1( ) +Ux t( )
h t( ) = f a t( )( )o t( ) = c+Vh t( )
LSTM – Long-Short Term Memory
➔ Gated RNNs ➔ The idea : creating paths through time that have derivatives that neither
vanish, nor explode ➔ Connection weights may change at each time
➔ For video analysis LSTM have been mainly replaced by 3D convolutional neural networks
Video Analysis & Coding/Computer Vision 9
Hochreiter and Schmidhubner, 1997
Goal
10
Improve athletes performances
for teachers and athletes
through tools
CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis
Goal
11 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis
Offensive Forehand Loop
Input Output
- Extract strokes in the temporal dimension
- Classify the strokes
t
1 - A new dataset : TTStroke-21
12 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis
129 videos at 120 fps 1 387 / 1 074 annotations before / after filtering for 20 classes 1 048 strokes + 272 negative samples extracted
Acquisition
Annotation platform Samples
TTStroke-21
[1] H. Bilen, B. Fernando, E. Gavves, and A. Vedaldi, “Action recognition with dynamic image networks,” CoRR, vol. abs/1612.00738, 2016. [2] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” CoRR, vol. abs/1705.07750, 2017. [3] G. Varol, I. Laptev, and C. Schmid, “Long-term temporal convolutions for action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 6, pp. 1510–1517, 2018.
Use of Dynamic Images[1] Very deep 3D
CNN[2]
Long-term Temporal Convolutions[3]
2 - Related Work
13 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis
3 - Proposed method
14 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis
Goal : good classification of the strokes extracted
- Use of deep learning model
- Need of temporal and spatial segmentation
- Data augmentation Best accuracy : 91.4% against 43.1% for the state of the art method[2]
[2] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” CoRR, vol. abs/1705.07750, 2017.
Offensive Forehand Loop
3.a - Model Architecture
15 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis
Siamese Spatio-Temporal Convolutional Neural Network Input
(W,H,T) = (100,120,120)
Training :
Stochastic Gradient descent Cross-entropy loss : = -x[class] + log(\sum_j exp(x[j])) learning rate = 0.001 for Siamese and 0.01 for one branch Nesterov Momentum Epochs 2000 Momentum : 0.5 decreased to 0.1 and 0.05 at epoch 1000 and 1500 Datasets : Training 70%,Validation 20%, Test : 10%
* “IMAGE SUPER RESOLUTION KERAS” from impremedia.net
3D convolutions*
3.b - Input Data
16 16 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis
[4] C. Liu, “Beyond pixels: Exploring new representations and applications for motion analysis,” Ph.D. dissertation, Massachusetts Institute of Technology, 5 2009. [5] Z. Zivkovic and F. van der Heijden, “Efficient adaptive density estimation per image pixel for the task of background subtraction,” Pattern Recognition Letters, vol. 27, no. 7, pp. 773–780, 2006.
Original Frame
Motion estimation[4]
Foreground estimation[5] Foreground
Motion
17 17 17 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis
3.b - Input Data
Spatial Segmentation using foreground motion
Xmax Xg
Final segmentation
Smoothing over temporal dimension using gaussian kernel of size 40 and standard deviation 4.44.
Xroi
3.c - Data Augmentation
18 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis
Online augmentation applied before spatial segmentation to avoid padding Spatial :
- random rotation range ±10° - random translation in range ±0.1 in x and y directions - random homothety in range 1 ± 0.1
Temporal : - 100 successive frames with the 50th frame selected according to a normal
probabilistic distribution along the temporal dimension of the stroke extracted
4 - Results
19 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis
4 - Results
20 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis
Training of our SSTC model
4 - Results
21 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis Training of the I3D model
4 - Results
22 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis
Conclusion
➔ This course is very far from being complete ➔ It was an attempt to give fundamentals and some examples from
authors’s research ➔ Happy adventure with Deep Learning for your visual data and your
problems.
➔ Jenny Benois-Pineau
Video Analysis & Coding/Computer Vision 23