loss-based learning with weak supervision
DESCRIPTION
Loss-based Learning with Weak Supervision. M. Pawan Kumar. About the Talk. Methods that use latent structured SVM A little math-y Initial stages. Outline. Latent SSVM Ranking Brain Activation Delays in M/EEG Probabilistic Segmentation of MRI. - PowerPoint PPT PresentationTRANSCRIPT
Loss-based Learning with Weak Supervision
M. Pawan Kumar
About the Talk• Methods that use latent structured SVM
• A little math-y
• Initial stages
• Latent SSVM
• Ranking
• Brain Activation Delays in M/EEG
• Probabilistic Segmentation of MRI
Andrews et al., NIPS 2001; Smola et al., AISTATS 2005;Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009
Outline
Weakly Supervised Data
Input x
Output y {-1,+1}
Hidden h
x
y = +1
h
Weakly Supervised Classification
Feature Φ(x,h)
Joint Feature Vector
Ψ(x,y,h)
x
y = +1
h
Weakly Supervised Classification
Feature Φ(x,h)
Joint Feature Vector
Ψ(x,+1,h) Φ(x,h)
0=
x
y = +1
h
Weakly Supervised Classification
Feature Φ(x,h)
Joint Feature Vector
Ψ(x,-1,h) 0
Φ(x,h)=
x
y = +1
h
Weakly Supervised Classification
Feature Φ(x,h)
Joint Feature Vector
Ψ(x,y,h)
Score f : Ψ(x,y,h) (-∞, +∞)
Optimize score over all possible y and h
x
y = +1
h
Scoring function
wTΨ(x,y,h)
Prediction
y(w),h(w) = argmaxy,h wTΨ(x,y,h)
Latent SSVM
Training data {(xi,yi), i = 1,2,…,n}
Highly non-convex in w
Cannot regularize w to prevent overfitting
w* = argminw Σi Δ(yi,yi(w))
Learning Latent SSVM
Minimize empirical risk specified by loss function
Δ(yi,yi(w))wTΨ(x,yi(w),hi(w)) + - wTΨ(x,yi(w),hi(w))
Δ(yi,yi(w))≤ wTΨ(x,yi(w),hi(w)) + - maxhi wTΨ(x,yi,hi)
Δ(yi,y)}≤ maxy,h{wTΨ(x,y,h) + - maxhi wTΨ(x,yi,hi)
Training data {(xi,yi), i = 1,2,…,n}
Learning Latent SSVM
Training data {(xi,yi), i = 1,2,…,n}
minw ||w||2 + C Σiξi
wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi
Difference-of-convex program in w
Local minimum or saddle point solution (CCCP)
Learning Latent SSVM
Start with an initial estimate of w
minw ||w||2 + C Σiξi
wTΨ(xi,y,h) + Δ(yi,y) - wTΨ(xi,yi,hi*) ≤ ξi
CCCP
Impute hidden variables
hi* = argmaxh wTΨ(xi,yi,h)
Update w
Repeat until convergence
Loss independent
Loss dependent
minw ||w||2 + C Σiξi
wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi
Scoring function
wTΨ(x,y,h)
Prediction
y(w),h(w) = argmaxy,h wTΨ(x,y,h)
Learning
Recap
• Latent SSVM
• Ranking
• Brain Activation Delays in M/EEG
• Probabilistic Segmentation of MRI
Joint Work with Aseem Behl and C. V. Jawahar
Outline
RankingRank 1 Rank 2 Rank 3
Rank 4 Rank 5 Rank 6
Average Precision = 1
RankingRank 1 Rank 2 Rank 3
Rank 4 Rank 5 Rank 6
Average Precision = 1 Accuracy = 1Average Precision = 0.92 Accuracy = 0.67Average Precision = 0.81
Ranking
During testing, AP is frequently used
During training, a surrogate loss is used
Contradictory to loss-based learning
Optimize AP directly
• Latent SSVM
• Ranking– Supervised Learning– Weakly Supervised Learning– Latent AP-SVM– Experiments
• Brain Activation Delays in M/EEG
• Probabilistic Segmentation of MRI
Outline
Yue, Finley, Radlinski and Joachims, 2007
Supervised Learning - Input
Training images X Bounding boxes H
P
N
= {HP,HN}
Supervised Learning - Output
Ranking matrix Y
Yik =
+1 if i is better ranked than k
-1 if k is better ranked than i
0 if i and k are ranked equally
Optimal ranking Y*
SSVM Formulation
Ψ(X,Y,{HP,HN})ΣiPΣkN Yik (Φ(xi,hi)-Φ(xk,hk))
|P||N|
Scoring function
wTΨ(X,Y,{HP,HN})
Joint feature vector
=
Prediction using SSVM
Y(w) = argmaxY wTΨ(X,Y, {HP,HN})
Sort by value of sample score wTΦ(xi,hi)
Same as standard binary SVM
Learning SSVM
Δ(Y*,Y(w))minw
Loss = 1 – AP of prediction
Learning SSVM
Δ(Y*,Y(w))
wTΨ(X,Y(w),{HP,HN})+
-wTΨ(X,Y(w),{HP,HN})
Learning SSVM
Δ(Y*,Y(w))
wTΨ(X,Y(w),{HP,HN})+
-wTΨ(X,Y*,{HP,HN})
Learning SSVM
Δ(Y*,Y)
wTΨ(X,Y,{HP,HN})+
-wTΨ(X,Y*,{HP,HN})
maxY
≤ ξ
minw ||w||2 + C ξ
Learning SSVM
Δ(Y*,Y)
wTΨ(X,Y,{HP,HN})+
-wTΨ(X,Y*,{HP,HN})
maxY
≤ ξ
minw ||w||2 + C ξ
Loss Augmented Inference
Loss Augmented InferenceRank 1 Rank 2 Rank 3
Rank positives according to sample scores
Loss Augmented InferenceRank 1 Rank 2 Rank 3
Rank 4 Rank 5 Rank 6
Rank negatives according to sample scores
Loss Augmented InferenceRank 1 Rank 2 Rank 3
Rank 4 Rank 5 Rank 6
Slide best negative to a higher rankContinue until score stops increasingSlide next negative to a higher rankContinue until score stops increasingTerminate after considering last negativeOptimal loss augmented inference
RecapScoring function
wTΨ(X,Y,{HP,HN})
Y(w) = argmaxY wTΨ(X,Y, {HP,HN})
Prediction
Learning
Using optimal loss augmented inference
• Latent SSVM
• Ranking– Supervised Learning– Weakly Supervised Learning– Latent AP-SVM– Experiments
• Brain Activation Delays in M/EEG
• Probabilistic Segmentation of MRI
Outline
Weakly Supervised Learning - Input
Training images X
Weakly Supervised Learning - Latent
Training images X Bounding boxes HP
All bounding boxes in negative images are negative
Intuitive Prediction Procedure
Select the best bounding boxes in all images
Intuitive Prediction ProcedureRank 1 Rank 2 Rank 3
Rank 4 Rank 5 Rank 6
Rank them according to their sample scores
Ranking matrix Y
Yik =
+1 if i is better ranked than k
-1 if k is better ranked than i
0 if i and k are ranked equally
Optimal ranking Y*
Weakly Supervised Learning - Output
Latent SSVM Formulation
Ψ(X,Y,{HP,HN})ΣiPΣkN Yik (Φ(xi,hi)-Φ(xk,hk))
|P||N|
Scoring function
wTΨ(X,Y,{HP,HN})
Joint feature vector
=
Prediction using Latent SSVM
maxY,H wTΨ(X,Y, {HP,HN})
Prediction using Latent SSVM
maxY,H wTΣiPΣkN Yik (Φ(xi,hi)-Φ(xk,hk))
Choose best bounding box for positives
Choose worst bounding box for negatives
Not what we wanted
Learning Latent SSVM
Δ(Y*,Y(w))minw
Loss = 1 – AP of prediction
Learning Latent SSVM
Δ(Y*,Y(w))
wTΨ(X,Y(w),{HP(w),HN(w)})+
-wTΨ(X,Y(w),{HP(w),HN(w)})
Learning Latent SSVM
Δ(Y*,Y(w))
wTΨ(X,Y(w),{HP(w),HN(w)})+
-wTΨ(X,Y*,{HP,HN})maxH
Learning Latent SSVM
Δ(Y*,Y)
wTΨ(X,Y,{HP,HN})+
-wTΨ(X,Y*,{HP,HN})maxH
maxY,H
≤ ξ
minw ||w||2 + C ξ
Learning Latent SSVM
Δ(Y*,Y)
wTΨ(X,Y,{HP,HN})+
-wTΨ(X,Y*,{HP,HN})maxH
maxY,H
≤ ξ
minw ||w||2 + C ξ
Loss Augmented Inference
Cannot be solved optimally
Recap
Unintuitive prediction
Non-optimal loss augmented inference
Can we do better?
Unintuitive objective function
• Latent SSVM
• Ranking– Supervised Learning– Weakly Supervised Learning– Latent AP-SVM– Experiments
• Brain Activation Delays in M/EEG
• Probabilistic Segmentation of MRI
Outline
Latent AP-SVM Formulation
Ψ(X,Y,{HP,HN})ΣiPΣkN Yik (Φ(xi,hi)-Φ(xk,hk))
|P||N|
Scoring function
wTΨ(X,Y,{HP,HN})
Joint feature vector
=
Prediction using Latent AP-SSVM
Choose best bounding box for all samples
Optimize over the ranking
hi(w) = argmaxh wTΦ(xi,h)
Y(w) = argmaxY wTΨ(X,Y, {HP(w),HN(w)})
Sort by sample scores
Learning Latent AP-SVM
Δ(Y*,Y(w))minw
Loss = 1 – AP of prediction
Learning Latent AP-SVM
Δ(Y*,Y(w))
wTΨ(X,Y(w),{HP(w),HN(w)})+
-wTΨ(X,Y(w),{HP(w),HN(w)})
Learning Latent AP-SVM
Δ(Y*,Y(w))
wTΨ(X,Y(w),{HP(w),HN(w)})+
-wTΨ(X,Y*,{HP(w),HN(w)})
Learning Latent AP-SVM
Δ(Y*,Y)
wTΨ(X,Y,{HP(w),HN})+
-wTΨ(X,Y*,{HP(w),HN})
maxY,HN
Learning Latent AP-SVM
Δ(Y*,Y)
wTΨ(X,Y,{HP,HN})+
-wTΨ(X,Y*,{HP,HN})
maxY,HNminHP
HP(w) minimizing the above upper bound
≤ ξ
minw ||w||2 + C ξ
Start with an initial estimate of w
CCCP
Impute hidden variables
Update w
Repeat until convergence
Above algorithm is optimal.
Imputing Hidden Variables
Choose best bounding boxes according to sample score
Start with an initial estimate of w
CCCP
Impute hidden variables
Update w
Repeat until convergence
Loss Augmented Inference
Choose best bounding boxes according to sample score
Loss Augmented InferenceRank 1 Rank 2 Rank 3
Rank 4 Rank 5 Rank 6
Slide best negative to a higher rankContinue until score stops increasingSlide next negative to a higher rankContinue until score stops increasingTerminate after considering last negativeOptimal loss augmented inference
Recap
Intuitive prediction
Optimal loss augmented inference
Performance in practice?
Intuitive objective function
• Latent SSVM
• Ranking– Supervised Learning– Weakly Supervised Learning– Latent AP-SVM– Experiments
• Brain Activation Delays in M/EEG
• Probabilistic Segmentation of MRI
Outline
• VOC 2011 action classification
• 10 action classes + other
• 2424 ‘trainval’ images
• 2424 ‘test’ images– Hidden annotations– Evaluated using a remote server– Only AP values are computed
Dataset
• Latent SSVM with 0/1 loss (latent SVM)– Relative loss weight C– Relative positive sample weight J– Robustness threshold K
• Latent SSVM with AP loss (latent SSVM)– Relative loss weight C– Approximate greedy inference algorithm
• 5 random initializations
• 5-fold cross-validation (80-20 split)
Baselines
Cross-Validation
Statistically significant improvement
Test
Latent SVM Latent SSVM Latent AP-SVM3637383940414243444546
AP
• Latent SSVM
• Ranking
• Brain Activation Delays in M/EEG
• Probabilistic Segmentation of MRI
Outline
Joint Work with Wojciech Zaremba, Alexander Gramfort and Matthew Blaschko
IPMI 2013
M/EEG Data
M/EEG Data
Faster activation (familiar with task)
M/EEG Data
Slower activation (bored with task)
Classifying M/EEG Data
Statistically significant improvement
Functional Connectivity
• visual cortex → deep subcortical source• visual cortex → higher level cognitive processing
Connected components have similar delay
• Latent SSVM
• Ranking
• Brain Activation Delays in M/EEG
• Probabilistic Segmentation of MRI
Outline
Joint Work with Pierre-Yves Baudin, Danny Goodman, Puneet Kumar, Nikos Paragios,Noura Azzabou, Pierre Carlier
MICCAI 2013
Training DataAnnotators provide‘hard’ segmentation
Training DataAnnotators provide‘hard’ segmentation
Random Walks provides‘soft’ segmentation
Best ‘soft’ segmentation?
Segmentation
Statistically significant improvement
To Conclude …• Choice of loss function matters during training
• Many interesting latent variables– Computer Vision (onerous annotations)– Medical Imaging (impossible annotations)
• Large-scale experiments– Other problems– General loss– Efficient Optimization
Questions?
http://www.centrale-ponts.fr/personnel/pawan
SPLENDID
Nikos ParagiosEquipe GalenINRIA Saclay
Daphne KollerDAGS
Stanford
Machine LearningWeak AnnotationsNoisy Annotations
ApplicationsComputer VisionMedical Imaging
Self-Paced Learning for Exploiting Noisy, Diverse or Incomplete Data
Visits between INRIA Saclay and Stanford University