loss-based learning with weak supervision

Loss-based Learning with Weak Supervision

M. Pawan Kumar

About the Talk• Methods that use latent structured SVM

• A little math-y

• Initial stages

• Latent SSVM

• Ranking

• Brain Activation Delays in M/EEG

• Probabilistic Segmentation of MRI

Andrews et al., NIPS 2001; Smola et al., AISTATS 2005;Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

Outline

Weakly Supervised Data

Input x

Output y {-1,+1}

Hidden h

x

y = +1

h

Weakly Supervised Classification

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,y,h)

x

y = +1

h


Feature Φ(x,h)


Ψ(x,+1,h) Φ(x,h)

0=

x

y = +1

h


Feature Φ(x,h)


Ψ(x,-1,h) 0

Φ(x,h)=

x

y = +1

h


Feature Φ(x,h)


Ψ(x,y,h)

Score f : Ψ(x,y,h) (-∞, +∞)

Optimize score over all possible y and h

x

y = +1

h

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h)

Latent SSVM

Training data {(xi,yi), i = 1,2,…,n}

Highly non-convex in w

Cannot regularize w to prevent overfitting

w* = argminw Σi Δ(yi,yi(w))

Learning Latent SSVM

Minimize empirical risk specified by loss function

Δ(yi,yi(w))wTΨ(x,yi(w),hi(w)) + - wTΨ(x,yi(w),hi(w))

Δ(yi,yi(w))≤ wTΨ(x,yi(w),hi(w)) + - maxhi wTΨ(x,yi,hi)

Δ(yi,y)}≤ maxy,h{wTΨ(x,y,h) + - maxhi wTΨ(x,yi,hi)




minw ||w||2 + C Σiξi

wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi

Difference-of-convex program in w

Local minimum or saddle point solution (CCCP)


Start with an initial estimate of w


wTΨ(xi,y,h) + Δ(yi,y) - wTΨ(xi,yi,hi*) ≤ ξi

CCCP

Impute hidden variables

hi* = argmaxh wTΨ(xi,yi,h)

Update w

Repeat until convergence

Loss independent

Loss dependent


wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h)

Learning

Recap

• Latent SSVM

• Ranking



Joint Work with Aseem Behl and C. V. Jawahar

Outline

RankingRank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1

RankingRank 1 Rank 2 Rank 3


Average Precision = 1 Accuracy = 1Average Precision = 0.92 Accuracy = 0.67Average Precision = 0.81

Ranking

During testing, AP is frequently used

During training, a surrogate loss is used

Contradictory to loss-based learning

Optimize AP directly

• Latent SSVM

• Ranking– Supervised Learning– Weakly Supervised Learning– Latent AP-SVM– Experiments



Outline

Yue, Finley, Radlinski and Joachims, 2007

Supervised Learning - Input

Training images X Bounding boxes H

P

N

= {HP,HN}

Supervised Learning - Output

Ranking matrix Y

Yik =

+1 if i is better ranked than k

-1 if k is better ranked than i

0 if i and k are ranked equally

Optimal ranking Y*

SSVM Formulation

Ψ(X,Y,{HP,HN})ΣiPΣkN Yik (Φ(xi,hi)-Φ(xk,hk))

|P||N|

Scoring function

wTΨ(X,Y,{HP,HN})

Joint feature vector

=

Prediction using SSVM

Y(w) = argmaxY wTΨ(X,Y, {HP,HN})

Sort by value of sample score wTΦ(xi,hi)

Same as standard binary SVM

Learning SSVM

Δ(Y*,Y(w))minw

Loss = 1 – AP of prediction

Learning SSVM

Δ(Y*,Y(w))

wTΨ(X,Y(w),{HP,HN})+

-wTΨ(X,Y(w),{HP,HN})

Learning SSVM

Δ(Y*,Y(w))

wTΨ(X,Y(w),{HP,HN})+

-wTΨ(X,Y*,{HP,HN})

Learning SSVM

Δ(Y*,Y)

wTΨ(X,Y,{HP,HN})+

-wTΨ(X,Y*,{HP,HN})

maxY

≤ ξ

minw ||w||2 + C ξ

Learning SSVM

Δ(Y*,Y)

wTΨ(X,Y,{HP,HN})+

-wTΨ(X,Y*,{HP,HN})

maxY

≤ ξ

minw ||w||2 + C ξ

Loss Augmented Inference

Loss Augmented InferenceRank 1 Rank 2 Rank 3

Rank positives according to sample scores



Rank negatives according to sample scores



Slide best negative to a higher rankContinue until score stops increasingSlide next negative to a higher rankContinue until score stops increasingTerminate after considering last negativeOptimal loss augmented inference

RecapScoring function

wTΨ(X,Y,{HP,HN})

Y(w) = argmaxY wTΨ(X,Y, {HP,HN})

Prediction

Learning

Using optimal loss augmented inference

• Latent SSVM




Outline

Weakly Supervised Learning - Input

Training images X

Weakly Supervised Learning - Latent

Training images X Bounding boxes HP

All bounding boxes in negative images are negative

Intuitive Prediction Procedure

Select the best bounding boxes in all images

Intuitive Prediction ProcedureRank 1 Rank 2 Rank 3


Rank them according to their sample scores

Ranking matrix Y

Yik =

+1 if i is better ranked than k

-1 if k is better ranked than i

0 if i and k are ranked equally

Optimal ranking Y*

Weakly Supervised Learning - Output

Latent SSVM Formulation


|P||N|

Scoring function

wTΨ(X,Y,{HP,HN})


=

Prediction using Latent SSVM

maxY,H wTΨ(X,Y, {HP,HN})

Prediction using Latent SSVM

maxY,H wTΣiPΣkN Yik (Φ(xi,hi)-Φ(xk,hk))

Choose best bounding box for positives

Choose worst bounding box for negatives

Not what we wanted


Δ(Y*,Y(w))minw



Δ(Y*,Y(w))

wTΨ(X,Y(w),{HP(w),HN(w)})+

-wTΨ(X,Y(w),{HP(w),HN(w)})


Δ(Y*,Y(w))


-wTΨ(X,Y*,{HP,HN})maxH


Δ(Y*,Y)

wTΨ(X,Y,{HP,HN})+


maxY,H

≤ ξ

minw ||w||2 + C ξ


Δ(Y*,Y)

wTΨ(X,Y,{HP,HN})+


maxY,H

≤ ξ

minw ||w||2 + C ξ


Cannot be solved optimally

Recap

Unintuitive prediction

Non-optimal loss augmented inference

Can we do better?

Unintuitive objective function

• Latent SSVM




Outline

Latent AP-SVM Formulation


|P||N|

Scoring function

wTΨ(X,Y,{HP,HN})


=

Prediction using Latent AP-SSVM

Choose best bounding box for all samples

Optimize over the ranking

hi(w) = argmaxh wTΦ(xi,h)

Y(w) = argmaxY wTΨ(X,Y, {HP(w),HN(w)})

Sort by sample scores

Learning Latent AP-SVM

Δ(Y*,Y(w))minw



Δ(Y*,Y(w))


-wTΨ(X,Y(w),{HP(w),HN(w)})


Δ(Y*,Y(w))


-wTΨ(X,Y*,{HP(w),HN(w)})


Δ(Y*,Y)

wTΨ(X,Y,{HP(w),HN})+

-wTΨ(X,Y*,{HP(w),HN})

maxY,HN


Δ(Y*,Y)

wTΨ(X,Y,{HP,HN})+

-wTΨ(X,Y*,{HP,HN})

maxY,HNminHP

HP(w) minimizing the above upper bound

≤ ξ

minw ||w||2 + C ξ


CCCP


Update w


Above algorithm is optimal.

Imputing Hidden Variables

Choose best bounding boxes according to sample score


CCCP


Update w



Choose best bounding boxes according to sample score



Slide best negative to a higher rankContinue until score stops increasingSlide next negative to a higher rankContinue until score stops increasingTerminate after considering last negativeOptimal loss augmented inference

Recap

Intuitive prediction

Optimal loss augmented inference

Performance in practice?

Intuitive objective function

• Latent SSVM




Outline

• VOC 2011 action classification

• 10 action classes + other

• 2424 ‘trainval’ images

• 2424 ‘test’ images– Hidden annotations– Evaluated using a remote server– Only AP values are computed

Dataset

• Latent SSVM with 0/1 loss (latent SVM)– Relative loss weight C– Relative positive sample weight J– Robustness threshold K

• Latent SSVM with AP loss (latent SSVM)– Relative loss weight C– Approximate greedy inference algorithm

• 5 random initializations

• 5-fold cross-validation (80-20 split)

Baselines

Cross-Validation

Statistically significant improvement

Test

Latent SVM Latent SSVM Latent AP-SVM3637383940414243444546

AP

• Latent SSVM

• Ranking



Outline

Joint Work with Wojciech Zaremba, Alexander Gramfort and Matthew Blaschko

IPMI 2013

M/EEG Data

M/EEG Data

Faster activation (familiar with task)

M/EEG Data

Slower activation (bored with task)

Classifying M/EEG Data


Functional Connectivity

• visual cortex → deep subcortical source• visual cortex → higher level cognitive processing

Connected components have similar delay

• Latent SSVM

• Ranking



Outline

Joint Work with Pierre-Yves Baudin, Danny Goodman, Puneet Kumar, Nikos Paragios,Noura Azzabou, Pierre Carlier

MICCAI 2013

Training DataAnnotators provide‘hard’ segmentation

Training DataAnnotators provide‘hard’ segmentation

Random Walks provides‘soft’ segmentation

Best ‘soft’ segmentation?

Segmentation


To Conclude …• Choice of loss function matters during training

• Many interesting latent variables– Computer Vision (onerous annotations)– Medical Imaging (impossible annotations)

• Large-scale experiments– Other problems– General loss– Efficient Optimization

Questions?

http://www.centrale-ponts.fr/personnel/pawan

SPLENDID

Nikos ParagiosEquipe GalenINRIA Saclay

Daphne KollerDAGS

Stanford

Machine LearningWeak AnnotationsNoisy Annotations

ApplicationsComputer VisionMedical Imaging

Self-Paced Learning for Exploiting Noisy, Diverse or Incomplete Data

Visits between INRIA Saclay and Stanford University

loss-based learning with weak supervision

Documents

y wtxi

possible y

y maxhi wtxi

h yi

h maxhi wtx

yiw wtx

jawaharoutlineranking

argmaxy wtx