active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings...

45
Active learning in sequence labeling Tom´ s ˇ Sabata 11. 5. 2017 Czech Technical University in Prague Faculty of Information technology Department of Theoretical Computer Science

Upload: others

Post on 12-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Active learning in sequence labeling

Tomas Sabata

11. 5. 2017

Czech Technical University in Prague

Faculty of Information technology

Department of Theoretical Computer Science

Page 2: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Table of contents

1. Introduction

2. Active learning

3. Graphical models

4. Active learning in sequence labeling

5. Semi-supervized active learning in sequence labeling

6. Experiment

7. Summary

Tomas Sabata Active learning in sequence labeling 1 / 36

Page 3: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Introduction

Page 4: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Sequence modeling and labeling problem definition

• Sequence of states

• Sequence of observations

Obrazek 1: Sequence representation

Tomas Sabata Active learning in sequence labeling 2 / 36

Page 5: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Sequence modeling and labeling problem definition

1. Sequence modeling

• Given a sequence of states/labels and sequence of observations, find

a model that the most likely generates the sequences.

2. Sequence labeling

• Given a sequence of observations, determine an appropriate

label/state for each observation.

• Reducing errors by considering relations.

Tomas Sabata Active learning in sequence labeling 3 / 36

Page 6: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Applications

• Handwriting recognition

• Facial expression dynamic modeling

• DNA analysis

• Part-of-speech tagging

• Speech recognition

• Video analysis

Tomas Sabata Active learning in sequence labeling 4 / 36

Page 7: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Active learning

Page 8: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

What is active learning?

• The quality of labels makes a huge difference.

Garbage in, garbage out.

• Obtaining ”golden”annotation data can be really

expensive.

Tomas Sabata Active learning in sequence labeling 5 / 36

Page 9: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

What is active learning?

Obrazek 2: Active learning cycle

Tomas Sabata Active learning in sequence labeling 6 / 36

Page 10: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Scenarios

Obrazek 3: Active learning scenarios

Tomas Sabata Active learning in sequence labeling 7 / 36

Page 11: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

AL algorithm

Data:

L - set of labeled examples

U - set of umlabeled examples

θ - utility function

while stopping criterion is not met do

1. learn model M from L;

2. for all xi ∈ U : uxi ← θM(xi );

3. select example x∗ ∈ U with the highest utility function ui ;

4. query annotator for label of example x∗;

5. move < y , x∗ > to L;

end

return LAlgorithm 1: General pool-based AL framework

Tomas Sabata Active learning in sequence labeling 8 / 36

Page 12: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Query strategies frameworks

1. Uncertainty Sampling

2. Query-By-Committee

3. Expected Model Change

4. Expected Error Reduction

5. Variance Reduction

6. Density-Weighted Methods

Tomas Sabata Active learning in sequence labeling 9 / 36

Page 13: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Frameworks: Uncertainty Sampling

• Simplest, most commonly used

• Intuitive for probabilistic learning model

• Binary problems: choose intance with posterior probability near to

0.5

• Multiclass problems:

• Least confident - x∗LC = arg maxx 1− Pθ(y |x)

• Margin sampling - x∗M = argminx Pθ(y1|x)− Pθ(y2|x)

• Entropy - x∗H = argmaxx −∑

i Pθ(yi |x)logPθ(yi |x)

Tomas Sabata Active learning in sequence labeling 10 / 36

Page 14: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Frameworks: Uncertainty Sampling

Obrazek 4: Uncertainty sampling for three-class classification problem

• Application dependent

• Entropy - minimizing log-loss

• LC + Margin - minimizing classification error

Tomas Sabata Active learning in sequence labeling 11 / 36

Page 15: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Frameworks: Query-By-Committee

• We maintain a committee C = {θ1, ..., θC} of models trained on L• The most informative query is considered to be the instance about

which they most disagree.

• We need to ensure variability of models in the beginning

• Measure of disagreement:

• Vote entropy - x∗VE = argmaxx −∑

iV (yi )C

logV (yi )C

• Kullback-Leibler divergence - x∗KL = argmaxx1C

∑Cc=1 DKL(Pθ(c) ||PC )

Tomas Sabata Active learning in sequence labeling 12 / 36

Page 16: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Frameworks: Expected Model Change

• Capable for models using gradient based training.

• Query instance which would cause the largest model change.

• Use a gradient of the objective function 5`θ(L)

• x∗EMC = argmaxx∑

i Pθ(yi |x) ‖ 5`θ(L∪ < yi , x >) ‖• Note: ‖ 5`θ(L) ‖ should be close to zero therefore we can use an

approximation ‖ 5`θ(L∪ < yi , x >) ‖≈‖ 5`θ(< yi , x >) ‖

Tomas Sabata Active learning in sequence labeling 13 / 36

Page 17: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Frameworks: Expected Error Reduction

• Estimate the expected future error of a model trained on

L∪ < x , y >

• Methods:

• Minimizing the excepted 0/1-loss

x∗ = argminx

∑i Pθ(yi |x)

(∑Uu=1 1− Pθ+<x,yi>(y |x (u))

)• Minimizing the excepted log-loss

x∗ = argminx

∑i Pθ(yi |x)

(−∑U

u=1

∑j Pθ+<x,yi>(yj |x (u))logPθ+<x,yi>(yj |x (u))

)• In most cases the most computationally expensive query framework

• Logistic regression - O(ULG)

• CRF - O(TMT+2ULG)

Tomas Sabata Active learning in sequence labeling 14 / 36

Page 18: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Frameworks: Variance Reduction

• Use the bias-variance decomposition

• ET [(y − y)2|x ] =

E [(y − E [y |x ])2] + (EL[y ]− E [y |x ])2 + EL[(y − EL[y ])2]

• Model dependent framework

Tomas Sabata Active learning in sequence labeling 15 / 36

Page 19: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Frameworks: Density-Weighted Methods

• Informative instances should not only be those which are uncertain,

but also those which are ”representative”of the underlying

distribution

• Uses one of other query strategies as base query strategy (e.g.

uncertainty sampling)

• x∗ = argmaxxφA(x)×(

1U

∑Uu=1 sim(x , x (u))

)β• The metod is more robust to outliers in dataset.

Tomas Sabata Active learning in sequence labeling 16 / 36

Page 20: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Active learning problem variants

• Active Learning for Structured Outputs

• Instance is not represented by a single feature vector, but rather a

structure.

• e.g.: Sequences, trees, grammars.

• Active Feature Acquisition

• Selection of salient unused features

• e.g.: Medical tests, sensitive information

• Active Class Selection

• Learner is allowed to query a known class label, and obtaining each

instance incurs a cost.

• Active Clustering

• Generate (or subsample) instances in such a way that they

self-organize into groupings

• Try to get less overlap or noise than with random sampling

Tomas Sabata Active learning in sequence labeling 17 / 36

Page 21: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Active learning problem variants

• Active Learning for Structured Outputs

• Instance is not represented by a single feature vector, but rather a

structure.

• e.g.: Sequences, trees, grammars.

• Active Feature Acquisition

• Selection of salient unused features

• e.g.: Medical tests, sensitive information

• Active Class Selection

• Learner is allowed to query a known class label, and obtaining each

instance incurs a cost.

• Active Clustering

• Generate (or subsample) instances in such a way that they

self-organize into groupings

• Try to get less overlap or noise than with random sampling

Tomas Sabata Active learning in sequence labeling 18 / 36

Page 22: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Graphical models

Page 23: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Models: Markov model

Obrazek 5: Markov model/chain

Tomas Sabata Active learning in sequence labeling 19 / 36

Page 24: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Models: Hidden Markov model

Obrazek 6: Hidden Markov model

Tomas Sabata Active learning in sequence labeling 20 / 36

Page 25: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Models: Hidden Markov model

λ = (A,B, π)

• Set of hidden states Y = {y1, y2, ..., yN}, set of observable values

X = {x1, x2, ..., xM}• Sequence of states Q = q1q2q3...qT sequence of outputs

O = o1o2o3...oT

• Transition probability matrix A = {aij}aij = P(qt = yj |qt−1 = yi ), 1 ≤ i , j ≤ N

• Emission probability distribution B = {bi,j}bi,j = P(ot = xj |qt = yi ), 1 ≤ i ≤ N, 1 ≤ i ≤ M

• Initial probability distribution π = {πi}πi = P(q1 = yi ), 1 ≤ i ≤ N

Tomas Sabata Active learning in sequence labeling 21 / 36

Page 26: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Models: Conditional random field (linear chain)

Obrazek 7: Linear chain conditional random field

• Discriminative model P(Y |X ), we do not explicitly model P(X ).

• Perform better than HMMs when the true data distribution has

higher-order dependencies than the model.

• P(Y |X ) = 1Z(x)

∏Tt=1 exp

(∑Kk=1 θk fk(yt , yt−1, xt)

)• Z (X ) =

∑Y

∏Tt=1 exp

(∑Kk=1 θk fk(yt , yt−1, xt)

)

Tomas Sabata Active learning in sequence labeling 22 / 36

Page 27: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Models: Summary

1. HMM

• P(Q,O) ∝∏T

t=1 P(qt |qt−1) P(ot |qt)

2. CRF

• P(Q|O) ∝ 1

ZO

T∏t=1

exp

( ∑j

λj fj(qt , qt−1)

+∑k

µkgk(qt , ot)

)

Tomas Sabata Active learning in sequence labeling 23 / 36

Page 28: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Active learning in sequence

labeling

Page 29: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Uncertainty sampling

• Least confident

• x∗LC = argmaxx 1− Pθ(y |x)

• Viterbi path

• Margin sampling

• x∗M = argminx Pθ(y1|x)− Pθ(y2|x)

• N-best aglorithm

• Entropy

• Token entropy - x∗TE = argmaxx − 1T

T∑t=1

M∑m=1

Pθ(yt = m)logPθ(yt = m)

• Total token entropy

• Sequence entropy x∗SE = argmaxx −∑y

Pθ(y |x)logPθ(y |x)

• N-best sequence entropy x∗SE = argmaxx −∑y∈N

Pθ(y |x)logPθ(y |x)

Tomas Sabata Active learning in sequence labeling 24 / 36

Page 30: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Query by Committee

• Query-by-bagging (each model has unique modified set L(c))

• Vote entropy - disagreement over Viterbi’s paths

x∗VE = argmaxx − 1T

T∑t=1

M∑m=1

V (yt ,m)C logV (yt ,m)

C

• Kullback Leibler

x∗KL = argmaxx1T

T∑t=1

1C

C∑c=1

DKL(θ(c)||C)

D(θ(c)||C) =M∑

m=1Pθ(c) (yt = m)log

Pθ(c) (yt=m)

PC(yt=m)

• Non-normalized variants

• Sequence vote entropy

x∗SVE = argmaxx −∑

y∈NCP(y |x , C)logP(y |x , C)

• Sequence Kullback-Leibler x∗SKL = argmaxx1C

C∑c=1

∑y∈NC

logPθ(c) (y |x)

PC(y |x)

Tomas Sabata Active learning in sequence labeling 25 / 36

Page 31: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Expected Gradient Length

• Exceptation over the N-best labelings.

• x∗EMC = argmaxx∑

y∈NCPθ(y |x) ‖ 5`θ(L∪ < yi , x >) ‖

Tomas Sabata Active learning in sequence labeling 26 / 36

Page 32: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Expected Error Reduction

• O(TMT+2ULG ) too expensive :(

Tomas Sabata Active learning in sequence labeling 27 / 36

Page 33: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Density-Weighted Methods

• Information density

- solves problem that US and QBC are prone to querying outliers

• We need a distance measure for sequences.

• Kullback-Leibler

• Euclidean distance

• Cosine distance

• Drawback: number of required similarity calculations grows

quadratically with the number of instances in U .

• Solution: Precompute them.

Tomas Sabata Active learning in sequence labeling 28 / 36

Page 34: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Semi-supervized active learning

in sequence labeling

Page 35: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Fully supervized vs semi-supervized

• FuSAL:

• Sequence is handled as a whole unit.

• Sequence-wise vs. token-wise utility functions

• SeSAL

• Some subsequences can be easily labelled automatically.

• Decrease labelling effort.

• Usage of self-training principle.

Tomas Sabata Active learning in sequence labeling 29 / 36

Page 36: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Fully supervized general algorithm

Data:

B - number of examples to be selected

L - set of labeled examples

U - set of umlabeled examples

θ - utility function

while stopping criterion is not met do

1. learn model M from L;

2. for all xi ∈ U : uxi ← θM(xi );

3. select B examples xi ∈ U with highest utility function ui ;

4. annotate sequences using M;

5. query for labels of non-confidential tokens;

6. move newly annotated examples to L;

end

return LAlgorithm 2: General AL framework

Tomas Sabata Active learning in sequence labeling 30 / 36

Page 37: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Experiment

Page 38: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Problem definition

• ”Handwritten”letters recognition.

• Each letter is randomly written in one of 6 fonts

• Downscaled to 4x4 pixels

• Letters are orginized in real sentences.

Obrazek 8: A Obrazek 9: B Obrazek 10: C

Tomas Sabata Active learning in sequence labeling 31 / 36

Page 39: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Model and training

• Linear chain conditional random fields.

• 3 labeled sentences in train dataset in the begining .

• Labeled sentences are added to the dataset interatively ( 50 times ).

• Random choice

• FuSAL (Least Confident)

• SeSAL (Least Confident + marginal probability)

Tomas Sabata Active learning in sequence labeling 32 / 36

Page 40: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Results

Does it even worth it?

Tomas Sabata Active learning in sequence labeling 33 / 36

Page 41: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Results

Let’s look from another point of view.

Tomas Sabata Active learning in sequence labeling 34 / 36

Page 42: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Results

Tomas Sabata Active learning in sequence labeling 35 / 36

Page 43: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Summary

Page 44: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Summary

• It does not work in all cases.

• It can be implementation overhead.

• In the most of cases the active learning helps.

• It can be applied to different structures like sequences or trees

• The combination with semi-supervized learning can lead to rapid

save of costs.

• Future work: Different query costs, automatic threshold finding,

CT-HMM.

Tomas Sabata Active learning in sequence labeling 36 / 36

Page 45: Active learning in sequence labelingai.ms.mff.cuni.cz/~sui/sabata17.pdfself-organize into groupings Try to get less overlap or noise than with random sampling Tom a s Sabata Active

Thank you for your attention.

Questions?

Tomas Sabata Active learning in sequence labeling 36 / 36