unified expectation maximization

38
Unified Expectation Maximization Rajhans Samdani Joint work with Ming-Wei Chang (Microsoft Research) and Dan Roth University of Illinois at Urbana-Champaign Page 1 NAACL 2012, Montreal

Upload: conway

Post on 24-Feb-2016

53 views

Category:

Documents


0 download

DESCRIPTION

Unified Expectation Maximization. Rajhans Samdani Joint work with Ming-Wei Chang ( Microsoft Research ) and Dan Roth University of Illinois at Urbana-Champaign. NAACL 2012, Montreal. Weakly Supervised Learning in NLP. Labeled data is scarce and difficult to obtain - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Unified Expectation Maximization

Unified Expectation Maximization

Rajhans SamdaniJoint work with

Ming-Wei Chang (Microsoft Research) and Dan Roth

University of Illinois at Urbana-Champaign

Page 1

NAACL 2012,Montreal

Page 2: Unified Expectation Maximization

Weakly Supervised Learning in NLP

Labeled data is scarce and difficult to obtain

A lot of work on learning with a small amount of labeled data

Expectation Maximization (EM) algorithm is the de facto standard

More recently: significant work on injecting weak supervision or domain knowledge via constraints into EM Constraint-driven Learning (CoDL; Chang et al, 07) Posterior regularization (PR; Ganchev et al, 10)

Page 2

Page 3: Unified Expectation Maximization

Weakly Supervised Learning: EM and …?

Several variants of EM exist in the literature: Hard EM

Variants of constrained EM: CoDL and PR

Which version to use: EM (PR) vs hard EM (CoDL)????? Or is there something better out there?

OUR CONTRIBUTION: a unified framework for EM algorithms, Unified EM (UEM) Includes existing EM algorithms Pick the most suitable EM algorithm in a simple, adaptive, and

principled way Adapting to data, initialization, and constraints

Page 3

Page 4: Unified Expectation Maximization

Outline

Background: Expectation Maximization (EM) EM with constraints

Unified Expectation Maximization (UEM)

Optimization Algorithm for the E-step

Experiments

Page 4

Page 5: Unified Expectation Maximization

Predicting Structures in NLP

Predict the output or dependent variable y from the space of allowed outputs Y given input variable x using parameters or weight vector w

E.g. predict POS tags given a sentence, predict word alignments given sentences in two different languages, predict the entity-relation structure from a document

Prediction expressed as y* = argmaxy 2 Y P (y | x; w)

Page 5

Page 6: Unified Expectation Maximization

Learning Using EM: a Quick Primer

Given unlabeled data: x, estimate w; hidden: y for t = 1 … T do

E:step: estimate a posterior distribution, q, over y:

M:step: estimate the parameters w w.r.t. q:

wt+1 = argmaxw Eq log P (x, y; w)

Page 6

qt(y) = P (y|x;wt) qt(y) = argminq KL( q(y) , P

(y|

x;wt) ) (Neal and Hinton,

99)

Conditional distribution of

y given wPosterior distribution

Page 7: Unified Expectation Maximization

Other Version of EM: Hard EM

Standard EM E-step:argminq KL(qt(y),P

(y|

x;wt))

M-step: argmaxw Eq log P (x, y; w)

Hard EM E-step:

M-step: argmaxw Eq log P (x, y; w)

Page 7

q(y) = ±y=y*y*= argmaxy P(y|x,w)Not clear which version To use!!!

Page 8: Unified Expectation Maximization

Constrained EM Domain knowledge-based constraints can help a lot by guiding

unsupervised learning Constraint-driven Learning (Chang et al, 07), Posterior Regularization (Ganchev et al, 10), Generalized Expectation Criterion (Mann & McCallum, 08), Learning from Measurements (Liang et al, 09)

Constraints are imposed on y (a structured object, {y1,y2…yn}) to specify/restrict the set of allowed structures Y

Page 8

Page 9: Unified Expectation Maximization

Entity-Relation Prediction: Type Constraints

Predict entity types: Per, Loc, Org, etc. Predict relation types: lives-in, org-based-in, works-for, etc. Entity-relation type constraints

Dole ’s wife, Elizabeth , is a resident of N.C. E1 E2 E3

R12 R2

3

Page 9

lives-in

LocPer

Page 10: Unified Expectation Maximization

10

Bilingual Word Alignment: Agreement Constraints

Align words from sentences in EN with sentences in FR

Agreement constraints: alignment from EN-FR should agree with the alignment from FR-EN (Ganchev et al, 10)

Picture: courtesy Lacoste-Julien et al

Page 11: Unified Expectation Maximization

Structured Prediction Constraints Representation

Assume a set of linear constraints: Y = {y : Uy · b}

A universal representation (Roth and Yih, 07)

Can be relaxed into expectation constraints on posterior probabilities:

Eq[Uy] · b

Focus on introducing constraints during the E-step

Page 11

Page 12: Unified Expectation Maximization

Posterior Regularization (Ganchev et al, 10) E-step:argminq KL(qt(y),P

(y|

x;wt)) Eq[Uy] · b

M-step: argmaxw Eq log P (x, y; w)

Constraint driven-learning (Chang et al, 07) E-step:

M-step: argmaxw Eq log P (x, y; w)

y*= argmaxy P(y|x,w)Uy · b

Not clear which version To use!!!

Two Versions of Constrained EM

Page 12

Page 13: Unified Expectation Maximization

So how do we learn…?

EM (PR) vs hard EM (CODL) Unclear which version of EM to use (Spitkovsky et al, 10)

This is the initial point of our research

We present a family of EM algorithms which includes these EM algorithms (and infinitely many new EM algorithms): Unified Expectation Maximization (UEM)

UEM lets us pick the best EM algorithm in a principled way

Page 13

Page 14: Unified Expectation Maximization

Outline

Notation and Expectation Maximization (EM)

Unified Expectation Maximization Motivation Formulation and mathematical intuition

Optimization Algorithm for the E-step

Experiments

Page 14

Page 15: Unified Expectation Maximization

Motivation: Unified Expectation Maximization (UEM)

EM (PR) and hard EM (CODL) differ mostly in the entropy of the posterior distribution

UEM tunes the entropy of the posterior distribution q and is parameterized by a single parameter °

Page 15

EM Hard EM

Page 16: Unified Expectation Maximization

EM (PR) minimizes the KL-Divergence KL(q , P (y|x;w)) KL(q , p) = y q(y) log q(y) – q(y) log p(y)

UEM changes the E-step of standard EM and minimizes a modified KL divergence KL(q , P (y|x;w); °) where

KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y)

Different ° values ! different EM algorithms

Changes the entropy of the posterior

Unified EM (UEM)

Page 16

Page 17: Unified Expectation Maximization

Effect of Changing °

Original Distribution p

q with ° = 1

q with ° = 0

q with ° = 1

q with ° = -1

Page 17

KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y)

Page 18: Unified Expectation Maximization

Unifying Existing EM Algorithms

Page 18

No Constraints

With Constraints

KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y)

° 1 0 -1 1

Hard EM

CODL

EM

PR

Deterministic Annealing (Smith and Eisner, 04; Hofmann, 99)

Changing ° values results in different existing EM algorithms

Page 19: Unified Expectation Maximization

Range of °

Page 19

No Constraints

With Constraints

KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y)

° 0 1

Hard EM EM

PRLP approx to CODL (New)

We focus on tuning ° in the range [0,1]

Infinitely many new EM algorithms

Page 20: Unified Expectation Maximization

Tuning ° in practice

° essentially tunes the entropy of the posterior to better adapt to data, initialization, constraints, etc.

We tune ° using a small amount of development data over the range

UEM for arbitrary ° in our range is very easy to implement: existing EM/PR/hard EM/CODL codes can be easily extended to implement UEM

Page 20

0 1 .1 .2 .3 ……

Page 21: Unified Expectation Maximization

Outline

Setting up the problem

Unified Expectation Maximization

Solving the constrained E-step Lagrange dual-based based algorithm Unification of existing algorithms

Experiments

Page 21

Page 22: Unified Expectation Maximization

The Constrained E-step

For ° ¸ 0 ) convex

Page 22

Domain knowledge-based linear constraints

°-Parameterized KL divergence

Standard probability simplex constraints

Page 23: Unified Expectation Maximization

1 Introduce dual variables ¸ for each constraint

2 Sub-gradient ascent on dual vars with O ¸ / Eq[Uy] – b

3 Compute q for given ¸ For °>0, compute

With ° !0, unconstrained MAP inference:

Page 23

Solving the Constrained E-step for q(y)

Iterate untilconvergence

Page 24: Unified Expectation Maximization

Some Properties of our E-step Optimization

We use a dual projected sub-gradient ascent algorithm (Bertsekas, 99) Includes inequality constraints

For special instances where two (or more) “easy” problems are connected via constraints, reduces to dual decomposition For ° > 0: convex dual decomposition over individual models

(e.g. HMMs) connected via dual variables ° = 1: dual decomposition in posterior regularization

(Ganchev et al, 08) For ° = 0: Lagrange relaxation/dual decomposition for hard

ILP inference (Koo et al, 10; Rush et al, 11)

Page 24

Page 25: Unified Expectation Maximization

Outline

Setting up the problem

Introduction to Unified Expectation Maximization

Lagrange dual-based optimization Algorithm for the E-step

Experiments POS tagging Entity-Relation Extraction Word Alignment

Page 25

Page 26: Unified Expectation Maximization

Experiments: exploring the role of ° Test if tuning ° helps improve the performance over baselines

Study the relation between the quality of initialization and ° (or “hardness” of inference)

Compare against: Posterior Regularization (PR) corresponds to ° = 1.0 Constraint-driven Learning (CODL) corresponds to ° = -1

Page 26

Page 27: Unified Expectation Maximization

Unsupervised POS Tagging

Model as first order HMM

Try varying qualities of initialization: Uniform initialization: initialize with equal probability for all states Supervised initialization: initialize with parameters trained on varying

amounts of labeled data

Test the “conventional wisdom” that hard EM does well with good initialization and EM does better with a weak initialization

Page 27

Page 28: Unified Expectation Maximization

Unsupervised POS tagging: Different EM instantiations

Uniform Initialization

Initialization with 5 examples

Initialization with 10 examples

Initialization with 20 examples

Initialization with 40-80 examples

°

Perfo

rman

ce re

lativ

e to

EM

Hard EMEM

Page 28

Page 29: Unified Expectation Maximization

Experiments: Entity-Relation Extraction

Extract entity types (e.g. Loc, Org, Per) and relation types (e.g. Lives-in, Org-based-in, Killed) between pairs of entities

Add constraints: Type constraints between entity and relations Expected count constraints to regularize the counts of ‘None’ relation

Semi-supervised learning with a small amount of labeled data

Page 29

Dole ’s wife, Elizabeth , is a resident of N.C. E1 E2 E3 R12 R2

3

Page 30: Unified Expectation Maximization

Result on Relations

5% 10% 20%0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

No semi-sup

CODL

PR

UEM

Page 30

% of labeled data

Mac

ro-f1

sco

res

UEM Statistically significantly better than PR

Page 31: Unified Expectation Maximization

Experiments: Word Alignment

Word alignment from a language S to language T We try En-Fr and En-Es pairs We use an HMM-based model with agreement constraints for

word alignment PR with agreement constraints known to give HUGE

improvements over HMM (Ganchev et al’08; Graca et al’08) Use our efficient algorithm to decomposes the E-step into

individual HMMs

Page 31

Page 32: Unified Expectation Maximization

Word Alignment: EN-FR with 10k Unlabeled Data

EN-FR FR-EN0

5

10

15

20

25

EMPRCODLUEM

Page 32

Alig

nmen

t Err

or R

ate

Page 33: Unified Expectation Maximization

Word Alignment: EN-FR

10k 50k 100k0

5

10

15

20

25

EMPRCODLUEM

Page 33

Alig

nmen

t Err

or R

ate

Page 34: Unified Expectation Maximization

Word Alignment: FR-EN

10k 50k 100k0

5

10

15

20

25

EMPRCODLUEM

Page 34

Alig

nmen

t Err

or R

ate

Page 35: Unified Expectation Maximization

Word Alignment: EN-ES

10k 50k 100k10

15

20

25

30

35

40

EMPRCODLUEM

Page 35

Alig

nmen

t Err

or R

ate

Page 36: Unified Expectation Maximization

Word Alignment: ES-EN

10k 50k 100k10

15

20

25

30

35

EMPRCODLUEM

Page 36

Alig

nmen

t Err

or R

ate

Page 37: Unified Expectation Maximization

Experiments Summary

In different settings, different baselines work better Entity-Relation extraction: CODL does better than PR Word Alignment: PR does better than CODL Unsupervised POS tagging: depends on the initialization

UEM allows us to choose the best algorithm in all of these cases Best version of EM: a new version with 0 < ° < 1

Page 37

Page 38: Unified Expectation Maximization

Unified EM: Summary

UEM generalizes existing variations of EM/constrained EM UEM provides new EM algorithms parameterized by a single

parameter ° Efficient dual projected subgradient ascent technique to

incorporate constraints into UEM The best ° corresponds to neither EM (PR) nor hard EM

(CODL) and found through the UEM framework Tuning ° adaptively changes the entropy of the posterior

UEM is easy to implement: add a few lines of code to existing EM codes

Page 38

Questions?