unified expectation maximization

Unified Expectation Maximization

Rajhans SamdaniJoint work with

Ming-Wei Chang (Microsoft Research) and Dan Roth

University of Illinois at Urbana-Champaign

NAACL 2012,Montreal

Weakly Supervised Learning in NLP

Labeled data is scarce and difficult to obtain

A lot of work on learning with a small amount of labeled data

Expectation Maximization (EM) algorithm is the de facto standard

More recently: significant work on injecting weak supervision or domain knowledge via constraints into EM Constraint-driven Learning (CoDL; Chang et al, 07) Posterior regularization (PR; Ganchev et al, 10)

Weakly Supervised Learning: EM and …?

Several variants of EM exist in the literature: Hard EM

Variants of constrained EM: CoDL and PR

Which version to use: EM (PR) vs hard EM (CoDL)????? Or is there something better out there?

OUR CONTRIBUTION: a unified framework for EM algorithms, Unified EM (UEM) Includes existing EM algorithms Pick the most suitable EM algorithm in a simple, adaptive, and

principled way Adapting to data, initialization, and constraints

Outline

Background: Expectation Maximization (EM) EM with constraints

Unified Expectation Maximization (UEM)

Optimization Algorithm for the E-step

Experiments

Predicting Structures in NLP

Predict the output or dependent variable y from the space of allowed outputs Y given input variable x using parameters or weight vector w

E.g. predict POS tags given a sentence, predict word alignments given sentences in two different languages, predict the entity-relation structure from a document

Prediction expressed as y* = argmaxy 2 Y P (y | x; w)

Learning Using EM: a Quick Primer

Given unlabeled data: x, estimate w; hidden: y for t = 1 … T do

E:step: estimate a posterior distribution, q, over y:

M:step: estimate the parameters w w.r.t. q:

wt+1 = argmaxw Eq log P (x, y; w)

qt(y) = P (y|x;wt) qt(y) = argminq KL( q(y) , P

(y|

x;wt) ) (Neal and Hinton,

99)

Conditional distribution of

y given wPosterior distribution

Other Version of EM: Hard EM

Standard EM E-step:argminq KL(qt(y),P

(y|

x;wt))

M-step: argmaxw Eq log P (x, y; w)

Hard EM E-step:


q(y) = ±y=y*y*= argmaxy P(y|x,w)Not clear which version To use!!!

Constrained EM Domain knowledge-based constraints can help a lot by guiding

unsupervised learning Constraint-driven Learning (Chang et al, 07), Posterior Regularization (Ganchev et al, 10), Generalized Expectation Criterion (Mann & McCallum, 08), Learning from Measurements (Liang et al, 09)

Constraints are imposed on y (a structured object, {y1,y2…yn}) to specify/restrict the set of allowed structures Y

Entity-Relation Prediction: Type Constraints

Predict entity types: Per, Loc, Org, etc. Predict relation types: lives-in, org-based-in, works-for, etc. Entity-relation type constraints

Dole ’s wife, Elizabeth , is a resident of N.C. E1 E2 E3

R12 R2

3

lives-in

LocPer

10

Bilingual Word Alignment: Agreement Constraints

Align words from sentences in EN with sentences in FR

Agreement constraints: alignment from EN-FR should agree with the alignment from FR-EN (Ganchev et al, 10)

Picture: courtesy Lacoste-Julien et al

Structured Prediction Constraints Representation

Assume a set of linear constraints: Y = {y : Uy · b}

A universal representation (Roth and Yih, 07)

Can be relaxed into expectation constraints on posterior probabilities:

Eq[Uy] · b

Focus on introducing constraints during the E-step

Posterior Regularization (Ganchev et al, 10) E-step:argminq KL(qt(y),P

(y|

x;wt)) Eq[Uy] · b


Constraint driven-learning (Chang et al, 07) E-step:


y*= argmaxy P(y|x,w)Uy · b

Not clear which version To use!!!

Two Versions of Constrained EM

So how do we learn…?

EM (PR) vs hard EM (CODL) Unclear which version of EM to use (Spitkovsky et al, 10)

This is the initial point of our research

We present a family of EM algorithms which includes these EM algorithms (and infinitely many new EM algorithms): Unified Expectation Maximization (UEM)

UEM lets us pick the best EM algorithm in a principled way

Outline

Notation and Expectation Maximization (EM)

Unified Expectation Maximization Motivation Formulation and mathematical intuition

Optimization Algorithm for the E-step

Experiments

Motivation: Unified Expectation Maximization (UEM)

EM (PR) and hard EM (CODL) differ mostly in the entropy of the posterior distribution

UEM tunes the entropy of the posterior distribution q and is parameterized by a single parameter °

EM Hard EM

EM (PR) minimizes the KL-Divergence KL(q , P (y|x;w)) KL(q , p) = y q(y) log q(y) – q(y) log p(y)

UEM changes the E-step of standard EM and minimizes a modified KL divergence KL(q , P (y|x;w); °) where

KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y)

Different ° values ! different EM algorithms

Changes the entropy of the posterior

Unified EM (UEM)

Effect of Changing °

Original Distribution p

q with ° = 1

q with ° = 0

q with ° = 1

q with ° = -1


Unifying Existing EM Algorithms

No Constraints

With Constraints


° 1 0 -1 1

Hard EM

CODL

EM

PR

Deterministic Annealing (Smith and Eisner, 04; Hofmann, 99)

Changing ° values results in different existing EM algorithms

Range of °

No Constraints

With Constraints


° 0 1

Hard EM EM

PRLP approx to CODL (New)

We focus on tuning ° in the range [0,1]

Infinitely many new EM algorithms

Tuning ° in practice

° essentially tunes the entropy of the posterior to better adapt to data, initialization, constraints, etc.

We tune ° using a small amount of development data over the range

UEM for arbitrary ° in our range is very easy to implement: existing EM/PR/hard EM/CODL codes can be easily extended to implement UEM

0 1 .1 .2 .3 ……

Outline

Setting up the problem

Unified Expectation Maximization

Solving the constrained E-step Lagrange dual-based based algorithm Unification of existing algorithms

Experiments

The Constrained E-step

For ° ¸ 0 ) convex

Domain knowledge-based linear constraints

°-Parameterized KL divergence

Standard probability simplex constraints

1 Introduce dual variables ¸ for each constraint

2 Sub-gradient ascent on dual vars with O ¸ / Eq[Uy] – b

3 Compute q for given ¸ For °>0, compute

With ° !0, unconstrained MAP inference:

Solving the Constrained E-step for q(y)

Iterate untilconvergence

Some Properties of our E-step Optimization

We use a dual projected sub-gradient ascent algorithm (Bertsekas, 99) Includes inequality constraints

For special instances where two (or more) “easy” problems are connected via constraints, reduces to dual decomposition For ° > 0: convex dual decomposition over individual models

(e.g. HMMs) connected via dual variables ° = 1: dual decomposition in posterior regularization

(Ganchev et al, 08) For ° = 0: Lagrange relaxation/dual decomposition for hard

ILP inference (Koo et al, 10; Rush et al, 11)

Outline

Setting up the problem

Introduction to Unified Expectation Maximization

Lagrange dual-based optimization Algorithm for the E-step

Experiments POS tagging Entity-Relation Extraction Word Alignment

Experiments: exploring the role of ° Test if tuning ° helps improve the performance over baselines

Study the relation between the quality of initialization and ° (or “hardness” of inference)

Compare against: Posterior Regularization (PR) corresponds to ° = 1.0 Constraint-driven Learning (CODL) corresponds to ° = -1

Unsupervised POS Tagging

Model as first order HMM

Try varying qualities of initialization: Uniform initialization: initialize with equal probability for all states Supervised initialization: initialize with parameters trained on varying

amounts of labeled data

Test the “conventional wisdom” that hard EM does well with good initialization and EM does better with a weak initialization

Unsupervised POS tagging: Different EM instantiations

Uniform Initialization

Initialization with 5 examples



Initialization with 40-80 examples

°

Perfo

rman

ce re

lativ

e to

EM

Hard EMEM

Experiments: Entity-Relation Extraction

Extract entity types (e.g. Loc, Org, Per) and relation types (e.g. Lives-in, Org-based-in, Killed) between pairs of entities

Add constraints: Type constraints between entity and relations Expected count constraints to regularize the counts of ‘None’ relation

Semi-supervised learning with a small amount of labeled data

Dole ’s wife, Elizabeth , is a resident of N.C. E1 E2 E3 R12 R2

3

Result on Relations

5% 10% 20%0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

No semi-sup

CODL

PR

UEM

% of labeled data

Mac

ro-f1

sco

res

UEM Statistically significantly better than PR

Experiments: Word Alignment

Word alignment from a language S to language T We try En-Fr and En-Es pairs We use an HMM-based model with agreement constraints for

word alignment PR with agreement constraints known to give HUGE

improvements over HMM (Ganchev et al’08; Graca et al’08) Use our efficient algorithm to decomposes the E-step into

individual HMMs

Word Alignment: EN-FR with 10k Unlabeled Data

EN-FR FR-EN0

5

10

15

20

25

EMPRCODLUEM

Alig

nmen

t Err

or R

ate

Word Alignment: EN-FR

10k 50k 100k0

5

10

15

20

25

EMPRCODLUEM

Alig

nmen

t Err

or R

ate

Word Alignment: FR-EN

10k 50k 100k0

5

10

15

20

25

EMPRCODLUEM

Alig

nmen

t Err

or R

ate

Word Alignment: EN-ES

10k 50k 100k10

15

20

25

30

35

40

EMPRCODLUEM

Alig

nmen

t Err

or R

ate

Word Alignment: ES-EN

10k 50k 100k10

15

20

25

30

35

EMPRCODLUEM

Alig

nmen

t Err

or R

ate

Experiments Summary

In different settings, different baselines work better Entity-Relation extraction: CODL does better than PR Word Alignment: PR does better than CODL Unsupervised POS tagging: depends on the initialization

UEM allows us to choose the best algorithm in all of these cases Best version of EM: a new version with 0 < ° < 1

Unified EM: Summary

UEM generalizes existing variations of EM/constrained EM UEM provides new EM algorithms parameterized by a single

parameter ° Efficient dual projected subgradient ascent technique to

incorporate constraints into UEM The best ° corresponds to neither EM (PR) nor hard EM

(CODL) and found through the UEM framework Tuning ° adaptively changes the entropy of the posterior

UEM is easy to implement: add a few lines of code to existing EM codes

Questions?

unified expectation maximization

Documents

variants of em

em pr vs hard em codl

y p y x wpage

way6other version of

suitable em algorithm

y wpage

y w page

y w hard emestep