1 unsupervised morphological segmentation with log-linear models hoifung poon university of...

1

Unsupervised Morphological Segmentation

With Log-Linear Models

Hoifung PoonUniversity of Washington

Joint Work with

Colin Cherry and Kristina Toutanova

2

Machine Learning in NLP

3


Unsupervised Learning

4


Unsupervised Learning Log-Linear Models

?

5



Little work except for a couple of cases

6



Global Features

?

7



Global Features

???

8



Global Features

We developed a method

for Unsupervised Learningof Log-Linear Modelswith Global Features

9



Global Features

We applied it to morphological segmentation

and reduced F1 errorby 10%50%

compared to the state of the art

10

Outline

Morphological segmentation Our model Learning and inference algorithms Experimental results Conclusion

11

Morphological Segmentation

Breaks words into morphemes

12


Breaks words into morphemesgovernments

13


Breaks words into morphemesgovernments govern ment s

14


Breaks words into morphemesgovernments govern ment slm$pxtm

(according to their families)

15


Breaks words into morphemesgovernments govern ment slm$pxtm l m$pm t m


16


Breaks words into morphemesgovernments govern ment slm$pxtm l m$pm t m


Key component in many NLP applications Particularly important for morphologically-rich

languages (e.g., Arabic, Hebrew, …)

17

Why Unsupervised Learning ?

Text: Unlimited supplies in any language Segmentation labels?

Only for a few languages Expensive to acquire

18

Why Log-Linear Models ?

Can incorporate arbitrary overlapping features

E.g., Al rb (the lord) Morpheme features:

Substrings Al, rb are likely morphemes Substrings Alr, lrb are not likely morphemes Etc.

Context features: Substrings between Al and # are likely morphemes Substrings between lr and # are not likely morphemes Etc.

19

Why Global Features ?

Words can inform each other on segmentation E.g., Al rb (the lord), l Al rb (to the lord)

20

State of the Art in Unsupervised Morphological Segmentation

Use directed graphical models Morfessor [Creutz & Lagus 2007]

Hidden Markov Model (HMM) Goldwater et al. [2006]

Based on Pitman-Yor processes Snyder & Barzilay [2008a, 2008b]

Based on Dirichlet processes Uses bilingual information to help segmentation

Phrasal alignment Prior knowledge on phonetic correspondence

E.g., Hebrew w Arabic w, f; ...

21

Unsupervised Learning with Log-Linear Models

Few approaches exist to this date Contrastive estimation [Smith & Eisner 2005] Sampling [Poon & Domingos 2008]

22

This Talk

First log-linear model for unsupervised morphological segmentation

Combines contrastive estimation with sampling Achieves state-of-the-art results Can apply to semi-supervised learning

23

Outline


24

Log-Linear Model

State variable x X Features fi: X R

Weights λi

Defines probability distribution over the states

1( ) exp ( )i i

i

P x f xZ

25

Log-Linear Model

State variables x X Features fi: X R

Weights λi

Defines probability distribution over the states

1( ) exp ( )i i

i

P x f xZ

'

exp ( ')i i

x X i

Z f x

26

States for Unsupervised Morphological Segmentation

Words

wvlAvwn, Alrb, … Segmentation

w vlAv wn, Al rb, … Induced lexicon (unique morphemes)

w, vlAv, wn, Al, rb

27

Features for Unsupervised Morphological Segmentation

Morphemes and contexts Exponential priors on model complexity

28

Morphemes and Contexts Count number of occurrences Inspired by CCM [Klein & Manning, 2001]

E.g., w vlAv wn

wvlAvwn(##_##)

vlAv(#w_wn)

w(##_vl)

wn(Av_##)

29

Complexity-Based Priors

Lexicon prior: On lexicon length (total number of characters) Favor fewer and shorter morpheme types

Corpus prior: On number of morphemes (normalized by word length) Favor fewer morpheme tokens

E.g., l Al rb, Al rb l, Al, rb 5 l Al rb 3/5 Al rb 2/4

30

Lexicon Prior Is Global Feature

Renders words interdependent in segmentation E.g., lAlrb, Alrb

lAlrb ?

Alrb ?

31



lAlrb l Al rbAlrb ?

32



lAlrb l Al rbAlrb Al rb

33



lAlrb l Alrb

Alrb ?

34



lAlrb l Alrb

Alrb Alrb

35

Probability Distribution

For corpus W and segmentation S

( , ) ( , )

( , )

1( , ) exp ( )

( ) / ( )

c c

Lex W S c Ctxt W S

Lex W S

w W

n n

P W S LenZ

NumMorph w Len w

Morphemes

Contexts

Lexicon Prior

Corpus Prior

36

Outline


37

Learning withLog-Linear Models

Maximizes likelihood of the observed data

Moves probability mass to the observed data From where? The set X that Z sums over

Normally, X = {all possible states} Major challenge:

Efficient computation (approximation) of the sum Particularly difficult in unsupervised learning

'

exp ( ')i i

x X i

Z f x

38

Contrastive Estimation

Smith & Eisner [2005]

X = a neighborhood of the observed data Neighborhood Pseudo-negative examples Discriminate them from observed instances

39

Problem with Contrastive Estimation

Objects are independent from each other Using global features leads to intractable inference In our case, could not use the lexicon prior

40

Sampling to the Rescue

Similar to Poon & Domingos [2008]

Markov chain Monte Carlo Estimates sufficient statistics based on samples Straightforward to handle global features

41

Our Learning Algorithm

Combines both ideas Contrastive estimation

Creates an informative neighborhood Sampling

Enables global feature (the lexicon prior)

42

Learning Objective

Observed: W* (words) Hidden: S (segmentation) Maximizes log-likelihood of observing the words

( *) log ( *, )S

L W P W S

43

Neighborhood

TRANS1 Transpose any pair of adjacent characters Intuition: Transposition usually leads to a non-word E.g.,

lAlrb Allrb, llArb, …

Alrb lArb, Arlb, …

44

Optimization

Gradient descent

| * , ( *, )

E [ ] E [ ]S W W Si ii

L W Sf f

45

Supervised Learning andSemi-Supervised Learning

Readily applicable if there are

labeled segmentations (S*)

Supervised: Labels for all words Semi-supervised: Labels for some words

| *, * , ( *, *)

E [ ] E [ ]S W S W Si ii

L W Sf f

46

Inference: Expectation

Gibbs sampling ES|W*[fi]

For each observed word in turn, sample next segmentation, conditioning on the rest

EW,S[fi]

For each observed word in turn, sample a word from neighborhood and next segmentation, conditioning on the rest

47

Inference: MAP Segmentation

Deterministic annealing Gibbs sampling with temperature Gradually lower the temperature from 10 to 0.1

48

Outline


49

Dataset

S&B: Snyder & Barzilay [2008a, 2008b] About 7,000 parallel short phrases Arabic and Hebrew with gold segmentation

Arabic Penn Treebank (ATB): 120,000 words

50

Methodology

Development set: 500 words from S&B Use trigram context in our full model Evaluation: Precision, recall, F1 on

segmentation points

51

Experiment Objectives

Comparison with state-of-the-art systems Unsupervised Supervised or semi-supervised

Relative contributions of feature components

52

Experiment: S&B (Unsupervised)

Snyder & Barzilay [2008b] S&B-MONO: Uses monolingual features only S&B-BEST: Uses bilingual information

Our system: Uses monolingual features only

53

Results: S&B (Unsupervised)

50

55

60

65

70

75

80

S&B-MONOS&B-BESTOur System

F1

54


50

55

60

65

70

75

80


F1

55


50

55

60

65

70

75

80


F1

56


50

55

60

65

70

75

80


F1

Reduces F1 error by 40%

57


50

55

60

65

70

75

80


F1


58

Experiment: ATB (Unsupervised)

Morfessor Categories-MAP [Creutz & Lagus 2007]

Our system

59

Results: ATB

70

72

74

76

78

MOR-MAPOur System

F1


60

Experiment: Ablation Tests

Conducted on the S&B dataset Change one feature component in each test

Priors Context features

61

Results: Ablation Tests

30

40

50

60

70

80

F1

No priorsLexicon prior only

Corpus prior only

Both priors are crucial

FULL NO-PR COR NO-CTXTLEX

62

Results: Ablation Tests

30

40

50

60

70

80

F1

No context features

Overlapping context features are important

FULL NO-PR COR NO-CTXTLEX

63

Experiment: S&B (Supervised and Semi-Supervised)

Snyder & Barzilay [2008a] S&B-MONO-S: Monolingual features and labels S&B-BEST-S: Bilingual information and labels

Our system: Monolingual features and labelsPartial or all labels (25%, 50%, 75%, 100%)

64

Results: S&B (Supervised and Semi-Supervised)

75

80

85

90

F1

S&BMONO-S

65


75

80

85

90

F1

S&BMONO-S

S&BBEST-S

66


75

80

85

90

F1

S&BMONO-S

S&BBEST-S

Our-S25%

67


75

80

85

90

F1

S&BMONO-S

S&BBEST-S

Our-S25%

Our-S50%

68


75

80

85

90

F1

S&BMONO-S

S&BBEST-S

Our-S25%

Our-S50%

Our-S75%

69

Results: S&B(Supervised and Semi-Supervised)

75

80

85

90

F1

S&BMONO-S

S&BBEST-S

Our-S25%

Our-S50%

Our-S75%

Our-S100%

Reduces F1 error by 46% compared to S&B-MONO-S

Reduces F1 error by 36% compared to S&B-BEST-S

70

Conclusion

We developed a method

for Unsupervised Learning

of Log-Linear Models

with Global Features Applied it to morphological segmentation Substantially outperforms state-of-the-art systems Effective for semi-supervised learning as well Easy to extend with additional features

71

Future Work

Apply to other NLP tasks Interplay between neighborhood and features Morphology

Apply to other languages Modeling internal variations of morphemes Leverage multi-lingual information Combine with other NLP tasks (e.g., MT)

72

Thanks Ben Snyder …

For his most generous help with S&B dataset

1 unsupervised morphological segmentation with log-linear models hoifung poon university of...

Documents

nlp slide

unsupervised learning

alrb alrb slide

lord slide

art slide

al rb alrb al rb slide

ment s slide

model complexity slide