an information theoretic approach to bilingual word clustering manaal faruqui & chris dyer...

24
An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Upload: marvin-dalton

Post on 23-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

An Information Theoretic Approach to Bilingual Word Clustering

Manaal Faruqui & Chris DyerLanguage Technologies Institute

SCS, CMU

Page 2: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Word Clustering

Grouping of words capturing syntactic, semantic and distributional regularities

Iran

USAIndia

Paris

11

13.422,000

play

London

laugheat

run

100

goodnice

better

awesome

cool

fight

Page 3: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Bilingual Word Clustering

What ?

• Clustering words of two languages simultaneously

• Inducing a dependence between the two clusterings

Why ?

• To obtain better clusterings (hypothesis)

How ?

• By using cross-lingual information

Page 4: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Bilingual Word Clustering

Assumption: Aligned words convey information about their respective clusters

Page 5: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Bilingual Word Clustering

Existing: Monolingual Models Proposed: Monolingual + Bilingual Hints

Page 6: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Related Work

• Bilingual Word Clustering (Och, 1999)• Language model based objective for monolingual

component• Word alignment count-based similarity function for

bilingual

• Linguistic structure transfer (Täckstrom et al. 2012)• Maximize the correspondence between clusters of

aligned words• Alternate optimization of mono & bi objective• Clustering of only top 1 million words

• POS tagging (Snyder & Barzilay, 2010)• Word sense disambiguation (Diab, 2003)• Bilingual graph based projections (Das and Petrov, 2011)

Page 7: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Monolingual Objective

S

P(S;C) = P(c1) * P(w1|c1) * P(c2|c1) * P(w2|c2) * …

(Brown, 1992)

c1 c4c3c2

w1 w2 w3 w4

H(S;C) = E [ -log P(S;C) ]

C

Maximize the likelihood of the word sequence given the clustering

Minimize the entropy (surprisal) of the word sequence given the clustering

Page 8: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Bilingual Objective

Maximize the information we know about one clustering given another

1 1

Language 1 Language 22

3

2

3 Word alignments

Page 9: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Bilingual Objective

1 1

Language 1 Language 22

3

2

3

Minimize the entropy of one clustering given the other

Word alignments

Page 10: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Bilingual Objective

For aligned words x in clustering C and y in clustering D,

The association between Cx and Dy can be written as:

p(Cx|Dy) + p (Dy|Cx)

Cx Dy

Dz

p(Dy|Cx) = a / (a + b)

a

b

Where,

Cw

c

Page 11: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Bilingual Objective

• Thus for the two clusterings,

AVI (C, D) = E(i, j) [ -log p(Ci|Dj) – log p (Dj|Ci) ]

• Aligned Variation of Information

• Captures the mutual information content of the two clusterings

• Has distance metric properties• Non-negative: AVI (C, D) > 0• Symmetric: AVI (C, D) = AVI (D, C) • Triangle Inequality: AVI (C, E) ≤ AVI (C, D) + AVI (D, E) • Identity of Indiscernibles: AVI (C, D) = 0, iff C ≅ D

Aligned Variation of Information

Page 12: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Joint Objective

α [ H (C) + H (D) ] + ß AVI (C, D)

BilingualMonolingual

α, ß are the weights of the mono and bi objectives resp.

Word sequence information

Cross lingual information

Page 13: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Inference

Bilingual

MonolingualMonolingual & Bilingual Word Clustering

We want to do a MAP inference on the factor graph

Page 14: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Inference

• Optimization• Optimal solution is a hard combinatorial problem (Och, 1995)

• Greedy hill climbing word exchange (Martin et al., 1995)

• Transfer word to the cluster with max improvement

• Initialization• Round-robin based on frequency

• Termination• No. of words exchanged < 0.1% (vocab1 + vocab2)• At least 5 complete iterations

Page 15: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Evaluation

Named Entity Recognition (NER)

Evaluation

• Core information extraction task• Very sensitive to word representations

• Word clusters are useful for downstream tasks (Turian et al, 2010)

• Can be directly used as features for NER • English(Finkel & Manning, 2009), German(Faruqui & Padó, 2010)

Page 16: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Data and Tools

German NER

• Training & Test data: CoNLL 2003• 220,000 and 55,000 tokens resp.

• Corpora for clustering: WIT-3 (Cettolo et al., 2012)

• Collection of TED talks• {Arabic, English, French, Korean, Turkish} – German• Around 1.5 million German tokens for each pair

• Stanford NER for training (Finkel and Manning, 2009)

• In-built functionality to use word clusters for generalization

• cdec for unsupervised word alignments (Dyer et al., 2013)

Page 17: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Experiments

Baseline: No clusters

1. Bilingual Information Only• α = 0, ß = 1• Objective: AVI (C, D)

2. Monolingual Information Only• α = 1, ß = 0• Objective: H (C) + H (D)

3. Monolingual + Bilingual Information• α = 1, ß = 0.1• Objective: H (C) + H (D) + 0.1 AVI (C, D)

α [ H (C) + H (D) ] + ß AVI (C, D)

Page 18: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Alignment Edge Filtering

• Word alignments are not perfect

• We filter out alignment edges between two words (x, y) if:

x y

a

b

cd

2 * b / ( (a + b + c) + (b + d) ) ≤ η

• Training η for different language pairs:

English 0.1

French 0.1

Arabic 0.3

Turkish 0.5

Korean 0.7

Page 19: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Results

F1 scores of German NER trained using different word clusters on the Training set

Page 20: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Results

F1 scores of German NER trained using different word clusters on the Test set

Page 21: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Ongoing Work

Bilingual

Monolingual

Multilingual Word Clustering

Page 22: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Ongoing Work

Current work: Parallel Data

Mono1 + Parallel Data

Mono1 + Parallel Data + Mono2

Page 23: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Conclusion

• Novel information theoretic model for bilingual clustering• The bilingual objective has an intuitive meaning• Joint optimization of the mono + bi objective

• Improvement in clustering quality over monolingual clustering

• Extendable to any number of languages incorporating both monolingual and parallel data

Page 24: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Thank You!