an information theoretic approach to bilingual word clustering manaal faruqui & chris dyer...
TRANSCRIPT
An Information Theoretic Approach to Bilingual Word Clustering
Manaal Faruqui & Chris DyerLanguage Technologies Institute
SCS, CMU
Word Clustering
Grouping of words capturing syntactic, semantic and distributional regularities
Iran
USAIndia
Paris
11
13.422,000
play
London
laugheat
run
100
goodnice
better
awesome
cool
fight
Bilingual Word Clustering
What ?
• Clustering words of two languages simultaneously
• Inducing a dependence between the two clusterings
Why ?
• To obtain better clusterings (hypothesis)
How ?
• By using cross-lingual information
Bilingual Word Clustering
Assumption: Aligned words convey information about their respective clusters
Bilingual Word Clustering
Existing: Monolingual Models Proposed: Monolingual + Bilingual Hints
Related Work
• Bilingual Word Clustering (Och, 1999)• Language model based objective for monolingual
component• Word alignment count-based similarity function for
bilingual
• Linguistic structure transfer (Täckstrom et al. 2012)• Maximize the correspondence between clusters of
aligned words• Alternate optimization of mono & bi objective• Clustering of only top 1 million words
• POS tagging (Snyder & Barzilay, 2010)• Word sense disambiguation (Diab, 2003)• Bilingual graph based projections (Das and Petrov, 2011)
Monolingual Objective
S
P(S;C) = P(c1) * P(w1|c1) * P(c2|c1) * P(w2|c2) * …
(Brown, 1992)
c1 c4c3c2
w1 w2 w3 w4
H(S;C) = E [ -log P(S;C) ]
C
Maximize the likelihood of the word sequence given the clustering
Minimize the entropy (surprisal) of the word sequence given the clustering
Bilingual Objective
Maximize the information we know about one clustering given another
1 1
Language 1 Language 22
3
2
3 Word alignments
Bilingual Objective
1 1
Language 1 Language 22
3
2
3
Minimize the entropy of one clustering given the other
Word alignments
Bilingual Objective
For aligned words x in clustering C and y in clustering D,
The association between Cx and Dy can be written as:
p(Cx|Dy) + p (Dy|Cx)
Cx Dy
Dz
p(Dy|Cx) = a / (a + b)
a
b
Where,
Cw
c
Bilingual Objective
• Thus for the two clusterings,
AVI (C, D) = E(i, j) [ -log p(Ci|Dj) – log p (Dj|Ci) ]
• Aligned Variation of Information
• Captures the mutual information content of the two clusterings
• Has distance metric properties• Non-negative: AVI (C, D) > 0• Symmetric: AVI (C, D) = AVI (D, C) • Triangle Inequality: AVI (C, E) ≤ AVI (C, D) + AVI (D, E) • Identity of Indiscernibles: AVI (C, D) = 0, iff C ≅ D
Aligned Variation of Information
Joint Objective
α [ H (C) + H (D) ] + ß AVI (C, D)
BilingualMonolingual
α, ß are the weights of the mono and bi objectives resp.
Word sequence information
Cross lingual information
Inference
Bilingual
MonolingualMonolingual & Bilingual Word Clustering
We want to do a MAP inference on the factor graph
Inference
• Optimization• Optimal solution is a hard combinatorial problem (Och, 1995)
• Greedy hill climbing word exchange (Martin et al., 1995)
• Transfer word to the cluster with max improvement
• Initialization• Round-robin based on frequency
• Termination• No. of words exchanged < 0.1% (vocab1 + vocab2)• At least 5 complete iterations
Evaluation
Named Entity Recognition (NER)
Evaluation
• Core information extraction task• Very sensitive to word representations
• Word clusters are useful for downstream tasks (Turian et al, 2010)
• Can be directly used as features for NER • English(Finkel & Manning, 2009), German(Faruqui & Padó, 2010)
Data and Tools
German NER
• Training & Test data: CoNLL 2003• 220,000 and 55,000 tokens resp.
• Corpora for clustering: WIT-3 (Cettolo et al., 2012)
• Collection of TED talks• {Arabic, English, French, Korean, Turkish} – German• Around 1.5 million German tokens for each pair
• Stanford NER for training (Finkel and Manning, 2009)
• In-built functionality to use word clusters for generalization
• cdec for unsupervised word alignments (Dyer et al., 2013)
Experiments
Baseline: No clusters
1. Bilingual Information Only• α = 0, ß = 1• Objective: AVI (C, D)
2. Monolingual Information Only• α = 1, ß = 0• Objective: H (C) + H (D)
3. Monolingual + Bilingual Information• α = 1, ß = 0.1• Objective: H (C) + H (D) + 0.1 AVI (C, D)
α [ H (C) + H (D) ] + ß AVI (C, D)
Alignment Edge Filtering
• Word alignments are not perfect
• We filter out alignment edges between two words (x, y) if:
x y
a
b
cd
2 * b / ( (a + b + c) + (b + d) ) ≤ η
• Training η for different language pairs:
English 0.1
French 0.1
Arabic 0.3
Turkish 0.5
Korean 0.7
Results
F1 scores of German NER trained using different word clusters on the Training set
Results
F1 scores of German NER trained using different word clusters on the Test set
Ongoing Work
Bilingual
Monolingual
Multilingual Word Clustering
Ongoing Work
Current work: Parallel Data
Mono1 + Parallel Data
Mono1 + Parallel Data + Mono2
Conclusion
• Novel information theoretic model for bilingual clustering• The bilingual objective has an intuitive meaning• Joint optimization of the mono + bi objective
• Improvement in clustering quality over monolingual clustering
• Extendable to any number of languages incorporating both monolingual and parallel data
Thank You!