collocation extraction using monolingual word alignment method
DESCRIPTION
Collocation Extraction Using Monolingual Word Alignment Method. Zhanyi Liu, Haifeng Wang, Hua Wu, Sheng Li EMNLP 2009. Collocation. Two words Consecutive ("by accident") Interrupted ("take ... advice") Other examples Proper noun ("New York") Compound nouns ("ice cream") - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/1.jpg)
Collocation Extraction Using Monolingual Word Alignment
MethodZhanyi Liu, Haifeng Wang, Hua
Wu, Sheng Li
EMNLP 2009
![Page 2: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/2.jpg)
Collocation
• Two words– Consecutive ("by accident")– Interrupted ("take ... advice")
• Other examples– Proper noun ("New York")– Compound nouns ("ice cream")– Correlative conjunction ("either ... or")
![Page 3: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/3.jpg)
Previous Works
• Co-occurring word pairs– Word pairs in a given window size
• Association measures– Frequency, log-likelihood, mutual information ...
• Disadvantage– Long-span collocation
• "either ... or", "because ... so"
• Limited by window size
– False collocation• Any word pairs in window size
![Page 4: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/4.jpg)
Monolingual Word Alignment
• Bilingual word alignment (BWA)– Source-target sentence pairs
• Monolingual Word Alignment (MWA)– Source-source sentence pairs
– Replicate the corpus
![Page 5: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/5.jpg)
Monolingual Word Alignment (2)
Bilingual
Monolingual
A word never collocates with itself
![Page 6: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/6.jpg)
MWA Model
• Sentence with l words S ={w1,...,wl}
• Alignment A = {(i,ai) | i [1,l]}∈
A = {(2,3), (3,2), (4,7), (6,7)...}
![Page 7: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/7.jpg)
MWA Model (2)
• Adapt IBM Model 3 to MWA
• EM training algorithm, produce 3 probability– Word collocation probability
– Position collocation probability• d(4|7,12)• Prob that 4th collocates with 7th word in a 12-word sentence
– Fertility probability
• Prob that wi is collocate with Φi words
![Page 8: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/8.jpg)
Collocation Extraction
• Extract and rank. Filter when freq(wi,wj)<5
• Symmetric assumption– (wi, wj) = (wj, wi)
![Page 9: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/9.jpg)
Initial Experiment
• Chinese
• Training data– LDC2007T03 Tagged Chinese Giga Word
– Xinhua portion, 28M words
• Gold set– Handcrafted collocation dictionaries
– 56888 collocations
![Page 10: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/10.jpg)
Initial Experiment (2)
• Precision
• Baseline– Frequency, log-likelihood, mutual information
– Log-likelihood achieves the best performance
![Page 11: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/11.jpg)
Initial Experiment (3)Observation
Precision is lowSmall gold set (57K/200K = 28%)
Low precision when N < 20K
![Page 12: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/12.jpg)
ObservationFrequency vs. Probability vs. PrecisionPrecision curve
Lower freq --> lower precisionAlignment probability curve
Lower freq --> higher probability
![Page 13: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/13.jpg)
Observation (2)
• Conclusion– What causes lower precision of top 20K?
– Collocation with low freq but high probability
![Page 14: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/14.jpg)
Improved MWA Method
• Add a penalization function y=f(x), x=freq(w1,w2)– When x is small, y approaches 0 (penalize)– When x is large, y approaches 1 (do not penalize)
• y = e-b/x (b is tuned to 25)• New ranking score
![Page 15: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/15.jpg)
Further Evaluation
• Automatic evaluation– Greatly outperforms the best baseline– For top 1K, 20.6% vs. 11.7%– Exponential function plays a key role
![Page 16: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/16.jpg)
Further Evaluation (2)
• Human Evaluation– Top 1K collocations– For each collocation, tag "True" or "False"
• 4 "False" cases– A: two semantically related words
• (醫生 , 護士 )
– B: a part of multi-word collocation(>= 3 words)• (自我 , 機制 ) in (自我 , 約束 , 機制 )
– C: high frequency bigram• (他 , 說 ), (這 , 是 ), (很 , 好 )
– D: two words co-occurring frequently• (北京 , 月 ), (和 , 為 )
![Page 17: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/17.jpg)
Further Evaluation (3)
• True collocations are much more than baseline• False collocation
– A: semantically related, not distinguishable by MWA– B: only two-word collocation is extracted.
• Few collocations have >=3 words
– C: frequent bigram, not distinguishable by MWA– D: much less than baseline
![Page 18: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/18.jpg)
Further Evaluation (3) cont.
• MWA are able to produce long-span collocations• 48 extracted collocations with span > 6
– 33 are tagged "True"• ("處於 ", "狀態 "), ("由於 ", "因此 ")
– 69% precision
![Page 19: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/19.jpg)
Fertility vs. Precision
• Manually label 100 sentences and observe fertility– 78% words collocate with 1 word– 17% words collocate with 2 words– 95% words have fertility <= 2
• Limit Φmax
![Page 20: Collocation Extraction Using Monolingual Word Alignment Method](https://reader034.vdocuments.us/reader034/viewer/2022051315/56812ba4550346895d8fd633/html5/thumbnails/20.jpg)
Conclusion
• Main contribution– Successfully adapt BWA to MWA– Propose a ranking method
• Alignment probability + Exponential penalty function
• Initial failure are well discussed• Future work
– Improving Statistical Machine Translation with Monolingual Collocation, ACL 2010
– Improve alignment, phrase table