cross-language name search raghavendra udupamicrosoft research india mitesh khapraiit bombay...
Post on 19-Dec-2015
215 views
TRANSCRIPT
![Page 1: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/1.jpg)
Cross-Language Name Search
Raghavendra UdupaMicrosoft Research India
Mitesh Khapra IIT Bombay
NAACL-HLT 2010June 3, 2010
Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search
![Page 2: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/2.jpg)
Name Search
• Searching people directories by name.
Facebook Friend Search Outlook Address Book Search
![Page 3: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/3.jpg)
Cross-Language Name Search
• Searching people directories by name across languages.
Query in Russian Query in Hebrew
![Page 4: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/4.jpg)
Challenges
• Script and phonetic differences
• Large Directories– Millions of names
• Multi-word Names and Partial Matches
• Spelling Variations
![Page 5: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/5.jpg)
Naive Approach
• Transliterate and Search– Rashid רשיד
• Limitations– Slow as it involves the intermediate step of
transliteration generation.– Machine Transliteration is not perfect• Transliteration errors affect search results
• Is Transliteration Generation necessary?
![Page 6: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/6.jpg)
Our Approach
רשיד
אנטוני
Rashid
Names Language-Independent Geometric Representation
Similarity
cos𝜃≈1
cos𝜃≈−1
![Page 7: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/7.jpg)
Search OverviewAaronBharatCecileDavid
MichaelSanjayStuartDanielRashmiAlbertRashidKumar
Query NamesGeometric Distance
רשיד
Geometric Nearest Neighbor Search
![Page 8: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/8.jpg)
What is the advantage?
• Can scale to reasonably large name directories• Compact geometric representation
• 50 dimensional space• 6 M names
• Search is effective and efficient• Geometric nearest-neighbor search using Approximate
Nearest Neighbor (ANN) [Arya et al, 1998]• ~1s per query for searching 6 M names
• >20 % improvement in MRR over Transliterate-and-Search
![Page 9: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/9.jpg)
What is the challenge?
• Language/Script Independent Representation• Learning common geometric feature
space from training data• Multi-Word Names and Partial Matches• Maximum Weighted Bipartite Matching
![Page 10: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/10.jpg)
Previous Work
• Language Independent Representation(2007) Canonical Correlation Analysis: An overview with application to learning methods.D. Hardoon et al., Neural Computation 2004.
• Transliteration Equivalence(2006) Named entity transliteration and discovery from multilingual comparable corpora.A. Klementiev and A. Roth, HLT-NAACL 2006.
(2009) Learning better transliterations.J. Pasternack and D. Roth, CIKM 2009.
(2010) Transliteration equivalence using canonical correlation analysis.R. Udupa and M. Khapra, ECIR 2010.
![Page 11: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/11.jpg)
Common Feature SpaceAaronBharat
RickDavid
MichaelSanjayStuartDanielRashmiAlbertRashidKumar
ಆರನ್ �ಭರತ್ �ರಿಕ್ �
ಡೇವಿಡ್ �ಮೈ�ಕೆಲ್ �ಸಂ�ಜಯಸಂ��ವರ್ಟ್ ��ಡೇನಿಯಲ್ �
ರಶ್ಮಿ�ಆಲ್ಬ�ರ್ಟ್ ��ರಶ್ಮಿದ್ �ಕು!ಮಾ#ರ್ �
Training Data Parallel Names Similar Vectors
Common Feature Space
![Page 12: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/12.jpg)
Feature Vectors
^R Ra as sh hi id d$ ic …
1 1 1 1 1 1 1 0 …
^ರ ರಶ ಶ ೀ �� ೀ��ದ ದ ೀ � ೀ �$ ಆಲ …
1 1 1 1 1 1 0 …
Feature Vectors
![Page 13: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/13.jpg)
Learning Common Feature Space
Canonical Correlation Analysis
![Page 14: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/14.jpg)
Canonical Correlation Analysis
(1) Aaron
(2) Bernard
(3) David
(4) William
אהרן (1)
ברנאר(2)
דוד (3)
ויליאם (4)
1 2
3
4
1
2
3
4
1
1 2
2
3
3
4
4
![Page 15: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/15.jpg)
Learning Common Feature Space
Canonical Correlation Analysis (Hoteling, 1936)
![Page 16: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/16.jpg)
Multi-Word Names
קלי מליסה
Melissa Jane Kelly
0.97 0.91
Score = Maximum Weighted Matching / (m – n + 1)
![Page 17: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/17.jpg)
Experimental Setup
•Name Directory:• English Wikipedia Titles• 6 Million Titles, 2 Million Unique Words
•Query Languages:• Russian, Hebrew, Kannada, Tamil, Hindi, Bengali• 1000 multi-word names in each language•Baseline:• State-of-the-art Machine Transliteration (NEWS 2009)
![Page 18: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/18.jpg)
Experimental Results
MRR0 1
Very Bad Perfect
Competitor GEOM-SEARCH
Algorithm Russian Kannada Tamil Hindi
TRANS-SEARCH 0.47 0.52 0.29 0.49
GEOM-SEARCH 0.56 0.69 0.49 0.69
![Page 19: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/19.jpg)
Conclusions
• Pros– Data driven: Easy to include new languages.– Not training data hungry: a few thousand parallel names
suffice.– Bridge languages are useful: feature space for (P,Q) can
be learnt using only data in (P,R) and (Q,R) (Khapra and Udupa, AAAI 2010)
– Fast search: ~1s for 6 M names directory – Applications:
• Cross-Language Wikipedia Search• Spelling Correction of Personal Names
![Page 20: Cross-Language Name Search Raghavendra UdupaMicrosoft Research India Mitesh KhapraIIT Bombay NAACL-HLT 2010 June 3, 2010 Improving the Multilingual User](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d355503460f94a0d11b/html5/thumbnails/20.jpg)
Raghavendra UdupaMicrosoft Research India
Mitesh Khapra IIT Bombay
NAACL-HLT 2010June 3, 2010
Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search
Thank you!