![Page 1: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/1.jpg)
Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
Advisor : Dr. Hsu Student : Sheng-Hsuan Wang
Department of Information Management
Acquisition of English-Japanese proper nouns from noisy-parallel newswire
articles using KATAKANA matching
Toshiba Corp. R&D Center
![Page 2: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/2.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline
Motivation Objective Introduction Background Method Simulations Discussion Conclusion
![Page 3: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/3.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation
Limitation of statistical approaches
![Page 4: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/4.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective
Superiority of linguistic approaches
![Page 5: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/5.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
A tool for extracting bilingual knowledge from noisy-parallel English-Japanese text Dynamic programming Phonetic similarities Partial matching of English-Japanese Extract a small reliable bilingual lexicon of
anchor points Establish further bilingual correspondences
![Page 6: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/6.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
Type of bilingual knowledge acquisition from parallel corpora Statistical
Internal distributional evidence of bilingual word pairs
Linguistic External evidence provided by bilingual
lexicons to establish anchor points between pairs of bilingual phrases
![Page 7: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/7.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background
The challenge for establishing a bilingual correspondance between English-Katakana Lose information when English-Katakana
`r' and `l' or `b' and `v' Redundant vowel sounds when Katakana-English
`fra' in “Frankfurt” ` フラ‘ translate into ‘fura’
![Page 8: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/8.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background
Deal with these problems in previous researches Transcribe into intermediate representations and
match these. The matching knowledge may be biased towards
English pronunciation.
“Chirac” => “ シラク”` シ ' is pronounced as shi.
![Page 9: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/9.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background
A neutral intermediate representation allows for partial matching When intermediate representation match above a
certain threshold then they are in a translation relation.
“ パレスチナ”
“Palestine”“Palestinian”“Palestinians”
![Page 10: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/10.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Method
NPT (Nearest Phonetic Transliteration) Takes each Katakana word and converts it to a ph
onetic string representing all English spelling combinations of the word.
“ ブルンジ” which is “Burundi” in English
‘ル ー > rloue’
“buorlouenmgesdjgiou”
![Page 11: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/11.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
![Page 12: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/12.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Method – NPT_score
“Burundi”“buorlouenmgesdjgiou”
npt: NPT stringe: English stringmd: maximum depthd: depth counts: score
![Page 13: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/13.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Method
Save search time and detect substrings
Several heuristics First letter is in upper case for obtaining candidate
proper nouns in the English text. Limit the minimum length of Katakana words
available for matching.
“ クリスマス” (=“Christmas”) and “Mass”
![Page 14: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/14.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Simulations
Two corpora of English and Japanese headline newswire articles.
The test corpus had 150 aligned articles 1730 English paragraphs and 771 Japanese paragraphs 871 Katakana words 9742 potential English proper nouns 65 comparisons for each Katakana word in each article.
![Page 15: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/15.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Simulations
Baseline Soundex algorithm
K&H Convert the Katakana and the English word to a simplif
ied disjunctive phonetic form. Does not allow either partial matches or matching of su
bstrings.
![Page 16: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/16.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Results
F-measure81%58%39%
![Page 17: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/17.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Discussion
NPT yielded the best result overall. Higher threshold and higher precision. K&H can’t handle partial match and intermediate for
m may lose information. Partial matching
Finding substrings Identify cognatively connectd translation pairs
“ インドネシア” => “Indonesia”“Indonesian”, “Indonesians”, “Indonesias"
![Page 18: Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management](https://reader034.vdocuments.us/reader034/viewer/2022052603/568134a9550346895d9bb8ea/html5/thumbnails/18.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusion
Back-transliterating from Katakana to English is unexpectedly difficult.
The set of matching rules is quite small, it could be improved.
Future research Induce the rules automatically from a corpus of ex
amples.