advisor : dr. hsu graduate : chun kai chen author : keita tsuji
DESCRIPTION
Automatic Extraction of Translational Japanese-KATAKANA and English Word Pairs from Bilingual Corpora. Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji. ICCPOL (2001) 245-250. Outline. Motivation Objective Introduction Extraction of Traslational Word Pairs - PowerPoint PPT PresentationTRANSCRIPT
Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
Automatic Extraction of Translational Japanese-KATAKANAand English Word Pairs from Bilingual Corpora
Advisor : Dr. Hsu
Graduate : Chun Kai Chen
Author : Keita Tsuji
ICCPOL (2001) 245-250
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Outline
Motivation Objective Introduction Extraction of Traslational Word Pairs Experimental Results Conclusions Personal Opinion
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Motivation
The bilingual lexicon is in demand in many fields ─ cross-language information retrieval─ machine translation
The bilingual corpora ─ become widely available reflecting the change in
publishing activities
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Objective
Propose a method ─ automatically extract translational Japanese-KATAKA
NA and English word pairs from bilingual corpora based on transliteration rules
─ < グラフ , graph>
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Introduction(1/2)
Many of the methods proposed ─ depend heavily on word frequency in the corpora─ cannot treat low-frequency words properly
The low-frequency words ─ include newly-coined words which are especially in de
mand in many fields
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Introduction(2/2)
Against these background─ we propose a method to extract translational KATAKA
NA-English word pairs from bilingual corpora The method
─ applies all the existing transliteration rules to each mora unit in a KATAKANA word
─ extract English word which matched or partially-matched to one of these transliteration candidates as translation
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Extraction of Traslational Word Pairs
Pick up J Decompose J
Generate candidates
Pick up E Identify LCS P(J, E)=maxi Dice()
Transliteration rules
グラフ ‘グ’ ,’ラ’ ,’フ’
Ti(J)=graf
Ti(J)=graph
graph P(グラフ , graph)=1P(グラフ , library)=0.36
S(‘guraf’, ‘graph’) = ‘gra’)
Construction of Transliteration Rule
Measure for Matching
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Construction of Transliteration Rule(1/2)
1. Decompose KATAKANA word ─ ‘ディスパッチャー’─ ‘ディ’ , ‘ ス’ , ‘ パッ’ and ‘ チャー’
2. Extract transliteration rule for each unit manually─ ‘ディスパッチャー’ and ‘dispatcher’─ ‘ディ’ = ‘di’─ ‘ス’ = ‘s’─ ‘パッ’ = ‘pat’─ ‘チャー’ = ‘cher’
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Construction of Transliteration Rule(2/2)
3. Repeat (1) and (2) for all the word pairs in the list
─ count the frequency of each rule─ rank them for each KATAKANA unit
4. Add Hepburn transliteration rules into the above rules
─ If the source list is large enough, this process might not be necessary
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Assume
When Japanese borrows an English word─ the word is transliterated on the basis of Japanese mora unit ─ their correspondences are stable
These transliterations are free from context ─ have no relation to the preceding─ following mora units
The number of transliteration rules is small They do not vary drastically from time to time nor fro
m domain to domain
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Extraction of Traslational Word Pairs
Pick up J and decompose it into units ─ according to the same framework we used at TR construction
Using all the transliteration rules in TR─ generate all the possible transliteration candidates for J─ Henceforth we represent the i-th transliteration candidate of J as Ti(J)
Pick up E which co-occurred with J ─ occurred in the same aligned segment in the corpus─ identify the longest common subsequence with each Ti(J)
If the following P(J, E) exceeds certain threshold─ extract pair J and E as translation
P(J, E)=maxi Dice(L(S(Ti(J), E)), L(Ti(J)), L(E))
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Extraction of Traslational Word Pairs
Pick up J Decompose J
Generate candidates
Pick up E Identify LCS P(J, E)=maxi Dice()
Transliteration rules
グラフ ‘グ’ ,’ラ’ ,’フ’
Ti(J)=graf
Ti(J)=graph
graph P(グラフ , graph)=1P(グラフ , library)=0.36
S(‘guraf’, ‘graph’) = ‘gra’)
Construction of Transliteration Rule
Measure for Matching
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Dice Measure J: KATAKANA word in the corpus E: English word in the corpus L(w): number of characters of word w T(J): transliteration candidate of word J S(w1, w2): longest common subsequence of w1 and w2
─ (e.g. S(‘guraf’, ‘graph’) = ‘gra’)
Dice(k,m,n)=k*2/(m+n)
Dice(L(S(Ti(J), E)), L(Ti(J)), L(E))
Dice(L(S(graf, graph)), L(graf), L(graph))
Dice(3,4,5)=3*2/(4+5)=0.67
Ti(J)=graf
Ti(J)=graph
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Example P( グラフ , graph)
= max( Dice(L(S(graf, graph)), L(graf), L(graph)), Dice(L(S(graph, graph)), L(graph), L(graph)), Dice(L(S(graff, graph)), L(graff), L(graph)), … Dice(L(S(gulerfe, graph)), L(gulerfe), L(graph))) = max(0.67, 1.00, 0.60, …, 0.33, 0.33) = 1.00
P( グラフ , library)=max(0.36, 0.33, ..., 0.29, 0.29)=0.36
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Device for Time-saving(1/2)
Procedure requires much computational time when applied to the actual data─ using all rules in TR often leads to the combinatory exp
losion of the number of transliteration candidates─ identifying the longest common subsequence often req
uires much time
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Device for Time-saving(2/2)
Apply less transliteration rules to longer KATAKANA word─ at TR construction, we have ranked transliteration rules according
to their frequencies─ applied top 12/(the number of units in J)+1 rules to each unit of J
3*6*4 candidates
12/3+1=5
3*5*4 candidates
3 6 4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Other Alternatives(1/2)
Transliteration Rule (TR)─ comparing the effectiveness of TR and HR (Hepburn tr
ansliteration rule)─ HR is an well-known rule for transliterating Japanese t
o alphabet strings and is easily available─ HR gives each KATAKANA unit a unique alphabet str
ing TR usually generates many T(J) HR generates only one T(J). If HR alone can produce good result, we can save computation
al time
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Other Alternatives(2/2)
Measure for Matching─ Our method uses the combination of Dice and NPT_sc
ore─ Mdic was used on behalf of Dice:
Mdic(k, m, n)=(1+logk)*k*2/(m+n)─ Bgrm was used on behalf of the combination of Dice a
nd NPT_score:
Bgrm(T(J), E)=|NT(J) ∩ NE|/|NT(J) ∪ NE|
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiment Data(1/2)
Source Data for TR─ from 3,742 transliterational term pairs in dictionary of
artificial intelligence─ added Hepburn transliteration rules to them
Bilingual Corpora─ eight bilingual corpora, Japanese-English parallel abstr
acts/titles of academic papers,─ domains of these corpora are artificial intelligence (AI),
forestry (FR), information processing (IP) and architecture (AC)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiment Data(2/2)
The purpose of using abstract and title corpora─ examine the influence of size and the degree of parallelism of eac
h segment to the extraction results─ abstracts are larger and noisier than titles
The purpose of using four domains ─ examine the influence of domain difference to the results─ results of the other three domains do not significantly differ from
that of artificial intelligence─ need not be so nervous about the domain of source data for TR co
nstruction
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Results(1/3)
TR is much more effective than simple HR(Hepburn transliteration rules)─ TR_Dice achieved 93% precision at 75% recall─ HR_Dice at 75% recall remained 4%
Dice is more effective than Mdic─ many word pairs which are long and morphologically-r
elated but not translational─ Mdic between these word pairs tend to become higher t
han those of short translational pairs, which leads to the decrease of precision
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Results(2/3) The combination of Dice and NPT_score is more effe
ctive than Bgrm Our method is more effective than using combination
of NPT and Match─ in [1] tends to generate strings which contain incorrect translation
s. ‘グラフ’ is transliterated into ‘ghurlaoffphu’ contains not only correct translation ‘graph’, but also incorrect one ‘
hop’.─ Match depends only on the length of English words and their mat
ched parts─ Therefore, it tends to evaluate short English words
Match(‘ghurlaoffphu’, ‘hop’)=3/3=1)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Results(3/3)
TR which was constructed based on artificial intelligence term pairs also performed well against the other three domains─ This indicates the domain-independence of TR ─ we do not have to construct TR for each domain
Our method achieved 96-100% precision at 75% recall against title corpora
Our method performed well against abstract corpora (83-93% precision at 75% recall)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Error Analysis(1/2)
Most of the translational word pairs not extracted were those containing transliteration units which were not listed in TR─ For instance
‘アーキテクチャ’ and ‘architecture’ was not extracted because ‘ キ’ = ‘chi’ ,‘チャ’ = ‘ture’ were not listed in TR
─ We can solve this problem by enriching TR
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Error Analysis(2/2)
A few of the translational word pairs which were not extracted were acronyms and their original forms─ cannot be extracted based on transliterations by nature
Many of the non-translational word pairs wrongly extracted were morphologically related pairs. ─ For instance
‘プログラマ’ (programmer) and ‘program’ , ‘クラス’ (class) and ‘subclass’
─ Emphasizing the first and the last unit matching might be effective
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Conclusion
We proposed a method─ for extracting translational KATAKANA-English word
pairs from bilingual corpora
The experiment─ shows our method is highly effective
By enriching TR and introducing some heuristics─ the performance will become higher
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Personal Opinion
Advantage─ Extraction of Traslational Word Pairs─ Error Analysis
Disadvantage─ Construction of Transliteration Rule
─ Assume and limit