transliteration transliteration cs 626 course seminar by purva joshi 08305907 mugdha bapat 07305916...
TRANSCRIPT
TransliterationTransliteration
CS 626 CS 626 course seminar bycourse seminar by
Purva Joshi 08305907 Mugdha Bapat 07305916
Aditya Joshi 08305908 Manasi Bapat 08305906
Humans transliterate frequently for different reasons
Can a machine do this?
(Why would a machine have to do this?)
If yes, how?Picture courtesy: Snapshot of Yahoo! Messenger
Motivation
• An important component of machine translation
• When you cannot translate, transliterate• Generally used for named entities, technical
terms and out of vocabulary words (OOV)• Issues specific to sounds, scripts and
accents• Can a machine do this? If yes, how?
• Task of converting a word from one alphabetic script to another
Used for:
• Named entities
• : Gandhiji
• Out of vocabulary words
• : Bank
What is transliteration?
• Accents
: Thoda or thora?
• Mapping of sounds
• Mahaan: Kahaan:
• Back-transliteration
Linguistic issues
Arabic Chinese Hindi/ Japanese
• Arabic b -> English p or bEnglish word: Paul transliterates toArabic word: Baul (issue in Back-transliteration)
• Origin of the proper noun determines the symbol in Chinese language
• Ideographic symbols in Chinese
• Several English symbols do not mapto any Japanese symbols. So, oftenmapped to closest sounding symbolice cream aisukuriimu
Linguistic Issues : Mapping of sounds
• Symbols map to different symbolsbased on their position
America
• Difference in originRestaurantconstant
xOverview
Source String
TransliterationUnits
Target String
TransliterationUnits
Contents
Source String
TransliterationUnits
Target String
TransliterationUnits
Phoneme- based
Phoneme-based approach
Word inSource language
Pronunciationin
Source language
Word inTarget language
PronunciationIn
target language
P( ps | ws)
P ( pt | ps )
P ( wt | pt )
Note: Phoneme is the smallest linguistically distinctive unit of sound.
P(wt)
Wt* = argmax (P (wt). P (wt | pt) . P (pt | ps) . P (ps | ws) )
Phoneme-based approach
Step I :
Consider each character of the word
Transliterating ‘BAPAT’B A P A T
P /ə//ə/ /a://a://ə//ə/ /a://a:/B T
Source word
to phonemes
P /ə//ə/ /a://a://ə//ə/ /a://a:/BT
Source phonemes
to target phonemes
t
t
Step II : Converting to phoneme seq.Step III : Converting to target phoneme seq.
Phoneme-based approach
Step IV : Phoneme sequence to target stringB : /ə/ :/ə/ :
/a:/ :/a:/ :
P:/ə/ :/ə/ :
/a:/ :/a:/ :
T:
t:
Output :
Concerns
Word inSource language
Pronunciationin
Source language
Word inTarget language
PronunciationIn
target language
Check if the world is validIn target language
Check if environment Is noise-free
• Unknown pronunciations
• Back-transliteration can be a problem Johnson Jonson
Issues in phonetic model
sanhita
samhita
Contents
Source String
TransliterationUnits
Target String
TransliterationUnits
Phoneme- based
Spelling-based
• Maps source word sequences to target word sequences (i.e. direct word to word)
• The transliteration score:
• P(w)
Spelling-based model
Letter trigram model included
Thus, we can accommodate the words not included in the dictionary
Pronunciationin
Source language
PronunciationIn
target language
Word inSource language
Word inTarget language
Comparison of the two methods
Contents
Source String
TransliterationUnits
Target String
TransliterationUnits
Phoneme- based
Spelling-based
Joint Source
Channel
• Particularly developed for Chinese
• Chinese : Highly ideographic
• Example :
• Two main steps:
The Third Method - Why?
Image courtesy: wikimedia-commons
Modeling Decoding
Modeling Step
• A bilingual dictionary in the source and target language
• From this dictionary, the character mapping between the source and target language is learnt
The word “Geo” has two possible mappings, the “context” in which it occurs is important
John
Georgia
Geology
Geo
Geo
Modeling step
Modeling step …
• N-gram Mapping :
• < Geo, > < rge, >
• < Geo, > < lo, >
• This concludes the modeling step
Modeling step …
Decoding Step
• Consider the transliteration of the word “George”.
• Alignments of George:
• Geo rge G eo rge
• Geo rge G eo rge
Decoding step
Decision to be made between….
• The context mapping is present in the map-dictionary
• Using ……
Decoding step …
• Where do the n-gram statistics come from?
Ans.: Automatic analysis of the bilingual dictionary
• How to align this dictionary?
Ans. : Using EM-algorithm
Transliteration Alignment
EM Algorithm
Bootstrap
Expectation
Maximization
TransliterationUnits
Bootstrap initial random alignment
Update n-gram statistics to estimate probability distribution
Apply the n-gram TM to obtain new alignment
Derive a list of transliteration units from final alignment
Evaluation
E2C Error rates for n-gram tests E2C v/s C2E for TM Tests
Conclusion
• Transliteration can make use of phonemes as an intermediate layer to move from a script to another
• Spelling-based approach connects the word sequences of the two languages
• The joint source channel method integrates optimization of alignment and transliteration
• no pre-alignment needed• reduction in development efforts
( the end )
References
• For all Devnagari transliterations, www.quillpad.in/hindi/
H. Li,M. Zhang, and J. Su. 2004. A joint source-channel model for machine transliteration. In ACL, pages 159–166.
www.wikipedia.org
Y. Al-Onaizan and K. Knight. 2002. Machine transliteration of names in Arabic text. In ACL Workshop on Comp. Approaches to Semitic Languages.
K. Knight and J. Graehl. 1998. Machine transliteration. Computational Linguistics, 24(4):599–612.
N. AbdulJaleel and L. S. Larkey. 2003. Statistical transliteration for English-Arabic cross language information retrieval. In CIKM, pages 139–146.
• Joint source-channel model
• Phoneme and spelling-based models