machine transliteration t bhargava reddy (knowledge sharing)
TRANSCRIPT
Machine TransliterationT BHARGAVA REDDY
(Knowledge sharing)
What is Machine Transliteration
It is the conversion of text from one script to another
Not every word in a particular language has its alternative in other language
We call such words as Out of Vocabulary words
Machine Transliteration would be a useful tool in machine translation when dealing with OOV words
Tirupati తిరు�పతి
Machine Transliteration Models
4 Machine Transliteration Models have been proposed so far:
1. Grapheme Based Transliteration Model (ψG)
2. Phoneme Based Transliteration Model (ψP)
3. Hybrid Transliteration Model (ψH)
4. Correspondence Based Transliteration Model (ψC)
Grapheme and Phoneme
Phoneme:
Smallest contrastive linguistic unit which may bring about a change of meaning. Kiss and Kill are two completely contrasting words. The phonemes here are /l/ and /s/ which make the difference. Grapheme:
Grapheme is the smallest semantically distinguishing unit in a written language. Analogous to the phonemes of spoken languages. A grapheme may or may not carry meaning by itself and may or may not correspond to single phoneme.
Grapheme Based Transliteration Model (ψG)
Machine directly converts the source language graphemes to target language graphemes
This method need not have any knowledge about phonetic knowledge of the source and target languages
4 methods are implemented for this scenario:
1. Source Channel Model
2. Decision Tree model
3. Transliteration network
4. Joint source-channel model
Source Channel Model
English language words are converted to chunks of English graphemes
Next all possible chunks of other language corresponding to the chunk of English language are produced
Most relevant sequence of the target language graphemes are identified
Advantage: It considers a chunk of graphemes representing a phonetic property of the source language word
Disadvantage: Errors in first step propagate to the subsequent steps making it difficult to produce the correct transliteration
Time complexity is a major issue in this case. As it is a time consuming task
Decision Tree Model
Decision trees that transform each source grapheme into target
graphemes are learned and the directly applied to MT
Advantage: Considers a wide range of contextual information, say the
left three and right three contexts
Disadvantage: Unlike the source channel model does not consider
phonetic aspects
Transliteration Network
The network consists of arcs and nodes
Node represents a chunk of source graphemes and its corresponding target graphemes
Arc represents a possible link between the nodes and has a weight showing their strengths
Method considers phonetic aspects in the formation of graphemes
Segmenting a chunk and identification of most relevant sequence in done in one step
This means the errors are not propagated from one step to the next
Phoneme-based Transliteration Model
This model is basically source grapheme – source grapheme and source grapheme – target grapheme transformation
This model was first proposed by Knight and Graehl in 1997
They used Weighted Finite State Transducers (WFST’s)
They modelled it for English – Japanese and Japanese – English Transliteration
Similar methods have come up for Arab-English and English-Chinese transliteration
Knight and Graehl’s Work
In these methods the main transliteration key is pronunciation (or) source phoneme rather than spelling or source grapheme
Katakana words are those words which are imported from other languages (primarily English)
This language has a lot of issues with when pronunciation is concerned
In Japanese the words L,R are pronounced the same
Same goes with H,F either
Katakana Words
Golf bag is pronounced as go-ru-hu-ba-ggu ---- ゴルフバッグ Johnson is pronounced as jyo-n-s-o-n --- ジョンソン Ice cream is pronounced as a-i-su-ku-ri-i-mu アイスクリーム What have we observed in the transliteration?
We can say that there has been a lot of information loss in the process of conversion from English to Japanese
So when we do the back-transliteration we may fall into trouble
Trouble in Back-Transliteration
There are several forms of writing the word switch which are acceptable by the Japanese language rules
But when converting it from the Japanese language to English we need to be very strict we cannot have any other word than ‘switch’
Back transliteration is harder than Romanization. Converting the word Angela (アンジェラ ) would give us the word anjera in English which is no where acceptable
The words are many times compressed. The word ‘word processing’ is transliterated as ‘waapuro’ which is not at all easy to back-transliterate
The steps to convert from English to Katakana
We do the following steps for the conversion:
1. – Generates English word sequences
2. – Pronounces English word sequences
3. – Converts English sounds into Japanese sounds
4. – Converts Japanese sounds to katakana writing
5. – Introduces misspelling caused by Optical Character Recognition
Fixing Back-Transliteration
We do the 5 steps for converting an English language word to Japanese
Knight and Graehl used Bayes theorem to do the reverse
For a given katakana word ‘o’ they maximized the following expression:
They implemented using WFSA (Weighted Finite State Acceptor) – with weights and symbols on the transition making some output sequence likely than others
The other probabilities are implemented using WFST (Weighted Finite State Transducer) with a pair of symbols on each transition, one input and one output. Inputs and outputs may include the empty symbol ϵ
Algorithms for extracting the best transliteration
Assume that we have a WFSA that contains n states and m arcs
2 algorithms have been used
1. Finding the Best Transliteration : Dijkstra’s shortest path (1960). Time complexity would be (O())
2. Finding the best k transliterations: Eppstein’s k shortest paths problem (1994). Time complexity would be O()
Example for Back Transliteration
Consider the Japanese word マスターズトーナメント This is first identified my an OCR which may give errors
Now this is passed to model which converts it as :
m a s u t a a z u t o o ch I m e n t o
Now comes the model. It converts the text as:
M AE S T AE AR DH UH T AO CH IH M EH N T AO
Now comes the model.
Masters tone am ent awe
Finally using the P(w) model. The word becomes: Masters Tournament
BTP Work
I would be working under PhD. Student Arjun Atre for the project
We would be trying to develop Machine Transliteration tools for Indian Languages
I would be trying to develop a bridging language which can be used to transliterate text from one Indian language to other
This contributes a lot to the NLP society and would be a leading step to develop OOV words which are many in our native languages
THANK YOU
References
A comparison of Different Machine Transliteration Tools (2006), Jong-Hoon Oh, Key-Sun Choi, Hitoshi Isahara
Machine Transliteration (1997), Kevin Knight and Jonathan Graehl. Phoneme based transliteration model
www.Wikipedia.org/transliteration