machine transliteration t bhargava reddy (knowledge sharing)

Machine TransliterationT BHARGAVA REDDY

(Knowledge sharing)

What is Machine Transliteration

It is the conversion of text from one script to another

Not every word in a particular language has its alternative in other language

We call such words as Out of Vocabulary words

Machine Transliteration would be a useful tool in machine translation when dealing with OOV words

Tirupati తిరు�పతి

Machine Transliteration Models

4 Machine Transliteration Models have been proposed so far:

1. Grapheme Based Transliteration Model (ψG)

2. Phoneme Based Transliteration Model (ψP)

3. Hybrid Transliteration Model (ψH)

4. Correspondence Based Transliteration Model (ψC)

Grapheme and Phoneme

Phoneme:

Smallest contrastive linguistic unit which may bring about a change of meaning. Kiss and Kill are two completely contrasting words. The phonemes here are /l/ and /s/ which make the difference. Grapheme:

Grapheme is the smallest semantically distinguishing unit in a written language. Analogous to the phonemes of spoken languages. A grapheme may or may not carry meaning by itself and may or may not correspond to single phoneme.

Grapheme Based Transliteration Model (ψG)

Machine directly converts the source language graphemes to target language graphemes

This method need not have any knowledge about phonetic knowledge of the source and target languages

4 methods are implemented for this scenario:

1. Source Channel Model

2. Decision Tree model

3. Transliteration network

4. Joint source-channel model

Source Channel Model

English language words are converted to chunks of English graphemes

Next all possible chunks of other language corresponding to the chunk of English language are produced

Most relevant sequence of the target language graphemes are identified

Advantage: It considers a chunk of graphemes representing a phonetic property of the source language word

Disadvantage: Errors in first step propagate to the subsequent steps making it difficult to produce the correct transliteration

Time complexity is a major issue in this case. As it is a time consuming task

Decision Tree Model

Decision trees that transform each source grapheme into target

graphemes are learned and the directly applied to MT

Advantage: Considers a wide range of contextual information, say the

left three and right three contexts

Disadvantage: Unlike the source channel model does not consider

phonetic aspects

Transliteration Network

The network consists of arcs and nodes

Node represents a chunk of source graphemes and its corresponding target graphemes

Arc represents a possible link between the nodes and has a weight showing their strengths

Method considers phonetic aspects in the formation of graphemes

Segmenting a chunk and identification of most relevant sequence in done in one step

This means the errors are not propagated from one step to the next

Phoneme-based Transliteration Model

This model is basically source grapheme – source grapheme and source grapheme – target grapheme transformation

This model was first proposed by Knight and Graehl in 1997

They used Weighted Finite State Transducers (WFST’s)

They modelled it for English – Japanese and Japanese – English Transliteration

Similar methods have come up for Arab-English and English-Chinese transliteration

Knight and Graehl’s Work

In these methods the main transliteration key is pronunciation (or) source phoneme rather than spelling or source grapheme

Katakana words are those words which are imported from other languages (primarily English)

This language has a lot of issues with when pronunciation is concerned

In Japanese the words L,R are pronounced the same

Same goes with H,F either

Katakana Words

Golf bag is pronounced as go-ru-hu-ba-ggu ---- ゴルフバッグ Johnson is pronounced as jyo-n-s-o-n --- ジョンソン Ice cream is pronounced as a-i-su-ku-ri-i-mu アイスクリーム What have we observed in the transliteration?

We can say that there has been a lot of information loss in the process of conversion from English to Japanese

So when we do the back-transliteration we may fall into trouble

Trouble in Back-Transliteration

There are several forms of writing the word switch which are acceptable by the Japanese language rules

But when converting it from the Japanese language to English we need to be very strict we cannot have any other word than ‘switch’

Back transliteration is harder than Romanization. Converting the word Angela (アンジェラ ) would give us the word anjera in English which is no where acceptable

The words are many times compressed. The word ‘word processing’ is transliterated as ‘waapuro’ which is not at all easy to back-transliterate

The steps to convert from English to Katakana

We do the following steps for the conversion:

1. – Generates English word sequences

2. – Pronounces English word sequences

3. – Converts English sounds into Japanese sounds

4. – Converts Japanese sounds to katakana writing

5. – Introduces misspelling caused by Optical Character Recognition

Fixing Back-Transliteration

We do the 5 steps for converting an English language word to Japanese

Knight and Graehl used Bayes theorem to do the reverse

For a given katakana word ‘o’ they maximized the following expression:

They implemented using WFSA (Weighted Finite State Acceptor) – with weights and symbols on the transition making some output sequence likely than others

The other probabilities are implemented using WFST (Weighted Finite State Transducer) with a pair of symbols on each transition, one input and one output. Inputs and outputs may include the empty symbol ϵ

Algorithms for extracting the best transliteration

Assume that we have a WFSA that contains n states and m arcs

2 algorithms have been used

1. Finding the Best Transliteration : Dijkstra’s shortest path (1960). Time complexity would be (O())

2. Finding the best k transliterations: Eppstein’s k shortest paths problem (1994). Time complexity would be O()

Example for Back Transliteration

Consider the Japanese word マスターズトーナメント This is first identified my an OCR which may give errors

Now this is passed to model which converts it as :

m a s u t a a z u t o o ch I m e n t o

Now comes the model. It converts the text as:

M AE S T AE AR DH UH T AO CH IH M EH N T AO

Now comes the model.

Masters tone am ent awe

Finally using the P(w) model. The word becomes: Masters Tournament

BTP Work

I would be working under PhD. Student Arjun Atre for the project

We would be trying to develop Machine Transliteration tools for Indian Languages

I would be trying to develop a bridging language which can be used to transliterate text from one Indian language to other

This contributes a lot to the NLP society and would be a leading step to develop OOV words which are many in our native languages

THANK YOU

References

A comparison of Different Machine Transliteration Tools (2006), Jong-Hoon Oh, Key-Sun Choi, Hitoshi Isahara

Machine Transliteration (1997), Kevin Knight and Jonathan Graehl. Phoneme based transliteration model

www.Wikipedia.org/transliteration

http://www.wikipedia.org/transliteration

machine transliteration t bhargava reddy (knowledge sharing)

Documents

grapheme source grapheme

source language graphemes

transliteration model

transliteration model

transliteration network

chunk of source graphemes

target language graphemes

machine transliteration