literature survey: machine translation...this report is a brief introduction to the major works in...

18
Literature Survey: Machine Translation Shreya Alva Shubham Dewangan Nitish Joshi Pushpak Bhattacharyya CFILT, Indian Institute of Technology Bombay, India {shreya, nitishjoshi, shubhamd, pb}@cse.iitb.ac.in Abstract Machine Translation lies at the heart of Nat- ural Language Processing, which in turn is one of the of the core components of the Ar- tificial Intelligence discipline. In this report, we have a brief look at the legacy approaches to machine translation, while also examining the path that Machine Translation is headed on, with the newest approach to this classic problem - namely neural machine translation This report is a brief introduction to the major works in the field of neural machine transla- tion. 1 Introduction Artificial Intelligence is transforming the way we interact with the world as it bridges the gap between humans and machines. Machine translation (MT) is the task wherein computers transform input from one natural language into another natural language, while preserving the meaning of the original input. It also serves the purpose of bridging the language divide between humans with the help of automa- tion. The input to an MT system can be a sentence, a block of text or whole documents. This problem is fraught with all kinds of ambiguities that make this problem very challenging for a computer that has no world knowledge or intuitive understanding of human language; to name a few of the ambigui- ties that exist in this field are morphological (what words to use and how to combine sub-word units), syntactic (relating to grammar), semantic (to infer the meaning of the words, as same word can mean different things like river or financial bank) and so on. Traditionally there existed 3 approaches to ma- chine translation (Bhattacharyya, 2015): direct MT, Rule Based MT (abbreviated to RBMT) and Data Driven MT. Under Data Driven MT we have 2 ap- proaches, Example Based and Statistical MT (ab- breviated to EBMT and SMT respectively). We will discuss the legacy approaches of Rule Based MT and Example Based MT in Chapter 2 and Sta- tistical MT in section 3. However, in the past decade Neural Network based approaches have gained momentum, as they have in several other fields. This seminar report is an exploration of the journey of machine transla- tion, and where it is headed. The main challenge is to make systems that deliver adequate and flu- ent translations. Researchers keep endeavoring to improve the translation quality, which is most com- monly measured by the BLEU (bilingual evaluation understudy) score, a metric that captures how good the translation done by the machine is by compar- ing the precision and recall of n-grams (n=1,2,3,4) with some reference translations. Although there may be several translations that need not be among the reference ones, BLEU (and its interactive form, iBLEU) remain the most popular metrics due to the simplicity of calculation and its good correlation with human judgment. 2 Legacy Approaches: RBMT and EBMT 2.1 Rule Based Machine Translation (RBMT) The original approach to machine translation re- quired a critical involvement by humans and lin- guistic experts, as training data was not widely available nor were the computational resources of the time up to the mark to exploit the language transfer information could be provided by data. In this approach, humans explicitly stated rules that specified how an input sentence is to be anal- ysed, transferred into an intermediate representa- tion which is then used to generate the target lan- guage sentence. These kinds of systems suffer from a high preci- sion, low recall condition such that when the rules

Upload: others

Post on 09-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Literature Survey: Machine Translation

    Shreya AlvaShubham Dewangan

    Nitish JoshiPushpak Bhattacharyya

    CFILT, Indian Institute of Technology Bombay, India{shreya, nitishjoshi, shubhamd, pb}@cse.iitb.ac.in

    AbstractMachine Translation lies at the heart of Nat-ural Language Processing, which in turn isone of the of the core components of the Ar-tificial Intelligence discipline. In this report,we have a brief look at the legacy approachesto machine translation, while also examiningthe path that Machine Translation is headedon, with the newest approach to this classicproblem - namely neural machine translationThis report is a brief introduction to the majorworks in the field of neural machine transla-tion.

    1 Introduction

    Artificial Intelligence is transforming the way weinteract with the world as it bridges the gap betweenhumans and machines. Machine translation (MT)is the task wherein computers transform input fromone natural language into another natural language,while preserving the meaning of the original input.It also serves the purpose of bridging the languagedivide between humans with the help of automa-tion. The input to an MT system can be a sentence,a block of text or whole documents. This problemis fraught with all kinds of ambiguities that makethis problem very challenging for a computer thathas no world knowledge or intuitive understandingof human language; to name a few of the ambigui-ties that exist in this field are morphological (whatwords to use and how to combine sub-word units),syntactic (relating to grammar), semantic (to inferthe meaning of the words, as same word can meandifferent things like river or financial bank) and soon.

    Traditionally there existed 3 approaches to ma-chine translation (Bhattacharyya, 2015): direct MT,Rule Based MT (abbreviated to RBMT) and DataDriven MT. Under Data Driven MT we have 2 ap-proaches, Example Based and Statistical MT (ab-breviated to EBMT and SMT respectively). We

    will discuss the legacy approaches of Rule BasedMT and Example Based MT in Chapter 2 and Sta-tistical MT in section 3.

    However, in the past decade Neural Networkbased approaches have gained momentum, as theyhave in several other fields. This seminar report isan exploration of the journey of machine transla-tion, and where it is headed. The main challengeis to make systems that deliver adequate and flu-ent translations. Researchers keep endeavoring toimprove the translation quality, which is most com-monly measured by the BLEU (bilingual evaluationunderstudy) score, a metric that captures how goodthe translation done by the machine is by compar-ing the precision and recall of n-grams (n=1,2,3,4)with some reference translations. Although theremay be several translations that need not be amongthe reference ones, BLEU (and its interactive form,iBLEU) remain the most popular metrics due to thesimplicity of calculation and its good correlationwith human judgment.

    2 Legacy Approaches: RBMT andEBMT

    2.1 Rule Based Machine Translation(RBMT)

    The original approach to machine translation re-quired a critical involvement by humans and lin-guistic experts, as training data was not widelyavailable nor were the computational resources ofthe time up to the mark to exploit the languagetransfer information could be provided by data.In this approach, humans explicitly stated rulesthat specified how an input sentence is to be anal-ysed, transferred into an intermediate representa-tion which is then used to generate the target lan-guage sentence.

    These kinds of systems suffer from a high preci-sion, low recall condition such that when the rules

  • can be correctly applied, they generate the rightoutput, but most of the time these rules cannot beaptly applied. When more than one rule is applica-ble to a sentence, there can be a conflict betweenrules, so careful ordering is necessary.

    There are two approaches to RBMT - Interlinguaand Transfer Based MT.

    2.1.1 Interlingua based MTInterlingua is an intermediate language that repre-sents meaning without any ambiguity. To performtranslation, the system needs to convert input fromthe source language to the form of the interlingua,and then from the interlingua to the target language.

    Universal Networking Language (UNL) is onesuch interlingua. Information is representedsentence-wise. Each sentence can be representedby a hypergraph wherein the nodes are conceptsand directed edges depict relations. Concepts areknown as universal words (UWs) and are gatheredfrom the lexicon (lexicon entries have syntacticand semantic attributes). They are a language-independent representation of word knowledge.For example, food(icl ¿ salad) denotes the nounsalad. The ’icl’ notation captures inclusion phe-nomenon and forms an is-a structure like in se-mantic nets. A standard set of relation labels (RLs)relate UWs and thus capture conceptual knowledge.Speech acts such as speaker’s time, aspect, view,of an event, are depicted via attribute labels.The reasons that establish the authority of UNLare that it is the only interlingua that does thesetogether:

    (1) uses semantic relations to connect words withthe main predicate of the sentence.

    (2) gives an unambiguous word representation(3) lucidly expresses properties of speakers’ and

    referents worldviewHowever creating a UW dictionary that can link

    different languages lexemes is challenging - thetwo main obstacles being:

    (1) granularity of conceptual space is not thesame in all languages and

    (2) the existence of multiwords. Multiwords arecombinations of words characterized by noncompo-sitionality (the meaning of the combination wordcannot be derived by the words that make it up)and fixed collocation of words, in structure andorder. Examples: the prepositional verb took offwhich means to run away, phrasal-prepositionalverb catch up with which could either mean reach-ing someone person ahead of you, or just getting to

    talk with someone you haven’t met in a long time.To perform translation, first in the Analysis stage,the source sentence is converted to the interlingua,and then the interlingua is converted to the targetsentence (generation or G-stage). Since the inter-lingual representation requires full disambiguation,the A-stage is a complete natural language analysisproblem. From the interlingua representation it canbe converted into the target language in these 3stages: selecting the appropriate lexemes and in-serting the function words, identifying the case andgenerating the correct morphology (verb form andso on) and planning the syntax i.e. deciding theword order for these words.

    2.1.2 Transfer based MTSince complete disambiguation of a sentence isa very difficult task, and not even necessary forclosely related languages wherein ambiguity canbe preserved without affecting the meaning, Trans-fer based MT gained traction. The process involvestaking the source sentence up to an intermediatestage in the Vauquois triangle and then perform-ing transfer to an intermediate representation. Thisdoes not demand complete disambiguation of thesentence. The defining characteristic of transfer-based MT is the application of well-defined transferrules that work with the generalizations of lexicalobjects to perform structural transformations (JJ:adjectives, NP: noun phrases, V: verb) . Trans-fer rule T: REPS -> REPT is a mapping from thesource language representation REPS to target lan-guage sentence representation REPT. To capturethe common operation in translating from an SVOlanguage like English to an SOV language likeHindi, a rule of this form < V NP -> NP V >, thatreverses the constituents of a verb phrase applies.

    2.2 Example Based Machine Translation(EBMT)

    The core idea is translation by analogy. To performa translation, we compare the input to the systemto the parallel corpora database that the system pos-sesses, which is done by computing text similarity.The fundamental requirement of text similarity are:(1) a valid measure of similarity; it should measuresimilar texts as indeed similar and dissimilar onesas dissimilar, and (2) resources like large lexicalknowledge networks that enable one to capture thenotion of ’similarity’. Transfer takes place throughtemplates that are learned from data. EBMT sharesstages of processing with RBMT and SMT. The

  • processing that they perform on the source sen-tences is common to all three; but their outputsdiffer.

    2.2.1 How EBMT is doneThe broad steps that are to be followed while doingEBMT are:1) Compute similarity with templates available2) Stitch the matching fragments together3) Smooth boundary frictionThe drawback of EBMT is that it doesn’t scale upwell.

    3 Statistical Machine Translation (SMT)

    Pure SMT relies wholly on data, the model is builtwithout any human intervention, using corpuses.

    It has been the ruling paradigm for until theearly 2010s. To teach a machine how to translate,it needs examples which give it this information:Translation of the words (the bilingual word map-pings) i.e. what word maps to what word(s) and thecorrect positions for the translated words in the tar-get sentence (alignment). The underlying principleof SMT is word alignment.

    3.1 Word AlignmentWhen a sentence in one language is translated to an-other, there exists a mapping between the words. Aword from either side of the translation can map tozero, one or more words on the other side. The phe-nomenon of mapping one word to multiple wordson the other side is called fertility. A word can mapto multiple words due to:1. Synonymy on target side, (e.g., ’water’ in En-glish translating to jal, paani, neer, etc., in Hindi),2. Polysemy on source side (e.g., ’brightened’translating to ’khil uthna’, as in her face bright-ened with joy –> usakaa cheharaa khushi se khiluthaa)3. Syncretism (’to run’ translating to ’bhaaga’,’bhaagi’, ’bhaage’) These alignments are the key totranslation.

    To learn these word mappings the system re-quires several parallel sentences, that will intro-duce the possibility of a particular alignment andothers to establish their certitude. Given a sentencepair, the other pair must be of one of the two types:one-same-rest-changed (only one word is commonto both sentences) and one-changed-rest-same (thesentences only differ by one word). To computethese mappings, the Expectation Maximization al-gorithm is used.

    The basic equation of SMT is:argmax

    eê = argmax

    eP (e|f) = argmax

    eP (e) · P (f |e)

    where e is a sentence from the sentence bank Eof language 1 (say, English), and f is from the sen-tence bank F of language 2 (the foreign language,like French), P(e) is the language model P(e) andP(f | e) is the translation model. P(e) is computedfrom the product of N-grams obtained from L1’scorpora. On the other hand, P(f | e) would be theproduct of word translations given the input words,subject to the effect of alignment.

    IBM proposed several approaches to build atranslation model and alignment model, of whichthe 3 most important and basic approaches were:

    3.1.1 IBM Model 1It is the most basic model and builds on the follow-ing assumptions:

    1. Length of the input m and length of the outputl is same, i.e. P (m+ l) = �

    2. The probabilities of all alignments are equali.e. any word can map to any word with equalprobability. Therefore,∏mj=1 P (aj |f

    j−11 , a

    j−11 , e,m) =

    1(l+1)m

    3. The word fj depends on the jth word in thealignment a in e which is eaj , i.e.,∏mj=1 P (fj |f

    j−11 , a

    j−11 , e,m) =∏m

    j=1 P (fj |eaj )

    Combining all the three assumptions and applyingoptimization stels, the IBM model-1 becomes:

    P (f |e) = �(l + 1)m

    m∏j=1

    l∑i=0

    P (fj |eaj ) (1)

    Drawbacks of IBM model-1:

    • The condition of input and output sentencehaving the same length is too rare and limit-ing.

    • Due to language phenomenon, words usuallyoccur in phrases, and it is unlikely that theywould have equal alignment probability acrosslanguages.

    • There could be zero or more alignments pos-sible for a single input word.

  • 3.1.2 IBM Model 2Assumptions of this model are as follows:

    1. It takes alignment into consideration andmodels it separately i.e.,∏mj=1 P (aj |f

    j−11 , a

    j−11 , e,m) =∏m

    j=1 P (aj |j, l,m)

    2. As before, fj depends on eaj .

    The resulting translation model is:

    P (f |e) = �.m∏j=1

    l∑i=0

    (P (i|j, l,m).P (fj |ei)

    )(2)

    It suffers from similar drawbacks as model 1, onlyhaving addressed the issue of accounting for align-ment.

    3.1.3 IBM Model 3Works with an alignment model that considers fer-tility of words. Fertility η(φ|f) can be definedas for each word in foreign sentence, number ofwords φ = 0, 1, 2, . . . are generated. Note thatfertility can also take care of NULL mapping.

    Translation by IBM model-3 is a four step pro-cess:.

    1. Fertility: For each source sentence wordsi, fertility φi is chosen with probabilityP (φi|ei).

    2. Null Insertion: Number of target words to begenerated from NULL is chosen with proba-bility η(φ|NULL).

    3. Lexical Translation: Translation of each word,like IBM model-1, is done by P (f |e).

    4. Distortion: Probability of translating fi (for-eign word in ith position) to sj (source wordin jth position) is modeled by d(j|i, l,m).

    p(f |e) =l∑

    a1=0

    l∑a2=0

    · · ·l∑

    am=0

    (P (f, a|e)

    )=

    l∑a1=0

    l∑a2=0

    · · ·l∑

    am=0

    [(m− φ0φ0

    )pm−wφ00 p

    φ01

    l∏i=1

    (φi!η(φi|ei)

    m∏j=1

    (t(fj |eej )d(j|aj ,m, l)

    )](3)

    IBM Model 4 and IBM Model 5 are more power-ful, but they are too complex. Process of translating

    a new input sentence, say fnew is called Decoding.We apply the same Noisy channel model:

    ebest = argmaxenew(P (enew)P (fnew|enew)

    )= argmaxenew

    (P (enew)

    (l + 1)m

    ∑a

    m∏j=1

    P (fnewj |enewaj ))

    = argmaxenew

    (P (enew)

    (l + 1)m

    m∏j=1

    l∑i=0

    P (fnewj |enewaj ))

    = argmaxenew( �(l + 1)m

    )(P (enew0 )

    )( l∏i=1

    P (enewi |enewi−1 ))

    ( m∏j=1

    l∑i=0

    P (fnewj |enewaj ))

    (4)When languages have different fertilities, wordbased translation suffers. Thus phrase based trans-lation is the most seminal way of doing SMT, withphrases (not necessarily linguistic) instead of wordsbeing the translation units.

    One major drawback of SMT is that it requiresindividual tuning of various components. Thusushered in the era of Neural Machine Translation -which comprises of an end to end system that canbe jointly tuned instead of having to adjust each ofthe individual components.

    4 Sata-Anuvadak

    As the name Sata-Anuvadak (Kunchukuttan et al.,2014) suggests, this literature is about 100 transla-tors. This work focuses on multiway translation ofIndian Languages. In this work, Indian languagesfrom various families are considered and followingobjectives were investigated:

    1. To understand translation patterns involvedwhen translating between same language fam-ily and other.

    2. To leverage use of shared characteristics be-tween Indian language. The shared character-istics in brief are as follows:

    (a) Free Word Order with SOV(Subject-Object-Verb) form being canonical.

    (b) Most of the Indian languages have scriptsderived from the Brahmi Script.

    (c) Vocabulary and Grammatical rules aremajorly derived from Sanskrit language.

    (d) Most of the Indian languages are mor-phologically rich in nature.

    3. To investigate the effect of pre-processing andpost-editing in SMT systems for Indian lan-guages.

  • In general, this literature worked on four systems,they were:

    1. Baseline Phrase-Based System: This systemis used to investigate relationship betweentranslation models while taking Languagefamilies and corpus size into consideration.

    2. English-Indian Language with reordering :This system focuses on English to Indian Lan-guage translation by applying reordering ruleson English side so that both the sides conformto the same word order.

    3. English-Indian Language with Hindi-tunedsource side reordering: This system uses bet-ter reordering rules and concludes that whencreating rules by considering the English-Hindi reordering rules benefits the overall sys-tem.

    4. Indian Language- Indian Language withTransliteration: For the unknown wordstranslation, that don’t share the same script,Transliteration can be done by exploiting theUnicode ranges that has been allotted to In-dian languages.

    But, the current focus of the project is Indian-Indian language and that too very specifically,Hindi-Marathi and for that we have taken theresults of System1 described above as the baselinefor all our experiments. Since, Hindi & Marathishare the same Devanagari script, thus, Translitera-tion technique needn’t be applied as a post-editingstep.

    The following important conclusions came outof the work:

    1. Translation accuracy between Indo-Aryan lan-guage families show best results because ofthe same word order and similar case markingwhereas the Dravidian languages being moremorphologically rich show less accuracies.

    2. The effect of increasing corpus size benefitsthe Indo-Aryan languages to a great extent.For morphologically poor languages, the ef-fect of increase in corpus size helps the trans-lation accuracies to improve whereas for mor-phologically richer languages, increasing thecorpus size is not that worth it.

    3. Morphologicially richer to poorer languagestranslation gives poor results because of ab-sence of source side phrases because of mor-phological richness, solution to which, mor-phological segmentation can help solve thisissue.

    Figure 1: Sata-Anuvadak System1 Re-sults(Kunchukuttan et al., 2014)

    5 Morfessor

    5.1 Introduction5.1.1 What is MorphemeMorpheme is a indivisible morphological unit of alanguage that carries some meaning. For example:un, break, able together forms ’unbreakable’. Here’un’ , ’break’ & ’able’ constitutes the morphemeset of English language.

    5.1.2 Why Learning Morpheme Important?When considering language with large vocabu-lary, learning morphemes helps. In support of thisstatement, consider a task of estimation of n-grammodel, which means finding the probabilities of allthe words in the n-gram sequence. When the vocab-ulary size is large enough, the n-gram estimationfaces two major problems, they are:

    1. Since the vocabulary size considered is veryhigh and when the word itself is consideredas the basic unit, then there exists large set ofword representations and thus increasing datasparsity. The n-gram estimation because ofdata sparsity is poor as a result.

    2. When considering morphologically rich lan-guages, they have large number of word formswhich are not even seen at the time of training.The model fails to deal with new word forms.Thus, learning sub-word units like morphemesmay help dealing with this issue.

  • 5.2 Approaches

    In the work(Creutz and Lagus, 2002), there are 2unsupervised approaches to learn morphemes arediscussed. They are discussed below with minordetails:

    5.2.1 Method1: Recursive Segmentation &MDL Cost

    1. Model Cost Using MDL: To define themodel cost, few terminology we must beaware of are:

    • Tokens: The total number of words ina text or corpus regardless of how oftenthey are repeated.• Types: The total number of distinct to-

    kens in a text or corpus is called types.For example: In the sentence ”I am In-dian. I love India.”, there are 6 tokensand 5 types in this example.• Codebook: The vocabulary of all the

    morph types is termed as morph code-book or simply codebook.

    The total cost is divided into two parts: (1)Cost of Source Text and (2) Cost of Codebook.Refer to the equation 5

    C = Cost(Sourcetext) + Cost(Codebook)

    =∑tokens

    −logp(mi) +∑types

    k ∗ l(mj)

    (5)

    The MDL cost function described in the equa-tion 5 is explained below:

    (a) Cost(Source Text): The cost of sourcetext is the sum of negative likelihood ofthe morph, summed over all the tokensin the source text.

    (b) Cost(Codebook): The cost of codebookis the total length in bits required to repre-sent all the morph types in the codebook.The term l(mj represents the length ofthe morph token and ’k’ represents thenumber of bits required to represent acharacter. Say, for English lowercasealphabets, 26 different representation isrequired which would be sufficed by 5bits i.e. 32 different representations.The probability of mi is calculated by

    maximum-likelihood estimate as followsin equation 6:

    p(mi) =count(mi) in source text

    total count of morph tokens(6)

    (c) Recursive Segmentation: A recursiveapproach is applied for the search ofbest morph segmentation. The below de-scribed points tell us in brief about therecursive segmentation technique:• Initially, all the tokens in the source

    text is considered as valid morphsand are added to the codebook.• These tokens are tried for all differ-

    ent splits into two and the one withminimum total cost is chosen. Thetotal cost is calculated by recursivelyperforming this step on the two seg-ments we have after segmentation.• The segmentation can be seen as a

    binary tree and an example of it isshown in the figure 2. The figuretakes two words into considerationfrom Finnish language with someoverlap in the morphs. There arenumbers associated with each boxesin format a:b where a represents theposition where segmentation is hap-pening and b represents the count ofthat token in source text. The countof chunk must be equal to the sum ofoccurences of its parents i.e. in theexample the morph ’auton’ is hav-ing count as 9 which is sum of oc-curences of ’linja-auton’ (5) & ’au-tonkulijettajallakaan (2)’.• To find a split, we search for the ex-

    act match of the word in this hierar-chical structured tree and recursivelytrace the splits until the leaf nodesare reached.

    (d) Adding & Removing Morphs: Whenwe encounter a new word in the sourcetext, we just don’t use the already seenmorphs to split the word as that may leadto local optima. For, each new wordseen:• if the word has been observed al-

    ready, remove its instance from thetree and decrease the count of all the

  • Figure 2: Hierarchical Segmentation of 2 words fromFinnish Language (Creutz and Lagus, 2002)

    children associated with that chunk.Do the segmentation of the wordafresh. This may result in better seg-mentation and will avoid local op-tima issue.• if the word is observed for the first

    time, segment it by looking at all thepossible split it can hold which hasthe minimum cost associated as de-scribed in the previous section.

    (e) Dreaming: Since, the cost is updatedonly with every next token processed, thequality of segmentation of the words thatoccurred at the very beginning and didn’toccur again in the training stage mightbe poor as the model learns new morphsgradually. To resolve the quality issueof poor segmentation of the words thatoccurred in the beginning, we do a stepcalled Dreaming, where no new wordsare processed, rather process the wordsalready seen and that too in random order.Dreaming continues for some specificamount of time or until there is reason-able decrease in the cost. The figure 3shows how dreaming lead to decrease inthe cost.

    5.2.2 Method2: Sequential Segmentation &ML Cost

    1. Model Cost Using Maximum Likelihood:The cost function used is the likelihood ofthe data given the model. The ML estimate isdescribed in the equation 7. The ML estimatefor the morph tokens is as same as the onedescribed in the previous method 6. In this

    Figure 3: Effect of Dreaming on Cost (Creutz and La-gus, 2002)

    method, we are just considering the probabiltyof data given the model and ignoring the costof the model.

    Cost(Sourcetext) =∑

    morphtokens

    −logp(mi)

    (7)

    2. Search Algorithm: This section describeshow the splitting is done. The following stepsare followed to learn the splittings.

    (a) For the initialization, using a Poisson dis-tribution with λ = 5.5, the length of thesplit is sampled and using this length ofthe interval, the splitting of words is car-ried out. If the length of the interval sam-pled is greater than the chunk left aftersegmenting, then segmentation for thatword stops.

    (b) The following steps are repeated somefixed number of times which is a hyper-parameter to the training process:

    i. The morph probabilities are esti-mated for the splitting we have.

    ii. With the splitting we have, i.e. thecodebook, re-segment the source textusing the Viterbi algorithm to findthe best segmenatation of the wordsinto morphs with minimum cost.

    iii. If not last iteration, test the segme-natation using the Rejection criteria.If the segmentation doesn’t fit accord-ing to the Rejection Criteria, makethe split randomly as done in the ini-

  • tialization step. This random split-ting leads to better learning of themodel as it leads to learning of newmorphs. Without this random split-ting, the model would not achieveoptimal solution.

    (c) Rejection criteria: The following rulesare followed under the Rejection criteriadefined:

    i. Removal of rare morphs: Themorphs which occurs just once tillprevious iteration are removed as itis known fact that the morphs whoseoccurrence is extremely low are mostprobably wrong morphs.

    ii. Reject the segmentation if it containsmultiple single letter morphs. Mul-tiple single-letter morphs in the seg-mentation is often the sign of badperformance of the model meaningthat model is stuck in local optimum.

    5.2.3 Method3: MAP Estimate

    This section is closely related to Mathematical For-mulation decribed in Morfessor 1.0(Creutz and La-gus, 2002). According to the work (?), the Mini-mum Description Length (MDL) principle & Max-imum a Priori (MAP) Estimate are equivalent &produce the same result.

    1. MAP of the overall probability: The aim isto induce a model of the language from givensource-text in an unsupervised manner. Themodel of the language (M) consists of twoelements:

    (a) Lexicon: The lexicon is the set of dis-tinct morphs in the source text. Basically,it is the vocabulary of the segmented cor-pus.

    (b) Grammar: The grammar contains theinformation about how the segmentedtext can be recombined.

    The aim is to create a model of the languagein such a way that the set of morphs is concise& also gives the concise representation of thesource text or the corpus. The MAP estimatefor the parameters that is to be maximized is

    describes in the following equation 8 and 9

    argmaxM

    P (M |corpus) =

    argmaxM

    P (corpus|M) ∗ P (M)(8)

    P (M) = P (lexicon, grammar) (9)

    The equation 8 shows that the MAP estimateis composed of two parts i.e. The Model prob-ability & ML estimate of the corpus giventhe Model. The equation 9 describes that themodel of the language is calculated as thejoint probability of the lexicon & the grammarwhich incorporates our assumption that themorphology learning task must be affected bysome or the other features.

    2. Lexicon: The lexicon is the vocabulary of thesegmented text or simply the collection of allthe morph units in the corpus. Each morphcarries some properties and in the MorfessorBaseline model, the only properties consid-ered are frequency of morphs in the source-text or the corpus & the string that morph iscomposed of which also implicitly carries theinformation of the length of string i.e. thenumber of characters in the sequence.

    3. Grammar: The grammar contains the infor-mation about how the segmented text can berecombined. But in the baseline model ofMorfessor, the grammar is not considered andis taken into consideration in later models. So,the P(M) reduces to P(lexicon) only. In ab-sence of grammar, the model could not decidewhether morph should be placed in starting,ending or the middle of the word formed. Theprobability of morph µi is ML estimate and iscalculated as described in the equation 10.

    PP (µi) =fµiN

    =fµi∑Mj=1 fµj

    (10)

    4. Corpus: The words present in the corpus arerepresented as a sequence of morphs. Theremight be multiple segmentation of a word andis chosen based on the MAP estimate. Theprobability of Corpus given the model is es-timated as described in equation 11 where in

  • the corpus W words have been considered &each word is split into nj morphs. The kth

    morph of the jth word is termed as µjk.

    P (corpus|M) =W∏j=1

    nj∏k=1

    P (µjk) (11)

    5. Search Algorithm: A greedy search algo-rithm is utilized for finding the optimal lexi-con and the segmentation. Different segmen-tation is tried and the segmentation that pro-duces the highest probability is selected. Thiscontinues until there is no significant improve-ment seen.The probabilities are replaced by log probabil-ities & the product is replaced by sum. Thenegative log probability is considered as codelengths as in MDL framework.

    6. Recursive Segmentation: A recursive ap-proach is applied for the search of best morphsegmentation. The below described pointstell us about the recursive segmentation tech-nique:

    • Initially, all the tokens in the source textis considered as valid morphs and areadded to the codebook.• These tokens are tried for all different

    splits into two and the one with mini-mum code length is chosen. The totalcode length is calculated by recursivelyperforming this step on the two segmentswe have after segmentation.• The segmentation can be seen as a binary

    tree and an example of it is shown inthe figure 4. The figure takes two wordsinto consideration from language withsome overlap in the morphs. There arenumbers associated with each boxes informat a:b where a represents the posi-tion where segmentation is happeningand b represents the count of that tokenin source text. The count of chunk mustbe equal to the sum of occurences of itsparents i.e. in the example the morph’auton’ is having count as 9 which is sumof occurences of ’linja-auton’ (5) & ’au-tonkulijettajallakaan (2)’.• To find a split, we search for the exact

    match of the word in this hierarchical

    structured tree and recursively trace thesplits until the leaf nodes are reached.• The above steps are repeated for mul-

    tiple epochs and in between 2 epochsthe words are randomly shuffled to avoidany inherent bias because of the wordsappearing in the corpus.

    Figure 4: Hierarchical Segmentation of 2 words fromFinnish Language(Creutz and Lagus, 2002)

    6 BiRNN encoder, RNN decoderarchitecture

    Bahdanau et al. (2015) proposed the first systemthat jointly learnt how to align and translate inputand output sentences with a novel architecture.The probability of the output sequence is given bythe following equation:

    p((y) =

    T∏t=1

    p({yt|y1, . . . , yt−1})

    The conditional probability for each outputword is given by:

    p(yi|y1, . . . , yi−1,x) = g(yi, si, ci)

    where, x is the input sentence having words de-noted by vectors x1 . . .xTx and si is the hiddenstate of the RNN at time i.The words of a sentence are encoded as one-hotvectors on a limited vocabulary of the words of alanguage. The input is encoded by a Bi-directionalRNN (BiRNN). The forward RNN reads the in-put sequence, in the given order and calculates asequence of forward hidden states; the backwardRNN does the reverse operation. The encodermaps the input sentence to a sequence of anno-tations. Each word has an annotation associated

  • (hj), whose value is the concatenation of the 2 hid-den states.The model generates a context vector for each wordin the output sentence; this was different from pre-vious models of NMT. ci is the context for targetword i. ci =

    ∑Txj=1 αijhj . αij is the weight of

    each annotation, given by:

    αij =exp (eij)∑Txk=1 exp (eik)

    (12)

    eij scores how well the input words around positionj and output words at position i match.

    Figure 5: Proposed Model generating target word ytBahdanau et al. (2015)

    Whenever the proposed model generates an out-put word, it searches for a set of positions in asource sentence where the most relevant informa-tion can be found. The output word is predictedbased on the context vectors associated with thosepertinent source sentence words as well as the pre-viously generated output words. The decoder usessoftmax at the final layer and finds the associatedprobability for each word in the target vocabulary.In earlier approaches to NMT, the entire input sen-tence was encoded to a fixed size vector. Hereinstead a subset of context vectors is adaptivelychosen on the fly while decoding.

    In this paradigm, alignment is not considered tobe a hidden variable, it is computed by a feedfor-ward neural network; model directly finds a softalignment which means each word need not mapto with certainty to a word/words (known as hardalignment) but we can assign different probabilitiesof its mapping to all words in the sentence. Thedefault option for gated units in OpenNMT-py is

    LSTM.

    7 Word Embeddings

    Word Embeddings are the n-dimensional vectorrepresentation of the word. The general acceptedlength of the word embedding vector is 300. Thewhole vocabulary is represented as Word Embed-ding Matrix and is pre-learned using the algorithmlike Word2Vec, Glove, Bert, and many others. Thefollowing subsection describes in brief the mostcommon and basic algorithms for obtaining theWord Embeddings from a mono-lingual corpora.

    7.1 Word2VecWord2Vec ((Mikolov et al., 2013a), (Mikolov et al.,2013b)) was a breakthrough technique for findingword embeddings as opposed to earlier methodsthat used global counts for obtaining vector repre-sentation of the words. The Word2Vec model gavea representation that was found successful in wordanalogy tasks. For instance, the very famous wordanalogy example is the vector arithmetic operationas ”king - man + woman” gave a vector representa-tion close to that of ”queen” which was surprisinglygreat result and made Word2Vec the state-of-the-arttechnique of finding word embeddings.

    The Word2Vec model chooses a window sizesay 5 (2 words before and 2 words after the wordin focus), and applies either of the two approachesto learn word embeddings. They are as follows:

    Figure 6: CBOW Visualization ((Mikolov et al.,2013a))

    • Continuous Bag-of-Words model: TheCBOW model aims to predict the word in

  • Figure 7: Skip-gram visualization ((Mikolov et al.,2013a))

    focus by looking at the words surrounding iti.e. which lies in context window range. Ex-ample: The dog —- at strangers. {using The,dog, at & strangers, the CBOW model tries topredict the word ’barks’. }

    • Skip-gram model: The Skip-gram modelaims to predict the context words by lookingat the word in focus. {using the word ’barks’the Skip-gram model tries to predict the wordsThe, dog, at & strangers } The training proce-dure of Word2Vec using Skip-gram model isdescribed in Appendix.

    7.1.1 Word2Vec TrainingThe Word2Vec training is explained below usingSkip-gram model with negative sampling

    • Creating the Training Data: A windowsize is chosen say 5(2 words before and2 words after the focus word), and theentire mono-lingual corpus is scanned. Tounderstand exactly how training data set ischosen, take an example sentence as follows:The cat sat on the mat.

    The cat sat on the mat.

    Table 1: Training Data

    focus context targetsat the 1sat cat 1sat on 1sat the 1on cat 1on sat 1on the 1on mat 1

    But, all our examples are with target ‘1’ (i.efor ex: ‘the’ lies in context of ‘sat’, so out-put label is 1, otherwise ‘0’) which makes theembeddings learn nothing from training. How-ever, the model will give 100% accuracy intraining set, the model learns only the Garbageembeddings.

    Negative Sampling: Adding some ’k’ negativesamples for each training example to the train-ing set, i.e. word pairs that are not actuallyneighbours and giving target value ‘0’.

    Table 2: Negative Sampling

    focus context targetsat the 1sat kite 0sat table 0sat cat 1sat mango 0sat bike 0sat on 1sat the 1on cat 1on sat 1on the 1on mat 1

    • Initialize two matrices called Embedding ma-trix and Context matrix of size (vocab size *chosen dimension)

    • Take focus word (sat), context word (the) andnegative samples (kite, table) for first step oftraining.

    • Get word embedding of focus word & contextword using Embedding matrix and Context

  • matrix respectively. (v1, v2, v3, v4 representsword embedding in table 3)

    Table 3: Error Calculation

    e1 e2 target e1.e2 sigmoid errorSat[v1]

    The[v2]

    1 0.7 0.66 0.34

    Sat[v1]

    Kite[v3]

    0 0.2 0.5 -0.5

    Sat[v1]

    Table[v4]

    0 -1.6 0.16 -0.84

    The objective function ((Goldberg and Levy,2014)) to be maximized is as follows:

    argmaxΘ

    ∑(w,c)εD

    logσ(vc.vw)+∑

    (w,c)εD′

    logσ(−vc.vw)

    (13)where, vc : embedding of the context word and

    vw : embedding of the word in focus

    7.2 GloVe

    GloVe ((Pennington et al., 2014)) stands for GlobalVectors for word represntation. GloVe embeddinguses global counts knowledge, unlike word2vec.The learning of the embedding is based on co-occurrence matrix formed using the mono-lingualcorpus and trains the word vectors (embeddings)such that their difference predict the co-occurrenceratios. Some pair of words occur more often thanothers and hence the method ensures that such pairsdo not contribute much towards the objective lossfunction by introducing a weight factor for eachword-pair based on its frequency of occurrence.

    7.2.1 GloVe TrainingThe GloVe training is purely based on co-occurrence matrix. The principle under whichGloVe method works is: The ratio of co-occurrencebetween two words in context tells about the mean-ing of the word in focus. The loss function beingminimized is as follows:

    ∑i,j

    weight(Xij)(dot(wi.w̃j)+bi+b̃k−log(Xij))2

    (14)where,

    • Xij : co-occurrence value of ith and jth wordin co-occurrence matrix X.

    • wi : word-embedding of input word

    • w̃j : word-embedding of output word

    • bi and b̃k: bias terms

    8 Bilingual Word Embeddings

    There are different ways of representing words inneural models. One particular method is One HotEncoding. In this scheme, we have a very highdimensional vector for every input word. The di-mension of this vector is equal to the number ofwords in our vocabulary. Every word in this vectorhas a position and this position is set to 1 for thecurrent word and all the other positions are set to0.For instance:

    • man = (0, 0, 1, 0)T

    • tan = (0, 0, 0, 1)T

    • crib = (1, 0, 0, 0)T

    This one-hot vector is very sparse since most ofthe entries are 0. We map these one hot vectors tolower dimensional space where every input wordrepresents a vector in that space. This is generallyreferred to as Word Embedding. We typically in-clude a word embedding layer between our one-hotvector and the hidden layer of the network. Wordembeddings give us clustering i.e. generalizingbetween words and backoff i.e. robust predictionsunder unseen contexts. A complete set of word em-beddings can identify similar words and naturallycapture relationships between words. For example:

    Figure 8: Visualizing the relationship representing ca-pabilities of word vectors

    In Mikolov et al. (2013c), the aim is to learna linear mapping W ∈ Rdd between the wordvectors of a seed dictionary.

    minW∈Rdxd

    1

    n

    n∑i=1

    l(Wxi, yi)

  • Typically the square loss is used as the learning ob-jective. W is constrained to be orthogonal since itpreserves distances between word vectors and like-wise word similarities. Experimentally, it had beenobserved to improve quality of inferred lexicon.

    Once a mapping W is learned, one can inferword correspondences for words that are not in theinitial lexicon. Translation t(i) of source word i isobtained as: t(i) ∈ argminj∈1...nWxi, yi

    Word embeddings can improve the quality oflanguage translations when embeddings betweentwo or more languages are aligned using a trans-formation matrix. Many approaches have beensuggested to align word embeddings, and we willbe discussing a few of them.

    This paper Joulin et al. (2018) presents a differ-ent approach to Bilingual Word Mapping. Theyminimize a convex relaxation of the cross-domainsimilarity local scaling (CSLS) loss, which signifi-cantly improves the quality of bilingual word vec-tor alignment. The CSLS criterion is a similaritymeasure between the vectors x and y defined as:

    CSLS(x, y) = −2cos(x, y)+1

    k

    ∑y′∈Ny(x)

    cos(x, y′) +1

    k

    ∑x′∈Nx(y)

    cos(x′, y)

    (15)

    where Ny(x) is the set of k nearest neighborsof the point x in the set of target word vectorsY = y1, . . . , yN , and cos is the cosine similar-ity. They removed the orthogonality constraintand found that it does not degrade the quality ofaligned vectors. Finding k nearest neighbors ofWxi among elements of Y is equivalent to findingelements of Y having the largest dot product withWxi. Employing this idea leads to the convex for-mulation when relaxing orthogonality constrainton W. Their approach can be generalized to otherloss functions by replacing the term xTi W

    T yTj byany function convex in W. They apply ExtendedNormalization: instead of computing the k-nearestneighbors amongst the annotated words, use thewhole dictionary use unpaired words in dictionar-ies as ’negatives’. Dataset used was WikipediafastText vectors and for supervision a lexicon com-posed of 5k words and their translations.The success of orthogonal mapping methods re-lies on the assumption that embedding spaces areisomorphic; i.e., they have the same inner-productstructures across languages, but this does not holdfor all languages, especially distant languages.

    Hoshen and Wolf (2018) align fastText embeddingswith orthogonal mappings and reported 81% En-glish–Spanish word translation accuracy but only2% for English–Japanese.

    To combat this, Zhang et al. (2019) puts forth apreprocessing technique called ’Iterative Normal-ization’ that makes orthogonal alignment easier bytransforming monolingual embeddings and simulta-neously enforcing that: (1) individual word vectorsare unit length, and (2) each language’s averagevector is zero. Iterative Normalization technique:For every word i, we iteratively transform eachword vector xi by making the vectors unit length,Then making their mean zero.This is what sets them apart from previous ap-proaches:

    • Earlier, the zero-mean condition was moti-vated based on the heuristic argument thattwo randomly selected word types shouldnot be semantically similar (or dissimilar) inexpectation.

    • But word types might not be evenly dis-tributed in the semantic space, some wordsmay have more synonyms.

    • Single round of length and center normaliza-tion doesn’t suffice.

    9 Transformer

    While RNNs and LSTM based Neural Networkshad been established as the state of the art ap-proaches in sequence modeling tasks like MT, theyare constrained by sequential computation. Typi-cally, the generated sequence of hidden states ht isa function of the previous hidden states ht−1 andthe input at time t. Attention mechanisms have be-come a vital part of such models recently, allowingmodeling of dependencies regardless of their dis-tance in the input or output sequences. The Trans-former, proposed by Vaswani et al. (2017) is modelarchitecture forgoes recurrence and instead relyingentirely on an attention mechanism to draw globaldependencies between input and output. The Trans-former allows for significantly more parallelization,faster training time and reached a new state of theart in translation quality.

  • Figure 9: The Transformer model architecture.

    Both the encoder and the decoder are composedof a stack of N = 6 identical layers. Each encoderlayer has two sub-layers. The first is a multi-headself-attention mechanism, and the second is a sim-ple, position-wise fully connected feed-forwardnetwork. That is, the output of each sub-layer isLayerNorm(x + Sublayer(x)). As for the de-coder, it inserts a third sub-layer, which performsmulti-head attention over the output of the encoderstack. A residual connection is employed aroundeach of the two sub-layers, followed by layer nor-malization. The self-attention mechanism is alsomodified to prevent positions from attending to sub-sequent positions. This masking, combined withthe fact that the output embeddings are offset byone position, ensure that the predictions for posi-tion i can depend only on the known outputs atpositions less than i.

    The authors have chosen to describe the attentionfunction as mapping a query and a set of key-valuepairs to an output, wherein they are all vectors.

    9.1 AttentionThere are two kinds of attention described in thepaper. The input consists of queries and keys ofdimension dk, and values of dimension dv.

    • Scaled Dot-Product Attention: is computed

    via matrix multiplication on a set of queries,using highly optimized code for the same.Q,K, V are the query, key, and value matriceswhere the respective vectors are packed in. Itis much faster and space-efficient in practicecompared to additive attention.

    Attention(Q,K, V ) = softmaxQKTV√

    dk(16)

    The product is divided by√dk to counter-

    act when softmax function is pushed into ex-tremely low gradient regions when the dotproducts grow too large in magnitude for largevalues of dk.

    • Multi-Head Attention: Instead of a singleattention function, the authors discovered thatprojecting the queries, keys, and values h(fixed as 8) times, in parallel, with different,learned linear projections into dimension dv,was beneficial. Multi-head attention allowsthe model to jointly attend to informationfrom different representation subspaces atdifferent positions. Due to the reduced dimen-sion of each head, the total computationalcost is similar to that of single-head attentionwith full dimensionality.

    MultiHead(Q,K,V)=Concat(head1,...,headh)WO(17)

    where,headi = Attention(QW

    Qi ,KW

    Ki , V W

    Vi )(18)

    The multi-head is utilized in 3 ways. In encoder-decoder attention layers, queries come from theprevious decoder layer, and the memory keys andvalues come from the output of the encoder, allow-ing every position in the decoder to attend over allpositions in the input sequence (reminiscent of se-quence to sequence models). In the encoders’ self-attention layers, the queries, keys, and values comefrom the output of the previous layer, allowing itto attend to all positions in the previous encoderlayer. In the decoder, however, as constrained bythe auto-regressive property, each position is onlyallowed to attend to positions up to and includingthat position.

    9.2 Position-wise Feed-Forward NetworksEach of the layers in our encoder and decodercontains a fully connected feed-forward network,which is applied to each position separately andidentically. While the linear transformations areidentical across different positions, there is no

  • weight sharing between layers.

    FFN(N) = max(0, xW1 + b1)W2 + b2

    9.3 Positional EncodingSince the model contains no recurrence (orconvolution), to impart the model with knowledgeabout the order of the sequence or the positions inthe sequence, positional encodings are added tothe input embeddings at the bottoms of the encoderand decoder stacks. In this work, they chose sineand cosine functions of different frequencies:

    PEpos,2i = sin(pos/100002i/dmodel)

    PEpos,2i+1 = cos(pos/100002i/dmodel)

    where pos is the position and i is the dimension.

    9.4 Advantages of Self-AttentionThe authors mention 3 key benefits of using self-attention. These are:

    • Total computational complexity per layer:Complexity per Layer is O(n2d) operationsfor self-attention layers and O(nd2) for recur-rent layers, making the former faster when thesequence length n is smaller than the represen-tation dimensionality d, which is most oftenthe case.

    • A self-attention layer connects all positionswith a constant number of sequentially exe-cuted operations, whereas a recurrent layer re-quires O(n) sequential operations. This meansparallelizing computation is easier in the for-mer case.

    • Path length between long-range dependen-cies in the network is also constant with self-attention layers. Learning long-range depen-dencies is a key challenge in many sequencetransduction tasks, and shorter paths lend toeasier learning.

    10 Byte Pair Encoding

    Closed vocabulary systems are those that have a fi-nite, predefined vocabulary; the test set will containwords from the vocabulary itself, i.e., there will beno unknown words. But, translation is an openvocabulary problem, it might happen that in real-world situations, a word in the test set was neverseen during training. Such words are called Out ofVocabulary or OOV words. In an open vocabularysystem, such OOV words are replaced by a special

    token, ¡UNK¿ while the output is generated. Neu-ral machine translation (NMT) models typicallyoperate with a fixed vocabulary on the source andtarget side (due to memory and computational con-straints). Thus OOV words pose a problem that im-pacts the output’s fluency and adequacy. Instead ofusing words as input and output tokens during trans-lation, Sennrich et al. (2016) found that in additionto making the translation process simpler, subwordmodels achieve better accuracy for the translationof rare words than large-vocabulary models andback-off dictionaries, They are also able to produc-tively generate new words that were not seen attraining time i.e. they could learn compoundingand transliteration from subword representations.The intuition behind this approach comes from thefact that some words are translatable, even if theyare new to a competent translator based on a trans-lation of known subword units such as morphemesor phonemes. Such words can be of these types:

    • Named entities: can often be copied fromsource to target text via transcription ortransliteration. Example: Barack Obama (En-glish) as (Hindi).

    • Cognates and loanwords: with a common ori-gin can differ in regular ways between lan-guages so that character-level translation rulessuffice. Example: Kettle (English) as (ketlii,Hindi)

    • Morphologically complex words: words con-taining multiple morphemes, for instance,formed via compounding, affixation, or in-flection, maybe translatable by translating themorphemes separately. Example: Solar Sys-tem (English) as Sonnensystem (Sonne + Sys-tem, German).

    In an analysis of a sample of 100 rare tokensin their German training data, they found that themajority of tokens are potentially translatable fromEnglish through smaller units. They identified 56compounds, 21 names, 6 loanwords with a com-mon origin (emancipate→ emanzipieren), 5 casesof transparent affixation, 1 number, and 1 computerlanguage identifier.

    Byte Pair Encoding (BPE) (Gage, 1994) is asimple data compression technique that iterativelyreplaces the most frequent pair of bytes in a se-quence with a single, unused byte. The authorsadapted this algorithm for word segmentation bymerging characters or character sequences instead.

  • Their algorithm is as follows:

    • Initialize the symbol vocabulary with the char-acter vocabulary. Represent each word as asequence of characters, plus a special end-ofword symbol ’·’ to allow restoration to theoriginal tokenization after translation.

    • Iteratively count all symbol pairs and replaceeach occurrence of the most frequent pair (’X’,’Y’) with a new symbol ’XY’.

    • Each merge operation produces a new symbolthat represents a character n-gram. Frequentcharacter n-grams (or whole words) are even-tually merged into a single symbol, thus BPErequires no shortlist.

    They evaluated two methods of applying BPE:-

    • Learning two independent encodings, one forthe source, one for the target vocabulary. Thishas the advantage of being more compact interms of text and vocabulary size and havingstronger guarantees that each subword unit hasbeen seen in the respective language’s trainingcorpus.

    • Joint BPE: learning the encoding on the unionof the two vocabularies. This improves con-sistency between the source and the target seg-mentation.

    When BPE is applied independently, the samename may be segmented differently in the two lan-guages, which makes it harder for the neural mod-els to learn a mapping between the subword units.To increase the consistency between English andRussian segmentation despite the differing alpha-bets, they transliterated the Russian vocabulary intoLatin characters with ISO-9 and learnt the jointBPE encoding. Then they transliterated the BPEmerge operations back into Cyrillic to apply themto the Russian training corpus. For OOVs, thebaseline strategy of copying unknown words workswell for English→ German. However, when alpha-bets differ, like in English→ Russian, the subwordmodels do much better.

    The final symbol vocabulary is equal to the sizeof the initial vocabulary plus the number of merge-operations. Also, the number of merge operationsacts as a hyperparameter which can be further tunedfor optimal performance.

    11 Pivot Based MT

    Having looked at some of the literature that ispresent in the domain of subword translation, wenow look at another way we can alleviate the prob-lem of data scarcity for low resource languagesnamely Pivot based approaches. In this approachwe use an intermediate language called a Pivot lan-guage such that we have parallel corpora from theactual source to the pivot and from the pivot tothe actual target. In this way, we circumvent theneed for a parallel corpus from the actual sourceto the actual target language. This usage of pivotlanguages is a boon for us especially in situationswhere the parallel corpora for the source and thetarget is very scarce. For pivot based translation wehave several approaches that we can use :

    1. Triangulation Approach: In this approach,we combine the source-pivot phrase table andthe pivot-target phrase table to get the source-target phrase table.

    2. Sentence Translation Approach: In this ap-proach, we first translate from source to pivotand then from pivot to target language.

    3. Corpus Synthesis Approach: In this ap-proach, we build the source-target corpus byeither translating the pivot sentences formthe source-pivot corpus using the pivot-targettranslation system or by translating the pivotsentences from the pivot-target sentences us-ing the source-pivot translation system.

    One of the major problems in pivot based trans-lation systems is the propagation of error. Thetranslation produced by the pivot-target system isdependent on the quality of the translation of thesource-pivot system i.e. any errors in the first stageare bound to be carried onto the second stage. Thepaper by Cheng (2019). addresses this concern byjoint training for the pivot based NMT. The cruxof the approach is that the source-to-pivot and thepivot-to-target systems are allowed to interact witheach other during the training phase through eithersharing of word embeddings on the pivot languageor maximizing the likelihood of the cascaded modelon a small source-target parallel corpus. The ob-jective function can be visualized as consisting ofa source-to-pivot likelihood, a pivot-to-target like-lihood, and a connection term that connects boththe models and whose influence is governed by ahyperparameter. The models are connected using

  • word embeddings of the pivot. They can also beconnected by using a small bridging corpus.

    12 Word and Morpheme injected NMT

    We suspected that BPE could cause ambiguity dueto word segmentation as smaller word segmentsmight have lesser context when compared to thewhole word. This leads us to augment our BPEmodel by adding a word and morpheme featuresto it. The idea was to merge word and morphemeembeddings with underlying BPE embeddings toexplore if this addition can help the model to miti-gate the problem of ambiguity and learn better. Forcombining embeddings, we have used Multi-LayerPerceptron.We updated this model by initializing it with pre-trained BPE embeddings. We’ve accomplishedit by first obtaining pre-trained BPE embeddingsfrom (Heinzerling and Strube, 2018) and then usedthese pre-trained BPE embeddings to initialize ourmodel. We’ve managed to get pre-trained embed-dings for around eighty per cent of our vocabulary,and the rest was initialized with random vectors.We further updated the model by replacing the soft-max function with sparsemax function while calcu-lating attention. The intuition behind it was, therecan be several words in a sentence which may notcontribute at all to the attention, but the softmaxfunction will give them some non zero probabilityscore(though low). This could lead to an unnec-essary diminishment in the intensity of the finalcontext vector as the ultimate context vector is aweighted sum of probability scores from all thewords. With the use of sparsemax function, we caneliminate this problem as sparsemax function willtruncate those probability scores to zero, which aretoo low or say less than some given threshold

    13 BERT augmented NMT

    BERT(Devlin et al., 2019)(Bi-directional EncoderRepresentation From Transformer) makes use ofTransformer, an attention mechanism that learnscontextual relations between words (or sub-words)in a text. BERT is typically used for other down-stream tasks like next sentence prediction and textclassification. We aimed to leverage the BERTmodel in our NMT task. We accomplished this,by taking sentence vector(which captures the con-textual relationship between words in a sentence)generated by BERT and then merging this sentencevector with the sentence vector generated by an en-

    coder using the multi-layer perceptron(MLP). Thesentence vector generated by BERT is of 768 di-mensions, while that produced by the encoder isof 300 dimensions. The final vector after mergingformer two vectors is of 768 dimensions which isthen passed to the decoder.

    14 Conclusion

    Machine Translation has come a long way sincethe time of RBMT. Now, humans do not need tohand craft their translation systems, programmingit with rules and examples that would guide it. Theapproach didn’t scale well, suffered from a highprecision low recall fault, and was cumbersome andexpensive. Data became the driver of MT systemswith SMT learning mappings and their probabilitiesfrom corpuses. Such systems required calibrationof several subsystems. Their success lead them tobe the reigning paradigm for a long time. In recentyears however, with powerful processors and mas-sive amounts of data, Neural Networks managaedto catch up with SMT. SMT still performs betterin low resource conditions, but NMT has the ad-vantage of being an end to end model, that can betrained jointly. NMT achieves translations of goodquality due to its attention mechanism, wherebyit can concentrate on different parts of the inputsentence and the output it has generated thus far, topredict the next time. To rectify the issue of NMTneeding large amounts of parallel data, researchersare trying to adopt an unsupervised approach andleverage monolingual data. The key tenets of thisare proper initialization of the model, good lan-guage models and the concept of back-translation.There is still scope for improving the performanceof such models and to understand how exactly thesemodels work.

    ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

    gio. 2015. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

    Pushpak Bhattacharyya. 2015. Machine translation.CRC Press.

    Yong Cheng. 2019. Joint training for pivot-based neu-ral machine translation. In Joint Training for NeuralMachine Translation, pages 41–54. Springer.

    Mathias Creutz and Krista Lagus. 2002. Unsuperviseddiscovery of morphemes. In Proceedings of the

  • ACL-02 workshop on Morphological and phonologi-cal learning-Volume 6, pages 21–30. Association forComputational Linguistics.

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. ArXiv, abs/1810.04805.

    Yoav Goldberg and Omer Levy. 2014. word2vecexplained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprintarXiv:1402.3722.

    Benjamin Heinzerling and Michael Strube. 2018.BPEmb: Tokenization-free Pre-trained SubwordEmbeddings in 275 Languages. In Proceedings ofthe Eleventh International Conference on LanguageResources and Evaluation (LREC 2018), Miyazaki,Japan. European Language Resources Association(ELRA).

    Armand Joulin, Piotr Bojanowski, Tomas Mikolov,Hervé Jégou, and Edouard Grave. 2018. Loss intranslation: Learning bilingual word mapping witha retrieval criterion. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 2979–2984, Brussels, Bel-gium. Association for Computational Linguistics.

    Anoop Kunchukuttan, Abhijit Mishra, Rajen Chatter-jee, Ritesh Shah, and Pushpak Bhattacharyya. 2014.Sata-anuvadak: Tackling multiway translation of in-dian languages. pan, 841(54,570):4–135.

    Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013a. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013c. Distributed representa-tions of words and phrases and their composition-ality. In C. J. C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Q. Weinberger, editors, Ad-vances in Neural Information Processing Systems26, pages 3111–3119. Curran Associates, Inc.

    Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In Proceedings of the 2014 conferenceon empirical methods in natural language process-ing (EMNLP), pages 1532–1543.

    Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.

    Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS.

    Mozhi Zhang, Keyulu Xu, Ken-ichi Kawarabayashi,Stefanie Jegelka, and Jordan Boyd-Graber. 2019.Are girls neko or shōjo? cross-lingual alignment ofnon-isomorphic embeddings with iterative normal-ization. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics, pages 3180–3189, Florence, Italy. Associationfor Computational Linguistics.

    https://www.aclweb.org/anthology/L18-1473https://www.aclweb.org/anthology/L18-1473https://doi.org/10.18653/v1/D18-1330https://doi.org/10.18653/v1/D18-1330https://doi.org/10.18653/v1/D18-1330http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdfhttp://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdfhttp://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdfhttps://doi.org/10.18653/v1/P16-1162https://doi.org/10.18653/v1/P16-1162https://doi.org/10.18653/v1/P19-1307https://doi.org/10.18653/v1/P19-1307https://doi.org/10.18653/v1/P19-1307