invited applications paper climbing the tower of babel: … · 2018-09-10 · climbing the tower of...

Invited Applications Paper

Climbing the Tower of Babel: Unsupervised Multilingual Learning

Benjamin Snyder [email protected]

Regina Barzilay [email protected]

MIT Computer Science & Artificial Intelligence Lab, 32 Vassar Street, Cambridge, MA 02139 USA

Abstract

For centuries, scholars have explored the deeplinks among human languages. In this pa-per, we present a class of probabilistic mod-els that use these links as a form of naturallyoccurring supervision. These models allowus to substantially improve performance forcore text processing tasks, such as morpho-logical segmentation, part-of-speech tagging,and syntactic parsing. Besides these tradi-tional NLP tasks, we also present a multilin-gual model for the computational decipher-ment of lost languages.

1. Overview

Electronic text is currently being produced at a vastand unprecedented scale across the languages of theworld. Natural Language Processing (NLP) holds outthe promise of automatically analyzing this growingbody of text. However, over the last several decades,NLP research efforts have focused on the English lan-guage, often neglecting the thousands of other lan-guages of the world (Bender, 2009).

Most of these languages are currently beyond the reachof NLP technology due to several factors. One of theseis simply the lack of the kinds of hand-annotated lin-guistic resources that have helped propel the perfor-mance of English language systems. For complex tasksof linguistic analysis, hand-annotated corpora can beprohibitively time-consuming and expensive to pro-duce. For example, the most widely used annotatedcorpus in the English language, the Penn Treebank(Marcus et al., 1994), took years for a team of profes-sional linguists to produce. It is unrealistic to expectsuch resources to ever exist for the majority of theworld’s languages.

Appearing in Proceedings of the 27 th International Confer-ence on Machine Learning, Haifa, Israel, 2010. Copyright2010 by the author(s)/owner(s).

Another difficulty for multilingual NLP is that lan-guages exhibit wide variation in their underlying lin-guistic structure. A model that has been developed forone language may not account for the kinds of struc-ture found in others. In fact, there exists an entireacademic discipline devoted to studying and describingsystematic cross-lingual variations in language struc-ture, known as linguistic typology (Comrie, 1989).

At first glance, it may seem that linguistic diver-sity would make developing intelligent text-processingtools for the world’s languages a very daunting task.However, we argue that in fact it is possible to harnesssystematic linguistic diversity and use it to our advan-tage, utilizing a framework which we call multilinguallearning. The goal of this enterprise is two-fold:

• To induce more accurate models of individual lan-guage structure without any human annotation.

• To induce accurate models of the relationships be-tween languages.

The multilingual learning framework is based on thehypothesis that cross-lingual variations in linguisticstructure correspond to variations in ambiguity. As anexample, consider the syntactically ambiguous Englishsentence: “I ate pasta with cheese.” The prepositionalphrase “with cheese” can be interpreted as attachingthe noun “pasta” (meaning the pasta had cheese), orcould be interpreted as attaching to the verb “ate”(meaning perhaps that the pasta was eaten by meansof a cheese-based utensil). As humans, we know thatthe first of these is the only plausible interpretation,but there is nothing in the sentence itself to indicatethe correct parse. In contrast, the parallel sentence inJapanese uses an explicit genitive marker to mark thefact that the word for “pasta” is being modified.

This example is an instance of a more general phe-nomenon: what one language leaves implicit, and thusambiguous for computers or humans, another will ex-press directly through overt linguistic forms. In theframework of multilingual learning, we treat these vari-

Climbing the Tower of Babel

that the number of induced values ranges from 11 (forpair of languages) to 17 (for eight languages).

We evaluate our model on a parallel corpus of eightlanguages: Bulgarian, Czech, English, Estonian, Hun-garian, Romanian, Serbian, and Slovene (Erjavec,2004). We perform inference using Markov ChainMonte Carlo sampling and always test on held outmonolingual data for each language. We ran our in-ference algorithm over all 255 subsets of the eight lan-guages in our corpus, so we could examine the aver-age change in performance as the number of languagesincreases. In the monolingual scenario, our model re-duces to the Bayesian HMM of Goldwater & Griffiths(2007). When a complete part-of-speech dictionary1

is available and our model is trained using eight lan-guages, average tag prediction accuracy increases from91.1% for monolingual models to 95%. In more realis-tic cases, where the tag dictionary is restricted to onlyfrequently occurring words, we see even larger gaps be-tween monolingual and multilingual performance. Inone such scenario, where dictionary entries are onlyavailable for words occurring more than five times inthe corpus, average multilingual performance increasesto 82.8% from the monolingual baseline of 74.8%. Asseen in Figure 3, accuracy gains steadily as languagesare added to the mix.

3. Morphological Segmentation

In the task of morphological analysis, the goal is tosegment words into morphemes, the smallest units ofmeaning (e.g. “misunderstanding” segments into threemorphemes: “mis understand ing”). While the mor-phology of English is fairly simple, many languagesexhibit a richer and more productive set of morpho-logical patterns. In the unsupervised setting, morpho-logical segmentation consists of finding recurrent prefixand suffix patterns which allow a more compact rep-resentation of the many possible derived word forms.Our multilingual model for this task automatically in-duces a segmentation and morpheme alignment froma multilingual (unannotated) corpus of short parallelphrases. For example, given parallel phrases meaningin my land in English, Arabic, Hebrew, and Aramaic,we wish to segment and align morphemes as shown inFigure 4.

This example illustrates the potential benefits of un-supervised multilingual morphological analysis. Thethree Semitic languages use cognates (words derivedfrom a common ancestor) to represent the word land.

1i.e. entries indicating the set of potential parts-of-speech for each word

fy ar! - y

b - ar" - y

b - ar!"- y

in my landEnglish:

Arabic:

Hebrew:

Aramaic:

Figure 4. Morphological segmentation and alignment.

They also use an identical suffix (-y) to represent thefirst person possessive pronoun (my). These similar-ities in form should guide the model by constrainingthe space of joint segmentations and alignments. Thecorresponding English phrase lacks this resemblanceto its Semitic counterparts. However, in this as inmany cases, no segmentation is required for English asall the morphemes are expressed as individual words.For this reason, English should provide a strong sourceof disambiguation for highly inflected languages, suchas Arabic and Hebrew. More generally speaking, ourmodel exploits the fact that each language distributesmorphemes across words in a unique pattern. Notethat morphemes expressed in one language often haveno counterpart at all in some other languages, so mor-phemes must be allowed to remain unaligned.

The technical difficulty when compared to the part-of-speech model of Section 2 is that the units of alignmentnow depend on the results of the model’s segmentationpredictions. Whereas before we could treat word-levelalignments as fixed and observed (as the result of pre-processing with standard NLP word-alignment tools),we must now fold alignment uncertainty into the mor-phology model itself.

We start with a sketch of the probabilistic processposited by our model for the generation of short bilin-gual phrases (see Figure 5 for an accompanying exam-ple). First, the numbers of unaligned language-specificmorphemes (m and n), and the number of aligned mor-pheme pairs (k) are drawn from a Poisson distribution.These are the number of morphemes that will ulti-mately compose the bilingual parallel phrase. Next,the morphemes are drawn from the appropriate distri-butions: m and n morphemes are respectively drawnfrom language-specific morpheme distributions E andF , and k bilingual morpheme pairs are drawn from A.The resulting morphemes for each language are finallyordered and fused into words.

As in the previous section, the scope of cross-lingualconnections (now in the form of aligned morphemepairs) is not known a priori. Indeed, even the num-ber of morphemes in each language is not known in


(i) (ii) (iii)

Figure 6. A pair of trees (i) and two possible alignment trees. In (ii), no empty spaces are inserted, but the order of oneof the original tree’s siblings has been reversed. In (iii), only two pairs of nodes have been aligned (indicated by arrows)and many empty spaces inserted.

through a similar comparison to the English sentence.

However, even in this simplest of sentence pairs, we no-tice syntactic divergence. While the English sentenceuses the simple transitive verb “climbed” to expressthe fact that John completed his climb of Everest,the verb in the Hindi/Urdu sentence takes the post-positional argument “Everest on.” The syntactic di-vergence in real-life examples becomes only more se-vere. The key challenge then is representational. Weneed to parse both sentences with possibly quite di-vergent trees, while recognizing shared syntactic struc-ture. In effect, we seek to produce two loosely boundtrees: node-to-node alignments need only be usedwhere repeated bilingual patterns can be discerned inthe data.

We achieve this loose binding of trees by adapting un-ordered tree alignment (Jiang et al., 1995) to a prob-abilistic setting. Under this formalism, any two treescan be aligned using an alignment tree. The alignmenttree embeds the original two trees within it: each nodeis labeled by a pair (x, y), (λ, y), or (x, λ) where x isa node from the first tree, y is a node from the secondtree, and λ is an empty space. The individual structureof each tree must be preserved under the embeddingwith the exception of sibling order (to allow variationsin phrase and word order).

The flexibility of this formalism can be demonstratedby two extreme cases: (1) an alignment betweentwo trees may actually align none of their individualnodes, instead inserting an empty space λ for eachof the original two trees’ nodes. (2) if the originaltrees are isomorphic to one another, the alignmentmay match their nodes exactly, without inserting anyempty spaces. See Figure 6 for an example. An addi-tional benefit of this formalism is computational: Themarginalized probability over all possible alignmentsfor any two trees can be efficiently computed with a

dynamic program in bi-linear time in the size of thetwo trees.

We formulated a generative Bayesian model whichseeks to explain sentence- and word-aligned paral-lel sentences through a combination of bilingual andmonolingual syntactic parameters. Our model viewseach bilingual pair of sentences as having been prob-abilistically generated as follows: First an alignmenttree is drawn uniformly from the set of all such trees.This alignment tree specifies the structure of each ofthe two individual trees, as well as the pairs of nodeswhich are aligned and those which are not aligned (i.e.paired with a λ). For each pair of aligned nodes, acorresponding pair of sentence constituents are jointlydrawn from a bilingual distribution. For unalignednodes (i.e. nodes paired with a λ in the alignmenttree), a single sentence constituent is drawn, in thiscase from a language-specific distributions. Finallyword-level alignments are drawn based on the struc-ture of the alignment tree.

To perform inference under this model, we use aMetropolis-Hastings within-Gibbs sampler. We sam-ple pairs of trees and then compute marginalized prob-abilities over all possible alignments using dynamicprogramming.

We tested the effectiveness of our bilingual grammarinduction model on three corpora of parallel text:English-Korean, English-Urdu and English-Chinese.The model is trained using bilingual data with auto-matically induced word-level alignments, but is testedon purely monolingual data for each language. In allcases, our model outperforms a state-of-the-art base-line: the Constituent Context Model (CCM) (Klein& Manning, 2002), sometimes by substantial margins.On average, over all the testing scenarios that we stud-ied, our model achieves an absolute increase in F-measure of 8.8 points, and a 19% reduction in error


monolingual test data. We believe this to be a realisticscenario for a large number of the world’s languages,as parallel texts are widely available. Finally, in Sec-tion 5, we considered the special case of lost languagedecipherment, where parallel text is not present, butinformation about a closely related language is avail-able.

For future work, we pose the following two questions:(i) Can multilingual learning be used to triangulatethe information content of sentences in multiple lan-guages? (ii) Can knowledge of linguistic typology (anduniversal features of language) be used to induce moreaccurate unsupervised models, even without the use ofparallel text?

References

Bender, Emily M. Linguistically naıve != languageindependent: why NLP needs linguistic typology.In Proceedings of the EACL 2009 Workshop on theInteraction between Linguistics and ComputationalLinguistics, pp. 26–32, Morristown, NJ, USA, 2009.Association for Computational Linguistics.

Charniak, Eugene and Carroll, Glen. Two experi-ments on learning probabilistic dependency gram-mars from corpora. In Proceedings of the AAAIWorkshop on Statistically-Based NLP Techniques,pp. 1–13, 1992.

Comrie, Bernard. Language universals and linguistictypology: Syntax and morphology. Oxford: Black-well, 1989.

Erjavec, T. MULTEXT-East version 3: Multilingualmorphosyntactic specifications, lexicons and cor-pora. In Fourth International Conference on Lan-guage Resources and Evaluation, LREC, volume 4,pp. 1535–1538, 2004.

Ferguson, T.S. A Bayesian analysis of some nonpara-metric problems. The annals of statistics, 1:209–230,1973.

Goldwater, Sharon and Griffiths, Thomas L. A fullyBayesian approach to unsupervised part-of-speechtagging. In Proceedings of the ACL, pp. 744–751,2007.

Jiang, T., Wang, L., and Zhang, K. Alignment of trees– an alternative to tree edit. Theoretical ComputerScience, 143(1):137–148, 1995.

Klein, Dan and Manning, Christopher D. A genera-tive constituent-context model for improved gram-mar induction. In Proceedings of the ACL, pp. 128–135, 2002.

Marcus, M.P., Santorini, B., and Marcinkiewicz, M.A.Building a large annotated corpus of English: ThePenn Treebank. Computational linguistics, 19(2):313–330, 1994.

Merialdo, Bernard. Tagging english text with a prob-abilistic model. Computational Linguistics, 20(2):155–171, 1994.

Naseem, Tahira, Snyder, Benjamin, Eisenstein, Ja-cob, and Barzilay, Regina. Multilingual part-of-speech tagging: two unsupervised approaches. Jour-nal of Artificial Intelligence Research, 36(1):341–385, 2009. ISSN 1076-9757.

Och, Franz Josef and Ney, Hermann. A systematiccomparison of various statistical alignment models.Computational Linguistics, 29(1):19–51, 2003.

Robinson, Andrew. Lost Languages: The Enigma ofthe World’s Undeciphered Scripts. McGraw-Hill,2002.

Snyder, Benjamin and Barzilay, Regina. Unsupervisedmultilingual learning for morphological segmenta-tion. In Proceedings of the ACL/HLT, pp. 737–745,2008a.

Snyder, Benjamin and Barzilay, Regina. Cross-lingualpropagation for morphological analysis. In Proceed-ings of the AAAI, pp. 848–854, 2008b.

Snyder, Benjamin, Naseem, Tahira, Eisenstein, Ja-cob, and Barzilay, Regina. Unsupervised multilin-gual learning for POS tagging. In Proceedings ofEMNLP, pp. 1041–1050, 2008.

Snyder, Benjamin, Naseem, Tahira, and Barzilay,Regina. Unsupervised multilingual grammar induc-tion. In Proceedings of the ACL, pp. 73–81, 2009a.

Snyder, Benjamin, Naseem, Tahira, Eisenstein, Jacob,and Barzilay, Regina. Adding more languages im-proves unsupervised multilingual part-of-speech tag-ging: a bayesian non-parametric approach. In Pro-ceedings of the NAACL, pp. 83–91, 2009b.

invited applications paper climbing the tower of babel: … · 2018-09-10 · climbing the tower of...

Documents