proceedings of typ-nlp: the first workshop on typology for ... · acl 2019 typ-nlp: the workshop on...

ACL 2019

TyP-NLP: The Workshop on Typology for Polyglot NLP

Proceedings of the First Workshop

August 1, 2019Florence, Italy

c©2019 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-950737-29-1

ii

Introduction

The TyP-NLP workshop will be the first dedicated venue for typology-related research and its integrationinto multilingual Natural Language Processing (NLP). The workshop will be hosted by the 57th AnnualMeeting of the Association for Computational Linguistics (ACL 2019) in Florence, Italy.

The ultimate goal of TyP-NLP is the development of robust language technology applicable across theworld’s languages. Long overdue, the workshop is specifically aimed at raising awareness of linguistictypology and its potential in supporting and widening the global reach of multilingual NLP. It will fosterresearch and discussion on open problems relevant to the multilingual NLP community, but it will alsoinvite input from leading researchers in linguistics and cognitive sciences.

The final program of TyP-NLP contains 5 keynote talks and 18 accepted posters, selected among a largenumber of non-archival submissions. This workshop would not have been possible without the hard workof its program committee. We would like to express our gratitude to them for writing meticulous reviewsin a very constrained span of time. We should also thank our invited speakers, Isabelle Augenstein,Emily Bender, Balthasar Bickel, Jason Eisner, and Sabine Stoll, for their irreplaceable contribution toour program. The workshop is generously sponsored by Google and by the European Research Council(ERC) Consolidator Grant LEXICAL (no. 648909).

Find more details on the TyP-NLP 2019 website: https://typology-and-nlp.github.io/

iii

Organizing Committee:

Haim Dubossarsky, University of CambridgeArya D. McCarthy, Johns Hopkins UniversityEdoardo M. Ponti, University of CambridgeIvan Vulic, University of CambridgeEkaterina Vylomova, University of Melbourne

Steering Committee:

Yevgeni Berzak, MITRyan Cotterell, University of CambridgeManaal Faruqui, GoogleEitan Grossman, Hebrew University of JerusalemAnna Korhonen, University of CambridgeThierry Poibeau, Centre national de la recherche scientifique (CNRS)Roi Reichart, Technion - Israel Institute of Technology

Program Committee:

Željko Agic, University of CopenhagenEhsaneddin Asgari, University of California, BerkeleyBarend Beekhuizen, University of TorontoChristian Bentz, University of ZürichRichard Futrell, UC IrvineEdward Gibson, MITGiuseppe Celano, Leipzig UniversityMans Hulden, University of ColoradoGerhard Jäger, University of TuebingenKatharina Kann, NYUDouwe Kiela, FacebookSilvia Luraghi, University of PaviaJohn Mansfield, University of MelbourneDavid R. Mortensen, Carnegie Mellon UniversityPhoebe Mulcaire, University of WashingtonJason Naradowsky, Johns HopkinsJoakim Nivre, Uppsala UniversityRobert Östling, Stockholm UniversityElla Rabinovich, University of TorontoMichael Regan, University of New MexicoTanja Samardzic, University of ZurichHinrich Schütze, University of MunichSabine Stoll, University of ZurichJörg Tiedemann, University of HelsinkiReut Tsarfaty, Open University of IsraelYulia Tsvetkov, Carnegie Mellon University

v

Daan van Esch, GoogleShuly Wintner, University of Haifa

Invited Speakers:

Isabelle Augenstein, University of CopenhagenEmily M. Bender, University of WashingtonBalthasar Bickel, University of ZurichJason Eisner, Johns Hopkins UniversitySabine Stoll, University of Zurich

Panelists:

Timothy Baldwin, University of MelbourneEmily M. Bender, University of WashingtonBalthasar Bickel, University of ZurichKilian Evang, University of DüsseldorfJungo Kasai, University of WashingtonYova Kementchedjhieva, University of CopenhagenSabine Stoll, University of Zurich

vi

Conference Program

Thursday, August 1, 2019

7:30–8:45 Breakfast

8:45–9:00 Opening Remarks

9:00–9:45 Invited TalkEmily M. Bender

9:45–10:30 Invited TalkJason Eisner

10:30–11:00 Coffee Break

11:00–12:30 Poster Session

Unsupervised Document Classification in Low-resource Languages for EmergencySituationsNidhi Vyas, Eduard Hovy and Dheeraj Rajagopal

Polyglot Parsing for One Thousand and One Languages (and Then Some)Ali Basirat, Miryam de Lhoneux, Artur Kulmizev, Murathan Kurfalı, Joakim Nivreand Robert Östling

Dissecting Treebanks to Uncover Typological Trends: a Multilingual ComparativeApproachChiara Alzetta, Felice Dell’Orletta, Simonetta Montemagni and Giulia Venturi

What Do Multilingual Neural Machine Translation Models Learn about Typology?Ryokan Ri and Yoshimasa Tsuruoka

Syntactic Typology from Plain Text Using Language EmbeddingsTaiqi He and Kenji Sagae

Typological Feature Prediction with Matrix CompletionAnnebeth Buis and Mans Hulden

Feature Comparison across Typological ResourcesTifa de Almeida, Youyun Zhang, Kristen Howell and Emily M. Bender

Using Typological Information in WALS to Improve Grammar InferenceYouyun Zhang, Tifa de Almeida, Kristen Howell and Emily M. Bender

vii

Thursday, August 1, 2019 (continued)

Cross-linguistic Robustness of Infant Word Segmentation Algorithms: Oversegment-ing Morphologically Complex LanguagesGeorgia R. Loukatou

Cross-linguistic Semantic Tagset for Case RelationshipsRitesh Kumar, Bornini Lahiri and Atul Kr. Ojha

AfricaSign - a Crowd-sourcing Platform for Lexical Documentation of African SignLanguagesAbdelhadi Soudi, Kristof Van Laerhoven and Elmosta Bou-Souf

Predicting Continuous Vowel Spaces in the WildernessEmily Ahn and David R. Mortensen

Towards a Multi-view Language Representation: A Shared Space of Discrete andContinuous Language FeaturesArturo Oncevay, Barry Haddow and Alexandra Birch

Cross-lingual CCG Induction: Learning Categorial Grammars via Parallel Cor-poraKilian Evang

Towards a Computationally Relevant Typology for Polyglot/Multilingual NLPAda Wan

Transfer Learning for Cognate Identification in Low-Resource LanguagesEliel Soisalon-Soininen and Mark Granroth-Wilding

Contextualization of Morphological InflectionEkaterina Vylomova, Ryan Cotterell, Timothy Baldwin, Trevor Cohn and JasonEisner

Towards Unsupervised Extraction of Linguistic Typological Features from Lan-guage DescriptionsSøren Wichmann and Taraka Rama

12:30–14:00 Lunch

14:00–14:45 Invited TalkBalthasar Bickel

14:45–15:30 Invited TalkSabine Stoll

viii

Thursday, August 1, 2019 (continued)

15:30–16:00 Coffee Break

16:00–16:45 Invited TalkIsabelle Augenstein

16:45–17:30 Panel Discussion

17:30–17:45 Best Paper Announcement and Closing Remarks

ix

Biography of the Speakers

Isabelle Augenstein is a tenure-track assistant professor at the University of Copenhagen, Department ofComputer Science since July 2017, affiliated with the Copenhagen NLP group and the Machine LearningSection, and work in the general areas of Statistical Natural Language Processing and Machine Learning.Her main research interests are weakly supervised and low-resource learning with applications includinginformation extraction, machine reading and fact checking.

Emily M. Bender’s primary research interests are in multilingual grammar engineering, the study ofvariation, both within and across languages, and the relationship between linguistics and computationallinguistics. She is the LSA’s delegate to the ACL. Her 2013 book Linguistic Fundamentals for NaturalLanguage Processing: 100 Essentials from Morphology and Syntax aims to present linguistic conceptsin an manner accessible to NLP practitioners.

Jason Eisner works on machine learning, combinatorial algorithms, probabilistic models of linguisticstructure, and declarative specification of knowledge and algorithms. His work addresses the question,“How can we appropriately formalize linguistic structure and discover it automatically?”

Balthasar Bickel aims at understanding the diversity of human language with rigorously tested causalmodels, i.e. at answering the question what’s where why in language. What structures are there, and howexactly do they vary? Engaged in both linguistic fieldwork and statistical modeling, he focuses on ex-plaining universally consistent biases in the diachrony of grammar properties, biases that are independentof local historical events.

Sabine Stoll questions how children can cope with the incredible variation exhibited in the approximately6000–7000 languages spoken around the world. Her main focus is the interplay of innate biologicalfactors (such as the capacity for pattern recognition and imitation) with idiosyncratic and culturallydetermined factors (such as for instance type and quantity of input). Her approach is radically empirical,based first and foremost on the quantitative analysis of large corpora that record how children learndiverse languages.

x

Proceedings of TyP-NLP: The First Workshop on Typology for Polyglot NLP, pages 1–3Florence, Italy, August 1, 2019. c©2019 Association for Computational Linguistics

Unsupervised Document Classification in Low-resource Languages forEmergency Situations

Nidhi Vyas Eduard Hovy Dheeraj RajagopalCarnegie Mellon University

{nkvyas,dheeraj}@[email protected]

1 Introduction

During emergency, relief workers need to con-stantly track updates so that they can learn of sit-uations that require immediate attention (Stoweet al., 2016). However, it is challenging to carryout these efforts rapidly when the informationis expressed in people’s native languages, whichhave little to no resources for NLP. We aim forbuilding an adaptable language-agnostic systemfor such emergent situations that can classify in-cident language documents into a set of rele-vant fine-grained classes of humanitarian-needsand unrest situations. Our approach requires nolanguage specific feature engineering and ratherleverages the semantic difference between genericclass features to build a classification frameworkthat supports relief efforts. We assume no knowl-edge of the incident language, except the com-monly available bilingual dictionaries (which tendto be very small or are generated from out-of-domain data such as Bible alignments). First,we obtain keywords for each target class usingEnglish news corpora (Naik et al., 2017; Marujoet al., 2015; Wen and Rose, 2012; Ozgur et al.,2005; Tran et al., 2013), that are then translatedusing the available bilingual dictionary (Zhanget al., 2016; Adams et al., 2017). Second,an unsupervised bootstrapping module enhancesthe generic keywords by adding incident-specificlanguage-specific keywords (Knopp, 2011; Huangand Riloff, 2013; Ebrahimi et al., 2016). Next, weuse all the keywords to generate labeled data. Fi-nally, this data is used to train a downstream docu-ment classifier. This entire procedure is language-agnostic because it bypasses the necessity to cre-ate training data from scratch. We validate thisprocedure in a low-resource setup, with 7 distinctlanguages, showing significant improvements overthe baseline by atleast 13 F1 points. To the best of

our knowledge, our approach is the first to com-bine the use of distant supervision from Englishand in-language semantic bootstrapping for sucha low-resource task. We believe our method canbe used as a strong benchmark for future develop-ments in low-resource unsupervised classification.

2 Approach

Figure 1 shows the overall architecture of our ap-proach, composed of three primary modules.Keywords: We use English in-domain corporaviz. Google News and Relief-Web corpus1 togenerate task-specific keywords, such that eachkeyword is strongly indicative of the underlyingclass(es) of a document. We cluster the documentsbased on their classes and use tf-idf to pick top100 candidate words for each class. We then com-pute a label affinity score between each candidateand class labels using cosine-similarity betweentheir corresponding Word2Vec embeddings2. Inthis way, each candidate keyword has a differentassociation strength across all classes, and we onlyretain the ones above threshold 0.9. The prunedkeywords are translated into the incident languageusing the available bilingual dictionary, droppingthe ones that are absent. Finally, we use keywordspotting to label each document with class/es ofthe keyword/s present in it.Bootstrapping: Dropping keywords during trans-lation leaves a significant fraction of documentsunlabeled. To improve the percentage of labeleddocuments, we expand the keywords within theincident language using a two-step process. First,we cluster the labeled documents (obtained fromkeyword spotting) based on their classes. For eachword in a cluster, we compute the sum of its tf-idfscore across other clusters and its average word

1Pre-classified English documents into disaster reliefneeds and emergency situations (https://reliefweb.int)

2https://code.google.com/archive/p/word2vec/

1

Figure 1: System architecture for low-resource document classification.

Mandarin Spanish Uzbek Farsi Tigrinya Uyghur Oromo

P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R F1

Random 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09

+ KWD 0.40 0.22 0.28 0.57 0.14 0.23 0.51 0.11 0.18 0.49 0.18 0.26 0.54 0.55 0.55 0.61 0.28 0.39 0.45 0.08 0.14

+ BS 0.34 0.38 0.36 0.22 0.74 0.34 0.33 0.47 0.39 0.31 0.30 0.30 0.53 0.59 0.56 0.52 0.31 0.39 0.33 0.09 0.14

+ KNN 0.31 0.40 0.35 0.31 0.46 0.37 0.29 0.47 0.36 0.24 0.48 0.32 0.54 0.58 0.56 0.38 0.36 0.37 0.10 0.12 0.11SVM 0.30 0.39 0.34 0.30 0.45 0.36 0.34 0.48 0.40 0.29 0.29 0.29 0.55 0.58 0.57 0.47 0.38 0.41 0.20 0.09 0.12R-Forest 0.37 0.38 0.38 0.31 0.46 0.37 0.33 0.47 0.39 0.23 0.40 0.30 0.56 0.61 0.58 0.42 0.38 0.40 0.12 0.10 0.11Log-Reg 0.29 0.39 0.33 0.31 0.45 0.37 0.32 0.47 0.38 0.23 0.37 0.29 0.56 0.61 0.58 0.51 0.36 0.42 0.12 0.10 0.10GNB 0.32 0.40 0.36 0.30 0.46 0.36 0.33 0.51 0.40 0.28 0.31 0.30 0.56 0.62 0.58 0.47 0.37 0.41 0.11 0.10 0.11DAN 0.22 0.38 0.27 0.21 0.74 0.33 0.23 0.52 0.32 0.25 0.68 0.37 0.56 0.58 0.57 0.37 0.38 0.37 0.26 0.19 0.22LSTM 0.29 0.39 0.33 0.43 0.23 0.30 0.29 0.48 0.37 0.25 0.24 0.24 0.51 0.63 0.56 0.37 0.31 0.33 0.09 0.20 0.13

Table 1: Results of classification across 7 languages, over each module (Modules - KWD:Keywords, BS:Bootstrap;Classifiers - R-Forest:Random Forest, GNB:Gaussian Naive Bayes, Log-Reg:Logistic regression, LSTM (Hochre-iter and Schmidhuber, 1997) and DAN (Iyyer et al., 2015; Chen et al., 2016)

similarity with all other keywords present in thatcluster. Each keyword belongs to the same class asits cluster. Second, we prune words with less than0.9 score, and use the rest to again label more doc-uments using keyword spotting (e.g. fraction oflabeled documents in Uzbeck increased by 36%).Classification: Finally, we use all labelled docu-ments obtained from Keywords and Bootstrappingmodule to train a classifier, which classifies all theremaining incident language documents.

3 Experiments and Results

We used the LDC corpora3 for 7 low-resource lan-guages having 11 class labels: Crime-violence,Terrorism, Regime-change, Medical, Food, Wa-ter, Evacuation, Shelter, Search-rescue, Infras-tructure, and Utilities. Mandarin, Uzbek, Farsiand Spanish have 190 documents with average2.7 labels per document. Tigrinya, Uyghur andOromo have 1.1K, 3.6K and 2.7K documents with1.4, 0.1 and 1.0 labels per document respectively.Apart from the difference in language families andwriting scripts, morphological complexity addsfurther challenge to classification. As shown in

3https://www.ldc.upenn.edu

Table 14, the Keywords module results in high-est average F1 gain over the baseline, showing theeffectiveness of using language-agnostic informa-tion for tasks. As expected, this module primarilyimproves precision. Further, on average 84.53%keywords were dropped in translation, suggestingimprovements in bilingual dictionary can benefitthis module. Similarly, the Bootstrap module fo-cuses on incident-specific information and primar-ily improves recall, resulting in an overall F1 gainof 6%. Finally, the Classifier module achievesan overall improvement of 4% F1. We observelow performance on Oromo, which is a morpho-logically rich language. On finer inspection, wefound the corpus had several misspelled words.For instance, we identified different versions ofEthiopia, such as itiyoophiyaa. We also observethat the languages of same family like Uyghur andUzbek have similar performances. In most cases,the gain in F1 provided by keyword extraction andbootstrapping is significantly higher than that fromany classifier. This suggests that the classifier per-formance will improve only when we improve themappings between source and target languages.

4We use the LOREHLT evaluation guidelines(https://goo.gl/ZT7sMq) for scoring

2

Acknowledgments

We acknowledge NIST for coordinating the SFtype evaluation and providing the test data. NISTserves to coordinate the evaluations in order tosupport research and to help advance the state-of-the-art. NIST evaluations are not viewed as a com-petition, and such results reported by NIST arenot to be construed, or represented, as endorse-ments of any participants system, or as officialfindings on the part of NIST or the U.S. Govern-ment. This project was sponsored by the DefenseAdvanced Research Projects Agency (DARPA)Information Innovation Office (I2O), program:Low Resource Languages for Emergent Incidents(LORELEI), issued by DARPA/I2O under Con-tract No. HR0011-15-C-0114.

ReferencesOliver Adams, Adam Makarucha, Graham Neubig,

Steven Bird, and Trevor Cohn. 2017. Cross-lingualword embeddings for low-resource language model-ing. In Proceedings of the 15th Conference of theEuropean Chapter of the Association for Compu-tational Linguistics: Volume 1, Long Papers, vol-ume 1, pages 937–947.

Xilun Chen, Yu Sun, Ben Athiwaratkun, Claire Cardie,and Kilian Weinberger. 2016. Adversarial deep av-eraging networks for cross-lingual sentiment classi-fication. arXiv preprint arXiv:1606.01614.

Javid Ebrahimi, Dejing Dou, and Daniel Lowd. 2016.Weakly supervised tweet stance classification by re-lational bootstrapping. In Proceedings of the 2016Conference on Empirical Methods in Natural Lan-guage Processing, pages 1012–1017. Associationfor Computational Linguistics.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Ruihong Huang and Ellen Riloff. 2013. Multi-facetedevent recognition with bootstrapped dictionaries. InProceedings of the 2013 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 41–51.

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,and Hal Daume III. 2015. Deep unordered com-position rivals syntactic methods for text classifica-tion. In Proceedings of the 53rd Annual Meeting ofthe Association for Computational Linguistics andthe 7th International Joint Conference on NaturalLanguage Processing (Volume 1: Long Papers), vol-ume 1, pages 1681–1691.

Johannes Knopp. 2011. Extending a multilingual lexi-cal resource by bootstrapping named entity classifi-cation using wikipedia’s category system. In Pro-ceedings of the Fifth International Workshop OnCross Lingual Information Access, pages 35–43.Asian Federation of Natural Language Processing.

Luis Marujo, Wang Ling, Isabel Trancoso, Chris Dyer,Alan W Black, Anatole Gershman, David Mar-tins de Matos, Joao Neto, and Jaime Carbonell.2015. Automatic keyword extraction on twitter.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 2: Short Papers), pages637–643. Association for Computational Linguis-tics.

Aakanksha Naik, Chris Bogart, and Carolyn Rose.2017. Extracting personal medical events for usertimeline construction using minimal supervision.BioNLP 2017, pages 356–364.

Arzucan Ozgur, Levent Ozgur, and Tunga Gungor.2005. Text categorization with class-based andcorpus-based keyword selection. In InternationalSymposium on Computer and Information Sciences,pages 606–615. Springer.

Kevin Stowe, Michael J. Paul, Martha Palmer, LeysiaPalen, and Kenneth Anderson. 2016. Identifyingand categorizing disaster-related tweets. In Pro-ceedings of The Fourth International Workshop onNatural Language Processing for Social Media,pages 1–6. Association for Computational Linguis-tics.

Dang Tran, Cuong Chu, Son Pham, and Minh Nguyen.2013. Learning based approaches for vietnamesequestion classification using keywords extractionfrom the web. In Proceedings of the Sixth Interna-tional Joint Conference on Natural Language Pro-cessing, pages 740–746.

Miaomiao Wen and Carolyn Penstein Rose. 2012. Un-derstanding participant behavior trajectories in on-line health support groups using automatic extrac-tion methods. In Proceedings of the 17th ACM in-ternational conference on Supporting group work,pages 179–188. ACM.

Dongxu Zhang, Boliang Zhang, Xiaoman Pan, Xi-aocheng Feng, Heng Ji, and XU Weiran. 2016. Bi-text name tagging for cross-lingual entity annota-tion projection. In Proceedings of COLING 2016,the 26th International Conference on ComputationalLinguistics: Technical Papers, pages 461–470.

3


Dissecting Treebanks to Uncover Typological Trends.A Multilingual Comparative Approach

Chiara Alzetta•�, Felice Dell’Orletta�, Simonetta Montemagni�, Giulia Venturi�•DIBRIS, Universita degli Studi di Genova, Italy

[email protected]�Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC-CNR) - ItaliaNLP Lab

{name.surname}@ilc.cnr.it

Introduction and Motivation. Over the lastyears, linguistic typology started attracting the in-terest of the community working on cross- andmulti-lingual NLP as a way to tackle the bot-tleneck deriving from the lack of annotated datafor many languages. Typological information ismostly acquired from publicly accessible typo-logical databases, manually constructed by lin-guists. As reported in Ponti et al. (2018), de-spite the abundant information contained in themfor many languages, these resources suffer fromtwo main shortcomings, i.e. their limited cover-age and the discrete nature of features (only “themajority value rather than the full range of pos-sible values and their corresponding frequencies”is reported). Corpus-based studies can help toautomatically acquire quantitative typological evi-dence which might be exploited for polyglot NLP.Recently, the availability of corpora annotated fol-lowing a cross-linguistically consistent annotationscheme such as the one developed in the UniversalDependencies project is prompting new compara-tive linguistic studies aimed to identify similari-ties as well as idiosyncrasies among typologicallydifferent languages (Nivre, 2015). The line of re-search described here is aimed at acquiring quan-titative typological evidence from UD treebanksthrough a multilingual contrastive approach.Method. The proposed methodology is inspiredby Alzetta et al. (2018) where an algorithm orig-inally developed for assessing the plausibility ofautomatically produced syntactic annotations wasused to infer quantitative typological evidencefrom treebanks. The authors demonstrate that thelinguistic properties used by this algorithm to rankdependency annotations from reliable to unreli-able ones can also be effectively used against man-ually revised corpora, i.e. gold treebanks. In thiscase, the resulting ranking of gold dependenciesturned out to closely reflect the degree of pro-

Figure 1: LISCA work-flow.

totypicality of dependency relations in the targetcorpus. In this study, we rely on the same algo-rithm, LISCA (Dell’Orletta et al., 2013), whichoperates in two steps (see Figure 1): i) it createsa language model collecting statistics about a setof linguistically-motivated features extracted fromthe automatically parsed sentences of a large ref-erence corpus, and ii) it uses the model to assign ascore to each dependency arc contained in a targetcorpus. Rather than the plausibility of the annota-tion, the score should be seen here as reflecting theprototypicality degree of a given relation, basedon wide variety of features including its contextof occurrence. The higher the score of a rankedarc, the more prototypical is the arc with respectto the statistics acquired from the large referencecorpus. In Alzetta et al. (2018), the algorithm wasused to acquire typological evidence from tree-banks relying on LISCA models (LM) of the samelanguage. The main novelty of this study con-sists in the adopted multilingual comparative ap-proach through which typological evidence is ac-quired. As illustrated in Figure 2, we ranked thesame monolingual treebank using LMs built fordifferent languages, thus obtaining four differentdependencies rankings of the same monolingual

4

Figure 2: Method work–flow exemplified on IUDT.

LM Target UDT modelsIT EN SP BUL

IT 1.00 0.94 0.97 0.79EN 0.84 1.00 0.87 0.90SP 0.98 0.95 1.00 0.93BUL 0.79 0.91 0.83 1.00

Table 1: Spearman’s correlation between pairs ofranked amod using different LMs on each treebank.

treebank. Different positions of the same depen-dency relation (DR) across the four rankings re-flect different degrees of prototypicality of that DRinstance: a bigger ranking difference associatedwith the same DR is connected with stronger ty-pological differences of the languages representedin the selected treebanks, whereas closer rankingsreflect typological closeness of languages.Data. In this study, we considered four UD tree-banks (v2.2) (Nivre et al., 2017): English (Silveiraet al., 2014), Italian (Bosco et al., 2013), Span-ish (McDonald et al., 2013) and Bulgarian (Simovet al., 2005). Statistics to build the LMs for the ex-amined languages were extracted from four mono-lingual corpora of around 40 million tokens eachparsed by the UDPipe pipeline (Straka et al., 2016)trained on the UD treebanks.Results. Due to space constraints, the method-ology is illustrated here wrt Italian UD Treebank(IUDT) and in particular wrt an individual DR: ad-jectival modifier (amod). Table 1 reports Spear-man’s rank correlation coefficients (p <0.00) ob-tained through pairwise comparisons of amod DRsacross the LISCA rankings. Each pair is repre-sented by the ranking obtained using the LM ofthe target UD treebank (Target UDT models) andthe ranking obtained using one of the other LMs.Interestingly, typologically similar languages such

LM Prenominal PostnominalUp Down Up Down

ENG 21,691.18 2,566.57 290.13 40,934.52SP 1,541.31 9,163.83 5,462.83 5,322.66BUL 30,597.76 967.66 885.15 43,003.64

Table 2: Avg difference of ranking positions of IUDTpre- and post-nominal amod in different rankings withrespect to the ranking obtained with the Italian LM.

Figure 3: Distributions of the IUDT pre- and post-nominal amod fluctuating in different rankings with re-spect to the ranking obtained with the Italian LM.

as IT and SP show higher correlation values (0.98and 0.97 respectively) than typologically distantones (e.g. BUL and IT). A similar trend is ob-served considering the dependency direction ofamod and its syntactic head. Table 2 and Fig-ure 3 report i) the average difference of positionsof pre- and post-nominal adjectival modifiers inthe IUDT ranking obtained using the Italian LMand the other LMs, and ii) the percentage distri-bution of fluctuations across rankings. It resultsthat higher the number of ranking fluctuations,the more typologically distant the languages are.Namely, in the rankings obtained using LMs ofEN and BUL, a higher percentage of prenominalamod goes up with respect to the ranking obtainedusing IT LM. This reflects the linguistic proper-ties used to build LM: right-headed adjectives aremore prototypical in EN and BUL than in IT, ac-cordingly they are highly scored by LISCA. As aconsequence, a higher percentage of pre-nominalamod goes up in the rankings obtained using theEN (86.50) and BUL (91.35) LMs wrt the rankingobtained using IT LM. In addition, the average dif-ference of positions of prenominal adjectives go-ing up in the rankings obtained with EN and BULLMs is higher (21,691.18 and 30,597.76), as wellas the difference of the going-down ranking fluc-tuations (40,934.52 and 43,003.64). This latter re-sult reflects the lower degree of prototipicality ofleft-headed adjectives in EN and BUL wrt IT.

5

ReferencesC. Alzetta, F. Dell’Orletta, S. Montemagni, and G. Ven-

turi. 2018. Universal dependencies and quantita-tive typological trends. a case study on word or-der. In Proceedings of the 11th Edition of Interna-tional Conference on Language Resources and Eval-uation (LREC 2018), pages 4540–4549. Associationfor Computational Linguistics.

C. Bosco, S. Montemagni, and M. Simi. 2013. Con-verting italian treebanks: Towards an italian stanforddependency treebank. In Proceedings of the ACLLinguistic Annotation Workshop & Interoperabilitywith Discourse, Sofia, Bulgaria.

F. Dell’Orletta, G. Venturi, and S. Montemagni. 2013.Linguistically-driven selection of correct arcs for de-pendency parsing. Computacion y Sistemas, 2:125–136.

R. McDonald, J. Nivre, Y. Quirmbach-Brundage,Y. Goldberg, D. Das, K. Ganchev, K. Hall, S. Petrov,H. Zhang, O. Tackstrom, C. Bedini, N. BertomeuCastello, and J. Lee. 2013. Universal dependencyannotation for multilingual parsing. In Proceedingsof the 51st Annual Meeting of the Association forComputational Linguistics, pages 92–97.

J. Nivre. 2015. Towards a universal grammar for natu-ral language processing. In Computational Linguis-tics and Intelligent Text Processing - Proceedings ofthe 16th International Conference, CICLing 2015,Part I, pages 3–16, Cairo, Egypt.

J. Nivre, A. Zeljko, A. Lars, and et alii. 2017. Univer-sal dependencies 2.0. In LINDAT/CLARIN digitallibrary at the Institute of Formal and Applied Lin-guistics, Charles University.

E.M. Ponti, H. O’Horan, Y. Berzak, I. Vulic, R. Re-ichart, T. Poibeau, E. Shutova, and A. Korho-nen. 2018. Modeling language variation anduniversals: A survey on typological linguisticsfor natural language processing. arXiv preprintarXiv:1807.00914.

N. Silveira, T. Dozat, M.C. de Marneffe, S. Bowman,M. Connor, J. Bauer, and C.D. Manning. 2014. Agold standard dependency corpus for english. InProceedings of the Ninth International Conferenceon Language Resources and Evaluation (LREC).

K. Simov, P. Osenova, A. Simov, and M. Kouylekov.2005. Design and Implementation of the Bulgar-ian HPSG-based Treebank. Journal of Researchon Language and Computation. Special Issue, pages495–522.

M. Straka, J. Hajic, and J. Strakova. 2016. UD-Pipe: Trainable pipeline for processing CoNLL-Ufiles performing tokenization, morphological anal-ysis, pos tagging and parsing. In Proceedings ofthe Tenth International Conference on Language Re-sources and Evaluation (LREC).

6


What do Multilingual Neural Machine Translation Models learn aboutTypology?

Ryokan Ri and Yoshimasa TsuruokaThe University of Tokyo

7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan{li0123,tsuruoka}@logos.t.u-tokyo.ac.jp

1 Introduction

Unlike traditional statistical machine translation,neural machine translation (NMT) has enabledtranslation between multiple languages usinga single model (Ha et al., 2016; Johnson et al.,2017). It enjoys easier maintainability and pro-duction deployment, without changing the modelarchitecture and hurting performance so much.

However, its simplicity raises a natural ques-tion: how the multi-lingual model handle themultilingualism? Recent work has shown thatthe representations learned in NMT modelsencode a great deal of linguistic information,such as morphology (Belinkov et al., 2017a;Bisazza and Tump, 2018), syntax (Shi et al.,2016), and semantics (Belinkov et al., 2017b;Poliak et al., 2018). However, most analysesare for models which translate in one direction,and only little is known about the linguisticcompetence of multilingual NMT models.

The goal of this work is to understand the fol-lowing question: how much do multilingual NMTmodels capture the universality and variation oflanguages? Specifically, we will answer the fol-lowing questions in this paper:

• How much typological information does eachmodule in a model contain?

• How do the architectural choice of a model,specifically the subword or character model,affect the ability to capture linguistic typol-ogy?

The experimental design in this work is similar toMalaviya et al. (2017), where they trained a many-to-one multilingual NMT system to predict the ty-pological features of the languages. However, wediffer in that, while they focus on learning repre-sentation that can be used to predict missing fea-tures in typological databases, we aim to analyze

the linguistic property of multilingual NMT mod-els and conducted more fine-grained analyses.

2 Methodology2.1 NMT Training

For a fair comparison among languages, weneed sentences aligned in multiple languages.In this experiment, we use the Bible Cor-pus (Christodouloupoulos and Steedman, 2015),which contains translations of the Bible in 100 lan-guages. We extracted verses aligned among 58languages, and then split them into train/dev/testsets, which resulted in 23,555/455/455 verses re-spectively. The test data is used afterward for thefollowing typological prediction task.

The NMT model is the attentional encoder-decoder model similar to Luong et al. (2015). Themodel has two stacked LSTM layers for the en-coder and decoder, and the sizes of embeddingsand hidden states are set to 500. The model istrained in the many-to-one scheme, i.e., translat-ing from multiple languages into one single targetlanguage. Following Johnson et al. (2017), we donot explicitly specify the source language whichthe model is translating.

Sentences are segmented by sentencepiece(Kudo, 2018), a language-agnostic tokenizer. Forthe source languages (57 languages in total), wecreated a shared vocabulary with the size of32,000. For the target language (English), the vo-cabulary size is set to 8,000. We also experimentedwith character tokenization.

2.2 Probing Task

We investigate the extent to which the NMTmodel captures the typology of the source lan-guages. We use the URIEL Typological Database(Littell et al., 2017), which compiles typologi-cal features of languages extracted from mul-tiple linguistics sources. We used the data

7

where missing values are predicted with kNNregressions based on phylogenetic or geograph-ical neighbors. We only use the syntacticalfeatures, which amount to 103 features, fromthe database (e.g., S SUBJECT BEFORE VERB,S PLURAL PREFIX), as we focus on featuresthat would be directly learned in translation.

Our approach utilizes a probing task (Adi et al.,2017; Conneau et al., 2018). We trained binary lo-gistic regression classifiers dedicated for each ty-pological feature. The classifiers are asked to pre-dict the typological feature of the source languagesbased on a sentence representation extracted fromthe trained model (max-pooling of hidden states).We performed 10-fold cross-validation, with nooverlapping of languages in each train/test set. Asthe data in the test set is of the languages unknownto the classifier, the accuracy indicates how muchthe extracted representation generalizes about thetypological features across languages.

3 Results and DiscussionThis section presents the results for the two ques-tions we asked: How much typological informa-tion does each module in a model contain?; Howdo the different architectures of a model, specif-ically subwords or characters model, affect theability to capture linguistic typology? Table 1summarizes the result with the majority baseline,where classifiers always predict the majority classfor the typological feature.

Encoder Decoder Attentionmajority 80.90%subword 84.90% 80.10% 81.30%character 87.00% 80.10% 84.90%

Table 1: The accuracy of typology prediction usingfeatures extracted from different layers of the model.The values are averaged across all the predicted typo-logical features and languages. Encoder and Decoderrepresents the output from the top layer of the encoderand decoder respectively. The lower layers gave loweraccuracy in most cases. Attention is the representationafter computing attention, before the output projectionlayer.

Effect of moduleThe representations from the encoder predict thetypological feature of the source language signif-icantly better than the majority baseline, whereasthe decoder sees almost no improvements from thebaseline. This indicates the encoder is aware of

what language it encodes, whereas the decoder isignorant of the source and focuses on generatingthe target language.

However, although the decoder is unaware ofthe source language, the representation from theattention, again, contains the typological infor-mation on the source language. This indicatesthe inefficiency of the current shared-attention ar-chitecture. Ideally, to achieve the most efficientparameter sharing in multilingual translation sys-tems, target sequence generation should be ig-norant of the source language properties, as thesentences with the same meaning are eventuallymapped to the same target sequence regardless ofthe source language. In other words, the decoderhas to generate a word sequence based on interlin-gua, i.e., shared meaning representation across alllanguages (Richens, 1958; Schwenk and Douze,2017; Johnson et al., 2017). Our result confirmsthat attention is one of the obstacles to language-agnostic generation of the decoder, and is in linewith recent efforts to improve multilingual NMTby seeking neural interlingua (Lu et al., 2018;Cıfka and Bojar, 2018).

Subword vs. charactersThe character model is more predictive of typo-logical properties than the subword model in ev-ery layer. This can be attributed to the abil-ity of the character model to capture morphology(Qian et al., 2016; Belinkov et al., 2017a), whichis verified by the top 5 typological features that seeimprovement from the subword model to the char-acter model. Three of them are features concern-ing part-of-speech (ADJECTIVE and NOUN),and one is about dependency marking:

• S ADJECTIVE AFTER NOUN• S ADJECTIVE BEFORE NOUN• S INDEFINITE WORD• S ADJECTIVE WITHOUT NOUN• S TEND DEPMARK.

4 Conclusion

We will continue to experiment with anotherdataset, larger models, and different target lan-guages to verify the observation in this paper andconduct further analyses. We also intend to useother probing tasks, such as universal part-of-speech tagging and natural language inference, toinvestigate the generalization ability of multilin-gual NMT models across languages.

8

ReferencesYossi Adi, Einat Kermany, Yonatan Belinkov, Ofer

Lavi, and Yoav Goldberg. 2017. Fine-grainedAnalysis of Sentence Embeddings Using AuxiliaryPredicition Tasks. In Proceedings of the Interna-tional Conference on Learning Representations.

Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, HassanSajjad, and James Glass. 2017a. What do NeuralMachine Translation Models Learn about Morphol-ogy? In Proceedings of the Annual Meeting of theAssociation for Computational Linguistics.

Yonatan Belinkov, Lluıs Marquez, Hassan Sajjad,Nadir Durrani, Fahim Dalvi, and James Glass.2017b. Evaluating Layers of Representation in Neu-ral Machine Translation on Part-of-Speech and Se-mantic Tagging Tasks . In Proceedings of the In-ternational Joint Conference on Natural LanguageProcessing.

Arianna Bisazza and Clara Tump. 2018. The LazyEncoder: A Fine-Grained Analysis of the Role ofMorphology in Neural Machine Translation. In Pro-ceedings of the Conference on Empirical Methods inNatural Language Processing.

Christos Christodouloupoulos and Mark Steedman.2015. A massively parallel corpus: the Bible in100 languages. Language Resources and Evalua-tion, 49(2):375–395.

Ondrej Cıfka and Ondrej Bojar. 2018. Are BLEU andMeaning Representation in Opposition? In Pro-ceedings of the Annual Meeting of the Associationfor Computational Linguistics.

Alexis Conneau, German Kruszewski, GuillaumeLample, Loic Barrault, and Marco Baroni. 2018.What you can cram into a single vector: Probingsentence embeddings for linguistic properties. InProceedings of the Annual Meeting of the Associ-ation for Computational Linguistics.

Thanh-Le Ha, Jan Niehues, and Alex Waibel. 2016.Toward Multilingual Neural Machine Translationwith Universal Encoder and Decoder. arXiv.org.

Melvin Johnson, Mike Schuster, Quoc V Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Viegas, Martin Wattenberg, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2017. Google’sMultilingual Neural Machine Translation System:Enabling Zero-Shot Translation. Transactions of theAssociation for Computational Linguistics, 5:339–351.

Taku Kudo. 2018. Subword Regularization: ImprovingNeural Network Translation Models with MultipleSubword Candidates . In Proceedings of the AnnualMeeting of the Association for Computational Lin-guistics.

Patrick Littell, David Mortensen, Ke Lin, Kather-ine Kairis, Carlisle Turner, and Lori Levin. 2017.

URIEL and lang2vec: Representing languages as ty-pological, geographical, and phylogenetic vectors.In Proceedings of the Conference of the EuropeanChapter of the Association for Computational Lin-guistics.

Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas Bhard-waj, Shaonan Zhang, and Jason Sun. 2018. A neuralinterlingua for multilingual machine translation. InProceedings of the Conference on Machine Transla-tion.

Minh-Yhang Luong, Hieu Pham, and Christopher DManning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing.

Chaitanya Malaviya, Graham Neubig, and Patrick Lit-tell. 2017. Learning Language Representations forTypology Prediction. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing.

Adam Poliak, Yonatan Belinkov, James Glass, andBenjamin Van Durme. 2018. On the Evaluation ofSemantic Phenomena in Neural Machine Transla-tion Using Natural Language Inference. In Proceed-ings of the Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies.

Peng Qian, Xipeng Qiu, and Xuanjing Huang. 2016.Investigating Language Universal and Specific Prop-erties in Word Embeddings. In Proceedings of theAnnual Meeting of the Association for Computa-tional Linguistics.

R H Richens. 1958. Interlingual Machine Translation.The Computer Journal, 1(3):144–147.

Holger Schwenk and Matthijs Douze. 2017. LearningJoint Multilingual Sentence Representations withNeural Machine Translation. In Proceedings of theWorkshop on Representation Learning for NLP.

Xing Shi, Inkit Padhi, and Kevin Knight. 2016. DoesString-Based Neural MT Learn Source Syntax? InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing.

9


Syntactic Typology from Plain Text Using Language Embeddings

Taiqi HeUniversity of California, Davis

[email protected]

Kenji SagaeUniversity of California, [email protected]

1 Introduction

We examine the question of whether informationabout linguistic typology can be derived automat-ically solely from text corpora, without access toany kind of annotation or parallel data. We de-scribe an ongoing effort to use text from variouslanguages to develop an unsupervised approach tocharacterize languages through language embed-dings, which encode information about the struc-ture of languages as vectors. We then explorewhether these language vector representations en-code typological information, which would tradi-tionally require human expertise.

In recent work, Malaviya et al. (2017) showedthat it is possible to extract language represen-tations from neural machine translation models,while Wang and Eisner (2017) used POS tag se-quences to predict dependency orders for vari-ous languages. In contrast, we do not leverageany kind of linguistic annotation or parallel data.Bjerva and Augenstein (2018) used an architec-ture similar to ours to generate language repre-sentations and showed improvement in task re-lated WALS predictions after task specific trans-fer learning. We show that it is possible to obtainresults better than baseline without optimizing forspecific tasks.

2 Method

Our approach is based on the idea behind a de-noising autoencoder (Vincent et al., 2008) appliedto many languages simultaneously. Given a textcorpus with unrelated sentences in each language,we use an encoder-decoder model that learns toreorder, or denoise, sentences in each language.

We first map the words from the various lan-guages into a common representation, leverag-ing the multilingual 300-dimensional word em-beddings from Facebook project MUSE (Conneau

et al., 2017). We then build multilingual dictio-naries using CSLS (Conneau et al., 2017) and re-place each word in each sentence to its Englishtranslation. We therefore have a corpus with sen-tences consisting of English words in the originalorders from the different languages. The wordsin each sentence of this corpus are reordered ran-domly, creating the input, or source, sequences.The target sequences are the corresponding orig-inal sentences. The model then must learn to re-order words in each language, from a random or-der, to the original order. Additionally, we pro-vide the model with information about what lan-guage the sentence is from by appending to eachword on both the source and target sides a featurethat corresponds to the language identity. Table1 shows how our target sentences are represented,with English words, original word orders, and lan-guage features, along with the original sentencesfor comparison. A 50-dimensional embedding ofthis language identity feature is learned along withthe reordering task. The intuition is that the modelwill learn that reordering the same words is donedifferently depending on whether the language isEnglish, French, Turkish, Vietnamese, etc., butthat certain languages are more similar to, or moredifferent from, each other. Once the model istrained, the language feature embedding that helpsthe model learn how to reorder words for a specificlanguage is the language embedding. After train-ing, we retrieve the language embeddings from thedecoder of the model and examine them in the fol-lowing section.

Our BiLSTM encoder and LSTM decoder bothhave two layers of 500 units. We used 29 lan-guages, with 200,000 sentences each. Althoughthe MUSE embeddings used in our experimentswere created using bilingual dictionaries, violatingour goal of deriving the language representationsfrom text only, we also plan to examine the use of

10

Figure 1: PCA projection of the language embeddings. Shapes represent automatically derived clusters. The Randscore between actual (color) and predicted (shape) language categorizations is .62.

Original TransformedEr hat den roten Hundnicht gesehen

he|de has|de the|dered|de dog|de not|deseen|de

No vio al perro rojo not|es saw|es the|esdog|es red|es

Il n’a pas vu le chienrouge

he|fr not|fr has|frseen|fr the|fr dog|frred|fr

Table 1: The sentence He didn’t see the red dog trans-formed from German, Spanish, and French to English.Word orders were preserved and a label denoting theorigin language was attached to each word.

multilingual embeddings obtained from text alone(Lample et al., 2017).

3 Examining Language Embeddings

Figure 1 shows the two-dimensional PCA projec-tion of the normalized language embeddings. Wecan clearly see clustering of Slavic languages onthe lower left, Romance on the right, and Ger-manic on the upper left. Our dataset also had twoFinnic languages, which appear in the upper re-gion in the Slavic languages, and two Semitic lan-guages, which appear on the lower right. The otherlanguages are from families underrepresented inour dataset, and appear either around the center (in

the case of Hungarian, Turkish and Greek), or inthe lower right corner (Vietnamese, Indonesian).Romanian, a Romance language, appears miscat-egorized by our language embeddings, also in thelower right cluster.

In addition to actual language relationships,represented by color, we also present the resultof spectral clustering with four categories throughdifferent shapes. This indicates that, broadly,the language embeddings do capture similaritieswithin language families and dissimilarity acrosslanguage families. Finally, we train linear modelsto predict WALS (Dryer and Haspelmath, 2013)features for each language based on the language’s50-dimensional embedding. Even with only 28training samples (the models were evaluated byleaving one language out), the models predict fea-tures under the areas of verbal categories, wordorder, nominal categories, simple clauses, phonol-ogy and lexicon above the level of a majority base-line, with average accuracy of 0.77 (baseline .71),while morphology, nominal syntax and other wereworse than or equal to a majority baseline, withaverage accuracy 0.73 (baseline .76). Despite thesurprising failure to capture nominal syntax, itdoes appear that the language embeddings capturesome aspects of syntactic typology.

11

ReferencesJohannes Bjerva and Isabelle Augenstein. 2018. From

Phonology to Syntax: Unsupervised Linguistic Ty-pology at Different Levels with Language Embed-dings. arXiv:1802.09375 [cs]. ArXiv: 1802.09375.

Alexis Conneau, Guillaume Lample, Marc’AurelioRanzato, Ludovic Denoyer, and Herve Jegou.2017. Word Translation Without Parallel Data.arXiv:1710.04087 [cs]. ArXiv: 1710.04087.

Matthew S. Dryer and Martin Haspelmath, editors.2013. WALS Online. Max Planck Institute for Evo-lutionary Anthropology, Leipzig.

Guillaume Lample, Alexis Conneau, Ludovic Denoyer,and Marc’Aurelio Ranzato. 2017. UnsupervisedMachine Translation Using Monolingual CorporaOnly. arXiv:1711.00043 [cs]. ArXiv: 1711.00043.

Chaitanya Malaviya, Graham Neubig, and Patrick Lit-tell. 2017. Learning Language Representationsfor Typology Prediction. arXiv:1707.09569 [cs].ArXiv: 1707.09569.

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, andPierre-Antoine Manzagol. 2008. Extracting andcomposing robust features with denoising autoen-coders. In Proceedings of the 25th internationalconference on Machine learning, pages 1096–1103.ACM.

Dingquan Wang and Jason Eisner. 2017. Fine-GrainedPrediction of Syntactic Typology: Discovering La-tent Structure with Supervised Learning. Transac-tions of the Association for Computational Linguis-tics, 5:147–161.

12


Typological Feature Prediction with Matrix Completion

Annebeth BuisDepartment of Linguistics

University of Colorado [email protected]

Mans HuldenDepartment of Linguistics

University of Colorado [email protected]

1 Introduction

Currently, there are approximately 7,000 non-extinct languages in the world (Lewis, 2009). Lin-guistic typology aims at studying and classifyingthese languages in a systematic way, based ontheir structural and functional features. The WorldAtlas of Language Structures (WALS, Dryer andHaspelmath, 2013) is an online database that de-scribes typological features—phonological, syn-tactic, lexical, word order features, etc.—andrecords the value of 144 such features for 2,679languages. However, even for this small set oflanguages and features, the value of 80% of thelanguage-feature combinations is undefined. Pre-vious research has shown that WALS feature val-ues can be predicted based on the existing data inWALS (e.g., Takamura et al., 2016). Predicted val-ues for missing language-feature combinations inWALS can be useful both for downstream NLPtasks (e.g., Naseem et al., 2012; Daiber et al.,2016) and for typological research.

We discuss predicting WALS data using matrixcompletion. In our first experiment, we use a sim-ple set-up and a leave-one-out-cross-validation topredict feature values in the database. We comparethe results to a majority class baseline and a logis-tic regression classifier. In further experiments, wetest the robustness of our method by leaving out(1) features in the same domain and (2) languageswithin the same language family. To our knowl-edge, matrix completion approaches have not beenused for typological prediction tasks previously;we show that they outperform our two baselineson the WALS data set.

2 Related Work

For space reasons, we point the reader to Pontiet al. (2018) for a comprehensive overviewof research on typological information inNLP/computational linguistics. We have included

a comparison to accuracies obtained in otherWALS prediction experiments in Table 1. Notethat it is difficult to objectively compare per-formance between different projects because ofthe wide disparity in methods and subsets of theWALS data used. Because of this, the focus of ourcomparison will be the two baselines evaluated onthe same data set used for matrix completion.

3 WALS and Preprocessing

Features in WALS are split up in 11 domains.1,2

The data set also includes 10 meta-features(isocodes, language family, genus, etc.), which arenot included in the data for prediction. Includingmeta-features improves accuracy and would there-fore be desirable in combination with a down-stream NLP task. However, the current set-upis more interesting from a linguistic perspective,since it constrains feature predictions to dependon typological implications (such as VSO order→Noun-Adjective order, Greenberg, 1963) and otherstatistical patterns in the WALS data.

The original WALS matrix contains categoricalfeature values, which were binarized before run-ning matrix completion. We excluded 214 lan-guages for which only 1 feature value has beenrecorded in WALS. Our method requires no ad-ditional preprocessing.3

4 Matrix completion

Matrix completion algorithms are not yet partof the standard computational linguistics toolkit.However, there are several reasons why matrixcompletion is potentially a good method for our

1Phonology, sign languages, morphology, nominal cate-gories, nominal syntax, verbal categories, word order, simpleclauses, complex sentences, lexicon and other.

2Both sign languages and features related to sign lan-guages have been excluded from the data in this project.

3All data and code used to obtain the results in this pa-per is available at https://github.com/annebeth/wals-matrix-completion.

13

Accuracy Method

Georgi et al. (2010) 65.5% Language clusteringTakamura et al. (2016) 75.5% Logistic regressionwithout language family 73.0% Logistic regressionMurawaki (2017) 74.5% Bayesian model

Baseline 1 53.1% Majority classBaseline 2 65.7% Logistic regressionMatrix completion 74.3% IterativeSVDwithout domain 61.6% IterativeSVDwithout language family 71.2% IterativeSVD

Table 1: Matrix completion experiment results com-pared with results obtained in previous work.

task. First, these algorithms have been used exten-sively with sparse matrices. Second, since theylearn from the entire matrix at once, we expectthem to be able to learn more holistic patterns inthe data than individual local predictors (such asour logistic regression baseline).

The matrix completion algorithm used in thispaper is IterativeSVD,4 based on Troyanskayaet al. (2001). This method attempts to learn a low-rank approximation of the original matrix by usingSingular Value Decomposition (SVD).

5 Experimental set-up

In our experiment, we are predicting each lan-guage × feature-combination that currently has avalue in WALS (i.e., it is not undefined) separatelyby using leave-one-out cross validation (LOOCV).

First, we calculate results for two baselines.The first baseline predicts a feature value by sim-ply assigning the majority class for each feature.The second baseline consists of logistic regressionclassifiers that are trained to predict a specific fea-ture based on all other features. Besides our basicmatrix completion setting, we have run the exper-iment in two additional settings: (1) without do-main-setting: all features from the same domainas the feature that is being predicted are excludedfrom the matrix, and (2) without family-setting:when predicting a feature value for a certain lan-guage, all other languages that are in the same lan-guage family are excluded from the matrix.

6 Results

Table 1 shows the prediction results obtained withmatrix completion on the WALS data and com-

4IterativeSVD as implemented in the FancyImputePython package: https://github.com/iskandr/.

30 40 50 60 70 80 90 100Accuracy per feature (%)

250

500

750

1000

1250

1500

# ex

ampl

es p

er fe

atur

e

0 20 40 60 80 100Accuracy per language (%)

0

25

50

75

100

125

150

# ex

ampl

es p

er la

ngua

ge

Figure 1: Comparison of (top) number of examples ofeach feature to the prediction accuracy for that featureand (bottom) the number of examples of each languageto the prediction accuracy for that language.

pares them to results obtained in related work.5

Matrix completion significantly outperforms ourtwo baselines and also improves on the baselinesin the without language family setting.

Figure 1 shows different distributional patternsin the comparison of the number of examples withthe obtained accuracy. The number of examplesshows no correlation with the prediction accuracyper language. For feature accuracy, however, hav-ing more examples of a feature can result in betterpredictions.

7 Conclusion

Matrix completion outperforms the baselines onthe WALS data and performs on par with previouswork. This shows that matrix completion capturesholistic patterns in the data that cannot be learnedin a traditional classifier approach. Furthermore,our method requires minimal preprocessing andcan easily be used with any typological database.

Our work has shown that treating WALS as amatrix is an effective approach. Daume III andCampbell (2007) showed that typological impli-cations can be learned from WALS. Non-negativematrix factorization (Lee and Seung, 1999) couldbe used for the analysis of these implications or toimprove the clustering of languages.

5We included papers that use only WALS as training dataand evaluate on all domains in WALS.

14

ReferencesJoachim Daiber, Milos Stanojevic, and Khalil Sima’an.

2016. Universal reordering via linguistic typology.In Proceedings of COLING 2016, the 26th Inter-national Conference on Computational Linguistics:Technical Papers, pages 3167–3176.

Hal Daume III and Lyle Campbell. 2007. A Bayesianmodel for discovering typological implications. InProceedings of the 45th Annual Meeting of the Asso-ciation of Computational Linguistics, pages 65–72.

Matthew S. Dryer and Martin Haspelmath, editors.2013. WALS Online. Leipzig: Max Planck Institutefor Evolutionary Anthropology.

Ryan Georgi, Fei Xia, and William Lewis. 2010.Comparing language similarity across genetic andtypologically-based groupings. In Proceedings ofthe 23rd International Conference on Computa-tional Linguistics, pages 385–393.

Joseph H. Greenberg. 1963. Some universals of gram-mar with particular reference to the order of mean-ingful elements. In Joseph H. Greenberg, edi-tor, Universals of Human Language, pages 73–113.Cambridge: MIT Press.

Daniel D Lee and H Sebastian Seung. 1999. Learningthe parts of objects by non-negative matrix factor-ization. Nature, 401(6755):788.

M. Paul Lewis, editor. 2009. Ethnologue: Languagesof the World, sixteenth edition. SIL International,Dallas, TX, USA.

Yugo Murawaki. 2017. Diachrony-aware induction ofbinary latent representations from typological fea-tures. In Proceedings of the Eighth InternationalJoint Conference on Natural Language Processing,pages 451–461, Taipei, Taiwan. Asian Federation ofNatural Language Processing.

Tahira Naseem, Regina Barzilay, and Amir Globerson.2012. Selective sharing for multilingual dependencyparsing. In Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguis-tics, pages 629–637. Association for ComputationalLinguistics.

Edoardo Maria Ponti, Helen O’Horan, Yevgeni Berzak,Ivan Vulic, Roi Reichart, Thierry Poibeau, EkaterinaShutova, and Anna Korhonen. 2018. Modeling lan-guage variation and universals: A survey on typo-logical linguistics for natural language processing.arXiv preprint arXiv:1807.00914.

Hiroya Takamura, Ryo Nagata, and YoshifumiKawasaki. 2016. Discriminative analysis of linguis-tic features for typological study. In Proceedings ofthe Tenth International Conference on Language Re-sources and Evaluation (LREC 2016), pages 69–76.

Olga Troyanskaya, Michael Cantor, Gavin Sherlock,Pat Brown, Trevor Hastie, Robert Tibshirani, David

Botstein, and Russ B Altman. 2001. Missing valueestimation methods for DNA microarrays. Bioinfor-matics, 17(6):520–525.

15


Feature Comparison across Typological Resources

Tifa de Almeida, Youyun Zhang, Kristen Howell, and Emily M. Bender

Department of Linguistics, University of Washington, Seattle, WA, USA

{trda, youyunzh, kphowell, ebender}@uw.edu

1 Introduction

We explore in this abstract the relationship be-tween a typological database (WALS) and a gram-mar specification system (the LinGO GrammarMatrix).

The LinGO Grammar Matrix (GM) customiza-tion system automatically generates a customHPSG grammar based on user inputs to a web-based questionnaire (Bender et al., 2002, 2010).One of its goals is to make the process of creat-ing new grammars easier for linguists of all back-grounds working on testing linguistic hypotheses.

The World Atlas of Linguistics Structures(WALS), often considered a typological atlas dueto its detailed geographical data, distinguishes it-self from other typological databases such as Ter-raling (Terraling) by the quantity and the qualityof data it contains. WALS was built upon the workof 55 authors who classified over 2,500 languagesaccording to 192 features,1 and integrated this in-formation with geographical coordinates for eachlanguage. It has been used to discover universaltypological implications (Daume III and Camp-bell, 2007), compare phylogenetic relationshipswithin feature-based language clusters (Georgiet al., 2010) and provide a benchmark for auto-matic typological feature identification from cor-pora (Lewis and Xia, 2008).

We explore this database in an effort to assistthe creation of custom grammars for the GM user.We develope a method to determine what is theoverlap between the GM questionnaire and WALSfeatures and how they correspond to each other. Inthe following sections, we detail how WALS fea-tures can be mapped to the Grammar Matrix andwhat conclusions can be drawn from these map-pings. To illustrate this process we provide exam-

1This dataset is sparse however: different features arespecified for different languages.

ples and conclude with a discussion of the poten-tial impacts of these matches and how they may beapplied in future work.

2 Methodology

There is a fundamental difference between the fea-tures defined in the WALS database and the oneselicited by the GM user interface: While both arebuilt with features extracted from detailed gram-mars and current typological literature, WALS isfundamentally a reference database of languagetypology. Accordingly, WALS features classifytypological information about a language but aregenerally not concerned with all the detail thatwould be required to implement language-specificgrammars (e.g. the particular form affixes take).2

Due to this, we developed a simple method todetermine which WALS features match which GMfeatures to determine to what extent WALS fea-tures can be imported and utilized in the grammarcustomization process. After studying the doc-umentation for a GM feature, for example Ad-nominal Possession, we examine WALS’ inven-tory of features looking for key terms in titles thatcorrespond to the GM phenomenon, such as Fea-ture 24A Locus of Marking in Possessive NounPhrases (Nichols and Bickel, 2013). We assess thevalues of the feature and organize them with thecorresponding questionnaire item. An example ofthis 1-to-1 pairing can be seen in Table 1.

This is where a careful interpretation of the doc-umentation for each system is necessary. WALSutilizes the term locus to refer to a head-dependentmarking relationship (Comrie, 2013), designatingthe possessed noun as the head noun and the pos-sessor as the dependent. The GM refers to the

2WALS also contains significant information about lin-guistic properties outside the morphological, syntactic andsemantic information required by the GM, including phono-logical and lexical features.

16

Grammar Matrix WALSMorpheme Placement Feature 24A Number of LanguagesOn the possessum Possessor is head-marked 78On the possessor Possessor is dependent-marked 98On both the possessor and the possessum Possessor is double-marked 22No possessive morphemes appear Possessor has no marking 32— Other types 6

Table 1: Correlation of GM Adnominal Possession morpheme placement options and WALS Feature 24A Locusof Marking in Possessive Noun Phrases. Data available for 236 languages.

possessor and the possessum (Nielsen and Bender,2018). Therefore we pair the values “on the pos-sessum” (GM) with “possessor is head-marked”(WALS). This analysis is repeated with all valuesfor each feature.

Another point of consideration this particularfeature brings up is that after asking about the po-sition of the possessive morpheme, the GM positsfurther clarifying questions that depend on whichoption was chosen. If the morpheme appears “onthe possessor”, the user is asked if it is an affix,separate word or clitic. When the option “on thepossessor” is chosen, the GM offers the user theoption to add a feature constraint, such as Case.This kind of information, along with the orthogra-phy and distribution of the morphemes, is regret-tably not provided by WALS and must remain un-der the user’s purview to add.

Below we offer two more examples ofWALS/GM features to illustrate the challengesof feature mapping when a one-to-one correspon-dence cannot be found or is not particularly useful.

2.1 Case Marking Strategy

The core case marking section of the GM (Drel-lishak, 2008) corresponds to WALS Feature 98A(Comrie, 2013), entitled Alignment of Case Mark-ing of Full Noun Phrase. The latter designates theargument abbreviations according to Dixon (1994)– A (agent of a transitive verb), O (object of atransitive verb) and S (subject of an intransitiveverb) – whereas WALS uses P (patient) instead ofO (Comrie, 1978).

The GM questionnaire gives the user the optionto check which case marking strategy is being usedand also what each case is called, e.g. ergative. Byextracting the values of feature 98A from WALS,the user is given a head-start at this. For the 190languages specified for this feature in WALS, theGM user has a readily usable source of informa-

tion for the first of these steps.

2.2 Number of Cases

WALS feature 49A (Iggesen, 2013), defined for261 languages, maps how many cases a languagecontains. One would think that knowing the num-ber of cases would be helpful in building a gram-mar, but it is actually not. Without also know-ing what each case is called, this feature couldonly notify the user that they must manually addN cases, which is not useful for our purposes.

3 Conclusion and Future Work

Having reviewed 33 WALS features, we estimatethat about 20 of them (10.4% of the total) can beusefully imported into the GM system to facilitatethe grammar generation process for the user. Thiscorresponds to about 8.5% of the GM’s grammarspecification options.

Our work identifying which features can bemapped and in what way could support an APIthat extracts the pertinent information from WALSwhen the user starts a custom grammar. The firstpage of the user interface asks the user to inputthe language ISO code, which is also used in theWALS database. The user would be given a choiceof importing this information or not, and shouldthey choose to do so would be shown a notifica-tion detailing how many features were found.

Additionally, our mapping of WALS to GM fea-tures could support work on automatically answer-ing the GM questionnaire. The AGGREGATIONproject (e.g. Zamaraeva et al., 2019) presentlyapproaches this problem by inferring GM gram-mar specifications from collections of interlinearglossed text. Where WALS features map to GMfeatures, and WALS values are available for thelanguage at hand, the WALS values can potentiallybe used to guide grammar specification inference,as explored in Zhang et al. 2019.

17

ReferencesEmily M. Bender, Scott Drellishak, Antske Fokkens,

Laurie Poulson, and Safiyyah Saleem. 2010. Gram-mar customization. Res. Lang. Comput., 8(1):23–72.

Emily M. Bender, Dan Flickinger, and Stephan Oepen.2002. The grammar matrix: An open-source starter-kit for the rapid development of cross-linguisticallyconsistent broad-coverage precision grammars. InCOLING-02: Grammar Engineering and Evalua-tion.

Bernard Comrie. 1978. Ergativity. In Winfred P.Lehmann, editor, Syntactic Typology: Studies in thePhenomenology of Language, pages 329–394. Uni-versity of Texas Press, Austin.

Bernard Comrie. 2013. Alignment of case marking offull noun phrases. In Matthew S. Dryer and MartinHaspelmath, editors, The World Atlas of LanguageStructures Online. Max Planck Institute for Evolu-tionary Anthropology, Leipzig.

Hal Daume III and Lyle Campbell. 2007. A bayesianmodel for discovering typological implications. InProceedings of the 45th Annual Meeting of the Asso-ciation of Computational Linguistics, pages 65–72,Prague, Czech Republic. Association for Computa-tional Linguistics.

Robert M. W. Dixon. 1994. Ergativity. CambridgeUniversity Press, Cambridge.

Scott Drellishak. 2008. Complex case phenomena inthe grammar matrix. In Proceedings of the 15thInternational Conference on Head-Driven PhraseStructure Grammar, National Institute of Informa-tion and Communications Technology, Keihanna,pages 67–86, Stanford, CA. CSLI Publications.

Ryan Georgi, Fei Xia, and William Lewis. 2010.Comparing language similarity across genetic andtypologically-based groupings. In Proceedings ofthe 23rd International Conference on Computa-tional Linguistics (Coling 2010), pages 385–393,Beijing, China. Coling 2010 Organizing Committee.

Oliver A. Iggesen. 2013. Number of cases. InMatthew S. Dryer and Martin Haspelmath, editors,The World Atlas of Language Structures Online.Max Planck Institute for Evolutionary Anthropol-ogy, Leipzig.

William D. Lewis and Fei Xia. 2008. Automati-cally identifying computationally relevant typolog-ical features. In Proceedings of the Third Interna-tional Joint Conference on Natural Language Pro-cessing: Volume-II.

Johanna Nichols and Balthasar Bickel. 2013. Locus ofmarking in possessive noun phrases. In Matthew S.Dryer and Martin Haspelmath, editors, The WorldAtlas of Language Structures Online. Max PlanckInstitute for Evolutionary Anthropology, Leipzig.

Elizabeth Nielsen and Emily M. Bender. 2018. Mod-eling adnominal possession in multilingual gram-mar engineering. In Proceedings of the 25th Inter-national Conference on Head-Driven Phrase Struc-ture Grammar, University of Tokyo, pages 140–153,Stanford, CA. CSLI Publications.

Terraling. Available online: http://test.terraling.com.Accessed: 26 April, 2019. [link].

Olga Zamaraeva, Kristen Howell, and Emily M. Ben-der. 2019. Handling cross-cutting properties in au-tomatic inference of lexical classes: A case studyof Chintang. In Proceedings of the 3rd Workshopon the Use of Computational Methods in the Studyof Endangered Languages, volume 1 Papers, pages28–38, Honolulu, Hawai‘i.

Youyun Zhang, Tifa de Almeida, Kristen Howell, andEmily M. Bender. 2019. Using typological informa-tion in wals to improve grammar inference. Unpub-lished paper, submitted to TypNLP.

18


From phonemes to morphemes: relating linguistic complexity tounsupervised word over-segmentation

Georgia LoukatouLaboratoire de sciences cognitives et de psycholinguistique, Departement d’etudes cognitives

ENS, EHESS, CNRS, PSL [email protected]

Abstract

Previous work has documented variation inword segmentation performance across lan-guages, with a trend to yield lower scores forlanguages with elaborate morphological struc-ture. However, segmenting smaller chunksthan words, “oversegmenting”, is reasonablefrom a computational point of view. We pre-dict that oversegmentation would be encoun-tered more often in complex languages. Inthis work in progress, we use a dataset of9 languages varying in complexity and fo-cus on cognitively-inspired word segmenta-tion algorithms. Complexity is defined byCompression-based, Type-Token Ratio andWord Length metrics. Preliminary resultsshow that a possible relation between morpho-logical complexity and oversegmentation can-not be predicted exactly by none of these met-rics, but may be best approximated by wordlength.

1 Introduction

The issue of word segmentation is open in the NLPcommunity (e.g., Harris (1955)). Its implemen-tations include processing languages with no or-thographic word boundaries, such as Chinese andJapanese. It is also a key problem humans facewhen acquiring language.

Previous work documented variation in the suc-cess rate of segmentation across languages, anda trend to yield lower scores for languages withelaborate morphological structure. This is true forboth cognitively inspired (Johnson, 2008; Four-tassi et al., 2013; Loukatou et al., 2018)) and othermodels (Mochihashi et al., 2009; Zhikov et al.,2013; Chen et al., 2011). Evaluation is conven-tionally based on orthographic word boundaries.

Do these models manage to learn more lin-guistic structure, that what is actually describedin these accuracy scores? Segmenting smaller

meaningful chunks than words is reasonable froma computational point of view: morphologicallycomplex languages often feature multimorphemic,long words, and algorithms might break wordsup into component morphemes, treating frequentmorphemes as words. Finding out morphemesmight be useful for later linguistic analysis, espe-cially for languages with rich morphological sys-tems , and such morphemes could be used as cuesto further bootstrap segmentation. Thus, a “use-ful” error in segmentation could be oversegmenta-tion (Gervain and Erra, 2012; Johnson, 2008), thepercentage of word tokens returned as two or moresubparts in the output.

We thus predict that oversegmentation might beencountered more often in complex languages. Totest this, we need data from languages varying incomplexity. Since there is no standard way to de-fine complexity, for this study, three metrics areused: first, the Moving Average Type-token Ra-tio (500-word window) (Kettunen, 2014), and sec-ond, two versions of compression-based complex-ity (Szmrecsanyi, 2016)1. The two metrics arenormalized (0=least complex, 1=most complex)and their average score is attributed to each lan-guage. Third, we look at word length, since, ingeneral, longer words could attract more division.

2 Methods

We use the ACQDIV database (Moran et al., 2016)of typologically diverse languages, with transcrip-tions of infant-directed and -surrounding speechrecordings, from Inuktitut (Allen, 1996), Chintang(Stoll et al., 2015), Turkish (Kuntay et al., Unpub-

11st metric: the size of compressed corpus (gzip) dividedby the size of raw corpus. 2nd metric: systematic distortionof morphological regularities, so as to estimate the role ofmorphological information in the corpus. Each word type isreplaced with a randomly chosen number. The size of thedistorted compressed corpus is then divided by the size of theoriginally compressed corpus.

19

lang compr. MATTR w length % over % corr % totalInu 1 0.90 8.56 51 22 73Chi 0.56 0.87 4.39 44 24 68Tur 0.44 0.86 4.92 39 26 65Yuc 0.42 0.92 3.80 31 27 58Rus 0.41 0.91 4.47 46 19 65Ses 0.31 0.86 4.28 44 25 69Ind 0.28 0.85 4.11 42 25 67Jap 0.14 0.87 3.94 37 25 62Eng 0.02 0.39 3.04 6 51 57

Table 1: Complexity scores for the three metrics aregiven in the first columns. Percentage of average over-segmented, correct word tokens and their sum are thengiven per language.

.

lished), Yucatec (Pfeiler, 2003), Russian (Stoll andMeyer, 2008), Sesotho (Demuth, 1992), Indone-sian (Gil and Tadmor, 2007) and Japanese (Miyataand Nisisawa, 2010; Nisisawa and Miyata, 2010).In order to compare with a previously studied lan-guage, we included the English Bernstein corpus(MacWhinney, 2000).

Several models have been proposed as plausiblestrategies used by learners retrieving words frominput. We used a set of these strategies (Bernardet al., 2018). Two baselines were Base0, treatingeach sentence as a word, and Base1, treating eachphoneme as a word. DiBS2 (Daland, 2009) imple-ments the idea that unit sequences often spanningphrase boundaries probably span word breaks.FTP3 (Saksida et al., 2017) measures transitionalprobabilities between phonemes and cuts depend-ing on a local threshold (relative, FTPr) or a globalthreshold (absolute, FTPa). Adaptor Grammar(AG) (Johnson, 2008) assumes that learners cre-ate a lexicon of minimal, recombinable units anduse it to segment the input. AG implements thePitman-Yor process. Finally, PUDDLE4 (Mon-aghan and Christiansen, 2010) is incremental, andlearners insert in a lexicon an utterance that cannotbe broken down further, and use its entries to findsubparts in subsequent utterances. Before segmen-tation, spaces between words were removed, leav-ing the input parsed into phonemes, with utteranceboundaries preserved.

3 Results

Statistics regarding corpora and results are pre-sented in Table 1. In general, languages had simi-

2Diphone Based Segmentation algorithm3Forward Transitional Probabilities algorithm4Phonotactics from Utterances Determine Distributional

Lexical Elements

lar oversegmentation scores, (ranging from 31% to51% if we exclude English), which did not exactlyfollow their complexity ranking. Performancedifference across languages decreased when con-sidering oversegmented tokens as correctly seg-mented.

4 Discussion

Word length had the best prediction of overseg-mentation compared to other metrics, compres-sion and MATTR. This shows that longer wordshave more alternative parses, and this could ex-plain oversegmentation results better than otherproperties inherent to morphologically complexlanguages. That said, a possible relation betweenmorphological complexity and oversegmentation,could not be exactly explained by none of thesecomplexity metrics.

It was also observed that there was no absoluteranking of complexity across languages; on thecontrary, it would change according to the featurestudied. In general, cross-linguistic differenceswere small for such a typologically distinct datasetof languages. Further research might shed light onwhether this behavior is due to linguistic proper-ties common across languages, or a confound (e.g.corpus size).

Moreover, discovering meaningful units is ofparticular importance to language acquisitionmodels, such as the ones implemented here. In-fant word segmentation algorithms are cognitivelyplausible only if they are cross-linguistically validand offer useful insights to learn all linguisticstructures. It would also be interesting to com-pare performance of these models to state-of-the-art NLP algorithms, such as HPYLM (Mochihashiet al., 2009) or ESA (Chen et al., 2011).

A limitation of this study is that the current im-plementation of WordSeg does not only look atoversegmentation cases resulting in meaningful,morpheme-like sub-parts. A next step would be tofocus on reasonable oversegmentation errors, eventhough not all of these corpora have morpheme an-notations.

Measuring reasonable errors such as overseg-mentation could shed light on the segmentabil-ity of morphologically complex languages and thecross-linguistic applicability of models. Furtherresearch might include over-, but also underseg-mentation errors, when two or more words in theinput returned as a single unit in the output.

20

ReferencesShanley E. M. Allen. 1996. Aspects of argument struc-

ture acquisition in Inuktitut. Benjamins, Amster-dam.

Mathieu Bernard, Roland Thiolliere, Amanda Sak-sida, Georgia Loukatou, Elin Larsen, Mark Johnson,Laia Fibla Reixachs, Emmanuel Dupoux, RobertDaland, Xuan Nga Cao, and Alejandrina Cristia.2018. Wordseg: Standardizing unsupervised wordform segmentation from text. Behavior researchMethods.

Songjian Chen, Yabo Xu, and Huiyou Chang. 2011. Asimple and effective unsupervised word segmenta-tion approach. In Twenty-Fifth AAAI Conference onArtificial Intelligence.

Robert Daland. 2009. Word segmentation, word recog-nition, and word learning: A computational modelof first language acquisition. Ph.D. thesis, North-western University.

Katherine A. Demuth. 1992. Acquisition of sesotho.In Dan Isaac Slobin, editor, The crosslinguistic studyof language acquisition, volume 3, pages 557–638.Lawrence Erlbaum Associates, Hillsdale, NJ.

Abdellah Fourtassi, Benjamin Borschinger, MarkJohnson, and Emmanuel Dupoux. 2013. WhyisEn-glishsoeasytosegment. In Proceedings of the FourthAnnual Workshop on Cognitive Modeling and Com-putational Linguistics, pages 1–10.

Judit Gervain and Ramon Guevara Erra. 2012. The sta-tistical signature of morphosyntax: A study of Hun-garian and Italian infant-directed speech. Cognition,125(2):263–287.

David Gil and Uri Tadmor. 2007. The mpi-eva jakartachild language database. a joint project of the de-partment of linguistics, max planck institute for evo-lutionary anthropology and the center for languageand culture studies, atma jaya catholic university.

Zellig S. Harris. 1955. From phoneme to morpheme.Language, 31(2):190–222.

Mark Johnson. 2008. Unsupervised word segmenta-tion for sesotho using adaptor grammars. In Pro-ceedings of the Tenth Meeting of ACL Special In-terest Group on Computational Morphology andPhonology, pages 20–27. Association for Computa-tional Linguistics.

Kimmo Kettunen. 2014. Can type-token ratio be usedto show morphological complexity of languages?Journal of Quantitative Linguistics, 21(3):223–245.

Aylin C. Kuntay, Dilara Kocbas, and SuleymanSabri Tascı. Unpublished. Koc university longitudi-nal language development database on language ac-quisition of 8 children from 8 to 36 months of age.

Georgia Loukatou, Sabine Stoll, Damian Blasi, andAlejandrina Cristia. 2018. Modeling infant segmen-tation of two morphologically diverse languages.TALN.

Brian MacWhinney. 2000. The CHILDES project:tools for analyzing talk. Lawrence Erlbaum Asso-ciates, Mahwah, NJ.

Susanne Miyata and Hiro Yuki Nisisawa. 2010. MiiPro- Tomito Corpus. Talkbank, Pittsburgh, PA.

Daichi Mochihashi, Takeshi Yamada, and NaonoriUeda. 2009. Bayesian unsupervised word segmen-tation with nested pitman-yor language modeling.In Proceedings of the Joint Conference of the 47thAnnual Meeting of the ACL and the 4th InternationalJoint Conference on Natural Language Processingof the AFNLP: Volume 1-Volume 1, pages 100–108.Association for Computational Linguistics.

Padraic Monaghan and Morten H Christiansen. 2010.Words in puddles of sound: Modelling psycholin-guistic effects in speech segmentation. Journal ofChild Language, 37(3):545–564.

Steven Moran, Robert Schikowski, D Pajovic, CazimHysi, and Sabine Stoll. 2016. The ACQDIVdatabase: Min (d) ing the ambient language. InProceedings of the tenth international conference onLanguage Resources and Evaluation (LREC 2016),pages 4423–4429.

Hiro Yuki Nisisawa and Susanne Miyata. 2010. MiiPro- ArikaM Corpus. Talkbank, Pittsburgh, PA.

Barbara Pfeiler. 2003. Early acquisition of the ver-bal complex in yucatec maya. Development of verbinflection in first language acquisition, pages 379–399.

Amanda Saksida, Alan Langus, and Marina Nes-por. 2017. Co-occurrence statistics as a language-dependent cue for speech segmentation. Develop-mental Science, 20(3):1–11.

Sabine Stoll, Elena Lieven, Goma Banjade, Toya NathBhatta, Martin Gaenszle, Netra P. Paudyal, ManojRai, Novel Kishor Rai, Ichchha P. Rai, Taras Za-kharko, Robert Schikowski, and Balthasar Bickel.2015. Audiovisual corpus on the acquisition of chin-tang by six children.

Sabine Stoll and Roland Meyer. 2008. Audio-visionallongitudinal corpus on the acquisition of russian by5 children.

Benedikt Szmrecsanyi. 2016. An information-theoretic approach to assess linguistic complexity.Complexity, isolation, and variation, 57:71.

Valentin Zhikov, Hiroya Takamura, and Manabu Oku-mura. 2013. An efficient algorithm for unsupervisedword segmentation with branching entropy and mdl.Information and Media Technologies, 8(2):514–527.

21


Predicting Continuous Vowel Spaces in the Wilderness

Emily Ahn and David MortensenLanguage Technologies Institute

Carnegie Mellon University{eahn1,dmortens}@cs.cmu.edu

Abstract

We aim to model the acoustic vowel spacesof 24 diverse languages—a subset taken fromthe CMU Wilderness corpus of Bible record-ings. With this model, we test hypothesesfrom phonological typology. We also expandupon previous work that used formant mea-surements taken by field linguists, and we useautomatic tools to align and extract vowel seg-ments in large-scale recorded speech. Thiswork in progress is at the stage where data hasbeen carefully processed, just prior to imple-mentation of the model.

1 Introduction

Every language has a system of vowels, whetherfew or many, and understanding how these sys-tems work crosslingually has been a goal in lin-guistic phonological typology. We are inter-ested in empirically studying vowel spaces fromthe CMU Wilderness Multilingual Speech Dataset(Black, 2019), a large database of audio recordingsfrom around 700 languages. Cotterell and Eisner(2018) developed a deep generative model to useacoustic formants to predict vowel spaces across adataset of 223 languages. We aim to expand upontheir findings and build a model from a set of lan-guages that contain rich acoustic data per languageand many vowel tokens per vowel type.

From this model, we can analyze our resultsto answer other questions regarding phonologi-cal typology. Dispersion Theory predicts thatvowel types will be maximally dispersed withinthe vowel space, and Focalization Theory predictsthat vowel types will preferentially be centeredaround canonical focii (Schwartz et al., 1997).Both theories make predictions about the distribu-tion of centroids of types within formant space, butneither theory makes an explicit prediction aboutthe dispersion of vowel tokens of a given type.

In fact, both of these theories suggest that within-type dispersion should be relatively insensitive toother factors, since they treat vowel space as a sys-tem of categories. Other theories of vowel spaces,like those based on Evolutionary Phonology andexemplar theory, predict that the dispersion of to-kens of each type should be inversely related tothe number of contrasting types within a vowelsystem, since phonological categories can be seenas competing with one another for phonetic space(Vaux and Samuels, 2015). Reduced vowel inven-tories, in such a theory, are the result of merg-ers and merged categories take up more phoneticspace than either of the categories prior to merger.

The Wilderness corpus provides a unique op-portunity to test whether the number of vowelcontrasts in a language’s phonological inventorypredicts the average dispersion of tokens in eachvowel type. Rather than just providing idealizedtokens of vowels from many languages or manytokens of vowels from few languages, it providesa massive number of tokens from a very largenumber of languages. If a relationship betweennumber of types and token dispersion does existon a large scale, it would be important evidencefor evolutionary approaches to vowel space typol-ogy. If token dispersion is insensitive to the num-ber of vowel types, support would be lent to thedispersion-focalization model.

These findings would be of interest to compu-tational typologists and have implications for low-resource NLP and speech technologies. In addi-tion to this narrow scientific question, this paperwould contribute a replicable methodology for ex-tracting vowels in a subset consisting of 24 lan-guages from the public Wilderness corpus of 700languages. This methdology could be used to ex-tract vowel tokens from the corpus on a muchlarger scale.

22

2 Data

The Wilderness corpus comprises of roughly 700languages of read speech from the New Testamentof the Bible, originally scraped from Bible.is.1

2.1 Selection of 24 LanguagesFor this work, we intersected these languages withthe PHOIBLE database of crosslingual phonolog-ical inventories (Moran and McCloy, 2019). Wechose 24 languages where for each integer be-tween 3 and 10, there are three languages (fromdistinct regions) whose vowel inventory size isthat integer. We also used a criterion of choosinglanguages with the highest automatic alignmentscores, as determined by algorithms provided inthe Wilderness data, and this list is given in Table1.

Language Country Vowels Hours

Cebuano Philippines 3 22Kabyle Algeria 3 8Tena Quechua Ecuador 3 19Yupik United States 4 22Maranao Philippines 4 24Podoko Cameroon 4 21Russian Russia 5 15Twampa Ethiopia 5 31Urarina Peru 5 31Hanga Ghana 6 14Paumari Brazil 6 48Manado Malay Indonesia 6 25Komi Russia 7 17Sundanese Indonesia 7 20Tigrinya Ethiopia 7 14Denya Cameroon 8 15Huambisa Peru 8 28Maithili India 8 14Moru Sudan 9 23Nomatsigenga Peru 9 36Ossetian Georgia 9 12Eastern Oromo Ethiopia 10 24Maka Paraguay 10 29Tamang Nepal 10 18

Table 1: The 24 languages chosen for this analysis,from the Wilderness data. They are sorted by numberof vowel types as determined by PHOIBLE, and we at-tempted to balance the languages by region.

2.2 PreprocessingWe first obtained phone-level alignments via thetool provided from Festvox.2 We then manuallymapped the phoneme lists from the data with theIPA from PHOIBLE, since there is noise in thepronunciation model. Given vowel alignments, we

1http://www.bible.is/2http://festvox.org/

extracted means of the first few formants usingDeepFormants,3 a tool for formant estimation. Allpreprocessing scripts will be made publicly avail-able for future analyses on any language from theWilderness.

A limitation of the data is that each languagerecording is spoken by few and undocumentedspeakers. We also anticipate challenges with re-gards to formant normalization across speakergender, given that females tend to have higher anda greater range of formants than male speakerseven when controlling for vocal tract.

3 Hypotheses

We hypothesize that vowel cloud size is inverselycorrelated with the number of vowels in a lan-guage’s inventory. Cloud size will be measuredas the level of dispersion in the probabilistic dis-tribution of the two-dimensional vowel space.

4 Methodology

We plan to implement the deep generative modelsusing determinantal point processes (DPP) from(Cotterell and Eisner, 2018) to analyze the for-mants in vowel spaces of our subset of the Wilder-ness data. We may choose a different method aswell, in order to capture the variety of vowel to-kens since the prior work used one representativevowel token per phoneme. We may use cross-entropy to evaluate our generative models, andwill present our findings with regards to our hy-potheses.

ReferencesAlan W Black. 2019. CMU wilderness multilingual

speech dataset. In ICASSP 2019-2019 IEEE Inter-national Conference on Acoustics, Speech and Sig-nal Processing (ICASSP), pages 5971–5975. IEEE.

Ryan Cotterell and Jason Eisner. 2018. A deep gener-ative model of vowel formant typology. In Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long Papers), pages 37–46.

Steven Moran and Daniel McCloy, editors. 2019.PHOIBLE 2.0. Max Planck Institute for the Scienceof Human History, Jena.

Jean-Luc Schwartz, Louis-Jean Boe, Nathalie Vallee,and Christian Abry. 1997. The dispersion-

3https://github.com/MLSpeech/DeepFormants

23

focalization theory of vowel systems. Journal ofphonetics, 25(3):255–286.

Bert Vaux and Bridget Samuels. 2015. Explainingvowel systems: dispersion theory vs natural selec-tion. The Linguistic Review, 32(3):573–599.

24


Transfer Learning for Cognate Identification in Low-Resource Languages

Eliel Soisalon-SoininenDepartment of Computer Science

University of [email protected]

Mark Granroth-WildingDepartment of Computer Science

University of [email protected]

Introduction

In our on-going work, we are addressing the prob-lem of identifying cognates across unannotatedvocabularies of any pair of languages. In partic-ular, we assume that the languages of interest arelow-resource to the extent that no training datawhatsoever, even in closely related languages, areavailable for the task. Instead, we are investigatingthe performance of language-independent transferlearning approaches utilising training data froma completely unrelated, higher-resource languagefamily. Our results so far suggest that a Siameseconvolutional neural network generalises more ef-fectively across language families than baselines.

Cognate identification is a core task in the com-parative method, a collection of techniques usedin historical linguistics, a field closely tied withlinguistic typology (Shields, 2011). Cognate in-formation is also useful for applications such asmachine translation (Gronroos et al., 2018). In ad-dition, knowledge of cognates is useful for second-language learning (Beinborn et al., 2014).

Cognate identification

In cognate identification, we are essentially giventwo string sets X = {x1, . . . , xn} and Y ={y1, . . . , ym}. The task is to extract those pairs(x, y) in relation R:

R = {(x, y) ∈ X × Y | x is cognate with y }

Each element x ∈ X and y ∈ Y is a string overalphabets Σx and Σy respectively. The alphabetsets do not necessarily overlap.

Table 1 illustrates the difficulty of cognate iden-tification. As can be seen, some cognates arestraightforward with a similar form and meaning(e.g. notte–noche). On the other hand, there islarge variation in the degree of similarity in termsof both form and meaning. However, common

Word A Word B Meaningsit: notte es: noche ’night’fi: huvittava et: huvitav ’amusing’; ’interesting’en: attend fr: attendre ’attend’; ’wait’en: oath sv: ed ’oath’fi: poyta sv: bord ’table’en: bite fr: fendre ’bite’; ’split’

Table 1: Examples of cognates with varying degree of simi-larity in form and meaning.

to all of these examples is that they exhibit reg-ular sound correspondences, i.e. word segmentsregularly occurring in similar positions and con-texts (List, 2013; Kondrak, 2012), such as oa–eand th–d in English–Swedish cognates. Therefore,cognate identification should rely on the identifi-cation of such cross-lingual correspondences (i.e.pairs of single characters or substrings).

Most previous work attempts to design such astring similarity metric that would tend to assigna higher score to cognate than unrelated words.Common approaches include extensions of the tra-ditional Levenshtein distance (Levenshtein, 1966)that either assign weights to pairs of symbolsaccording to their phonetic properties (e.g. List,2013; Kondrak, 2000), or that learn such weightsfrom example cognates (Ciobanu and Dinu, 2014;Gomes and Pereira Lopes, 2011). McCoy andFrank (2018) use weights based on character em-beddings.

In contrast to much of previous work, we makeno strict assumptions about the degree of sim-ilarity in form or meaning that any two cog-nates should exhibit. Instead, following Rama(2016) and Jager (2014), we treat regular corre-spondences as the main driving factor in the cog-nate relation and attempt to capture these in a com-pletely data-driven manner. We aim to contributeto this line of research by considering the ability ofour models to generalise across language families.

25

Models and experiments

In our experiments, we have trained our modelswith an etymological database of Indo-Europeanlanguages (De Melo, 2014), and tested their per-formance on combinations of three unannotatedvocabularies from Sami languages of the Uralicfamily. We have experimented with two simi-larity learning models, a Siamese convolutionalneural network (S-CNN) based on Rama (2016)and a support vector machine (SVM) based onHauer and Kondrak (2011), compared with aLevenshtein-distance (LD) baseline (Levenshtein,1966). In addition, we have experimented withfine-tuning the S-CNN model in order to quan-tify the benefit of having small amounts of target-language training data.

The Levenshtein distance between two stringsis the minimum number of insertions, deletions,and substitutions needed to transform one stringto another. It is straightforward to turn this into asimilarity metric. For the SVM, word pairs are en-coded into vectors of the following features: Lev-enshtein distance, number of common bigrams,prefix length, lengths of both words, and the abso-lute difference between the lengths. The S-CNN isa two-input version of a convolutional neural net-work, where input words are encoded into matri-ces of concatenated one-hot vectors representingcharacters. As shown in Figure 1, the network cre-ates a merged representation of a word pair, to beclassified as cognate or unrelated.

Figure 2 shows a precision-recall curve foreach model, including both a fine-tuned (with 500target-language training pairs) and unadapted S-CNN. As expected, the fine-tuned S-CNN outper-forms the other models. Interestingly, even the un-adapted S-CNN simply relying on Indo-Europeantraining data outperforms the SVM and LD. Thissuggests that the S-CNN is able to more effectivelycapture such aspects of the cognateness relationthat carry over across language families.

Work in progress

We are currently investigating approaches to im-prove target-family performance with unsuper-vised methods of domain adaptation. One ofour lines of work is to use an adversarial ap-proach to making target-family word pair repre-sentations more similar to source-family represen-tations, similarly to the method of Tzeng et al.(2017) intended for domain adaptation of images.

Xa Xb

Xa ?W Xb ?W

Max-pooling Max-pooling

Fully-connected layer with dropout

y

rbra

m = |ra − rb|

ReLU ReLU

Figure 1: The S-CNN architecture. Column vectorsin input matrices represent one-hot-encoded characters.The filter W is convolved over character sequences.

0.0 0.2 0.4 0.6 0.8 1.0Recall

0.0

0.2

0.4

0.6

0.8

1.0

Prec

ision

S-CNN + FTS-CNNSVMED

Figure 2: Precision-recall for Sami test set.

Another way to extend the S-CNN model is to useunsupervised multilingual character embeddings(Granroth-Wilding and Toivonen, 2019), trainedwith small corpora from the target languages. Thiscould be a way to make characters across lan-guages more comparable to each other, thus tack-ling the issue that orthographies are often not di-rectly comparable.

In addition to unsupervised methods, we alsointend to compare our data-driven approaches withmore linguistically-informed ones, in order to as-sess the benefit of such information. For exam-ple, our work could benefit from the use of univer-sal phonetic encodings of words instead of ortho-graphic forms.

Although we have thus far specifically focusedon the problem of cognate identification, we be-lieve that these methods could be extended to thestudy of other typological features of language andthe automatic inference of such features. Lan-guage typology could also provide a means to in-terpret the representations of the S-CNN model.

26

References

Lisa Beinborn, Torsten Zesch, and Iryna Gurevych.2014. Readability for foreign language learning:The importance of cognates. ITL-InternationalJournal of Applied Linguistics, 165(2):136–162.

Alina Maria Ciobanu and Liviu P Dinu. 2014. Au-tomatic detection of cognates using orthographicalignment. In Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers), volume 2, pages 99–105.

Gerard De Melo. 2014. Etymological wordnet: Trac-ing the history of words. In LREC, pages 1148–1154. Citeseer.

Luıs Gomes and Jose Pereira Lopes. 2011. Mea-suring spelling similarity for cognate identification.Progress in Artificial Intelligence, pages 624–633.

Mark Granroth-Wilding and Hannu Toivonen. 2019.Unsupervised learning of cross-lingual symbol em-beddings without parallel data. Proceedings of theSociety for Computation in Linguistics (SCiL) 2019,pages 19–28.

Stig-Arne Gronroos, Sami Virpioja, and Mikko Ku-rimo. 2018. Cognate-aware morphological segmen-tation for multilingual neural translation. arXivpreprint arXiv:1808.10791.

Bradley Hauer and Grzegorz Kondrak. 2011. Cluster-ing semantically equivalent words into cognate setsin multilingual lists. In Proceedings of 5th interna-tional joint conference on natural language process-ing, pages 865–873.

Gerhard Jager. 2014. Phylogenetic inference fromword lists using weighted alignment with empiri-cally determined weights. In Quantifying LanguageDynamics, pages 155 – 204. Leiden, The Nether-lands: Brill.

Grzegorz Kondrak. 2000. A new algorithm for thealignment of phonetic sequences. In Proceedingsof the 1st North American chapter of the Asso-ciation for Computational Linguistics conference,pages 288–295. Association for Computational Lin-guistics.

Grzegorz Kondrak. 2012. Similarity patterns in words.In Proceedings of the EACL 2012 Joint Workshop ofLINGVIS & UNCLH, pages 49–53. Association forComputational Linguistics.

Vladimir I Levenshtein. 1966. Binary codes capableof correcting deletions, insertions, and reversals. InSoviet physics doklady, volume 10, pages 707–710.

Johann-Mattis List. 2013. Sequence comparison inhistorical linguistics. Ph.D. thesis, Heinrich-Heine-Universitat Dusseldorf.

Richard T McCoy and Robert Frank. 2018. Phono-logically informed edit distance algorithms for wordalignment with low-resource languages. Proceed-ings of the Society for Computation in Linguistics(SCiL) 2018, pages 102–112.

Taraka Rama. 2016. Siamese convolutional networksfor cognate identification. In Proceedings of COL-ING 2016, the 26th International Conference onComputational Linguistics: Technical Papers, pages1018–1027.

Kenneth Shields. 2011. Linguistic typology and his-torical linguistics. In The Oxford Handbook of Lin-guistic Typology.

Eric Tzeng, Judy Hoffman, Kate Saenko, and TrevorDarrell. 2017. Adversarial discriminative domainadaptation. In Computer Vision and Pattern Recog-nition (CVPR), volume 1, page 4.

27

Author Index

Ahn, Emily, 22Alzetta, Chiara, 4

Bender, Emily M., 16Buis, Annebeth, 13

de Almeida, Tifa, 16Dell’Orletta, Felice, 4

Granroth-Wilding, Mark, 25

He, Taiqi, 10Hovy, Eduard, 1Howell, Kristen, 16Hulden, Mans, 13

Loukatou, Georgia R., 19

Montemagni, Simonetta, 4Mortensen, David R., 22

Rajagopal, Dheeraj, 1Ri, Ryokan, 7

Sagae, Kenji, 10Soisalon-Soininen, Eliel, 25

Tsuruoka, Yoshimasa, 7

Venturi, Giulia, 4Vyas, Nidhi, 1

Zhang, Youyun, 16

29

proceedings of typ-nlp: the first workshop on typology for ... · acl 2019 typ-nlp: the workshop on...

Documents