automated reconstruction of ancient languages using ... · automated reconstruction of ancient...

6
Automated reconstruction of ancient languages using probabilistic models of sound change Alexandre Bouchard-Côté a,1 , David Hall b , Thomas L. Grifths c , and Dan Klein b a Department of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada; b Computer Science Division and c Department of Psychology, University of California, Berkeley, CA 94720 Edited by Nick Chater, University of Warwick, Coventry, United Kingdom, and accepted by the Editorial Board December 22, 2012 (received for review March 19, 2012) One of the oldest problems in linguistics is reconstructing the words that appeared in the protolanguages from which modern languages evolved. Identifying the forms of these ancient lan- guages makes it possible to evaluate proposals about the nature of language change and to draw inferences about human history. Protolanguages are typically reconstructed using a painstaking manual process known as the comparative method. We present a family of probabilistic models of sound change as well as algo- rithms for performing inference in these models. The resulting system automatically and accurately reconstructs protolanguages from modern languages. We apply this system to 637 Austronesian languages, providing an accurate, large-scale automatic reconstruc- tion of a set of protolanguages. Over 85% of the systems recon- structions are within one character of the manual reconstruction provided by a linguist specializing in Austronesian languages. Be- ing able to automatically reconstruct large numbers of languages provides a useful way to quantitatively explore hypotheses about the factors determining which sounds in a language are likely to change over time. We demonstrate this by showing that the reconstructed Austronesian protolanguages provide compelling support for a hy- pothesis about the relationship between the function of a sound and its probability of changing that was rst proposed in 1955. ancestral | computational | diachronic R econstruction of the protolanguages from which modern lan- guages are descended is a difcult problem, occupying histor- ical linguists since the late 18th century. To solve this problem linguists have developed a labor-intensive manual procedure called the comparative method (1), drawing on information about the sounds and words that appear in many modern languages to hy- pothesize protolanguage reconstructions even when no written records are available, opening one of the few possible windows to prehistoric societies (2, 3). Reconstructions can help in un- derstanding many aspects of our past, such as the technological level (2), migration patterns (4), and scripts (2, 5) of early societies. Comparing reconstructions across many languages can help reveal the nature of language change itself, identifying which aspects of language are most likely to change over time, a long-standing question in historical linguistics (6, 7). In many cases, direct evidence of the form of protolanguages is not available. Fortunately, owing to the worlds considerable lin- guistic diversity, it is still possible to propose reconstructions by leveraging a large collection of extant languages descended from a single protolanguage. Words that appear in these modern lan- guages can be organized into cognate sets that contain words sus- pected to have a shared ancestral form (Table 1). The key observation that makes reconstruction from these data possible is that languages seem to undergo a relatively limited set of regular sound changes, each applied to the entire vocabulary of a language at specic stages of its history (1). Still, several factors make re- construction a hard problem. For example, sound changes are often context sensitive, and many are string insertions and deletions. In this paper, we present an automated system capable of large-scale reconstruction of protolanguages directly from words that appear in modern languages. This system is based on a probabilistic model of sound change at the level of phonemes, building on work on the reconstruction of ancestral sequences and alignment in computational biology (812). Several groups have recently explored how methods from computational biology can be applied to problems in historical linguistics, but such work has focused on identifying the relationships between languages (as might be expressed in a phylogeny) rather than reconstructing the languages themselves (1318). Much of this type of work has been based on binary cognate or structural matrices (19, 20), which discard all information about the form that words take, simply in- dicating whether they are cognate. Such models did not have the goal of reconstructing protolanguages and consequently use a rep- resentation that lacks the resolution required to infer ancestral phonetic sequences. Using phonological representations allows us to perform reconstruction and does not require us to assume that cognate sets have been fully resolved as a preprocessing step. Rep- resenting the words at each point in a phylogeny and having a model of how they change give a way of comparing different hypothesized cognate sets and hence inferring cognate sets automatically. The focus on problems other than reconstruction in previous computational approaches has meant that almost all existing protolanguage reconstructions have been done manually. How- ever, to obtain more accurate reconstructions for older languages, large numbers of modern languages need to be analyzed. The Proto-Austronesian language, for instance, has over 1,200 de- scendant languages (21). All of these languages could potentially increase the quality of the reconstructions, but the number of possibilities increases considerably with each language, making it difcult to analyze a large number of languages simultaneously. The few previous systems for automated reconstruction of pro- tolanguages or cognate inference (2224) were unable to handle this increase in computational complexity, as they relied on de- terministic models of sound change and exact but intractable algorithms for reconstruction. Being able to reconstruct large numbers of languages also makes it possible to provide quantitative answers to questions about the factors that are involved in language change. We dem- onstrate the potential for automated reconstruction to lead to novel results in historical linguistics by investigating a specic hypothesized regularity in sound changes called functional load. The functional load hypothesis, introduced in 1955, asserts that sounds that play a more important role in distinguishing words are less likely to change over time (6). Our probabilistic reconstruction of hundreds of protolanguages in the Austronesian phylogeny provides a way to explore this question quantitatively, producing compelling evidence in favor of the functional load hypothesis. Author contributions: A.B.-C., D.H., T.L.G., and D.K. designed research; A.B.-C. and D.H. performed research; A.B.-C. and D.H. contributed new reagents/analytic tools; A.B.-C., D.H., T.L.G., and D.K. analyzed data; and A.B.-C., D.H., T.L.G., and D.K. wrote the paper. The authors declare no conict of interest. This article is a PNAS Direct Submission. N.C. is a guest editor invited by the Editorial Board. Freely available online through the PNAS open access option. See Commentary on page 4159. 1 To whom correspondence should be addressed. E-mail: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1204678110/-/DCSupplemental. 42244229 | PNAS | March 12, 2013 | vol. 110 | no. 11 www.pnas.org/cgi/doi/10.1073/pnas.1204678110 Downloaded by guest on October 10, 2020

Upload: others

Post on 31-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automated reconstruction of ancient languages using ... · Automated reconstruction of ancient languages using probabilistic models of sound change Alexandre Bouchard-Côtéa,1, David

Automated reconstruction of ancient languages usingprobabilistic models of sound changeAlexandre Bouchard-Côtéa,1, David Hallb, Thomas L. Griffithsc, and Dan Kleinb

aDepartment of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada; bComputer Science Division and cDepartment of Psychology,University of California, Berkeley, CA 94720

Edited by Nick Chater, University of Warwick, Coventry, United Kingdom, and accepted by the Editorial Board December 22, 2012 (received for reviewMarch 19, 2012)

One of the oldest problems in linguistics is reconstructing thewords that appeared in the protolanguages from which modernlanguages evolved. Identifying the forms of these ancient lan-guages makes it possible to evaluate proposals about the natureof language change and to draw inferences about human history.Protolanguages are typically reconstructed using a painstakingmanual process known as the comparative method. We present afamily of probabilistic models of sound change as well as algo-rithms for performing inference in these models. The resultingsystem automatically and accurately reconstructs protolanguagesfrommodern languages. We apply this system to 637 Austronesianlanguages, providing an accurate, large-scale automatic reconstruc-tion of a set of protolanguages. Over 85% of the system’s recon-structions are within one character of the manual reconstructionprovided by a linguist specializing in Austronesian languages. Be-ing able to automatically reconstruct large numbers of languagesprovides a useful way to quantitatively explore hypotheses about thefactors determining which sounds in a language are likely to changeover time. We demonstrate this by showing that the reconstructedAustronesian protolanguages provide compelling support for a hy-pothesis about the relationship between the function of a soundand its probability of changing that was first proposed in 1955.

ancestral | computational | diachronic

Reconstruction of the protolanguages from which modern lan-guages are descended is a difficult problem, occupying histor-

ical linguists since the late 18th century. To solve this problemlinguists have developed a labor-intensivemanual procedure calledthe comparative method (1), drawing on information about thesounds and words that appear in many modern languages to hy-pothesize protolanguage reconstructions even when no writtenrecords are available, opening one of the few possible windowsto prehistoric societies (2, 3). Reconstructions can help in un-derstanding many aspects of our past, such as the technologicallevel (2), migration patterns (4), and scripts (2, 5) of early societies.Comparing reconstructions across many languages can help revealthe nature of language change itself, identifying which aspectsof language are most likely to change over time, a long-standingquestion in historical linguistics (6, 7).In many cases, direct evidence of the form of protolanguages is

not available. Fortunately, owing to the world’s considerable lin-guistic diversity, it is still possible to propose reconstructions byleveraging a large collection of extant languages descended from asingle protolanguage. Words that appear in these modern lan-guages can be organized into cognate sets that contain words sus-pected to have a shared ancestral form (Table 1). The keyobservation that makes reconstruction from these data possible isthat languages seem to undergo a relatively limited set of regularsound changes, each applied to the entire vocabulary of a languageat specific stages of its history (1). Still, several factors make re-construction a hard problem. For example, sound changes are oftencontext sensitive, and many are string insertions and deletions.In this paper, we present an automated system capable of

large-scale reconstruction of protolanguages directly from wordsthat appear in modern languages. This system is based on aprobabilistic model of sound change at the level of phonemes,

building on work on the reconstruction of ancestral sequencesand alignment in computational biology (8–12). Several groupshave recently explored how methods from computational biologycan be applied to problems in historical linguistics, but such workhas focused on identifying the relationships between languages(as might be expressed in a phylogeny) rather than reconstructingthe languages themselves (13–18). Much of this type of work hasbeen based on binary cognate or structural matrices (19, 20), whichdiscard all information about the form that words take, simply in-dicating whether they are cognate. Such models did not have thegoal of reconstructing protolanguages and consequently use a rep-resentation that lacks the resolution required to infer ancestralphonetic sequences. Using phonological representations allows usto perform reconstruction and does not require us to assume thatcognate sets have been fully resolved as a preprocessing step. Rep-resenting the words at each point in a phylogeny and having a modelof how they change give a way of comparing different hypothesizedcognate sets and hence inferring cognate sets automatically.The focus on problems other than reconstruction in previous

computational approaches has meant that almost all existingprotolanguage reconstructions have been done manually. How-ever, to obtain more accurate reconstructions for older languages,large numbers of modern languages need to be analyzed. TheProto-Austronesian language, for instance, has over 1,200 de-scendant languages (21). All of these languages could potentiallyincrease the quality of the reconstructions, but the number ofpossibilities increases considerably with each language, making itdifficult to analyze a large number of languages simultaneously.The few previous systems for automated reconstruction of pro-tolanguages or cognate inference (22–24) were unable to handlethis increase in computational complexity, as they relied on de-terministic models of sound change and exact but intractablealgorithms for reconstruction.Being able to reconstruct large numbers of languages also

makes it possible to provide quantitative answers to questionsabout the factors that are involved in language change. We dem-onstrate the potential for automated reconstruction to lead tonovel results in historical linguistics by investigating a specifichypothesized regularity in sound changes called functional load.The functional load hypothesis, introduced in 1955, asserts thatsounds that play a more important role in distinguishing words areless likely to change over time (6). Our probabilistic reconstructionof hundreds of protolanguages in the Austronesian phylogenyprovides a way to explore this question quantitatively, producingcompelling evidence in favor of the functional load hypothesis.

Author contributions: A.B.-C., D.H., T.L.G., and D.K. designed research; A.B.-C. and D.H.performed research; A.B.-C. and D.H. contributed new reagents/analytic tools; A.B.-C.,D.H., T.L.G., and D.K. analyzed data; and A.B.-C., D.H., T.L.G., and D.K. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. N.C. is a guest editor invited by the EditorialBoard.

Freely available online through the PNAS open access option.

See Commentary on page 4159.1To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1204678110/-/DCSupplemental.

4224–4229 | PNAS | March 12, 2013 | vol. 110 | no. 11 www.pnas.org/cgi/doi/10.1073/pnas.1204678110

Dow

nloa

ded

by g

uest

on

Oct

ober

10,

202

0

Page 2: Automated reconstruction of ancient languages using ... · Automated reconstruction of ancient languages using probabilistic models of sound change Alexandre Bouchard-Côtéa,1, David

ModelWe use a probabilistic model of sound change and a Monte Carloinference algorithm to reconstruct the lexicon and phonology ofprotolanguages given a collection of cognate sets from modernlanguages. As in other recent work in computational historicallinguistics (13–18), we make the simplifying assumption that eachword evolves along the branches of a tree of languages, reflectingthe languages’ phylogenetic relationships. The tree’s internalnodes are languages whose word forms are not observed, and theleaves are modern languages. The output of our system is a pos-terior probability distribution over derivations. Each derivationcontains, for each cognate set, a reconstructed transcription ofancestral forms, as well as a list of sound changes describing thetransformation from parent word to child word. This represen-tation is rich enough to answer a wide range of queries that wouldnormally be answered by carrying out the comparative methodmanually, such as which sound changes were most prominentalong each branch of the tree.We model the evolution of discrete sequences of phonemes,

using a context-dependent probabilistic string transducer (8).Probabilistic string transducers efficiently encode a distributionover possible changes that a string might undergo as it changesthrough time. Transducers are sufficient to capture most types ofregular sound changes (e.g., lenitions, epentheses, and elisions) andcan be sensitive to the context in which a change takes place. Mosttypes of changes not captured by transducers are not regular (1) andare therefore less informative (e.g., metatheses, reduplications, andhaplologies). Unlike simple molecular InDel models used in com-putational biology such as the TKF91 model (25), the parameter-ization of our model is very expressive: Mutation probabilities arecontext sensitive, depending on the neighboring characters, andeach branch has its own set of parameters. This context-sensitiveand branch-specific parameterization plays a central role in oursystem, allowing explicit modeling of sound changes.Formally, let τ be a phylogenetic tree of languages, where each

language is linked to the languages that descended from it. Insuch a tree, the modern languages, whose word forms will beobserved, are the leaves of τ. The most recent common ancestorof these modern languages is the root of τ. Internal nodes of thetree (including the root) are protolanguages with unobservedword forms. Let L denote all languages, modern and otherwise.All word forms are assumed to be strings in the InternationalPhonetic Alphabet (IPA).We assume that word forms evolve along the branches of the

tree τ. However, it is usually not the case that a word belongingto each cognate set exists in each modern language—words arelost or replaced over time, meaning that words that appear in theroot languages may not have cognate descendants in the lan-guages at the leaves of the tree. For the moment, we assume thereis a known list of C cognate sets. For each c∈ f1; . . . ;Cg let LðcÞ

denote the subset of modern languages that have a word form inthe cth cognate set. For each set c∈ f1; . . . ;Cg and each languageℓ∈LðcÞ, we denote the modern word form by wcℓ. For cognate setc, only the minimal subtree τðcÞ containing LðcÞ and the root isrelevant to the reconstruction inference problem for that set.Our model of sound change is based on a generative process

defined on this tree. From a high-level perspective, the genera-tive process is quite simple. Let c be the index of the currentcognate set, with topology τðcÞ. First, a word is generated for theroot of τðcÞ, using an (initially unknown) root language model(i.e., a probability distribution over strings). The words that ap-pear at other nodes of the tree are generated incrementally, usinga branch-specific distribution over changes in strings to generateeach word from the word in the language that is its parent in τðcÞ.Although this distribution differs across branches of the tree,making it possible to estimate the pattern of changes involved inthe transition from one language to another, it remains the samefor all cognate sets, expressing changes that apply stochasticallyto all words. The probabilities of substitution, insertion and de-letion are also dependent on the context in which the changeoccurs. Further details of the distributions that were used andtheir parameterization appear in Materials and Methods.The flexibility of our model comes at the cost of having literally

millions of parameters to set, creating challenges not found inmost computational approaches to phylogenetics. Our inferencealgorithm learns these parameters automatically, using estab-lished principles from machine learning and statistics. Specifi-cally, we use a variant of the expectation-maximization algorithm(26), which alternates between producing reconstructions on thebasis of the current parameter estimates and updating the pa-rameter estimates on the basis of those reconstructions. Thereconstructions are inferred using an efficient Monte Carlo in-ference algorithm (27). The parameters are estimated by opti-mizing a cost function that penalizes complexity, allowing us toobtain robust estimates of large numbers of parameters. See SIAppendix, Section 1 for further details of the inference algorithm.If cognate assignments are not available, our system can be

applied just to lists of words in different languages. In this case itautomatically infers the cognate assignments as well as thereconstructions. This setting requires only two modifications tothe model. First, because cognates are not available, we index thewords by their semantic meaning (or gloss) g, and there are thusG groups of words. The model is then defined as in the previouscase, with words indexed as wgℓ. Second, the generation process isaugmented with a notion of innovation, wherein a word wgℓ′ insome language ℓ′ may instead be generated independently fromits parent word wgℓ. In this instance, the word is generated froma language model as though it were a root string. In effect, thetree is “cut” at a language when innovation happens, and so theword begins anew. The probability of innovation in any given

Table 1. Sample of reconstructions produced by the system

*Complete sets of reconstructions can be found in SI Appendix.†Randomly selected by stratified sampling according to the Levenshtein edit distance Δ.‡Levenshtein distance to a reference manual reconstruction, in this case the reconstruction of Blust (42).§The colors encode cognate sets.{We use this symbol for encoding missing data.

Bouchard-Côté et al. PNAS | March 12, 2013 | vol. 110 | no. 11 | 4225

COMPU

TERSC

IENCE

SPS

YCHOLO

GICALAND

COGNITIVESC

IENCE

SSE

ECO

MMEN

TARY

Dow

nloa

ded

by g

uest

on

Oct

ober

10,

202

0

Page 3: Automated reconstruction of ancient languages using ... · Automated reconstruction of ancient languages using probabilistic models of sound change Alexandre Bouchard-Côtéa,1, David

language is initially unknown and must be learned automaticallyalong with the other branch-specific model parameters.

ResultsOur results address three questions about the performance of oursystem. First, how well does it reconstruct protolanguages? Second,howwell does it identify cognate sets? Finally, how can this approachbe used to address outstanding questions in historical linguistics?

Protolanguage Reconstructions. To test our system, we applied it toa large-scale database of Austronesian languages, the Austrone-sian Basic Vocabulary Database (ABVD) (28). We used a pre-viously established phylogeny for these languages, the Ethnologuetree (29) (we also describe experiments with other trees in Fig. 1).For this first test of our system we also used the cognate setsprovided in the database. The dataset contained 659 languages atthe time of download (August 7, 2010), including a few languagesoutside the Austronesian family and some manually reconstructedprotolanguages used for evaluation. The total data comprised142,661 word forms and 7,708 cognate sets. The goal was to re-construct the word in each protolanguage that corresponded toeach cognate set and to infer the patterns of sound changes alongeach branch in the phylogeny. See SI Appendix, Section 2 forfurther details of our simulations.We used the Austronesian dataset to quantitatively evaluate the

performance of our system by comparing withheld words fromknown languages with automatic reconstructions of those words.The Levenshtein distance between the held-out and reconstructedforms provides a measure of the number of errors in thesereconstructions. We used this measure to show that using morelanguages helped reconstruction and also to assess the overallperformance of our system. Specifically, we compared the system’serror rate on the ancestral reconstructions to a baseline and also tothe amount of divergence between the reconstructions of twolinguists (Fig. 1A). Given enough data, the system can achievereconstruction error rates close to the level of disagreement be-tween manual reconstructions. In particular, most reconstructionsperfectly agree with manual reconstructions, and only a few con-tain big errors. Refer to Table 1 for examples of reconstructions.See SI Appendix, Section 3 for the full lists.We also present in Fig. 1B the effect of the tree topology on

reconstruction quality, reiterating the importance of using in-formative topologies for reconstruction. In Fig. 1C, we show thatthe accuracy of our method increases with the number of ob-served Oceanic languages, confirming that large-scale inferenceis desirable for automatic protolanguage reconstruction: Recon-struction improved statistically significantly with each increase

except from 32 to 64 languages, where the average edit distanceimprovement was 0.05.For comparison, we also evaluated previous automatic re-

construction methods. These previous methods do not scale tolarge datasets so we performed comparisons on smaller subsetsof the Austronesian dataset. We show in SI Appendix, Section 2that our method outperforms these baselines.We analyze the output of our system in more depth in Fig. 2

A–C, which shows the system learned a variety of realistic soundchanges across the Austronesian family (30). In Fig. 2D, we showthe most frequent substitution errors in the Proto-Austronesianreconstruction experiments. See SI Appendix, Section 5 fordetails and similar plots for the most common incorrect inser-tions and deletions.

Cognate Recovery. Previous reconstruction systems (22) requiredthat cognate sets be provided to the system. However, the crea-tion of these large cognate databases requires considerable an-notation effort on the part of linguists and often requires that atleast some reconstruction be done by hand. To demonstrate thatour model can accurately infer cognate sets automatically, weused a version of our system that learns which words are cognate,starting only from raw word lists and their meanings. This systemuses a faster but lower-fidelity model of sound change to infercorrespondences. We then ran our reconstruction system oncognate sets that our cognate recovery system found. See SI Ap-pendix, Section 1 for details.This version of the system was run on all of the Oceanic lan-

guages in the ABVD, which comprise roughly half of the Aus-tronesian languages. We then evaluated the pairwise precision(the fraction of cognate pairs identified by our system that are alsoin the set of labeled cognate pairs), pairwise recall (the fraction oflabeled cognate pairs identified by our system), and pairwise F1measure (defined as the harmonic mean of precision and recall)for the cognates found by our system against the known cognatesthat are encoded in the ABVD. We also report cluster purity,which is the fraction of words that are in a cluster whose knowncognate group matches the cognate group of the cluster. See SIAppendix, Section 2.3 for a detailed description of the metrics.Using these metrics, we found that our system achieved a pre-

cision of 0.844, recall of 0.621, F1 of 0.715, and cluster purity of0.918. Thus, over 9 of 10 words are correctly grouped, and oursystem errs on the side of undergrouping words rather than clus-tering words that are not cognates. Because the null hypothesis inhistorical linguistics is to deem words to be unrelated unlessproved otherwise, a slight undergrouping is the desired behavior.

Rec

onst

ruct

ion

erro

r ra

te

A

0

0.125

0.250

0.375

0.500

Randommodern

Automated reconstruction

Agreement between two linguists

YakanBajoMapun

SamalS ias iInabaknonSubanunSinSubanonSioWesternBukManoboWestManoboIliaManoboDibaManoboAtadManoboAtauManoboTigwManoboKa laManoboSaraBinukid

MaranaoIranunSasakBaliMamanwaIlonggoHiligaynonWarayWaray

ButuanonTausugJoloSurigaonon

AklanonBisCebuanoKa laganMansaka

TagalogAntBikolNagaCTagbanwaKa

TagbanwaAb

BatakPalawPalawanBat

BanggiSWPa lawano

HanunooPalauanSinghi

DayakBakat

KatinganDayakNgaju

Kadorih

MerinaMalaMaanyan

TunjungKerinc i

Me layuSara

Me layuOganLowMalay

Indones ian

BanjareseM

Ma layBahas

MelayuBrun

Minangkaba

PhanRangCh

ChruHainanCham

MokenIbanGayo

Idaan

BunduDusun

TimugonM

ur

KelabitBar

Bintulu

BerawanLon

Belait

KenyahLong

MelanauM

uk

Lahanan

EngganoMal

EngganoBan

Nias

TobaBatak

Madurese

Iv atanBas c

Itbayat

Babuyan

Iraralay

Itbayaten

Is amorong

Ivasay

Imorod

Yami

SambalBoto

Kapampanga

IfugaoBayn

Ka llahanKa

KallahanKe

Inibaloi

Pangas inan

KakidugenI

IlongotKak

IfugaoAmga

IfugaoBata

BontokGuin

BontocGuin

KankanayNo

Balangaw

KalingaGui

ItnegBinon

Gaddang

AttaPamplo

IsnegDibag

AgtaDum

agatCas

IlokanoYogyaOldJavanesW

esternMa layoPolynes ian

BugineseSo

Maloh

MakassarTaeSToraja

SangirTabu

SangirSangilSaraBantikKa idipang

GorontaloHBolaangM

onPopaliaBonerateW

unaM

unaKatobuTontem

boan

Tons ea

Bang

gaiW

diBa

ree

Wol

ioM

ori

Kaya

nUm

aJu

Buka

tPu

nanK

ela i

Wam

par

Yabe

m

Num

bam

iSib

Mou

kAr

ia

Lam

ogai

Mul

Amar

a

Seng

seng

Kaul

ongA

uVBi

libil

Meg

iar

Geda

ged

Mat

ukar

Bilia

u

Men

gen

Mal

euKo

veTa

rpia

Sobe

i

Kayu

pula

uKRi

wo

Kairi

ruKis

Wog

eoAli

Auhe

lawa

Salib

a

Suau

Mai

s in

Gum

awan

a

Diod

io

Mol

ima

Ubir

Gapa

paiw

a

Wed

au

Dobu

an

Mis

ima

Kiliv

ila

Gaba

diDo

uraKu

niRo

ro

Mek

eoLala

Vilir

upu

Mot

u

Mag

oriS

out

Kand

asTang

aS iar

Kuan

ua

Patp

atar

Rov i

ana

S im

boNduk

e

Kus a

gheHoav

a

Lung

gaLuqa

Kubokota

Ughele

Ghanongga

Vangunu

Marovo

MbarekeSolosTaiof

Teop

NehanHapeNis sanHaku

Sis ingga

BabatanaTu

VarisiGhonVaris

iRirioSengga

BabatanaAvVaghua

Babatana

BabatanaLo

BabatanaKaLungaLunga

MonoMonoAluTorauUruava

MonoFauroBilurChekeHolo

MaringeKmaMaringeTatNggaoPoroMaringeLel

ZabanaKiaLaghuSamas

BlablangaBlablangaGKokotaKilokakaYsBanoniMadaraTungagTungKaraWes tNalikTiangTigakL ihirSunglMadakLamas

BarokMaututuNakanaiBilLaka la iVituYapes eMbughotuDhBugotuTalis eMoli

Talis eMalaGhariNdiGhariToloTalis e

Ma langoGhariNggaeMbirao

GhariTandaTalis ePoleGhariNggerTalis eKoo

GhariNginiLengo

LengoGhaimLengoParip

Nggela

ToambaitaKwai

LangalangaKwaio

MbaengguuLau

Mbae le lea

KwaraaeSolFata leka

LauWaladeLauNorthLonggu

SaaUkiNiMa

SaaSaaVillOroha

SaaUlawaSaa

Dorio

AreareMaas

AreareWaia

SaaAuluVil

BauroBaroo

BauroHaunu

Fagani

KahuaMami

SantaCatal

Aros iOneib

Aros iTawatAros i

Kahua

BauroPawaV

FaganiAguf

FaganiRihu

SantaAna

Tawaroga

TannaSouth

Kwamera

Lenakel

SyeErromanUra

ArakiSouthMere i

AmbrymSoutMota

PeteraraMa

PaameseSou

Mwotlap

Raga

SouthEfate

Orkon

Nguna

Namakir

Nati

Nahavaq

NeseTape

AnejomAnei

SakaoPortO

AvavaNeveei

Naman

Puluwatese

PuloAnnanW

oleaian

Satawalese

ChuukeseAKM

ortlockes

Carolinian

ChuukeseSonsoroles

Sa ipanCaroPuloAnnaW

oleai

Mokilese

PingilapesPonapeanKiribatiKus a ie

Marsha lles

NauruNelem

waJawe

CanalaIaai

NengoneDehu

Rotuman

WesternF ij

Maori

TahitiPenrhyn

Tuamotu

Manihiki

RurutuanTahitianM

oRarotongan

TahitianthM

arquesanM

angarevaHawaiian

RapanuiEasSam

oanBellonaTikopia

Emae

AnutaRennellese

FutunaAniwIfiraM

eleMUveaW

estFutunaEas t

VaeakauTauTakuu

TuvaluSikaiana

Kapingamar

NukuoroLuangiua

PukapukaTokelau

UveaEastTongan

NiueFijianBau

Tean

uVa

noTa

nem

aBu

ma

Asum

boa

Tani

mbi

liNe

mba

oSe

imat

Wuv

ulu

Lou

Naun

aLe

ipon

S iv i

saTi

taL i

kum

Leve

iLo

niu

Mus

sau

Kas i

raIra

hGi

man

Buli

Mor

Min

yaifu

inBi

gaM

isoo

lAs War

open

Num

for

Mar

auAm

baiY

apen

Win

des i

Wan

North

Baba

rDa

wera

Dawe

Dai

Tela

Mas

bua

Imro

ing

Empl

awas

Eas t

Mas

e la

Seril

iCe

ntra

lMas

Sout

hEas

tBW

estD

amar

Tara

ngan

Ba

UjirN

Aru

Ngai

borS

Ar

Gese

rW

atub

ela

E lat

KeiB

es

Hitu

Ambo

n

SouA

man

aTe

Amah

aiPa

uloh

iAl

une

Mur

nate

nAl

Bonf

iaW

erin

ama

Buru

Nam

rol

Sobo

yoM

ambo

ruM

angg

arai

Ngad

haPo

ndok

Kodi

Wej

ewaT

ana

Bim

aSo

aSa

vuW

anuka

ka

Gaur

aNgg

au

Eas tSumban

Baliled

o

Lamboya

L ioF loresT

Ende

NageKambera

PalueNitun

Anakalang

Sekar

PulauArgun

RotiTerm

an

Atoni

Mambai

TetunTerik

Kemak

Lamalera le

LamaholotI

S ika

Kedang

Era i

TalurAputai

Perai

Tugun

Iliun

Serua

NilaTeunKis ar

Roma

Eas tDamar

LetineseYamdena

KeiTanimba

Se laru

KoiwaiIriaChamorro

Lampung

Komering

Sunda

RejangRe ja

TboliTagab

TagabiliKoronadalB

SaranganiB

BilaanKoroBilaanSara

CiuliAtayaSquliqAtay

SediqBunun

ThaoFavorlang

RukaiPaiwanKanakanabu

SaaroaTsouPuyumaKav alanBasaiCentralAmiS iraya

PazehSa is ia t

YakanBajoMapun

SamalS ias iInabaknonSubanunSinSubanonSioWesternBukManoboWestManoboIliaManoboDibaManoboAtadManoboAtauManoboTigwManoboKalaManoboSaraBinukid

MaranaoIranunSasakBaliMamanwaIlonggoHiligaynonWarayWaray

ButuanonTausugJoloSurigaonon

AklanonBisCebuanoKalaganMansaka

TagalogAntBikolNagaCTagbanwaKa

TagbanwaAb

BatakPalawPalawanBat

BanggiSWPalawano

HanunooPalauanS inghi

DayakBakat

KatinganDayakNgaju

Kadorih

MerinaMalaMaanyan

TunjungKerinc i

Me layuSara

Me layuOganLowMalay

Indones ian

BanjareseM

Ma layBahas

Me layuBrun

Minangkaba

PhanRangCh

ChruHainanCham

MokenIbanGayo

Idaan

BunduDusun

TimugonM

ur

KelabitBar

Bintulu

BerawanLon

Belait

KenyahLong

MelanauM

uk

Lahanan

EngganoMa l

EngganoBan

Nias

TobaBatak

Madures e

Iv atanBasc

Itbayat

Babuyan

Iraralay

Itbayaten

Isamorong

Ivasay

Imorod

Yami

SambalBoto

Kapampanga

IfugaoBayn

Ka llahanKa

Ka llahanKe

Inibaloi

Pangas inan

KakidugenI

IlongotKak

IfugaoAmga

IfugaoBata

BontokGuin

BontocGuin

KankanayNo

Balangaw

KalingaGui

ItnegBinon

Gaddang

AttaPamplo

IsnegDibag

AgtaDum

agatCas

IlokanoYogyaOldJavanesW

es ternMa layoPolynes ian

BugineseSo

Ma loh

Makassa rTaeSToraja

SangirTabu

SangirSangilSa raBantikKaidipang

Goronta loHBolaangM

onPopaliaBonerateW

unaM

unaKa tobuTontem

boan

T on s ea

Bang

g ai W

diBa

ree

Wol

ioM

ori

Kaya

nUm

aJu

Buka

tPu

nanK

e la i

Wam

par

Yabe

m

Num

bam

iSib

Mou

kAr

ia

Lam

oga i

Mul

Amar

a

Seng

seng

Kaul

ongA

uVBi

libil

Meg

iar

Geda

ged

Mat

ukar

Bilia

u

Men

gen

Mal

euKo

veTa

rpia

Sobe

i

Kayu

pula

uKRi

wo

Kairi

ruKis

Wog

eoAli

Auhe

lawa

Salib

a

Suau

Ma i

s in

Gum

awan

a

Diod

io

Mol

ima

Ubir

Gapa

paiw

a

Wed

au

Dobu

an

Mis

ima

Kiliv

ila

Gaba

diDo

uraKu

niRo

ro

Mek

eoLala

Vilir

upu

Mot

u

Mag

oriS

out

Kand

asTang

aSiar

Kuan

ua

Patp

atar

Rov i

ana

S im

boNduk

e

Kus a

gheHoav

a

Lung

gaLuqa

Kubokota

Ughele

Ghanongga

Vangunu

Marovo

MbarekeSolosTaiof

Teop

NehanHapeNis sanHaku

Sis ingga

BabatanaTu

VarisiGhonVaris

iRirioSengga

BabatanaAvVaghua

Babatana

BabatanaLo

BabatanaKaLungaLunga

MonoMonoAluTorauUruava

MonoFauroBilurChekeHolo

MaringeKmaMaringeTatNggaoPoroMaringeLel

ZabanaKiaLaghuSamas

BlablangaBlablangaGKokotaKilokakaYsBanoniMadaraTungagTungKaraWes tNalikTiangTigakL ihirSunglMadakLamas

BarokMaututuNakanaiBilLaka la iVituYapes eMbughotuDhBugotuTalis eMoli

Talis eMalaGhariNdiGhariToloTalis e

Ma langoGhariNggaeMbirao

GhariTandaTalisePoleGhariNggerTalis eKoo

GhariNginiLengo

LengoGhaimLengoParip

Nggela

ToambaitaKwai

LangalangaKwaio

MbaengguuLau

Mbaele lea

KwaraaeSolFata leka

LauWa ladeLauNorthLonggu

SaaUkiNiMa

SaaSaaVillOroha

SaaUlawaSaa

Dorio

AreareMaas

AreareWaia

SaaAuluVil

BauroBaroo

BauroHaunu

Fagani

KahuaMami

SantaCatal

Aros iOneib

Aros iTawatAros i

Kahua

BauroPawaV

FaganiAguf

FaganiRihu

SantaAna

Tawaroga

TannaSouth

Kwamera

Lenakel

SyeErromanUra

ArakiSouthMere i

AmbrymSoutMota

PeteraraMa

PaameseSou

Mwotlap

Raga

SouthEfate

Orkon

Nguna

Namakir

Nati

Nahavaq

NeseTape

AnejomAnei

SakaoPortO

AvavaNeveei

Naman

Puluwatese

PuloAnnanW

oleaian

Satawalese

ChuukeseAKM

ortlockes

Carolinian

ChuukeseSonsoroles

SaipanCaroPuloAnnaW

oleai

Mokiles e

PingilapesPonapeanKiriba tiKus a ie

Marsha lles

NauruNelem

waJawe

CanalaIaai

NengoneDehu

Rotuman

WesternF ij

Maori

TahitiPenrhyn

Tuamotu

Manihiki

RurutuanTahitianM

oRarotongan

TahitianthM

arquesanM

angarevaHawaiian

RapanuiEasSam

oanBellonaTikopia

Emae

AnutaRennellese

FutunaAniwIfiraM

eleMUveaW

estFutunaEast

VaeakauTauTakuu

Tuva luSikaiana

Kapingamar

NukuoroLuangiua

PukapukaTokelau

Uve aEa stTon gan

NiueF ij ia nB au

T ean

uVa

noT a

nem

aBu

ma

Asum

boa

Tani

mbi

liNe

mba

oSe

imat

Wuv

ulu

Lou

Naun

aLe

ipon

Sivi

saTi

taLi

kum

Leve

iLo

niu

Mus

sau

Kas i

raIra

hGi

man

Buli

Mor

Min

yaifu

inBi

gaM

isoo

lAs War

open

Num

for

Mar

auAm

baiY

apen

Win

des i

Wan

North

Baba

rDa

wera

Dawe

Dai

Tela

Mas

bua

Imro

ing

Empl

awas

Eas t

Mas

e la

Seril

iCe

ntra

lMas

Sout

hEas

tBW

estD

amar

Tara

ngan

Ba

UjirN

Aru

Ngai

borS

Ar

Gese

rW

atub

ela

Ela t

KeiB

es

Hitu

Ambo

n

SouA

man

aTe

Amah

aiPa

uloh

iAl

une

Mur

nate

nAl

Bonf

iaW

erin

ama

Buru

Nam

rol

Sobo

yoM

ambo

ruM

angg

arai

Ngad

haPo

ndok

Kodi

Wej

ewaT

ana

Bim

aSo

aSa

vuW

anuka

ka

Gaur

aNgg

au

Eas tSumban

Baliled

o

Lamboya

L ioF loresT

Ende

NageKambera

PalueNitun

Anakalang

Sekar

PulauArgun

RotiTerm

an

Atoni

Mambai

TetunTerik

Kemak

Lamalera le

LamaholotI

S ika

Kedang

Era i

TalurAputai

Perai

Tugun

Iliun

Serua

NilaTeunKis ar

Roma

Eas tDamar

LetineseYamdena

KeiTanimba

Se laru

KoiwaiIriaChamorro

Lampung

Komering

Sunda

RejangRe ja

TboliTagab

TagabiliKoronadalB

SaranganiB

BilaanKoroBilaanSara

CiuliAtayaSquliqAtay

SediqBunun

ThaoFavorlang

RukaiPaiwanKanakanabu

SaaroaTsouPuyumaKav alanBasaiCentralAmiS iraya

PazehSa is ia t

N/A

B

0 10 20 30 40 50 600.3

0.35

0.4

0.45

0.5

0.55

Number of modern languages

Rec

onst

ruct

ion

erro

r ra

te

C

Rec

onst

ruct

ion

erro

r ra

te

Effect of tree quality

Effect of tree sizeFig. 1. Quantitative validation ofreconstructions and identificationof some important factors influ-encing reconstruction quality. (A)Reconstruction error rates for abaseline (which consists of pickingone modern word at random), oursystem, and the amount of dis-agreement between two linguist’smanual reconstructions. Reconstruc-tion error rates are Levenshteindistances normalized by the meanword form length so that errorscan be compared across languages.Agreement between linguists was computed on only Proto-Oceanic because the dataset used lacked multiple reconstructions for other protolanguages. (B) Theeffect of the topology on the quality of the reconstruction. On one hand, the difference between reconstruction error rates obtained from the system that ranon an uninformed topology (first and second) and rates obtained from the system that ran on an informed topology (third and fourth) is statistically significant.On the other hand, the corresponding difference between a flat tree and a random binary tree is not statistically significant, nor is the difference between usingthe consensus tree of ref. 41 and the Ethnologue tree (29). This suggests that our method has a certain robustness to moderate topology variations. (C) Re-construction error rate as a function of the number of languages used to train our automatic reconstruction system. Note that the error is not expected togo down to zero, perfect reconstruction being generally unidentifiable. The results in A and B are directly comparable: In fact, the entry labeled “Ethnologue”in B corresponds to the green Proto-Austronesian entry in A. The results in A and B and those in C are not directly comparable because the evaluation in Cis restricted to those cognates with at least one reflex in the smallest evaluation set (to make the curve comparable across the horizontal axis of C).

4226 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110 Bouchard-Côté et al.

Dow

nloa

ded

by g

uest

on

Oct

ober

10,

202

0

Page 4: Automated reconstruction of ancient languages using ... · Automated reconstruction of ancient languages using probabilistic models of sound change Alexandre Bouchard-Côtéa,1, David

Because we are ultimately interested in reconstruction, we thencompared our reconstruction system’s ability to reconstruct wordsgiven these automatically determined cognates. Specifically, wetook every cognate group found by our system (run on theOceanicsubclade) with at least two words in it. Then, we automaticallyreconstructed the Proto-Oceanic ancestor of those words, usingour system. For evaluation, we then looked at the average Lev-enshtein distance from our reconstructions to the known recon-structions described in the previous sections. This time, however,we average per modern word rather than per cognate group, toprovide a fairer comparison. (Results were not substantially dif-ferent when averaging per cognate group.) Compared with re-construction from manually labeled cognate sets, automaticallyidentified cognates led to an increase in error rate of only 12.8%and with a significant reduction in the cost of curating linguisticdatabases. See SI Appendix, Fig. S1 for the fraction of words witheach Levenshtein distance for these reconstructions.

Functional Load. To demonstrate the utility of large-scale recon-struction of protolanguages, we used the output of our system toinvestigate an open question in historical linguistics. The func-tional load hypothesis (FLH), introduced 1955 (6), claims thatthe probability that a sound will change over time is related tothe amount of information provided by a sound. Intuitively, iftwo phonemes appear only in words that are differentiated fromone another by at least one other sound, then one can argue thatno information is lost if those phonemes merge together, be-cause no new ambiguous forms can be created by the merger.A first step toward quantitatively testing the FLH was taken in

1967 (7). By defining a statistic that formalizes the amount ofinformation lost when a language undergoes a certain soundchange—on the basis of the proportion of words that are dis-criminated by each pair of phonemes—it became possible toevaluate the empirical support for the FLH. However, this initial

investigation was based on just four languages and found littleevidence to support the hypothesis. This conclusion was criti-cized by several authors (31, 32) on the basis of the small numberof languages and sound changes considered, although they pro-vided no positive counterevidence.Using the output of our system, we collected sound change

statistics from our reconstruction of 637 Austronesian languages,including the probability of a particular change as estimated by oursystem. These statistics provided the information needed to givea more comprehensive quantitative evaluation of the FLH, usinga much larger sample than previous work (details in SI Appendix,Section 2.4). We show in Fig. 3 A and B that this analysis providesclear quantitative evidence in favor of the FLH. The revealedpattern would not be apparent had we not been able to reconstructlarge numbers of protolanguages and supply probabilities of dif-ferent kinds of change taking place for each pair of languages.

DiscussionWe have developed an automated system capable of large-scalereconstruction of protolanguage word forms, cognate sets, andsound change histories. The analysis of the properties of hun-dreds of ancient languages performed by this system goes farbeyond the capabilities of any previous automated system andwould require significant amounts of manual effort by linguists.Furthermore, the system is in no way restricted to applicationslike assessing the effects of functional load: It can be used asa tool to investigate a wide range of questions about the structureand dynamics of languages.In developing an automated system for reconstructing ancient

languages, it is by no means our goal to replace the carefulreconstructions performed by linguists. It should be emphasizedthat the reconstruction mechanism used by our system ignoresmany of the phenomena normally used in manual recon-structions. We have mentioned limitations due to the transducer

(521)

( 559) ( 750)

Yakan ( 632) ( 91) ( 40)Bajo ( 375)Mapun ( 560)

SamalS ias i ( 230)

Inabaknon ( 620)

( 629) ( 630)

SubanunSin ( 628)

SubanonSio ( 362) ( 123) ( 728) ( 733)

Wes ternBuk ( 370)

ManoboWes t ( 366)

ManoboIlia

( 365)

ManoboDiba

( 27) ( 363)

ManoboAtad ( 364)

ManoboAtau

( 369)

ManoboTigw

( 614)

( 367)

ManoboKala

( 368)

ManoboSara

( 79)

Binukid

( 377)

( 376)

Maranao

( 233)

Iranun

( 43)

( 576)

Sasak

( 42)

Bali

( 405) ( 118)

( 355)

Mamanwa

( 80) ( 121) ( 507)

( 225)

Ilonggo

( 210)

Hiligaynon

( 717)

WarayWaray

( 613) ( 102)

( 103)

Butuanon

( 662)

TausugJolo

( 635)

Surigaonon

( 3)

AklanonBis

( 107)

Cebuano

( 372)

( 252)

Kalagan

( 371)

Mansaka

( 639)

TagalogAnt

( 69)

BikolNagaC

( 253)

( 641)

TagbanwaKa

( 640)

TagbanwaAb

( 495)

( 57)

BatakPalaw

( 494)

PalawanBat

( 47)

Banggi

( 547)

SWPalawano

( 208)

Hanunoo

( 493)

Palauan

( 311)

( 595)

S inghi

( 136)

DayakBakat

( 52)

( 727)

( 615)

( 267)

Katingan

( 137)

DayakNgaju

( 245)

Kadorih

( 151)

( 403)

MerinaMala

( 337)

Maanyan

( 693)

Tunjung

( 350)

( 349)

( 327)

( 279)

Kerinc i

( 400)

MelayuSara

( 398)

Melayu

( 487)

Ogan

( 331)

LowMalay

( 231)

Indones ian

( 48)

BanjareseM

( 348)

MalayBahas

( 399)

MelayuBrun

( 407)

M inangkaba

( 126)

( 125)

( 511)

PhanRangCh

( 130)

Chru

( 206)

HainanCham

( 410)

Moken

( 215)

Iban

( 187)

Gayo

( 477)

( 554)

( 217)

Idaan

( 99)

BunduDusun

( 464)

( 138)

( 677)

TimugonM

ur

( 276)

KelabitBar

( 78)

Bintulu

( 66)

( 65)

BerawanLon

( 62)

Belait

( 278)

KenyahLong

( 396)

( 397)

MelanauM

uk

( 303)

Lahanan

( 633)

( 167)

( 169)

EngganoMal

( 168)

EngganoBan

( 455)

Nias

( 679)

TobaBatak

( 341)

Madurese

( 470) ( 56)

( 55)

( 241)

( 242)

Iv atanBasc

( 237)

Itbayat

( 39)

Babuyan

( 234)

Iraralay

( 238)

Itbayaten

( 235)

Is amorong

( 240)

Iv as ay

( 228)

Imorod

( 752)

Yami

( 113)

( 561)

SambalBoto

( 263)

Kapampanga

( 469)

( 605)

( 619)

( 498)

( 64)

( 256)

( 222)IfugaoBayn

( 257)KallahanKa

( 258)KallahanKe

( 232)

Inibaloi

( 497)

Pangas inan

( 226)

( 251)

KakidugenI

( 227)

IlongotKak

( 110)

( 478)

( 219)

( 220)

IfugaoAmga

( 221)

IfugaoBata

( 90)

( 88)

( 89)BontokGuin

( 87)BontocGuin

( 262)

KankanayNo

( 41)

Balangaw

( 255)

( 254)

KalingaGui

( 239)

ItnegBinon

( 468)

( 216)

( 184)

Gaddang

( 30)

AttaPamplo

( 236)

Is negDibag

( 471)

( 2)

Agta

( 144)

DumagatCas

( 224)

Ilokano

( 243)

( 754)

Yogya

( 488)

OldJavanes

( 735)

Wes ternM

alayoPolynes ian

( 631)

( 611)

( 94)

( 93)BugineseSo

( 354)M

aloh

( 344)

Makassar

( 637)

TaeSToraja

( 568)

( 476)

( 567)SangirTabu

( 566)Sangir

( 565)SangilSara

( 50)

Bantik

( 203) ( 201)

( 248)Kaidipang

( 202)GorontaloH

( 84)BolaangM

on

( 423) ( 691)

( 518)Popalia

( 85)Bonerate

( 739) ( 747)

Wuna

( 424)M

unaKatobu (466)

(685)Tontem

boan ( 684)

Tons ea (46)

BanggaiWdi

(51)Baree

(746)

Wolio

(418)

Mori

(270) (271)

KayanUmaJu

(96)

Bukat (529)

PunanKelai

(11

1)

(158)

(523)

(736)

(462)

(213) (715)

Wam

par (749)

Yabem (484)

Numbam

iSib (451)

(713)

(624) (67)

(422)M

ouk (20)

Aria (309)

LamogaiM

ul

(7)

Amara

(500) (585)

Sengseng (268)

KaulongAuV

(61) (475)

(73)Bilibil

(394)M

egiar (188)

Gedaged

(387)M

atukar

(72)

Biliau

(401)

Mengen

(353)

Maleu

(292)

Kove

(575) (574)

(661)

Tarpia

(600)

Sobei

(272)

KayupulauK

(539)

Riwo

(579) (250)

(249)

Kairiru

(358) (284)

Kis

(743)

Wogeo

(4)

Ali

(499)

(480)

(627)

(31)

Auhelawa

(558)

Saliba

(626)

Suau

(343)

Maisin

(463) (205)

Gumawana

(104)

(140)

Diodio

(412)

Molim

a

(17) (16)

(695)

Ubir

(185)

Gapapaiwa

(720)

Wedau

(141)

Dobuan

(508)

(281)

(409)

Misim

a

(280)

Kilivila

(117) (723)

(183)

Gabadi

(482) (143)

Doura

(295)

Kuni

(541)

Roro

(395)

Mekeo

(305)

Lala

(594)

(712)

Vilirupu

(421)

Motu

(342)

MagoriSout

(404)

(448)

(610)

(502)

(261)

Kandas

(655)

Tanga

(590)

Siar

(293)

Kuanua

(501)

Patpatar

(447) (730)

(544)

Roviana

(593)

Simbo

(437)

Nduke

(296)

Kusaghe

(212)

Hoava

(335)

Lungga

(336)

Luqa

(294)

Kubokota

(696)Ughele

(193)Ghanongga

(154)

(706)Vangunu

(382)Marovo

(391)Mbareke

(440)

(602)

Solos

(572)

(646)Taiof

(668)Teop

(438)

(439)NehanHape

(458)Nissan

(207)

Haku

(129)

(597)

Sisingga

(38)

BabatanaTu

(711)

VarisiGhon

(710)

Varisi

(538)

Ririo

(584)

Sengga

(35)

BabatanaAv

(705)

Vaghua

(34)

Babatana

(37)

BabatanaLo

(36)

BabatanaKa

(416)

(334)

LungaLunga

(413)

Mono

(414)

MonoAlu

(686)

Torau

(700)

Uruava

(415)

MonoFauro

(75)

Bilur

(571)

(157)

(128)ChekeHolo

(379)MaringeKma

(381)MaringeTat

(452)NggaoPoro

(380)MaringeLel

(732) (755)ZabanaKia

(302)LaghuSamas

(124)

(82)Blablanga

(83)BlablangaG

(289)Kokota

(282)KilokakaYs

(49)

Banoni

(316)

(340)

Madara

(692)

TungagTung

(265)KaraWest

(431)Nalik

(673)Tiang

(674)Tigak

(324)LihirSungl

(338)

(339)MadakLamas

(53)Barok

(741)

(388)Maututu

(430)NakanaiBil

(304)Lakalai

(714)

Vitu

(753)

Yapese

(11

2)

(618)

(190)

(92) (393)MbughotuDh

(95)Bugotu

(204)

(651)TaliseMoli (650)TaliseMala (195)GhariNdi (194)Ghari (681)Tolo (648)Talise (347)Malango (196)

GhariNggae (392)Mbirao (199)GhariTanda (652)TalisePole (197)GhariNgger (649)TaliseKoo (198)GhariNgini (189)

(319)Lengo (320)LengoGhaim

(321)LengoParip

(453)Nggela

(346)

(345)

(473)

(678)Toambaita

(298)Kwai

(312)Langalanga

(299)Kwaio

(390)Mbaengguu

(313)Lau

(389)

Mbaelelea (301)

KwaraaeSol

(175)

Fataleka (315)

LauWalade

(314)

LauNorth

(328)

Longgu (621)

(551)

SaaUkiNiMa

(550)

SaaSaaVill

(490)

Oroha

(552)

SaaUlawa

(548)

Saa

(142)

Dorio

(18)

AreareMaas

(19)

AreareWaia

(549)

SaaAuluVil

(564)

(58)

BauroBaroo

(59)

BauroHaunu

(172)

Fagani

(247)

KahuaMami

(570)

SantaCatal

(22)

ArosiOneib

(23)

ArosiTawat

(21)

Arosi

(246)

Kahua

(60)

BauroPawaV

(173)

FaganiAguf

(17

4)

FaganiRihu

(56

9)

SantaAna

(66

3)

Tawaroga

(

536)

(70

9)

(65

7)

(65

8)

TannaSouth

(30

0)

Kwamera

(31

8)

Lenakel

(17

1)

(63

6)

SyeErroman

(69

9)

Ura

(46

7)

(72

6)

(15

)

ArakiSouth

(40

2)

Merei

(15

0)

(10)

AmbrymSout

(42

0)

Mota

(51

0)

PeteraraMa

(49

1)

PaameseSou

(42

7)

Mwotlap

(53

1)

Raga

(11

9)

(60

7)

SouthEfate

(48

9)

Orkon

(45

4)

Nguna

(43

2)

Namakir

(35

2)

(43

4)

Nati

(42

9)

Nahavaq

(44

4)

Nese

(65

9)

Tape

(1

2)

Anejom

Anei

(

557)

Saka

oPor

tO

(

351)

(

32)

Avav

a

(

445)

Neve

ei

(

433)

Nam

an

(

522)

(

406)

(

516)

(

520)

(

528)

Pulu

wate

se

(

527)

Pulo

Anna

n

(

745)

Wol

eaia

n

(

577)

Sata

wale

se

(

132)

Chuu

kese

AK

(

419)

Mor

tlock

es

(

106)

Caro

linia

n

(

131)

Chuu

kese

(

603)

Sons

orol

es

(

555)

Saip

anCa

ro

(

526)

Pulo

Anna

(

744)

Wol

eai

(

515)

(

411)

Mok

ilese

(

512)

Ping

ilape

s

(

514)

Pona

pean

(

283)

Kirib

ati

(

297)

Kusa

ie

(

385)

Mar

shal

les

(

436)

Naur

u

(

332)

(

446)

(

474)

(

441)

Nele

mwa

(

244)

Jawe

(

105)

Cana

la

(

214)

Iaai

(

443)

Neng

one

(

139)

Dehu

(

116)

(

725)

(

543)

Rotu

man

(

734)

Wes

tern

Fij

(

146)

(

513)

(

481)

(

156)

(

122)

(64

5) (

374)

Mao

ri

(

642)

Tahi

ti

(

505)

Penr

hyn

(

689)

Tuam

otu

(

361)

Man

ihik

i

(

546)

Ruru

tuan

(

643)

Tahi

tianM

o

(

534)

Raro

tong

an

(

644)

Tahi

tiant

h

(

384)

(

383)

Mar

ques

an

(

359)

Man

gare

va

(

209)

Hawa

iian

(

533)

Rapa

nuiE

as

(

563)

(

562)

Sam

oan

(

182)

(

63)

Bello

na

(

675)

Tiko

pia

(

163)

Emae

(

13)

Anut

a

(

537)

Renn

elle

se

(

180)

Futu

naAn

iw

(

218)

Ifira

Mel

eM

(

703)

Uvea

Wes

t

(

181)

Futu

naEa

st

(

704)

Vaea

kauT

au

(

162)

(

647)

Taku

u

(69

4)Tu

valu

(

592)

Sika

iana

(

264)

Kapi

ngam

ar

(48

3)Nu

kuor

o

(33

3)Lu

angi

ua

(52

4)Pu

kapu

ka

(68

0)To

kela

u

(70

2)Uv

eaEa

st

(68

3)

(68

2)To

ngan

(

459)

Niue

(

177)

Fijia

nBau

(

159)

(

707)

(

666)

Tean

u

(70

8)Va

no

(65

4)Ta

nem

a

(98

)Bu

ma

(

701)

(

26)

Asum

boa

(

656)

Tani

mbi

li

(44

2)Ne

mba

o

(1)

(

738)

(

581)

Seim

at

(74

8)

Wuv

ulu

(

160)

(

616)

(

330)

Lou

(

435)

Naun

a

(37

3)

(15

3)

(31

7)Le

ipon

(

598)

S iv i

s aTi

ta

(

729)

(

325)

L iku

m

(32

3)

Leve

i

(

329)

Loni

u

(

426)

Mus

sau

(

609)

(

608)

(

266)

Kas i

raIra

h

(

200)

Gim

an

(

97)

Buli

(

108)

(

417)

Mor

(

532)

(

408)

Min

yaifu

in

(

68)

Biga

Mis

ool

(

25)

As

(

718)

War

open

(

485)

Num

for

(

120)

(

378)

Mar

au

(

8)

Amba

iYap

en

(

742)

Win

des i

Wan

(

519)

(

33)

(

465)

(

460)

North

Baba

r

(

135)

Dawe

raDa

we

(

134)

Dai

(

612)

(

622)

(

667)

Tela

Mas

bua

(

229)

Imro

ing

(

164)

Empl

awas

(

386)

(

148)

Eas t

Mas

ela

(

588)

Seril

i

(

115)

Cent

ralM

as

(

606)

Sout

hEas

tB

(

724)

Wes

tDam

ar

(

24)

(

660)

Tara

ngan

Ba

(

697)

UjirN

Aru

(

450)

Ngai

borS

Ar

(

114)

(

152)

(

45)

(

192)

(

191)

Gese

r

(

719)

Wat

ubel

a

(

161)

E lat

KeiB

es

(

586)

(

486)

(

587)

(

9)

(

211)

Hitu

Ambo

n

(

604)

SouA

man

aTe

(

6)

Amah

ai

(

503)

Paul

ohi

(

698)

(

5)

Alun

e

(

425)

Mur

nate

nAl

(

86)

Bonf

ia

(

722)

Wer

inam

a

(

101)

Buru

Nam

rol

(

601)

Sobo

yo

(

77)

(

357)

Mam

boru

(

360)

Man

ggar

ai

(

449)

Ngad

ha

(

517)

Pond

ok

(

287)

Kodi

(

721)

Wej

ewaT

ana

(

76)

Bim

a

(

599)

Soa

(

578)

Savu

(

716)

Wan

ukak

a

(

186)

Gaur

aNgg

au

( 1

49)

Eas tSum

ban

( 44

)

Baliledo

( 30

8)

Lamboya

( 16

6)

( 32

6)

L ioF loresT

( 16

5)

Ende

( 42

8)

Nage

( 25

9)

Kambera

( 49

6)

PalueNitun

( 11

)

Anakalang

( 46

1)

( 58

2)

Sekar

( 52

5)

PulauArgun

( 67

6)

( 47

9)

( 73

1)

( 54

2)

RotiTerm

an

( 29

)

Atoni

( 15

5)

( 35

6)

Mambai

( 66

9)

TetunTerik

( 27

7)

Kemak

( 17

8)

( 30

7)

Lamalerale

( 30

6)

LamaholotI

( 59

1)

S ika

( 27

3)

Kedang

(623)

( 74

0)

( 17

0)

Erai

( 65

3)

Talur

( 14

)

Aputai

( 50

6)

Perai

( 69

0)

Tugun

(223)

Iliun

(671)

(457)

(589)

Serua

(456)

Nila

(670)

Teun

(286)

(285)

Kis ar

(540)

Roma

(145)

Eas tDamar

(322)

Letinese

(617)

(275)

(751)

Yamdena

(274)

KeiTanimba

(583)

Selaru

( 288)

KoiwaiIria

( 127)

Chamorro

( 509)

( 310)

Lampung

( 290)

Komering

( 634)

Sunda

( 535)

RejangReja

( 74)

( 664)

( 665)

TboliTagab

( 638)

Tagabili

( 81)

( 291)

KoronadalB

( 573)

SaranganiB

( 70)

BilaanKoro

( 71)

BilaanSara

( 179)

( 28)

( 133)

CiuliAtaya

( 625)

SquliqAtay

( 580)

Sediq

( 100)

Bunun

( 737)

( 672)Thao

( 176)Favorlang

( 545)

Rukai

( 492)

Paiwan

( 688)

( 260)Kanakanabu

( 553)Saaroa

( 687) Tsou

( 530)

Puyuma

( 147) ( 472)

( 269) Kavalan ( 54) Basai

( 109) CentralAmi ( 596) S iraya

( 504) Pazeh ( 556) Sais iat

Proto-Austronesian

(Fig. S2)

(Fig. S3)

(Fig. S5)

(Fig. S4)

f

gdb

nm

k

hv

t

s

qp

z x

pppp g qqqqqe

a

o

i uy

e11

12

1 2

3

8

56

10

4 97

A B

C D

(

543)

Ro

(

734)

Wes

te

3)

(48

1)

(

156)

(

122)

(

645) (

374)

Mao

ri

(

642)

Tahi

ti

(

505)

Penr

hyn

(

689)

Tuam

otu

(

361)

Man

ihik

i

(

546)

Ruru

tuan

(

643)

Tahi

tianM

o

(

534)

Raro

tong

an

(

644)

Tahi

tiant

h

(

384)

(

383)

Mar

ques

an

(

359)

Man

gare

va

(

209)

Hawa

iian

(

533)

Rapa

nuiE

as

(

563)

(56

2)Sa

moa

n

(

182)

(

63)

Bello

na

(

675)

Tiko

pia

(

163)

Emae

(

13)

Anut

a

(53

7)Re

nnel

lese

(

180)

Futu

naAn

iw

(

218)

Ifira

Mel

eM

(

703)

Uvea

Wes

t

(

181)

Futu

naEa

st

(

704)

Vaea

kauT

au

(

162)

(

647)

Taku

u

(69

4)Tu

valu

(

592)

Sika

iana

(

264)

Kapi

ngam

ar

(48

3)Nu

kuor

o

(33

3)Lu

angi

ua

(52

4)Pu

kapu

ka

(68

0)To

kela

u

(70

2)Uv

eaEa

st

(68

3)

(68

2)To

ngan

(

459)

Niue

(177

)Fi

jianB

au

(66

6)Te

an

i iiFraction of substitution errors

Fig. 2. Analysis of the output of our system in more depth. (A) An Austronesian phylogenetic tree from ref. 29 used in our analyses. Each quadrant isavailable in a larger format in SI Appendix, Figs. S2–S5, along with a detailed table of sound changes (SI Appendix, Table S5). The numbers in parenthesesattached to each branch correspond to rows in SI Appendix, Table S5. The colors and numbers in parentheses encode the most prominent sound change alongeach branch, as inferred automatically by our system in SI Appendix, Section 4. (B) The most supported sound changes across the phylogeny, with the width oflinks proportional to the support. Note that the standard organization of the IPA chart into columns and rows according to place, manner, height, andbackness is only for visualization purposes: This information was not encoded in the model in this experiment, showing that the model can recover realisticcross-linguistic sound change trends. All of the arcs correspond to sound changes frequently used by historical linguists: sonorizations /p/ > /b/ (1) and /t/ > /d/(2), voicing changes (3, 4), debuccalizations /f/ > /h/ (5) and /s/ > /h/ (6), spirantizations /b/ > /v/ (7) and /p/ > /f/ (8), changes of place of articulation (9, 10), andvowel changes in height (11) and backness (12) (1). Whereas this visualization depicts sound changes as undirected arcs, the sound changes are actuallyrepresented with directionality in our system. (C) Zooming in a portion of the Oceanic languages, where the Nuclear Polynesian family (i) and Polynesianfamily (ii) are visible. Several attested sound changes such as debuccalization to Maori and place of articulation change /t/ > /k/ to Hawaiian (30) are suc-cessfully localized by the system. (D) Most common substitution errors in the PAn reconstructions produced by our system. The first phoneme in each pairðx; yÞ represents the reference phoneme, followed by the incorrectly hypothesized one. Most of these errors could be plausible disagreements among humanexperts. For example, the most dominant error (p, v) could arise over a disagreement over the phonemic inventory of Proto-Austronesian, whereas vowels arecommon sources of disagreement.

Bouchard-Côté et al. PNAS | March 12, 2013 | vol. 110 | no. 11 | 4227

COMPU

TERSC

IENCE

SPS

YCHOLO

GICALAND

COGNITIVESC

IENCE

SSE

ECO

MMEN

TARY

Dow

nloa

ded

by g

uest

on

Oct

ober

10,

202

0

Page 5: Automated reconstruction of ancient languages using ... · Automated reconstruction of ancient languages using probabilistic models of sound change Alexandre Bouchard-Côtéa,1, David

formalism but other limitations include the lack of explicit mod-eling of changes at the level of the phoneme inventories used bya language and the lack of morphological analysis. Challengesspecific to the cognate inference task, for example difficulties withpolymorphisms, are also discussed in more detail in SI Appendix.Another limitation of the current approach stems from the as-sumption that languages form a phylogenetic tree, an assumptionviolated by borrowing, dialect variation, and creole languages.However, we believe our system will be useful to linguists inseveral ways, particularly in contexts where there are large num-bers of languages to be analyzed. Examples might include usingthe system to propose short lists of potential sound changes andcorrespondences across highly divergent word forms.An exciting possible application of this work is to use the

model described here to infer the phylogenetic relationshipsbetween languages jointly with reconstructions and cognate sets.This will remove a source of circularity present in most previouscomputational work in historical linguistics. Systems for inferringphylogenies such as ref. 13 generally assume that cognate sets aregiven as a fixed input, but cognacy as determined by linguists is inturn motivated by phylogenetic considerations. The phylogenetictree hypothesized by the linguist is therefore affecting the treebuilt by systems using only these cognates. This problem can beavoided by inferring cognates at the same time as a phylogeny,something that should be possible using an extended version ofour probabilistic model.Our system is able to reconstruct the words that appear in

ancient languages because it represents words as sequences ofsounds and uses a rich probabilistic model of sound change. Thisis an important step forward from previous work applying com-putational ideas to historical linguistics. By leveraging the fullsequence information available in the word forms in modernlanguages, we hope to see in historical linguistics a breakthroughsimilar to the advances in evolutionary biology prompted by thetransition from morphological characters to molecular sequencesin phylogenetic analysis.

Materials and MethodsThis section provides a more detailed specification of our probabilisticmodel. See SI Appendix, Section 1.2 for additional content on the algo-rithm and simulations.

Distributions. The conditional distributions over pairs of evolving strings arespecified using a lexicalized stochastic string transducer (33).

Consider a language ℓ′ evolving to ℓ for cognate set c. Assume we have aword form x =wcℓ′. The generative process for producing y =wcℓ works asfollows. First, we consider x to be composed of characters x1x2 . . . xn, withthe first and last ones being a special boundary symbol x1 =#∈Σ, which isnever deleted, mutated, or created. The process generates y = y1y2 . . . yn in nchunks yi ∈Σ*; i∈ f1; . . . ;ng, one for each xi . The yi s may be a single char-acter, multiple characters, or even empty. To generate yi , we define a mu-tation Markov chain that incrementally adds zero or more characters to aninitially empty yi . First, we decide whether the current phoneme in the topword t = xi will be deleted, in which case yi = e (the probabilities of the

decisions taken in this process depend on a context to be specified shortly).If t is not deleted, we choose a single substitution character in the bottomword. We write S =Σ∪ fζg for this set of outcomes, where ζ is the specialoutcome indicating deletion. Importantly, the probabilities of this multino-mial can depend on both the previous character generated so far (i.e., therightmost character p of yi−1) and the current character in the previousgeneration string (t), providing a way to make changes context sensitive.This multinomial decision acts as the initial distribution of the mutationMarkov chain. We consider insertions only if a deletion was not selected inthe first step. Here, we draw from a multinomial over S , where this time thespecial outcome ζ corresponds to stopping insertions, and the other ele-ments of S correspond to symbols that are appended to yi . In this case, theconditioning environment is t = xi and the current rightmost symbol p in yi .Insertions continue until ζ is selected. We use θS;t;p;ℓ and θI;t;p;ℓ to denote theprobabilities over the substitution and insertion decisions in the currentbranch ℓ′→ ℓ. A similar process generates the word at the root ℓ of a tree orwhen an innovation happens at some language ℓ, treating this word asa single string y1 generated from a dummy ancestor t = x1. In this case, onlythe insertion probabilities matter, and we separately parameterize theseprobabilities with θR;t;p;ℓ. There is no actual dependence on t at the root orinnovative languages, but this formulation allows us to unify the parame-terization, with each θω;t;p;ℓ ∈RjΣj+1, where ω∈ fR; S; Ig. During cognate in-ference, the decision to innovate is controlled by a simple Bernoulli randomvariable ngℓ for each language in the tree. When known cognate groups areassumed, ncℓ is set to 0 for all nonroot languages and to 1 for the rootlanguage. These Bernoulli distributions have parameters νℓ.

Mutation distributions confined in the family of transducers miss certainphylogenetic phenomena. For example, the process of reduplication (as in“bye-bye”, for example) is a well-studied mechanism to derive morpholog-ical and lexical forms that is not explicitly captured by transducers. The samesituation arises in metatheses (e.g., Old English frist > English first). How-ever, these changes are generally not regular and therefore less informative(1). Moreover, because we are using a probabilistic framework, these eventscan still be handled in our system, even though their costs will simply not beas discounted as they should be.

Note also that the generative process described in this section does notallow explicit dependencies to the next character in ℓ. Relaxing this as-sumption can be done in principle by using weighted transducers, but at thecost of a more computationally expensive inference problem (caused by thetransducer normalization computation) (34). A simpler approach is to usethe next character in the parent ℓ′ as a surrogate for the next character in ℓ.Using the context in the parent word is also more aligned to the standardrepresentation of sound change used in historical linguistics, where thecontext is defined on the parent as well.

More generally, dependencies limited to a bounded context on the parentstring can be incorporated in our formalism. By bounded, we mean that itshould be possible to fix an integer k beforehand such that all of themodeled dependencies are within k characters to the string operation. Thecaveat is that the computational cost of inference grows exponentially in k.We leave open the question of handling computation in the face of un-bounded dependencies such as those induced by harmony (35).

Parameterization. Instead of directly estimating the transition probabilities ofthemutationMarkov chain (which could be done, in principle, by taking themto be the parameters of a collection of multinomial distributions) we expressthem as the output of a multinomial logistic regression model (36). This

A B

0 2x10-5 4x10-5 6x10-5 8x10-5 1x10-40

0.2

0.4

0.6

0.8

1

Functional load

Mer

ger

post

erio

r

0 2x10-5 4x10-5 6x10-5 8x10-5 1x10-40

0.2

0.4

0.6

0.8

1

Functional load

Mer

ger

post

erio

r

0

1

102

104

106Fig. 3. Increasing the number of languages we canreconstruct gives new ways to approach questions inhistorical linguistics, such as the effect of functionalload on the probability of merging two sounds. Theplots shown are heat maps where the color encodesthe log of the number of sound changes that fallinto a given two-dimensional bin. Each soundchange x > y is encoded as a pair of numbers in theunit square, ðl;mÞ, as explained in Materials andMethods. To convey the amount of noise one couldexpect from a study with the number of languagesthat King previously used (7), we first show in A theheat map visualization for four languages. Next, weshow the same plot for 637 Austronesian languagesin B. Only in this latter setup is structure clearlyvisible: Most of the points with high probability of merging can be seen to have comparatively low functional load, providing evidence in favor of thefunctional load hypothesis introduced in 1955. See SI Appendix, Section 2.4 for details.

4228 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110 Bouchard-Côté et al.

Dow

nloa

ded

by g

uest

on

Oct

ober

10,

202

0

Page 6: Automated reconstruction of ancient languages using ... · Automated reconstruction of ancient languages using probabilistic models of sound change Alexandre Bouchard-Côtéa,1, David

model specifies a distribution over transition probabilities by assigningweights to a set of features that describe properties of the sound changesinvolved. These features provide a more coherent representation of thetransition probabilities, capturing regularities in sound changes that reflectthe underlying linguistic structure.

We used the following feature templates: OPERATION, which identifieswhether an operation in the mutation Markov chain is an insertion, a de-letion, a substitution, a self-substitution (i.e., of the form x > y, x = y), or theend of an insertion event; MARKEDNESS, which consists of language-specificn-gram indicator functions for all symbols in Σ (during reconstruction, onlyunigram and bigram features are used for computational reasons; for cognateinference, only unigram features are used); FAITHFULNESS, which consists ofindicators for mutation events of the form 1 [ x > y ], where x ∈Σ, y ∈ S .Feature templates similar to these can be found, for instance, in the work ofrefs. 37 and 38, in the context of string-to-string transduction models used incomputational linguistics. This approach to specifying the transition proba-bilities produces an interesting connection to stochastic optimality theory (39,40), where a logistic regression model mediates markedness and faithfulnessof the production of an output form from an underlying input form.

Data sparsity is a significant challenge in protolanguage reconstruction.Although the experiments we present here use an order of magnitude morelanguages than previous computational approaches, the increase in observeddata also brings with it additional unknowns in the form of intermediateprotolanguages. Because there is one set of parameters for each language,adding more data is not sufficient to increase the quality of the recon-struction; it is important to share parameters across different branches in thetree to benefit from having observations from more languages. We used thefollowing technique to address this problem: We augment the parameteri-zation to include the current language (or language at the bottom of thecurrent branch) and use a single, global weight vector instead of a set ofbranch-specific weights. Generalization across branches is then achievedby using features that ignore ℓ, whereas branch-specific features dependon ℓ. Similarly, all of the features in OPERATION, MARKEDNESS, andFAITHFULNESS have universal and branch-specific versions.

Using these features and parameter sharing, the logistic regression modeldefines the transition probabilities of the mutation process and the rootlanguage model to be

θω;t;p;ℓ = θω;t;p;ℓðξ; λÞ= expfÆλ; fðω; t;p; ℓ; ξÞægZðω; t;p; ℓ; λÞ × μðω; t; ξÞ; [1]

where ξ∈ S , f : fS; I;Rg×Σ×Σ× L× S →Rk is the feature function (whichindicates which features apply for each event), Æ · ; · æ denotes inner product,and λ∈Rk is a weight vector. Here, k is the dimensionality of the feature spaceof the logistic regression model. In the terminology of exponential families, Zand μ are the normalization function and the reference measure, respectively:

Zðω; t;p; ℓ; λÞ=Xξ′∈S

exp�Æλ; f

�ω; t;p; ℓ; ξ′

�æ�

μðω; t; ξÞ=

8>><>>:

0 if ω= S; t =#; ξ≠#0 if ω=R; ξ= ζ0 if ω≠R; ξ=#1  o:w:

Here, μ is used to handle boundary conditions, ensuring that the resultingprobability distribution is well defined.

During cognate inference, the innovation Bernoulli random variables νgℓare similarly parameterized, using a logistic regression model with two kindsof features: a global innovation feature κglobal ∈R and a language-specificfeature κℓ ∈R. The likelihood function for each νgℓ then takes the form

νgℓ =1

1+ exp�−κglobal − κℓ

�: [2]

ACKNOWLEDGMENTS. This work was supported by Grant IIS-1018733 fromthe National Science Foundation.

1. Hock HH (1991) Principles of Historical Linguistics (Mouton de Gruyter, The Hague,Netherlands).

2. Ross M, Pawley A, Osmond M (1998) The Lexicon of Proto Oceanic: The Culture andEnvironment of Ancestral Oceanic Society (Pacific Linguistics, Canberra, Australia).

3. Diamond J (1999) Guns, Germs, and Steel: The Fates of Human Societies (WW Norton,New York).

4. Nichols J (1999) Archaeology and Language: Correlating Archaeological and LinguisticHypotheses, eds Blench R, Spriggs M (Routledge, London).

5. Ventris M, Chadwick J (1973) Documents in Mycenaean Greek (Cambridge Univ Press,Cambridge, UK).

6. Martinet A (1955) Économie des Changements Phonétiques [Economy of phoneticsound changes] (Maisonneuve & Larose, Paris).

7. King R (1967) Functional load and sound change. Language 43:831–852.8. Holmes I, Bruno WJ (2001) Evolutionary HMMs: A Bayesian approach to multiple

alignment. Bioinformatics 17(9):803–820.9. Miklós I, Lunter GA, Holmes I (2004) A “Long Indel” model for evolutionary sequence

alignment. Mol Biol Evol 21(3):529–540.10. Suchard MA, Redelings BD (2006) BAli-Phy: Simultaneous Bayesian inference of

alignment and phylogeny. Bioinformatics 22(16):2047–2048.11. Liberles DA, ed (2007) Ancestral Sequence Reconstruction (Oxford Univ Press, Ox-

ford, UK).12. Paten B, et al. (2008) Genome-wide nucleotide-level mammalian ancestor recon-

struction. Genome Res 18(11):1829–1843.13. Gray RD, Jordan FM (2000) Language trees support the express-train sequence of

Austronesian expansion. Nature 405(6790):1052–1055.14. Ringe D, Warnow T, Taylor A (2002) Indo-European and computational cladistics.

Trans Philol Soc 100:59–129.15. Evans SN, Ringe D, Warnow T (2004) Inference of Divergence Times as a Statistical In-

verse Problem, McDonald Institute Monographs, eds Forster P, Renfrew C (McDonaldInstitute, Cambridge, UK).

16. Gray RD, Atkinson QD (2003) Language-tree divergence times support the Anatoliantheory of Indo-European origin. Nature 426(6965):435–439.

17. Nakhleh L, Ringe D, Warnow T (2005) Perfect phylogenetic networks: A new meth-odology for reconstructing the evolutionary history of natural languages. Language81:382–420.

18. Bryant D (2006) Phylogenetic Methods and the Prehistory of Languages, eds Forster P,Renfrew C (McDonald Institute for Archaeological Research, Cambridge, UK), pp111–118.

19. Daumé H III, Campbell L (2007) A Bayesian model for discovering typological im-plications. Assoc Comput Linguist 45:65–72.

20. Dunn M, Levinson S, Lindstrom E, Reesink G, Terrill A (2008) Structural phylogenyin historical linguistics: Methodological explorations applied in Island Melanesia.Language 84:710–759.

21. Lynch J, ed (2003) Issues in Austronesian (Pacific Linguistics, Canberra, Australia).22. Oakes M (2000) Computer estimation of vocabulary in a protolanguage from word

lists in four daughter languages. J Quant Linguist 7:233–244.23. Kondrak G (2002) Algorithms for Language Reconstruction. PhD thesis (Univ of Tor-

onto, Toronto).24. Ellison TM (2007) Bayesian identification of cognates and correspondences. Assoc

Comput Linguist 45:15–22.25. Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum like-

lihood alignment of DNA sequences. J Mol Evol 33(2):114–124.26. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data

via the EM algorithm. J R Stat Soc B 39:1–38.27. Bouchard-Côté A, Jordan MI, Klein D (2009) Efficient inference in phylogenetic InDel

trees. Adv Neural Inf Process Syst 21:177–184.28. Greenhill SJ, Blust R, Gray RD (2008) The Austronesian basic vocabulary database:

From bioinformatics to lexomics. Evol Bioinform Online 4:271–283.29. Lewis MP, ed (2009) Ethnologue: Languages of the World (SIL International, Dallas,

TX, 16th Ed).30. Lyovin A (1997) An Introduction to the Languages of the World (Oxford Univ Press,

Oxford, UK).31. Hockett CF (1967) The quantification of functional load. Word 23:320–339.32. Surendran D, Niyogi P (2006) Competing Models of Linguistic Change. Evolution and

Beyond (Benjamins, Amsterdam).33. Varadarajan A, Bradley RK, Holmes IH (2008) Tools for simulating evolution of aligned

genomic regions with integrated parameter estimation. Genome Biol 9(10):R147.34. Mohri M (2009) Handbook of Weighted Automata, Monographs in Theoretical

Computer Science, eds Droste M, Kuich W, Vogler H (Springer, Berlin).35. Hansson GO (2007) On the evolution of consonant harmony: The case of secondary

articulation agreement. Phonology 24:77–120.36. McCullagh P, Nelder JA (1989) Generalized Linear Models (Chapman & Hall, London).37. Dreyer M, Smith JR, Eisner J (2008) Latent-variable modeling of string transductions

with finite-state methods. Empirical Methods on Natural Language Processing 13:1080–1089.

38. Chen SF (2003) Conditional and joint models for grapheme-to-phoneme conversion.Eurospeech 8:2033–2036.

39. Goldwater S, Johnson M (2003) Learning OT constraint rankings using a maximumentropy model. Proceedings of the Workshop on Variation Within Optimality Theoryeds Spenader J, Eriksson A, Dahl Ö (Stockholm University, Stockholm) pp 113–122.

40. Wilson C (2006) Learning phonology with substantive bias: An experimental andcomputational study of velar palatalization. Cogn Sci 30(5):945–982.

41. Gray RD, Drummond AJ, Greenhill SJ (2009) Language phylogenies reveal expansionpulses and pauses in Pacific settlement. Science 323(5913):479–483.

42. Blust R (1999) Subgrouping, circularity and extinction: Some issues in Austronesiancomparative linguistics. Inst Linguist Acad Sinica 1:31–94.

Bouchard-Côté et al. PNAS | March 12, 2013 | vol. 110 | no. 11 | 4229

COMPU

TERSC

IENCE

SPS

YCHOLO

GICALAND

COGNITIVESC

IENCE

SSE

ECO

MMEN

TARY

Dow

nloa

ded

by g

uest

on

Oct

ober

10,

202

0