Automated reconstruction of ancient languages usingprobabilistic models of sound changeAlexandre Bouchard-Côtéa,1, David Hallb, Thomas L. Griffithsc, and Dan Kleinb
aDepartment of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada; bComputer Science Division and cDepartment of Psychology,University of California, Berkeley, CA 94720
Edited by Nick Chater, University of Warwick, Coventry, United Kingdom, and accepted by the Editorial Board December 22, 2012 (received for reviewMarch 19, 2012)
One of the oldest problems in linguistics is reconstructing thewords that appeared in the protolanguages from which modernlanguages evolved. Identifying the forms of these ancient lan-guages makes it possible to evaluate proposals about the natureof language change and to draw inferences about human history.Protolanguages are typically reconstructed using a painstakingmanual process known as the comparative method. We present afamily of probabilistic models of sound change as well as algo-rithms for performing inference in these models. The resultingsystem automatically and accurately reconstructs protolanguagesfrommodern languages. We apply this system to 637 Austronesianlanguages, providing an accurate, large-scale automatic reconstruc-tion of a set of protolanguages. Over 85% of the system’s recon-structions are within one character of the manual reconstructionprovided by a linguist specializing in Austronesian languages. Be-ing able to automatically reconstruct large numbers of languagesprovides a useful way to quantitatively explore hypotheses about thefactors determining which sounds in a language are likely to changeover time. We demonstrate this by showing that the reconstructedAustronesian protolanguages provide compelling support for a hy-pothesis about the relationship between the function of a soundand its probability of changing that was first proposed in 1955.
ancestral | computational | diachronic
Reconstruction of the protolanguages from which modern lan-guages are descended is a difficult problem, occupying histor-
ical linguists since the late 18th century. To solve this problemlinguists have developed a labor-intensivemanual procedure calledthe comparative method (1), drawing on information about thesounds and words that appear in many modern languages to hy-pothesize protolanguage reconstructions even when no writtenrecords are available, opening one of the few possible windowsto prehistoric societies (2, 3). Reconstructions can help in un-derstanding many aspects of our past, such as the technologicallevel (2), migration patterns (4), and scripts (2, 5) of early societies.Comparing reconstructions across many languages can help revealthe nature of language change itself, identifying which aspectsof language are most likely to change over time, a long-standingquestion in historical linguistics (6, 7).In many cases, direct evidence of the form of protolanguages is
not available. Fortunately, owing to the world’s considerable lin-guistic diversity, it is still possible to propose reconstructions byleveraging a large collection of extant languages descended from asingle protolanguage. Words that appear in these modern lan-guages can be organized into cognate sets that contain words sus-pected to have a shared ancestral form (Table 1). The keyobservation that makes reconstruction from these data possible isthat languages seem to undergo a relatively limited set of regularsound changes, each applied to the entire vocabulary of a languageat specific stages of its history (1). Still, several factors make re-construction a hard problem. For example, sound changes are oftencontext sensitive, and many are string insertions and deletions.In this paper, we present an automated system capable of
large-scale reconstruction of protolanguages directly from wordsthat appear in modern languages. This system is based on aprobabilistic model of sound change at the level of phonemes,
building on work on the reconstruction of ancestral sequencesand alignment in computational biology (8–12). Several groupshave recently explored how methods from computational biologycan be applied to problems in historical linguistics, but such workhas focused on identifying the relationships between languages(as might be expressed in a phylogeny) rather than reconstructingthe languages themselves (13–18). Much of this type of work hasbeen based on binary cognate or structural matrices (19, 20), whichdiscard all information about the form that words take, simply in-dicating whether they are cognate. Such models did not have thegoal of reconstructing protolanguages and consequently use a rep-resentation that lacks the resolution required to infer ancestralphonetic sequences. Using phonological representations allows usto perform reconstruction and does not require us to assume thatcognate sets have been fully resolved as a preprocessing step. Rep-resenting the words at each point in a phylogeny and having a modelof how they change give a way of comparing different hypothesizedcognate sets and hence inferring cognate sets automatically.The focus on problems other than reconstruction in previous
computational approaches has meant that almost all existingprotolanguage reconstructions have been done manually. How-ever, to obtain more accurate reconstructions for older languages,large numbers of modern languages need to be analyzed. TheProto-Austronesian language, for instance, has over 1,200 de-scendant languages (21). All of these languages could potentiallyincrease the quality of the reconstructions, but the number ofpossibilities increases considerably with each language, making itdifficult to analyze a large number of languages simultaneously.The few previous systems for automated reconstruction of pro-tolanguages or cognate inference (22–24) were unable to handlethis increase in computational complexity, as they relied on de-terministic models of sound change and exact but intractablealgorithms for reconstruction.Being able to reconstruct large numbers of languages also
makes it possible to provide quantitative answers to questionsabout the factors that are involved in language change. We dem-onstrate the potential for automated reconstruction to lead tonovel results in historical linguistics by investigating a specifichypothesized regularity in sound changes called functional load.The functional load hypothesis, introduced in 1955, asserts thatsounds that play a more important role in distinguishing words areless likely to change over time (6). Our probabilistic reconstructionof hundreds of protolanguages in the Austronesian phylogenyprovides a way to explore this question quantitatively, producingcompelling evidence in favor of the functional load hypothesis.
Author contributions: A.B.-C., D.H., T.L.G., and D.K. designed research; A.B.-C. and D.H.performed research; A.B.-C. and D.H. contributed new reagents/analytic tools; A.B.-C.,D.H., T.L.G., and D.K. analyzed data; and A.B.-C., D.H., T.L.G., and D.K. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission. N.C. is a guest editor invited by the EditorialBoard.
Freely available online through the PNAS open access option.
See Commentary on page 4159.1To whom correspondence should be addressed. E-mail: [email protected].
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1204678110/-/DCSupplemental.
4224–4229 | PNAS | March 12, 2013 | vol. 110 | no. 11 www.pnas.org/cgi/doi/10.1073/pnas.1204678110
ModelWe use a probabilistic model of sound change and a Monte Carloinference algorithm to reconstruct the lexicon and phonology ofprotolanguages given a collection of cognate sets from modernlanguages. As in other recent work in computational historicallinguistics (13–18), we make the simplifying assumption that eachword evolves along the branches of a tree of languages, reflectingthe languages’ phylogenetic relationships. The tree’s internalnodes are languages whose word forms are not observed, and theleaves are modern languages. The output of our system is a pos-terior probability distribution over derivations. Each derivationcontains, for each cognate set, a reconstructed transcription ofancestral forms, as well as a list of sound changes describing thetransformation from parent word to child word. This represen-tation is rich enough to answer a wide range of queries that wouldnormally be answered by carrying out the comparative methodmanually, such as which sound changes were most prominentalong each branch of the tree.We model the evolution of discrete sequences of phonemes,
using a context-dependent probabilistic string transducer (8).Probabilistic string transducers efficiently encode a distributionover possible changes that a string might undergo as it changesthrough time. Transducers are sufficient to capture most types ofregular sound changes (e.g., lenitions, epentheses, and elisions) andcan be sensitive to the context in which a change takes place. Mosttypes of changes not captured by transducers are not regular (1) andare therefore less informative (e.g., metatheses, reduplications, andhaplologies). Unlike simple molecular InDel models used in com-putational biology such as the TKF91 model (25), the parameter-ization of our model is very expressive: Mutation probabilities arecontext sensitive, depending on the neighboring characters, andeach branch has its own set of parameters. This context-sensitiveand branch-specific parameterization plays a central role in oursystem, allowing explicit modeling of sound changes.Formally, let τ be a phylogenetic tree of languages, where each
language is linked to the languages that descended from it. Insuch a tree, the modern languages, whose word forms will beobserved, are the leaves of τ. The most recent common ancestorof these modern languages is the root of τ. Internal nodes of thetree (including the root) are protolanguages with unobservedword forms. Let L denote all languages, modern and otherwise.All word forms are assumed to be strings in the InternationalPhonetic Alphabet (IPA).We assume that word forms evolve along the branches of the
tree τ. However, it is usually not the case that a word belongingto each cognate set exists in each modern language—words arelost or replaced over time, meaning that words that appear in theroot languages may not have cognate descendants in the lan-guages at the leaves of the tree. For the moment, we assume thereis a known list of C cognate sets. For each c∈ f1; . . . ;Cg let LðcÞ
denote the subset of modern languages that have a word form inthe cth cognate set. For each set c∈ f1; . . . ;Cg and each languageℓ∈LðcÞ, we denote the modern word form by wcℓ. For cognate setc, only the minimal subtree τðcÞ containing LðcÞ and the root isrelevant to the reconstruction inference problem for that set.Our model of sound change is based on a generative process
defined on this tree. From a high-level perspective, the genera-tive process is quite simple. Let c be the index of the currentcognate set, with topology τðcÞ. First, a word is generated for theroot of τðcÞ, using an (initially unknown) root language model(i.e., a probability distribution over strings). The words that ap-pear at other nodes of the tree are generated incrementally, usinga branch-specific distribution over changes in strings to generateeach word from the word in the language that is its parent in τðcÞ.Although this distribution differs across branches of the tree,making it possible to estimate the pattern of changes involved inthe transition from one language to another, it remains the samefor all cognate sets, expressing changes that apply stochasticallyto all words. The probabilities of substitution, insertion and de-letion are also dependent on the context in which the changeoccurs. Further details of the distributions that were used andtheir parameterization appear in Materials and Methods.The flexibility of our model comes at the cost of having literally
millions of parameters to set, creating challenges not found inmost computational approaches to phylogenetics. Our inferencealgorithm learns these parameters automatically, using estab-lished principles from machine learning and statistics. Specifi-cally, we use a variant of the expectation-maximization algorithm(26), which alternates between producing reconstructions on thebasis of the current parameter estimates and updating the pa-rameter estimates on the basis of those reconstructions. Thereconstructions are inferred using an efficient Monte Carlo in-ference algorithm (27). The parameters are estimated by opti-mizing a cost function that penalizes complexity, allowing us toobtain robust estimates of large numbers of parameters. See SIAppendix, Section 1 for further details of the inference algorithm.If cognate assignments are not available, our system can be
applied just to lists of words in different languages. In this case itautomatically infers the cognate assignments as well as thereconstructions. This setting requires only two modifications tothe model. First, because cognates are not available, we index thewords by their semantic meaning (or gloss) g, and there are thusG groups of words. The model is then defined as in the previouscase, with words indexed as wgℓ. Second, the generation process isaugmented with a notion of innovation, wherein a word wgℓ′ insome language ℓ′ may instead be generated independently fromits parent word wgℓ. In this instance, the word is generated froma language model as though it were a root string. In effect, thetree is “cut” at a language when innovation happens, and so theword begins anew. The probability of innovation in any given
Table 1. Sample of reconstructions produced by the system
*Complete sets of reconstructions can be found in SI Appendix.†Randomly selected by stratified sampling according to the Levenshtein edit distance Δ.‡Levenshtein distance to a reference manual reconstruction, in this case the reconstruction of Blust (42).§The colors encode cognate sets.{We use this symbol for encoding missing data.
Bouchard-Côté et al. PNAS | March 12, 2013 | vol. 110 | no. 11 | 4225
COMPU
TERSC
IENCE
SPS
YCHOLO
GICALAND
COGNITIVESC
IENCE
SSE
ECO
MMEN
TARY
language is initially unknown and must be learned automaticallyalong with the other branch-specific model parameters.
ResultsOur results address three questions about the performance of oursystem. First, how well does it reconstruct protolanguages? Second,howwell does it identify cognate sets? Finally, how can this approachbe used to address outstanding questions in historical linguistics?
Protolanguage Reconstructions. To test our system, we applied it toa large-scale database of Austronesian languages, the Austrone-sian Basic Vocabulary Database (ABVD) (28). We used a pre-viously established phylogeny for these languages, the Ethnologuetree (29) (we also describe experiments with other trees in Fig. 1).For this first test of our system we also used the cognate setsprovided in the database. The dataset contained 659 languages atthe time of download (August 7, 2010), including a few languagesoutside the Austronesian family and some manually reconstructedprotolanguages used for evaluation. The total data comprised142,661 word forms and 7,708 cognate sets. The goal was to re-construct the word in each protolanguage that corresponded toeach cognate set and to infer the patterns of sound changes alongeach branch in the phylogeny. See SI Appendix, Section 2 forfurther details of our simulations.We used the Austronesian dataset to quantitatively evaluate the
performance of our system by comparing withheld words fromknown languages with automatic reconstructions of those words.The Levenshtein distance between the held-out and reconstructedforms provides a measure of the number of errors in thesereconstructions. We used this measure to show that using morelanguages helped reconstruction and also to assess the overallperformance of our system. Specifically, we compared the system’serror rate on the ancestral reconstructions to a baseline and also tothe amount of divergence between the reconstructions of twolinguists (Fig. 1A). Given enough data, the system can achievereconstruction error rates close to the level of disagreement be-tween manual reconstructions. In particular, most reconstructionsperfectly agree with manual reconstructions, and only a few con-tain big errors. Refer to Table 1 for examples of reconstructions.See SI Appendix, Section 3 for the full lists.We also present in Fig. 1B the effect of the tree topology on
reconstruction quality, reiterating the importance of using in-formative topologies for reconstruction. In Fig. 1C, we show thatthe accuracy of our method increases with the number of ob-served Oceanic languages, confirming that large-scale inferenceis desirable for automatic protolanguage reconstruction: Recon-struction improved statistically significantly with each increase
except from 32 to 64 languages, where the average edit distanceimprovement was 0.05.For comparison, we also evaluated previous automatic re-
construction methods. These previous methods do not scale tolarge datasets so we performed comparisons on smaller subsetsof the Austronesian dataset. We show in SI Appendix, Section 2that our method outperforms these baselines.We analyze the output of our system in more depth in Fig. 2
A–C, which shows the system learned a variety of realistic soundchanges across the Austronesian family (30). In Fig. 2D, we showthe most frequent substitution errors in the Proto-Austronesianreconstruction experiments. See SI Appendix, Section 5 fordetails and similar plots for the most common incorrect inser-tions and deletions.
Cognate Recovery. Previous reconstruction systems (22) requiredthat cognate sets be provided to the system. However, the crea-tion of these large cognate databases requires considerable an-notation effort on the part of linguists and often requires that atleast some reconstruction be done by hand. To demonstrate thatour model can accurately infer cognate sets automatically, weused a version of our system that learns which words are cognate,starting only from raw word lists and their meanings. This systemuses a faster but lower-fidelity model of sound change to infercorrespondences. We then ran our reconstruction system oncognate sets that our cognate recovery system found. See SI Ap-pendix, Section 1 for details.This version of the system was run on all of the Oceanic lan-
guages in the ABVD, which comprise roughly half of the Aus-tronesian languages. We then evaluated the pairwise precision(the fraction of cognate pairs identified by our system that are alsoin the set of labeled cognate pairs), pairwise recall (the fraction oflabeled cognate pairs identified by our system), and pairwise F1measure (defined as the harmonic mean of precision and recall)for the cognates found by our system against the known cognatesthat are encoded in the ABVD. We also report cluster purity,which is the fraction of words that are in a cluster whose knowncognate group matches the cognate group of the cluster. See SIAppendix, Section 2.3 for a detailed description of the metrics.Using these metrics, we found that our system achieved a pre-
cision of 0.844, recall of 0.621, F1 of 0.715, and cluster purity of0.918. Thus, over 9 of 10 words are correctly grouped, and oursystem errs on the side of undergrouping words rather than clus-tering words that are not cognates. Because the null hypothesis inhistorical linguistics is to deem words to be unrelated unlessproved otherwise, a slight undergrouping is the desired behavior.
Rec
onst
ruct
ion
erro
r ra
te
A
0
0.125
0.250
0.375
0.500
Randommodern
Automated reconstruction
Agreement between two linguists
YakanBajoMapun
SamalS ias iInabaknonSubanunSinSubanonSioWesternBukManoboWestManoboIliaManoboDibaManoboAtadManoboAtauManoboTigwManoboKa laManoboSaraBinukid
MaranaoIranunSasakBaliMamanwaIlonggoHiligaynonWarayWaray
ButuanonTausugJoloSurigaonon
AklanonBisCebuanoKa laganMansaka
TagalogAntBikolNagaCTagbanwaKa
TagbanwaAb
BatakPalawPalawanBat
BanggiSWPa lawano
HanunooPalauanSinghi
DayakBakat
KatinganDayakNgaju
Kadorih
MerinaMalaMaanyan
TunjungKerinc i
Me layuSara
Me layuOganLowMalay
Indones ian
BanjareseM
Ma layBahas
MelayuBrun
Minangkaba
PhanRangCh
ChruHainanCham
MokenIbanGayo
Idaan
BunduDusun
TimugonM
ur
KelabitBar
Bintulu
BerawanLon
Belait
KenyahLong
MelanauM
uk
Lahanan
EngganoMal
EngganoBan
Nias
TobaBatak
Madurese
Iv atanBas c
Itbayat
Babuyan
Iraralay
Itbayaten
Is amorong
Ivasay
Imorod
Yami
SambalBoto
Kapampanga
IfugaoBayn
Ka llahanKa
KallahanKe
Inibaloi
Pangas inan
KakidugenI
IlongotKak
IfugaoAmga
IfugaoBata
BontokGuin
BontocGuin
KankanayNo
Balangaw
KalingaGui
ItnegBinon
Gaddang
AttaPamplo
IsnegDibag
AgtaDum
agatCas
IlokanoYogyaOldJavanesW
esternMa layoPolynes ian
BugineseSo
Maloh
MakassarTaeSToraja
SangirTabu
SangirSangilSaraBantikKa idipang
GorontaloHBolaangM
onPopaliaBonerateW
unaM
unaKatobuTontem
boan
Tons ea
Bang
gaiW
diBa
ree
Wol
ioM
ori
Kaya
nUm
aJu
Buka
tPu
nanK
ela i
Wam
par
Yabe
m
Num
bam
iSib
Mou
kAr
ia
Lam
ogai
Mul
Amar
a
Seng
seng
Kaul
ongA
uVBi
libil
Meg
iar
Geda
ged
Mat
ukar
Bilia
u
Men
gen
Mal
euKo
veTa
rpia
Sobe
i
Kayu
pula
uKRi
wo
Kairi
ruKis
Wog
eoAli
Auhe
lawa
Salib
a
Suau
Mai
s in
Gum
awan
a
Diod
io
Mol
ima
Ubir
Gapa
paiw
a
Wed
au
Dobu
an
Mis
ima
Kiliv
ila
Gaba
diDo
uraKu
niRo
ro
Mek
eoLala
Vilir
upu
Mot
u
Mag
oriS
out
Kand
asTang
aS iar
Kuan
ua
Patp
atar
Rov i
ana
S im
boNduk
e
Kus a
gheHoav
a
Lung
gaLuqa
Kubokota
Ughele
Ghanongga
Vangunu
Marovo
MbarekeSolosTaiof
Teop
NehanHapeNis sanHaku
Sis ingga
BabatanaTu
VarisiGhonVaris
iRirioSengga
BabatanaAvVaghua
Babatana
BabatanaLo
BabatanaKaLungaLunga
MonoMonoAluTorauUruava
MonoFauroBilurChekeHolo
MaringeKmaMaringeTatNggaoPoroMaringeLel
ZabanaKiaLaghuSamas
BlablangaBlablangaGKokotaKilokakaYsBanoniMadaraTungagTungKaraWes tNalikTiangTigakL ihirSunglMadakLamas
BarokMaututuNakanaiBilLaka la iVituYapes eMbughotuDhBugotuTalis eMoli
Talis eMalaGhariNdiGhariToloTalis e
Ma langoGhariNggaeMbirao
GhariTandaTalis ePoleGhariNggerTalis eKoo
GhariNginiLengo
LengoGhaimLengoParip
Nggela
ToambaitaKwai
LangalangaKwaio
MbaengguuLau
Mbae le lea
KwaraaeSolFata leka
LauWaladeLauNorthLonggu
SaaUkiNiMa
SaaSaaVillOroha
SaaUlawaSaa
Dorio
AreareMaas
AreareWaia
SaaAuluVil
BauroBaroo
BauroHaunu
Fagani
KahuaMami
SantaCatal
Aros iOneib
Aros iTawatAros i
Kahua
BauroPawaV
FaganiAguf
FaganiRihu
SantaAna
Tawaroga
TannaSouth
Kwamera
Lenakel
SyeErromanUra
ArakiSouthMere i
AmbrymSoutMota
PeteraraMa
PaameseSou
Mwotlap
Raga
SouthEfate
Orkon
Nguna
Namakir
Nati
Nahavaq
NeseTape
AnejomAnei
SakaoPortO
AvavaNeveei
Naman
Puluwatese
PuloAnnanW
oleaian
Satawalese
ChuukeseAKM
ortlockes
Carolinian
ChuukeseSonsoroles
Sa ipanCaroPuloAnnaW
oleai
Mokilese
PingilapesPonapeanKiribatiKus a ie
Marsha lles
NauruNelem
waJawe
CanalaIaai
NengoneDehu
Rotuman
WesternF ij
Maori
TahitiPenrhyn
Tuamotu
Manihiki
RurutuanTahitianM
oRarotongan
TahitianthM
arquesanM
angarevaHawaiian
RapanuiEasSam
oanBellonaTikopia
Emae
AnutaRennellese
FutunaAniwIfiraM
eleMUveaW
estFutunaEas t
VaeakauTauTakuu
TuvaluSikaiana
Kapingamar
NukuoroLuangiua
PukapukaTokelau
UveaEastTongan
NiueFijianBau
Tean
uVa
noTa
nem
aBu
ma
Asum
boa
Tani
mbi
liNe
mba
oSe
imat
Wuv
ulu
Lou
Naun
aLe
ipon
S iv i
saTi
taL i
kum
Leve
iLo
niu
Mus
sau
Kas i
raIra
hGi
man
Buli
Mor
Min
yaifu
inBi
gaM
isoo
lAs War
open
Num
for
Mar
auAm
baiY
apen
Win
des i
Wan
North
Baba
rDa
wera
Dawe
Dai
Tela
Mas
bua
Imro
ing
Empl
awas
Eas t
Mas
e la
Seril
iCe
ntra
lMas
Sout
hEas
tBW
estD
amar
Tara
ngan
Ba
UjirN
Aru
Ngai
borS
Ar
Gese
rW
atub
ela
E lat
KeiB
es
Hitu
Ambo
n
SouA
man
aTe
Amah
aiPa
uloh
iAl
une
Mur
nate
nAl
Bonf
iaW
erin
ama
Buru
Nam
rol
Sobo
yoM
ambo
ruM
angg
arai
Ngad
haPo
ndok
Kodi
Wej
ewaT
ana
Bim
aSo
aSa
vuW
anuka
ka
Gaur
aNgg
au
Eas tSumban
Baliled
o
Lamboya
L ioF loresT
Ende
NageKambera
PalueNitun
Anakalang
Sekar
PulauArgun
RotiTerm
an
Atoni
Mambai
TetunTerik
Kemak
Lamalera le
LamaholotI
S ika
Kedang
Era i
TalurAputai
Perai
Tugun
Iliun
Serua
NilaTeunKis ar
Roma
Eas tDamar
LetineseYamdena
KeiTanimba
Se laru
KoiwaiIriaChamorro
Lampung
Komering
Sunda
RejangRe ja
TboliTagab
TagabiliKoronadalB
SaranganiB
BilaanKoroBilaanSara
CiuliAtayaSquliqAtay
SediqBunun
ThaoFavorlang
RukaiPaiwanKanakanabu
SaaroaTsouPuyumaKav alanBasaiCentralAmiS iraya
PazehSa is ia t
YakanBajoMapun
SamalS ias iInabaknonSubanunSinSubanonSioWesternBukManoboWestManoboIliaManoboDibaManoboAtadManoboAtauManoboTigwManoboKalaManoboSaraBinukid
MaranaoIranunSasakBaliMamanwaIlonggoHiligaynonWarayWaray
ButuanonTausugJoloSurigaonon
AklanonBisCebuanoKalaganMansaka
TagalogAntBikolNagaCTagbanwaKa
TagbanwaAb
BatakPalawPalawanBat
BanggiSWPalawano
HanunooPalauanS inghi
DayakBakat
KatinganDayakNgaju
Kadorih
MerinaMalaMaanyan
TunjungKerinc i
Me layuSara
Me layuOganLowMalay
Indones ian
BanjareseM
Ma layBahas
Me layuBrun
Minangkaba
PhanRangCh
ChruHainanCham
MokenIbanGayo
Idaan
BunduDusun
TimugonM
ur
KelabitBar
Bintulu
BerawanLon
Belait
KenyahLong
MelanauM
uk
Lahanan
EngganoMa l
EngganoBan
Nias
TobaBatak
Madures e
Iv atanBasc
Itbayat
Babuyan
Iraralay
Itbayaten
Isamorong
Ivasay
Imorod
Yami
SambalBoto
Kapampanga
IfugaoBayn
Ka llahanKa
Ka llahanKe
Inibaloi
Pangas inan
KakidugenI
IlongotKak
IfugaoAmga
IfugaoBata
BontokGuin
BontocGuin
KankanayNo
Balangaw
KalingaGui
ItnegBinon
Gaddang
AttaPamplo
IsnegDibag
AgtaDum
agatCas
IlokanoYogyaOldJavanesW
es ternMa layoPolynes ian
BugineseSo
Ma loh
Makassa rTaeSToraja
SangirTabu
SangirSangilSa raBantikKaidipang
Goronta loHBolaangM
onPopaliaBonerateW
unaM
unaKa tobuTontem
boan
T on s ea
Bang
g ai W
diBa
ree
Wol
ioM
ori
Kaya
nUm
aJu
Buka
tPu
nanK
e la i
Wam
par
Yabe
m
Num
bam
iSib
Mou
kAr
ia
Lam
oga i
Mul
Amar
a
Seng
seng
Kaul
ongA
uVBi
libil
Meg
iar
Geda
ged
Mat
ukar
Bilia
u
Men
gen
Mal
euKo
veTa
rpia
Sobe
i
Kayu
pula
uKRi
wo
Kairi
ruKis
Wog
eoAli
Auhe
lawa
Salib
a
Suau
Ma i
s in
Gum
awan
a
Diod
io
Mol
ima
Ubir
Gapa
paiw
a
Wed
au
Dobu
an
Mis
ima
Kiliv
ila
Gaba
diDo
uraKu
niRo
ro
Mek
eoLala
Vilir
upu
Mot
u
Mag
oriS
out
Kand
asTang
aSiar
Kuan
ua
Patp
atar
Rov i
ana
S im
boNduk
e
Kus a
gheHoav
a
Lung
gaLuqa
Kubokota
Ughele
Ghanongga
Vangunu
Marovo
MbarekeSolosTaiof
Teop
NehanHapeNis sanHaku
Sis ingga
BabatanaTu
VarisiGhonVaris
iRirioSengga
BabatanaAvVaghua
Babatana
BabatanaLo
BabatanaKaLungaLunga
MonoMonoAluTorauUruava
MonoFauroBilurChekeHolo
MaringeKmaMaringeTatNggaoPoroMaringeLel
ZabanaKiaLaghuSamas
BlablangaBlablangaGKokotaKilokakaYsBanoniMadaraTungagTungKaraWes tNalikTiangTigakL ihirSunglMadakLamas
BarokMaututuNakanaiBilLaka la iVituYapes eMbughotuDhBugotuTalis eMoli
Talis eMalaGhariNdiGhariToloTalis e
Ma langoGhariNggaeMbirao
GhariTandaTalisePoleGhariNggerTalis eKoo
GhariNginiLengo
LengoGhaimLengoParip
Nggela
ToambaitaKwai
LangalangaKwaio
MbaengguuLau
Mbaele lea
KwaraaeSolFata leka
LauWa ladeLauNorthLonggu
SaaUkiNiMa
SaaSaaVillOroha
SaaUlawaSaa
Dorio
AreareMaas
AreareWaia
SaaAuluVil
BauroBaroo
BauroHaunu
Fagani
KahuaMami
SantaCatal
Aros iOneib
Aros iTawatAros i
Kahua
BauroPawaV
FaganiAguf
FaganiRihu
SantaAna
Tawaroga
TannaSouth
Kwamera
Lenakel
SyeErromanUra
ArakiSouthMere i
AmbrymSoutMota
PeteraraMa
PaameseSou
Mwotlap
Raga
SouthEfate
Orkon
Nguna
Namakir
Nati
Nahavaq
NeseTape
AnejomAnei
SakaoPortO
AvavaNeveei
Naman
Puluwatese
PuloAnnanW
oleaian
Satawalese
ChuukeseAKM
ortlockes
Carolinian
ChuukeseSonsoroles
SaipanCaroPuloAnnaW
oleai
Mokiles e
PingilapesPonapeanKiriba tiKus a ie
Marsha lles
NauruNelem
waJawe
CanalaIaai
NengoneDehu
Rotuman
WesternF ij
Maori
TahitiPenrhyn
Tuamotu
Manihiki
RurutuanTahitianM
oRarotongan
TahitianthM
arquesanM
angarevaHawaiian
RapanuiEasSam
oanBellonaTikopia
Emae
AnutaRennellese
FutunaAniwIfiraM
eleMUveaW
estFutunaEast
VaeakauTauTakuu
Tuva luSikaiana
Kapingamar
NukuoroLuangiua
PukapukaTokelau
Uve aEa stTon gan
NiueF ij ia nB au
T ean
uVa
noT a
nem
aBu
ma
Asum
boa
Tani
mbi
liNe
mba
oSe
imat
Wuv
ulu
Lou
Naun
aLe
ipon
Sivi
saTi
taLi
kum
Leve
iLo
niu
Mus
sau
Kas i
raIra
hGi
man
Buli
Mor
Min
yaifu
inBi
gaM
isoo
lAs War
open
Num
for
Mar
auAm
baiY
apen
Win
des i
Wan
North
Baba
rDa
wera
Dawe
Dai
Tela
Mas
bua
Imro
ing
Empl
awas
Eas t
Mas
e la
Seril
iCe
ntra
lMas
Sout
hEas
tBW
estD
amar
Tara
ngan
Ba
UjirN
Aru
Ngai
borS
Ar
Gese
rW
atub
ela
Ela t
KeiB
es
Hitu
Ambo
n
SouA
man
aTe
Amah
aiPa
uloh
iAl
une
Mur
nate
nAl
Bonf
iaW
erin
ama
Buru
Nam
rol
Sobo
yoM
ambo
ruM
angg
arai
Ngad
haPo
ndok
Kodi
Wej
ewaT
ana
Bim
aSo
aSa
vuW
anuka
ka
Gaur
aNgg
au
Eas tSumban
Baliled
o
Lamboya
L ioF loresT
Ende
NageKambera
PalueNitun
Anakalang
Sekar
PulauArgun
RotiTerm
an
Atoni
Mambai
TetunTerik
Kemak
Lamalera le
LamaholotI
S ika
Kedang
Era i
TalurAputai
Perai
Tugun
Iliun
Serua
NilaTeunKis ar
Roma
Eas tDamar
LetineseYamdena
KeiTanimba
Se laru
KoiwaiIriaChamorro
Lampung
Komering
Sunda
RejangRe ja
TboliTagab
TagabiliKoronadalB
SaranganiB
BilaanKoroBilaanSara
CiuliAtayaSquliqAtay
SediqBunun
ThaoFavorlang
RukaiPaiwanKanakanabu
SaaroaTsouPuyumaKav alanBasaiCentralAmiS iraya
PazehSa is ia t
N/A
B
0 10 20 30 40 50 600.3
0.35
0.4
0.45
0.5
0.55
Number of modern languages
Rec
onst
ruct
ion
erro
r ra
te
C
Rec
onst
ruct
ion
erro
r ra
te
Effect of tree quality
Effect of tree sizeFig. 1. Quantitative validation ofreconstructions and identificationof some important factors influ-encing reconstruction quality. (A)Reconstruction error rates for abaseline (which consists of pickingone modern word at random), oursystem, and the amount of dis-agreement between two linguist’smanual reconstructions. Reconstruc-tion error rates are Levenshteindistances normalized by the meanword form length so that errorscan be compared across languages.Agreement between linguists was computed on only Proto-Oceanic because the dataset used lacked multiple reconstructions for other protolanguages. (B) Theeffect of the topology on the quality of the reconstruction. On one hand, the difference between reconstruction error rates obtained from the system that ranon an uninformed topology (first and second) and rates obtained from the system that ran on an informed topology (third and fourth) is statistically significant.On the other hand, the corresponding difference between a flat tree and a random binary tree is not statistically significant, nor is the difference between usingthe consensus tree of ref. 41 and the Ethnologue tree (29). This suggests that our method has a certain robustness to moderate topology variations. (C) Re-construction error rate as a function of the number of languages used to train our automatic reconstruction system. Note that the error is not expected togo down to zero, perfect reconstruction being generally unidentifiable. The results in A and B are directly comparable: In fact, the entry labeled “Ethnologue”in B corresponds to the green Proto-Austronesian entry in A. The results in A and B and those in C are not directly comparable because the evaluation in Cis restricted to those cognates with at least one reflex in the smallest evaluation set (to make the curve comparable across the horizontal axis of C).
4226 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110 Bouchard-Côté et al.
Because we are ultimately interested in reconstruction, we thencompared our reconstruction system’s ability to reconstruct wordsgiven these automatically determined cognates. Specifically, wetook every cognate group found by our system (run on theOceanicsubclade) with at least two words in it. Then, we automaticallyreconstructed the Proto-Oceanic ancestor of those words, usingour system. For evaluation, we then looked at the average Lev-enshtein distance from our reconstructions to the known recon-structions described in the previous sections. This time, however,we average per modern word rather than per cognate group, toprovide a fairer comparison. (Results were not substantially dif-ferent when averaging per cognate group.) Compared with re-construction from manually labeled cognate sets, automaticallyidentified cognates led to an increase in error rate of only 12.8%and with a significant reduction in the cost of curating linguisticdatabases. See SI Appendix, Fig. S1 for the fraction of words witheach Levenshtein distance for these reconstructions.
Functional Load. To demonstrate the utility of large-scale recon-struction of protolanguages, we used the output of our system toinvestigate an open question in historical linguistics. The func-tional load hypothesis (FLH), introduced 1955 (6), claims thatthe probability that a sound will change over time is related tothe amount of information provided by a sound. Intuitively, iftwo phonemes appear only in words that are differentiated fromone another by at least one other sound, then one can argue thatno information is lost if those phonemes merge together, be-cause no new ambiguous forms can be created by the merger.A first step toward quantitatively testing the FLH was taken in
1967 (7). By defining a statistic that formalizes the amount ofinformation lost when a language undergoes a certain soundchange—on the basis of the proportion of words that are dis-criminated by each pair of phonemes—it became possible toevaluate the empirical support for the FLH. However, this initial
investigation was based on just four languages and found littleevidence to support the hypothesis. This conclusion was criti-cized by several authors (31, 32) on the basis of the small numberof languages and sound changes considered, although they pro-vided no positive counterevidence.Using the output of our system, we collected sound change
statistics from our reconstruction of 637 Austronesian languages,including the probability of a particular change as estimated by oursystem. These statistics provided the information needed to givea more comprehensive quantitative evaluation of the FLH, usinga much larger sample than previous work (details in SI Appendix,Section 2.4). We show in Fig. 3 A and B that this analysis providesclear quantitative evidence in favor of the FLH. The revealedpattern would not be apparent had we not been able to reconstructlarge numbers of protolanguages and supply probabilities of dif-ferent kinds of change taking place for each pair of languages.
DiscussionWe have developed an automated system capable of large-scalereconstruction of protolanguage word forms, cognate sets, andsound change histories. The analysis of the properties of hun-dreds of ancient languages performed by this system goes farbeyond the capabilities of any previous automated system andwould require significant amounts of manual effort by linguists.Furthermore, the system is in no way restricted to applicationslike assessing the effects of functional load: It can be used asa tool to investigate a wide range of questions about the structureand dynamics of languages.In developing an automated system for reconstructing ancient
languages, it is by no means our goal to replace the carefulreconstructions performed by linguists. It should be emphasizedthat the reconstruction mechanism used by our system ignoresmany of the phenomena normally used in manual recon-structions. We have mentioned limitations due to the transducer
(521)
( 559) ( 750)
Yakan ( 632) ( 91) ( 40)Bajo ( 375)Mapun ( 560)
SamalS ias i ( 230)
Inabaknon ( 620)
( 629) ( 630)
SubanunSin ( 628)
SubanonSio ( 362) ( 123) ( 728) ( 733)
Wes ternBuk ( 370)
ManoboWes t ( 366)
ManoboIlia
( 365)
ManoboDiba
( 27) ( 363)
ManoboAtad ( 364)
ManoboAtau
( 369)
ManoboTigw
( 614)
( 367)
ManoboKala
( 368)
ManoboSara
( 79)
Binukid
( 377)
( 376)
Maranao
( 233)
Iranun
( 43)
( 576)
Sasak
( 42)
Bali
( 405) ( 118)
( 355)
Mamanwa
( 80) ( 121) ( 507)
( 225)
Ilonggo
( 210)
Hiligaynon
( 717)
WarayWaray
( 613) ( 102)
( 103)
Butuanon
( 662)
TausugJolo
( 635)
Surigaonon
( 3)
AklanonBis
( 107)
Cebuano
( 372)
( 252)
Kalagan
( 371)
Mansaka
( 639)
TagalogAnt
( 69)
BikolNagaC
( 253)
( 641)
TagbanwaKa
( 640)
TagbanwaAb
( 495)
( 57)
BatakPalaw
( 494)
PalawanBat
( 47)
Banggi
( 547)
SWPalawano
( 208)
Hanunoo
( 493)
Palauan
( 311)
( 595)
S inghi
( 136)
DayakBakat
( 52)
( 727)
( 615)
( 267)
Katingan
( 137)
DayakNgaju
( 245)
Kadorih
( 151)
( 403)
MerinaMala
( 337)
Maanyan
( 693)
Tunjung
( 350)
( 349)
( 327)
( 279)
Kerinc i
( 400)
MelayuSara
( 398)
Melayu
( 487)
Ogan
( 331)
LowMalay
( 231)
Indones ian
( 48)
BanjareseM
( 348)
MalayBahas
( 399)
MelayuBrun
( 407)
M inangkaba
( 126)
( 125)
( 511)
PhanRangCh
( 130)
Chru
( 206)
HainanCham
( 410)
Moken
( 215)
Iban
( 187)
Gayo
( 477)
( 554)
( 217)
Idaan
( 99)
BunduDusun
( 464)
( 138)
( 677)
TimugonM
ur
( 276)
KelabitBar
( 78)
Bintulu
( 66)
( 65)
BerawanLon
( 62)
Belait
( 278)
KenyahLong
( 396)
( 397)
MelanauM
uk
( 303)
Lahanan
( 633)
( 167)
( 169)
EngganoMal
( 168)
EngganoBan
( 455)
Nias
( 679)
TobaBatak
( 341)
Madurese
( 470) ( 56)
( 55)
( 241)
( 242)
Iv atanBasc
( 237)
Itbayat
( 39)
Babuyan
( 234)
Iraralay
( 238)
Itbayaten
( 235)
Is amorong
( 240)
Iv as ay
( 228)
Imorod
( 752)
Yami
( 113)
( 561)
SambalBoto
( 263)
Kapampanga
( 469)
( 605)
( 619)
( 498)
( 64)
( 256)
( 222)IfugaoBayn
( 257)KallahanKa
( 258)KallahanKe
( 232)
Inibaloi
( 497)
Pangas inan
( 226)
( 251)
KakidugenI
( 227)
IlongotKak
( 110)
( 478)
( 219)
( 220)
IfugaoAmga
( 221)
IfugaoBata
( 90)
( 88)
( 89)BontokGuin
( 87)BontocGuin
( 262)
KankanayNo
( 41)
Balangaw
( 255)
( 254)
KalingaGui
( 239)
ItnegBinon
( 468)
( 216)
( 184)
Gaddang
( 30)
AttaPamplo
( 236)
Is negDibag
( 471)
( 2)
Agta
( 144)
DumagatCas
( 224)
Ilokano
( 243)
( 754)
Yogya
( 488)
OldJavanes
( 735)
Wes ternM
alayoPolynes ian
( 631)
( 611)
( 94)
( 93)BugineseSo
( 354)M
aloh
( 344)
Makassar
( 637)
TaeSToraja
( 568)
( 476)
( 567)SangirTabu
( 566)Sangir
( 565)SangilSara
( 50)
Bantik
( 203) ( 201)
( 248)Kaidipang
( 202)GorontaloH
( 84)BolaangM
on
( 423) ( 691)
( 518)Popalia
( 85)Bonerate
( 739) ( 747)
Wuna
( 424)M
unaKatobu (466)
(685)Tontem
boan ( 684)
Tons ea (46)
BanggaiWdi
(51)Baree
(746)
Wolio
(418)
Mori
(270) (271)
KayanUmaJu
(96)
Bukat (529)
PunanKelai
(11
1)
(158)
(523)
(736)
(462)
(213) (715)
Wam
par (749)
Yabem (484)
Numbam
iSib (451)
(713)
(624) (67)
(422)M
ouk (20)
Aria (309)
LamogaiM
ul
(7)
Amara
(500) (585)
Sengseng (268)
KaulongAuV
(61) (475)
(73)Bilibil
(394)M
egiar (188)
Gedaged
(387)M
atukar
(72)
Biliau
(401)
Mengen
(353)
Maleu
(292)
Kove
(575) (574)
(661)
Tarpia
(600)
Sobei
(272)
KayupulauK
(539)
Riwo
(579) (250)
(249)
Kairiru
(358) (284)
Kis
(743)
Wogeo
(4)
Ali
(499)
(480)
(627)
(31)
Auhelawa
(558)
Saliba
(626)
Suau
(343)
Maisin
(463) (205)
Gumawana
(104)
(140)
Diodio
(412)
Molim
a
(17) (16)
(695)
Ubir
(185)
Gapapaiwa
(720)
Wedau
(141)
Dobuan
(508)
(281)
(409)
Misim
a
(280)
Kilivila
(117) (723)
(183)
Gabadi
(482) (143)
Doura
(295)
Kuni
(541)
Roro
(395)
Mekeo
(305)
Lala
(594)
(712)
Vilirupu
(421)
Motu
(342)
MagoriSout
(404)
(448)
(610)
(502)
(261)
Kandas
(655)
Tanga
(590)
Siar
(293)
Kuanua
(501)
Patpatar
(447) (730)
(544)
Roviana
(593)
Simbo
(437)
Nduke
(296)
Kusaghe
(212)
Hoava
(335)
Lungga
(336)
Luqa
(294)
Kubokota
(696)Ughele
(193)Ghanongga
(154)
(706)Vangunu
(382)Marovo
(391)Mbareke
(440)
(602)
Solos
(572)
(646)Taiof
(668)Teop
(438)
(439)NehanHape
(458)Nissan
(207)
Haku
(129)
(597)
Sisingga
(38)
BabatanaTu
(711)
VarisiGhon
(710)
Varisi
(538)
Ririo
(584)
Sengga
(35)
BabatanaAv
(705)
Vaghua
(34)
Babatana
(37)
BabatanaLo
(36)
BabatanaKa
(416)
(334)
LungaLunga
(413)
Mono
(414)
MonoAlu
(686)
Torau
(700)
Uruava
(415)
MonoFauro
(75)
Bilur
(571)
(157)
(128)ChekeHolo
(379)MaringeKma
(381)MaringeTat
(452)NggaoPoro
(380)MaringeLel
(732) (755)ZabanaKia
(302)LaghuSamas
(124)
(82)Blablanga
(83)BlablangaG
(289)Kokota
(282)KilokakaYs
(49)
Banoni
(316)
(340)
Madara
(692)
TungagTung
(265)KaraWest
(431)Nalik
(673)Tiang
(674)Tigak
(324)LihirSungl
(338)
(339)MadakLamas
(53)Barok
(741)
(388)Maututu
(430)NakanaiBil
(304)Lakalai
(714)
Vitu
(753)
Yapese
(11
2)
(618)
(190)
(92) (393)MbughotuDh
(95)Bugotu
(204)
(651)TaliseMoli (650)TaliseMala (195)GhariNdi (194)Ghari (681)Tolo (648)Talise (347)Malango (196)
GhariNggae (392)Mbirao (199)GhariTanda (652)TalisePole (197)GhariNgger (649)TaliseKoo (198)GhariNgini (189)
(319)Lengo (320)LengoGhaim
(321)LengoParip
(453)Nggela
(346)
(345)
(473)
(678)Toambaita
(298)Kwai
(312)Langalanga
(299)Kwaio
(390)Mbaengguu
(313)Lau
(389)
Mbaelelea (301)
KwaraaeSol
(175)
Fataleka (315)
LauWalade
(314)
LauNorth
(328)
Longgu (621)
(551)
SaaUkiNiMa
(550)
SaaSaaVill
(490)
Oroha
(552)
SaaUlawa
(548)
Saa
(142)
Dorio
(18)
AreareMaas
(19)
AreareWaia
(549)
SaaAuluVil
(564)
(58)
BauroBaroo
(59)
BauroHaunu
(172)
Fagani
(247)
KahuaMami
(570)
SantaCatal
(22)
ArosiOneib
(23)
ArosiTawat
(21)
Arosi
(246)
Kahua
(60)
BauroPawaV
(173)
FaganiAguf
(17
4)
FaganiRihu
(56
9)
SantaAna
(66
3)
Tawaroga
(
536)
(70
9)
(65
7)
(65
8)
TannaSouth
(30
0)
Kwamera
(31
8)
Lenakel
(17
1)
(63
6)
SyeErroman
(69
9)
Ura
(46
7)
(72
6)
(15
)
ArakiSouth
(40
2)
Merei
(15
0)
(10)
AmbrymSout
(42
0)
Mota
(51
0)
PeteraraMa
(49
1)
PaameseSou
(42
7)
Mwotlap
(53
1)
Raga
(11
9)
(60
7)
SouthEfate
(48
9)
Orkon
(45
4)
Nguna
(43
2)
Namakir
(35
2)
(43
4)
Nati
(42
9)
Nahavaq
(44
4)
Nese
(65
9)
Tape
(1
2)
Anejom
Anei
(
557)
Saka
oPor
tO
(
351)
(
32)
Avav
a
(
445)
Neve
ei
(
433)
Nam
an
(
522)
(
406)
(
516)
(
520)
(
528)
Pulu
wate
se
(
527)
Pulo
Anna
n
(
745)
Wol
eaia
n
(
577)
Sata
wale
se
(
132)
Chuu
kese
AK
(
419)
Mor
tlock
es
(
106)
Caro
linia
n
(
131)
Chuu
kese
(
603)
Sons
orol
es
(
555)
Saip
anCa
ro
(
526)
Pulo
Anna
(
744)
Wol
eai
(
515)
(
411)
Mok
ilese
(
512)
Ping
ilape
s
(
514)
Pona
pean
(
283)
Kirib
ati
(
297)
Kusa
ie
(
385)
Mar
shal
les
(
436)
Naur
u
(
332)
(
446)
(
474)
(
441)
Nele
mwa
(
244)
Jawe
(
105)
Cana
la
(
214)
Iaai
(
443)
Neng
one
(
139)
Dehu
(
116)
(
725)
(
543)
Rotu
man
(
734)
Wes
tern
Fij
(
146)
(
513)
(
481)
(
156)
(
122)
(64
5) (
374)
Mao
ri
(
642)
Tahi
ti
(
505)
Penr
hyn
(
689)
Tuam
otu
(
361)
Man
ihik
i
(
546)
Ruru
tuan
(
643)
Tahi
tianM
o
(
534)
Raro
tong
an
(
644)
Tahi
tiant
h
(
384)
(
383)
Mar
ques
an
(
359)
Man
gare
va
(
209)
Hawa
iian
(
533)
Rapa
nuiE
as
(
563)
(
562)
Sam
oan
(
182)
(
63)
Bello
na
(
675)
Tiko
pia
(
163)
Emae
(
13)
Anut
a
(
537)
Renn
elle
se
(
180)
Futu
naAn
iw
(
218)
Ifira
Mel
eM
(
703)
Uvea
Wes
t
(
181)
Futu
naEa
st
(
704)
Vaea
kauT
au
(
162)
(
647)
Taku
u
(69
4)Tu
valu
(
592)
Sika
iana
(
264)
Kapi
ngam
ar
(48
3)Nu
kuor
o
(33
3)Lu
angi
ua
(52
4)Pu
kapu
ka
(68
0)To
kela
u
(70
2)Uv
eaEa
st
(68
3)
(68
2)To
ngan
(
459)
Niue
(
177)
Fijia
nBau
(
159)
(
707)
(
666)
Tean
u
(70
8)Va
no
(65
4)Ta
nem
a
(98
)Bu
ma
(
701)
(
26)
Asum
boa
(
656)
Tani
mbi
li
(44
2)Ne
mba
o
(1)
(
738)
(
581)
Seim
at
(74
8)
Wuv
ulu
(
160)
(
616)
(
330)
Lou
(
435)
Naun
a
(37
3)
(15
3)
(31
7)Le
ipon
(
598)
S iv i
s aTi
ta
(
729)
(
325)
L iku
m
(32
3)
Leve
i
(
329)
Loni
u
(
426)
Mus
sau
(
609)
(
608)
(
266)
Kas i
raIra
h
(
200)
Gim
an
(
97)
Buli
(
108)
(
417)
Mor
(
532)
(
408)
Min
yaifu
in
(
68)
Biga
Mis
ool
(
25)
As
(
718)
War
open
(
485)
Num
for
(
120)
(
378)
Mar
au
(
8)
Amba
iYap
en
(
742)
Win
des i
Wan
(
519)
(
33)
(
465)
(
460)
North
Baba
r
(
135)
Dawe
raDa
we
(
134)
Dai
(
612)
(
622)
(
667)
Tela
Mas
bua
(
229)
Imro
ing
(
164)
Empl
awas
(
386)
(
148)
Eas t
Mas
ela
(
588)
Seril
i
(
115)
Cent
ralM
as
(
606)
Sout
hEas
tB
(
724)
Wes
tDam
ar
(
24)
(
660)
Tara
ngan
Ba
(
697)
UjirN
Aru
(
450)
Ngai
borS
Ar
(
114)
(
152)
(
45)
(
192)
(
191)
Gese
r
(
719)
Wat
ubel
a
(
161)
E lat
KeiB
es
(
586)
(
486)
(
587)
(
9)
(
211)
Hitu
Ambo
n
(
604)
SouA
man
aTe
(
6)
Amah
ai
(
503)
Paul
ohi
(
698)
(
5)
Alun
e
(
425)
Mur
nate
nAl
(
86)
Bonf
ia
(
722)
Wer
inam
a
(
101)
Buru
Nam
rol
(
601)
Sobo
yo
(
77)
(
357)
Mam
boru
(
360)
Man
ggar
ai
(
449)
Ngad
ha
(
517)
Pond
ok
(
287)
Kodi
(
721)
Wej
ewaT
ana
(
76)
Bim
a
(
599)
Soa
(
578)
Savu
(
716)
Wan
ukak
a
(
186)
Gaur
aNgg
au
( 1
49)
Eas tSum
ban
( 44
)
Baliledo
( 30
8)
Lamboya
( 16
6)
( 32
6)
L ioF loresT
( 16
5)
Ende
( 42
8)
Nage
( 25
9)
Kambera
( 49
6)
PalueNitun
( 11
)
Anakalang
( 46
1)
( 58
2)
Sekar
( 52
5)
PulauArgun
( 67
6)
( 47
9)
( 73
1)
( 54
2)
RotiTerm
an
( 29
)
Atoni
( 15
5)
( 35
6)
Mambai
( 66
9)
TetunTerik
( 27
7)
Kemak
( 17
8)
( 30
7)
Lamalerale
( 30
6)
LamaholotI
( 59
1)
S ika
( 27
3)
Kedang
(623)
( 74
0)
( 17
0)
Erai
( 65
3)
Talur
( 14
)
Aputai
( 50
6)
Perai
( 69
0)
Tugun
(223)
Iliun
(671)
(457)
(589)
Serua
(456)
Nila
(670)
Teun
(286)
(285)
Kis ar
(540)
Roma
(145)
Eas tDamar
(322)
Letinese
(617)
(275)
(751)
Yamdena
(274)
KeiTanimba
(583)
Selaru
( 288)
KoiwaiIria
( 127)
Chamorro
( 509)
( 310)
Lampung
( 290)
Komering
( 634)
Sunda
( 535)
RejangReja
( 74)
( 664)
( 665)
TboliTagab
( 638)
Tagabili
( 81)
( 291)
KoronadalB
( 573)
SaranganiB
( 70)
BilaanKoro
( 71)
BilaanSara
( 179)
( 28)
( 133)
CiuliAtaya
( 625)
SquliqAtay
( 580)
Sediq
( 100)
Bunun
( 737)
( 672)Thao
( 176)Favorlang
( 545)
Rukai
( 492)
Paiwan
( 688)
( 260)Kanakanabu
( 553)Saaroa
( 687) Tsou
( 530)
Puyuma
( 147) ( 472)
( 269) Kavalan ( 54) Basai
( 109) CentralAmi ( 596) S iraya
( 504) Pazeh ( 556) Sais iat
Proto-Austronesian
(Fig. S2)
(Fig. S3)
(Fig. S5)
(Fig. S4)
f
gdb
nm
k
hv
t
s
qp
z x
pppp g qqqqqe
a
o
i uy
e11
12
1 2
3
8
56
10
4 97
A B
C D
(
543)
Ro
(
734)
Wes
te
3)
(48
1)
(
156)
(
122)
(
645) (
374)
Mao
ri
(
642)
Tahi
ti
(
505)
Penr
hyn
(
689)
Tuam
otu
(
361)
Man
ihik
i
(
546)
Ruru
tuan
(
643)
Tahi
tianM
o
(
534)
Raro
tong
an
(
644)
Tahi
tiant
h
(
384)
(
383)
Mar
ques
an
(
359)
Man
gare
va
(
209)
Hawa
iian
(
533)
Rapa
nuiE
as
(
563)
(56
2)Sa
moa
n
(
182)
(
63)
Bello
na
(
675)
Tiko
pia
(
163)
Emae
(
13)
Anut
a
(53
7)Re
nnel
lese
(
180)
Futu
naAn
iw
(
218)
Ifira
Mel
eM
(
703)
Uvea
Wes
t
(
181)
Futu
naEa
st
(
704)
Vaea
kauT
au
(
162)
(
647)
Taku
u
(69
4)Tu
valu
(
592)
Sika
iana
(
264)
Kapi
ngam
ar
(48
3)Nu
kuor
o
(33
3)Lu
angi
ua
(52
4)Pu
kapu
ka
(68
0)To
kela
u
(70
2)Uv
eaEa
st
(68
3)
(68
2)To
ngan
(
459)
Niue
(177
)Fi
jianB
au
(66
6)Te
an
i iiFraction of substitution errors
Fig. 2. Analysis of the output of our system in more depth. (A) An Austronesian phylogenetic tree from ref. 29 used in our analyses. Each quadrant isavailable in a larger format in SI Appendix, Figs. S2–S5, along with a detailed table of sound changes (SI Appendix, Table S5). The numbers in parenthesesattached to each branch correspond to rows in SI Appendix, Table S5. The colors and numbers in parentheses encode the most prominent sound change alongeach branch, as inferred automatically by our system in SI Appendix, Section 4. (B) The most supported sound changes across the phylogeny, with the width oflinks proportional to the support. Note that the standard organization of the IPA chart into columns and rows according to place, manner, height, andbackness is only for visualization purposes: This information was not encoded in the model in this experiment, showing that the model can recover realisticcross-linguistic sound change trends. All of the arcs correspond to sound changes frequently used by historical linguists: sonorizations /p/ > /b/ (1) and /t/ > /d/(2), voicing changes (3, 4), debuccalizations /f/ > /h/ (5) and /s/ > /h/ (6), spirantizations /b/ > /v/ (7) and /p/ > /f/ (8), changes of place of articulation (9, 10), andvowel changes in height (11) and backness (12) (1). Whereas this visualization depicts sound changes as undirected arcs, the sound changes are actuallyrepresented with directionality in our system. (C) Zooming in a portion of the Oceanic languages, where the Nuclear Polynesian family (i) and Polynesianfamily (ii) are visible. Several attested sound changes such as debuccalization to Maori and place of articulation change /t/ > /k/ to Hawaiian (30) are suc-cessfully localized by the system. (D) Most common substitution errors in the PAn reconstructions produced by our system. The first phoneme in each pairðx; yÞ represents the reference phoneme, followed by the incorrectly hypothesized one. Most of these errors could be plausible disagreements among humanexperts. For example, the most dominant error (p, v) could arise over a disagreement over the phonemic inventory of Proto-Austronesian, whereas vowels arecommon sources of disagreement.
Bouchard-Côté et al. PNAS | March 12, 2013 | vol. 110 | no. 11 | 4227
COMPU
TERSC
IENCE
SPS
YCHOLO
GICALAND
COGNITIVESC
IENCE
SSE
ECO
MMEN
TARY
formalism but other limitations include the lack of explicit mod-eling of changes at the level of the phoneme inventories used bya language and the lack of morphological analysis. Challengesspecific to the cognate inference task, for example difficulties withpolymorphisms, are also discussed in more detail in SI Appendix.Another limitation of the current approach stems from the as-sumption that languages form a phylogenetic tree, an assumptionviolated by borrowing, dialect variation, and creole languages.However, we believe our system will be useful to linguists inseveral ways, particularly in contexts where there are large num-bers of languages to be analyzed. Examples might include usingthe system to propose short lists of potential sound changes andcorrespondences across highly divergent word forms.An exciting possible application of this work is to use the
model described here to infer the phylogenetic relationshipsbetween languages jointly with reconstructions and cognate sets.This will remove a source of circularity present in most previouscomputational work in historical linguistics. Systems for inferringphylogenies such as ref. 13 generally assume that cognate sets aregiven as a fixed input, but cognacy as determined by linguists is inturn motivated by phylogenetic considerations. The phylogenetictree hypothesized by the linguist is therefore affecting the treebuilt by systems using only these cognates. This problem can beavoided by inferring cognates at the same time as a phylogeny,something that should be possible using an extended version ofour probabilistic model.Our system is able to reconstruct the words that appear in
ancient languages because it represents words as sequences ofsounds and uses a rich probabilistic model of sound change. Thisis an important step forward from previous work applying com-putational ideas to historical linguistics. By leveraging the fullsequence information available in the word forms in modernlanguages, we hope to see in historical linguistics a breakthroughsimilar to the advances in evolutionary biology prompted by thetransition from morphological characters to molecular sequencesin phylogenetic analysis.
Materials and MethodsThis section provides a more detailed specification of our probabilisticmodel. See SI Appendix, Section 1.2 for additional content on the algo-rithm and simulations.
Distributions. The conditional distributions over pairs of evolving strings arespecified using a lexicalized stochastic string transducer (33).
Consider a language ℓ′ evolving to ℓ for cognate set c. Assume we have aword form x =wcℓ′. The generative process for producing y =wcℓ works asfollows. First, we consider x to be composed of characters x1x2 . . . xn, withthe first and last ones being a special boundary symbol x1 =#∈Σ, which isnever deleted, mutated, or created. The process generates y = y1y2 . . . yn in nchunks yi ∈Σ*; i∈ f1; . . . ;ng, one for each xi . The yi s may be a single char-acter, multiple characters, or even empty. To generate yi , we define a mu-tation Markov chain that incrementally adds zero or more characters to aninitially empty yi . First, we decide whether the current phoneme in the topword t = xi will be deleted, in which case yi = e (the probabilities of the
decisions taken in this process depend on a context to be specified shortly).If t is not deleted, we choose a single substitution character in the bottomword. We write S =Σ∪ fζg for this set of outcomes, where ζ is the specialoutcome indicating deletion. Importantly, the probabilities of this multino-mial can depend on both the previous character generated so far (i.e., therightmost character p of yi−1) and the current character in the previousgeneration string (t), providing a way to make changes context sensitive.This multinomial decision acts as the initial distribution of the mutationMarkov chain. We consider insertions only if a deletion was not selected inthe first step. Here, we draw from a multinomial over S , where this time thespecial outcome ζ corresponds to stopping insertions, and the other ele-ments of S correspond to symbols that are appended to yi . In this case, theconditioning environment is t = xi and the current rightmost symbol p in yi .Insertions continue until ζ is selected. We use θS;t;p;ℓ and θI;t;p;ℓ to denote theprobabilities over the substitution and insertion decisions in the currentbranch ℓ′→ ℓ. A similar process generates the word at the root ℓ of a tree orwhen an innovation happens at some language ℓ, treating this word asa single string y1 generated from a dummy ancestor t = x1. In this case, onlythe insertion probabilities matter, and we separately parameterize theseprobabilities with θR;t;p;ℓ. There is no actual dependence on t at the root orinnovative languages, but this formulation allows us to unify the parame-terization, with each θω;t;p;ℓ ∈RjΣj+1, where ω∈ fR; S; Ig. During cognate in-ference, the decision to innovate is controlled by a simple Bernoulli randomvariable ngℓ for each language in the tree. When known cognate groups areassumed, ncℓ is set to 0 for all nonroot languages and to 1 for the rootlanguage. These Bernoulli distributions have parameters νℓ.
Mutation distributions confined in the family of transducers miss certainphylogenetic phenomena. For example, the process of reduplication (as in“bye-bye”, for example) is a well-studied mechanism to derive morpholog-ical and lexical forms that is not explicitly captured by transducers. The samesituation arises in metatheses (e.g., Old English frist > English first). How-ever, these changes are generally not regular and therefore less informative(1). Moreover, because we are using a probabilistic framework, these eventscan still be handled in our system, even though their costs will simply not beas discounted as they should be.
Note also that the generative process described in this section does notallow explicit dependencies to the next character in ℓ. Relaxing this as-sumption can be done in principle by using weighted transducers, but at thecost of a more computationally expensive inference problem (caused by thetransducer normalization computation) (34). A simpler approach is to usethe next character in the parent ℓ′ as a surrogate for the next character in ℓ.Using the context in the parent word is also more aligned to the standardrepresentation of sound change used in historical linguistics, where thecontext is defined on the parent as well.
More generally, dependencies limited to a bounded context on the parentstring can be incorporated in our formalism. By bounded, we mean that itshould be possible to fix an integer k beforehand such that all of themodeled dependencies are within k characters to the string operation. Thecaveat is that the computational cost of inference grows exponentially in k.We leave open the question of handling computation in the face of un-bounded dependencies such as those induced by harmony (35).
Parameterization. Instead of directly estimating the transition probabilities ofthemutationMarkov chain (which could be done, in principle, by taking themto be the parameters of a collection of multinomial distributions) we expressthem as the output of a multinomial logistic regression model (36). This
A B
0 2x10-5 4x10-5 6x10-5 8x10-5 1x10-40
0.2
0.4
0.6
0.8
1
Functional load
Mer
ger
post
erio
r
0 2x10-5 4x10-5 6x10-5 8x10-5 1x10-40
0.2
0.4
0.6
0.8
1
Functional load
Mer
ger
post
erio
r
0
1
102
104
106Fig. 3. Increasing the number of languages we canreconstruct gives new ways to approach questions inhistorical linguistics, such as the effect of functionalload on the probability of merging two sounds. Theplots shown are heat maps where the color encodesthe log of the number of sound changes that fallinto a given two-dimensional bin. Each soundchange x > y is encoded as a pair of numbers in theunit square, ðl;mÞ, as explained in Materials andMethods. To convey the amount of noise one couldexpect from a study with the number of languagesthat King previously used (7), we first show in A theheat map visualization for four languages. Next, weshow the same plot for 637 Austronesian languagesin B. Only in this latter setup is structure clearlyvisible: Most of the points with high probability of merging can be seen to have comparatively low functional load, providing evidence in favor of thefunctional load hypothesis introduced in 1955. See SI Appendix, Section 2.4 for details.
4228 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110 Bouchard-Côté et al.
model specifies a distribution over transition probabilities by assigningweights to a set of features that describe properties of the sound changesinvolved. These features provide a more coherent representation of thetransition probabilities, capturing regularities in sound changes that reflectthe underlying linguistic structure.
We used the following feature templates: OPERATION, which identifieswhether an operation in the mutation Markov chain is an insertion, a de-letion, a substitution, a self-substitution (i.e., of the form x > y, x = y), or theend of an insertion event; MARKEDNESS, which consists of language-specificn-gram indicator functions for all symbols in Σ (during reconstruction, onlyunigram and bigram features are used for computational reasons; for cognateinference, only unigram features are used); FAITHFULNESS, which consists ofindicators for mutation events of the form 1 [ x > y ], where x ∈Σ, y ∈ S .Feature templates similar to these can be found, for instance, in the work ofrefs. 37 and 38, in the context of string-to-string transduction models used incomputational linguistics. This approach to specifying the transition proba-bilities produces an interesting connection to stochastic optimality theory (39,40), where a logistic regression model mediates markedness and faithfulnessof the production of an output form from an underlying input form.
Data sparsity is a significant challenge in protolanguage reconstruction.Although the experiments we present here use an order of magnitude morelanguages than previous computational approaches, the increase in observeddata also brings with it additional unknowns in the form of intermediateprotolanguages. Because there is one set of parameters for each language,adding more data is not sufficient to increase the quality of the recon-struction; it is important to share parameters across different branches in thetree to benefit from having observations from more languages. We used thefollowing technique to address this problem: We augment the parameteri-zation to include the current language (or language at the bottom of thecurrent branch) and use a single, global weight vector instead of a set ofbranch-specific weights. Generalization across branches is then achievedby using features that ignore ℓ, whereas branch-specific features dependon ℓ. Similarly, all of the features in OPERATION, MARKEDNESS, andFAITHFULNESS have universal and branch-specific versions.
Using these features and parameter sharing, the logistic regression modeldefines the transition probabilities of the mutation process and the rootlanguage model to be
θω;t;p;ℓ = θω;t;p;ℓðξ; λÞ= expfÆλ; fðω; t;p; ℓ; ξÞægZðω; t;p; ℓ; λÞ × μðω; t; ξÞ; [1]
where ξ∈ S , f : fS; I;Rg×Σ×Σ× L× S →Rk is the feature function (whichindicates which features apply for each event), Æ · ; · æ denotes inner product,and λ∈Rk is a weight vector. Here, k is the dimensionality of the feature spaceof the logistic regression model. In the terminology of exponential families, Zand μ are the normalization function and the reference measure, respectively:
Zðω; t;p; ℓ; λÞ=Xξ′∈S
exp�Æλ; f
�ω; t;p; ℓ; ξ′
�æ�
μðω; t; ξÞ=
8>><>>:
0 if ω= S; t =#; ξ≠#0 if ω=R; ξ= ζ0 if ω≠R; ξ=#1 o:w:
Here, μ is used to handle boundary conditions, ensuring that the resultingprobability distribution is well defined.
During cognate inference, the innovation Bernoulli random variables νgℓare similarly parameterized, using a logistic regression model with two kindsof features: a global innovation feature κglobal ∈R and a language-specificfeature κℓ ∈R. The likelihood function for each νgℓ then takes the form
νgℓ =1
1+ exp�−κglobal − κℓ
�: [2]
ACKNOWLEDGMENTS. This work was supported by Grant IIS-1018733 fromthe National Science Foundation.
1. Hock HH (1991) Principles of Historical Linguistics (Mouton de Gruyter, The Hague,Netherlands).
2. Ross M, Pawley A, Osmond M (1998) The Lexicon of Proto Oceanic: The Culture andEnvironment of Ancestral Oceanic Society (Pacific Linguistics, Canberra, Australia).
3. Diamond J (1999) Guns, Germs, and Steel: The Fates of Human Societies (WW Norton,New York).
4. Nichols J (1999) Archaeology and Language: Correlating Archaeological and LinguisticHypotheses, eds Blench R, Spriggs M (Routledge, London).
5. Ventris M, Chadwick J (1973) Documents in Mycenaean Greek (Cambridge Univ Press,Cambridge, UK).
6. Martinet A (1955) Économie des Changements Phonétiques [Economy of phoneticsound changes] (Maisonneuve & Larose, Paris).
7. King R (1967) Functional load and sound change. Language 43:831–852.8. Holmes I, Bruno WJ (2001) Evolutionary HMMs: A Bayesian approach to multiple
alignment. Bioinformatics 17(9):803–820.9. Miklós I, Lunter GA, Holmes I (2004) A “Long Indel” model for evolutionary sequence
alignment. Mol Biol Evol 21(3):529–540.10. Suchard MA, Redelings BD (2006) BAli-Phy: Simultaneous Bayesian inference of
alignment and phylogeny. Bioinformatics 22(16):2047–2048.11. Liberles DA, ed (2007) Ancestral Sequence Reconstruction (Oxford Univ Press, Ox-
ford, UK).12. Paten B, et al. (2008) Genome-wide nucleotide-level mammalian ancestor recon-
struction. Genome Res 18(11):1829–1843.13. Gray RD, Jordan FM (2000) Language trees support the express-train sequence of
Austronesian expansion. Nature 405(6790):1052–1055.14. Ringe D, Warnow T, Taylor A (2002) Indo-European and computational cladistics.
Trans Philol Soc 100:59–129.15. Evans SN, Ringe D, Warnow T (2004) Inference of Divergence Times as a Statistical In-
verse Problem, McDonald Institute Monographs, eds Forster P, Renfrew C (McDonaldInstitute, Cambridge, UK).
16. Gray RD, Atkinson QD (2003) Language-tree divergence times support the Anatoliantheory of Indo-European origin. Nature 426(6965):435–439.
17. Nakhleh L, Ringe D, Warnow T (2005) Perfect phylogenetic networks: A new meth-odology for reconstructing the evolutionary history of natural languages. Language81:382–420.
18. Bryant D (2006) Phylogenetic Methods and the Prehistory of Languages, eds Forster P,Renfrew C (McDonald Institute for Archaeological Research, Cambridge, UK), pp111–118.
19. Daumé H III, Campbell L (2007) A Bayesian model for discovering typological im-plications. Assoc Comput Linguist 45:65–72.
20. Dunn M, Levinson S, Lindstrom E, Reesink G, Terrill A (2008) Structural phylogenyin historical linguistics: Methodological explorations applied in Island Melanesia.Language 84:710–759.
21. Lynch J, ed (2003) Issues in Austronesian (Pacific Linguistics, Canberra, Australia).22. Oakes M (2000) Computer estimation of vocabulary in a protolanguage from word
lists in four daughter languages. J Quant Linguist 7:233–244.23. Kondrak G (2002) Algorithms for Language Reconstruction. PhD thesis (Univ of Tor-
onto, Toronto).24. Ellison TM (2007) Bayesian identification of cognates and correspondences. Assoc
Comput Linguist 45:15–22.25. Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum like-
lihood alignment of DNA sequences. J Mol Evol 33(2):114–124.26. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data
via the EM algorithm. J R Stat Soc B 39:1–38.27. Bouchard-Côté A, Jordan MI, Klein D (2009) Efficient inference in phylogenetic InDel
trees. Adv Neural Inf Process Syst 21:177–184.28. Greenhill SJ, Blust R, Gray RD (2008) The Austronesian basic vocabulary database:
From bioinformatics to lexomics. Evol Bioinform Online 4:271–283.29. Lewis MP, ed (2009) Ethnologue: Languages of the World (SIL International, Dallas,
TX, 16th Ed).30. Lyovin A (1997) An Introduction to the Languages of the World (Oxford Univ Press,
Oxford, UK).31. Hockett CF (1967) The quantification of functional load. Word 23:320–339.32. Surendran D, Niyogi P (2006) Competing Models of Linguistic Change. Evolution and
Beyond (Benjamins, Amsterdam).33. Varadarajan A, Bradley RK, Holmes IH (2008) Tools for simulating evolution of aligned
genomic regions with integrated parameter estimation. Genome Biol 9(10):R147.34. Mohri M (2009) Handbook of Weighted Automata, Monographs in Theoretical
Computer Science, eds Droste M, Kuich W, Vogler H (Springer, Berlin).35. Hansson GO (2007) On the evolution of consonant harmony: The case of secondary
articulation agreement. Phonology 24:77–120.36. McCullagh P, Nelder JA (1989) Generalized Linear Models (Chapman & Hall, London).37. Dreyer M, Smith JR, Eisner J (2008) Latent-variable modeling of string transductions
with finite-state methods. Empirical Methods on Natural Language Processing 13:1080–1089.
38. Chen SF (2003) Conditional and joint models for grapheme-to-phoneme conversion.Eurospeech 8:2033–2036.
39. Goldwater S, Johnson M (2003) Learning OT constraint rankings using a maximumentropy model. Proceedings of the Workshop on Variation Within Optimality Theoryeds Spenader J, Eriksson A, Dahl Ö (Stockholm University, Stockholm) pp 113–122.
40. Wilson C (2006) Learning phonology with substantive bias: An experimental andcomputational study of velar palatalization. Cogn Sci 30(5):945–982.
41. Gray RD, Drummond AJ, Greenhill SJ (2009) Language phylogenies reveal expansionpulses and pauses in Pacific settlement. Science 323(5913):479–483.
42. Blust R (1999) Subgrouping, circularity and extinction: Some issues in Austronesiancomparative linguistics. Inst Linguist Acad Sinica 1:31–94.
Bouchard-Côté et al. PNAS | March 12, 2013 | vol. 110 | no. 11 | 4229
COMPU
TERSC
IENCE
SPS
YCHOLO
GICALAND
COGNITIVESC
IENCE
SSE
ECO
MMEN
TARY