from geometry and physics to computational linguistics

From Geometry and Physics to Computational

Linguistics

Matilde Marcolli

Geometric Science of Information, Paris, October 2015

Matilde Marcolli Geometry, Physics, Linguistics

A Mathematical Physicist’s adventures in Linguistics

Based on:

1 Alexander Port, Iulia Gheorghita, Daniel Guth, John M.Clark,Crystal Liang, Shival Dasu, Matilde Marcolli, PersistentTopology of Syntax, arXiv:1507.05134

2 Karthik Siva, Jim Tao, Matilde Marcolli, Spin Glass Models ofSyntax and Language Evolution, arXiv:1508.00504

3 Jeong Joon Park, Ronnel Boettcher, Andrew Zhao, Alex Mun,Kevin Yuh, Vibhor Kumar, Matilde Marcolli, Prevalence andrecoverability of syntactic parameters in sparse distributedmemories, arXiv:1510.06342

4 Sharjeel Aziz, Vy-Luan Huynh, David Warrick, MatildeMarcolli, Syntactic Phylogenetic Trees, in preparation...coming soon to an arXiv near you


What is Linguistics?• Linguistics is the scientific study of language

- What is Language? (langage, lenguaje, ...)- What is a Language? (lange, lengua,...)

Similar to ‘What is Life?’ or ‘What is an organism?’ in biology

• natural languageas opposed to artificial (formal, programming, ...) languages

• The point of view we will focus on:Language is a kind of Structure

- It can be approached mathematically and computationally, likemany other kinds of structures- The main purpose of mathematics is the understanding ofstructures


• How are di↵erent languages related?What does it mean that they come in families?

• How do languages evolve in time?Phylogenetics, Historical Linguistics, Etymology

• How does the process of language acquisition work?(Neuroscience)

• Semiotic viewpoint (mathematical theory of communication)

• Discrete versus Continuum(probabilistic methods, versus discrete structures)

• Descriptive or Predictive?to be predictive, a science needs good mathematical models


A language exists at many di↵erent levels of structure

An Analogy: Physics looks very di↵erent at di↵erent scales:

General Relativity and Cosmology (� 1010 m)

Classical Physics (⇠ 1 m)

Quantum Physics ( 10�10 m)

Quantum Gravity (10�35 m)

Despite dreams of a Unified Theory, we deal with di↵erentmathematical models for di↵erent levels of structure


Similarly, we view language at di↵erent “scales”:

units of sound (phonology)

words (morphology)

sentences (syntax)

global meaning (semantics)

We expect to be dealing with di↵erent mathematical structuresand di↵erent models at these various di↵erent levels

Main level I will focus on: Syntax


Linguistics view of syntax kind of looks like this...

Alexander Calder, Mobile, 1960


Modern Syntactic Theory:

• grammaticality: judgement on whether a sentence is well formed(grammatical) in a given language, i-language gives people thecapacity to decide on grammaticality

• generative grammar: produce a set of rules that correctly predictgrammaticality of sentences

• universal grammar: ability to learn grammar is built in thehuman brain, e.g. properties like distinction between nouns andverbs are universal ... is universal grammar a falsifiable theory?


Principles and Parameters (Government and Binding)(Chomsky, 1981)

• principles: general rules of grammar

• parameters: binary variables (on/o↵ switches) that distinguishlanguages in terms of syntactic structures

• Example of parameter: head-directionality(head-initial versus head-final)English is head-initial, Japanese is head-final

VP= verb phrase, TP= tense phrase, DP= determiner phraseMatilde Marcolli Geometry, Physics, Linguistics

...but not always so clear-cut: German can use both structures

auf seine Kinder stolze Vater (head-final) orer ist stolz auf seine Kinder (head-initial)

AP= adjective phrase, PP= prepositional phrase

• Corpora based statistical analysis of head-directionality (HaitaoLiu, 2010): a continuum between head-initial and head-final


Examples of Parameters

Head-directionality

Subject-side

Pro-drop

Null-subject

Problems• Interdependencies between parameters• Diachronic changes of parameters in language evolution


Dependent parameters

• null-subject parameter: can drop subjectExample: among Latin languages, Italian and Spanish havenull-subject (+), French does not (-)it rains, piove, llueve, il pleut

• pro-drop parameter: can drop pronouns in sentences

• Pro-drop controls Null-subject

How many independent parameters?Geometry of the space of syntactic parameters?


Persistent Topology of Syntax

• Alexander Port, Iulia Gheorghita, Daniel Guth, John M.Clark,Crystal Liang, Shival Dasu, Matilde Marcolli, Persistent Topologyof Syntax, arXiv:1507.05134

Databases of Syntactic Parameters of World Languages:

1 Syntactic Structures of World Languages (SSWL)http://sswl.railsplayground.net/

2 TerraLing http://www.terraling.com/

3 World Atlas of Language Structures (WALS)http://wals.info/


Persistent Topology of Data Sets

how data cluster around topological shapes at di↵erent scalesMatilde Marcolli Geometry, Physics, Linguistics

Vietoris–Rips complexes

• set X = {x↵} of points in Euclidean space EN , distanced(x , y) = kx � yk = (

PNj=1(xj � yj)2)1/2

• Vietoris-Rips complex R(X , ✏) of scale ✏ over field K:

Rn(X , ✏) is K-vector space spanned by all unordered (n + 1)-tuplesof points {x↵0 , x↵1 , . . . , x↵n} in X where all pairs have distancesd(x↵i , x↵j ) ✏


• inclusion maps R(X , ✏1) ,! R(X , ✏2) for ✏1 < ✏2 induce maps inhomology by functoriality Hn(X , ✏1) ! Hn(X , ✏2)

barcode diagrams: births and deaths of persistent generators


Persistent Topology of Syntactic Parameters

• Data: 252 languages from SSWL with 115 parameters

• if consider all world languages together too much noise in thepersistent topology: subdivide by language families

• Principal Component Analysis: reduce dimensionality of data

• compute Vietoris–Rips complex and barcode diagrams

Persistent H0: clustering of data in components– language subfamilies

Persistent H1: clustering of data along closed curves (circles)– linguistic meaning?


Sources of Persistent H1

• “Hopf bifurcation” type phenomenon

• two di↵erent branches of a tree closing up in a loop

two di↵erent types of phenomena of historical linguisticdevelopment within a language family


Persistent Topology of Indo-European Languages

• Two persistent generators of H0 (Indo-Iranian, European)• One persistent generator of H1


Persistent Topology of Niger–Congo Languages

• Three persistent components of H0

(Mande, Atlantic-Congo, Kordofanian)• No persistent H1


The origin of persistent H1 of Indo-European Languages?

Naive guess: the Anglo-Norman bridge ... but lexical not syntactic


Answer: No, it is not the Anglo-Norman bridge!

Persistent topology of the Germanic+Latin languages


Answer: It’s all because of Ancient Greek!

Persistent topology with Hellenic (and Indo-Iranic) branch removed


Syntactic Parameters as Dynamical Variables• Example: Word Order: SOV, SVO, VSO, VOS, OVS, OSV

Very uneven distribution across world languages


• Word order distribution: a neuroscience explanation?

- D. Kemmerer, The cross-linguistic prevalence of SOV and SVOword orders reflects the sequential and hierarchical representationof action in Broca’s area, Language and Linguistics Compass, 6(2012) N.1, 50–66.

• Internal reasons for diachronic switch?

- F.Antinucci, A.Duranti, L.Gebert, Relative clause structure,relative clause perception, and the change from SOV to SVO,Cognition, Vol.7 (1979) N.2 145–176.


Changes over time in Word Order

• Ancient Greek: switched from Homeric to Classical- A. Taylor, The change from SOV to SVO in Ancient Greek,Language Variation and Change, 6 (1994) 1–37

• Sanskrit: di↵erent word orders allowed, but prevalent one inVedic Sanskrit is SOV (switched at least twice by influence ofDravidian languages)- F.J. Staal, Word Order in Sanskrit and Universal Grammar,Springer, 1967

• English: switched from Old English (transitional between SOVand SVO) to Middle English (SVO)- J. McLaughlin, Old English Syntax: a handbook, Walter deGruyter, 1983.

Syntactic Parameters are Dynamical in Language Evolution


Spin Glass Models of Syntax

• Karthik Siva, Jim Tao, Matilde Marcolli, Spin Glass Models ofSyntax and Language Evolution, arXiv:1508.00504

– focus on linguistic change caused by language interactions

– think of syntactic parameters as spin variables

– spin interaction tends to align (ferromagnet)

– strength of interaction proportional to bilingualism (MediaLab)

– role of temperature parameter: probabilistic interpretation ofparameters

– not all parameters are independent: entailment relations

– Metropolis–Hastings algorithm: simulate evolution


The Ising Model of spin systems on a graph G

• configurations of spins s : V (G ) ! {±1}

• magnetic field B and correlation strength J: Hamiltonian

H(s) = �JX

e2E(G):@(e)={v ,v 0}

sv sv 0 � BX

v2V (G)

sv

• first term measures degree of alignment of nearby spins

• second term measures alignment of spins with direction ofmagnetic field


Equilibrium Probability Distribution

• Partition Function ZG (�)

ZG (�) =X

s:V (G)!{±1}

exp(��H(s))

• Probability distribution on the configuration space: Gibbsmeasure

PG ,�(s) =e��H(s)

ZG (�)

• low energy states weight most

• at low temperature (large �): ground state dominates; at highertemperature (� small) higher energy states also contribute


Average Spin Magnetization

MG (�) =1

#V (G )

X

s:V (G)!{±1}

X

v2V (G)

sv P(s)

• Free energy FG (�,B) = logZG (�,B)

MG (�) =1

#V (G )

1

�

✓@FG (�,B)

@B

◆|B=0

Ising Model on a 2-dimensional lattice

• 9 critical temperature T = Tc where phase transition occurs

• for T > Tc equilibrium state has m(T ) = 0 (computed withrespect to the equilibrium Gibbs measure PG ,�

• demagnetization: on average as many up as down spins

• for T < Tc have m(T ) > 0: spontaneous magnetization


Syntactic Parameters and Ising/Potts Models

• characterize set of n = 2N languages Li by binary strings of Nsyntactic parameters (Ising model)

• or by ternary strings (Potts model) if take values ±1 forparameters that are set and 0 for parameters that are not definedin a certain language

• a system of n interacting languages = graph G with n = #V (G )

• languages Li = vertices of the graph (e.g. language thatoccupies a certain geographic area)

• languages that have interaction with each other = edges E (G )(geographical proximity, or high volume of exchange for otherreasons)


graph of language interaction (detail) from Global LanguageNetwork of MIT MediaLab, with interaction strengths Je on edgesbased on number of book translations (or Wikipedia edits)


• if only one syntactic parameter, would have an Ising model onthe graph G : configurations s : V (G ) ! {±1} set the parameterat all the locations on the graph

• variable interaction energies along edges (some pairs oflanguages interact more than others) • magnetic field B andcorrelation strength J: Hamiltonian

H(s) = �X

e2E(G):@(e)={v ,v 0}

NX

i=1

Je sv ,i sv 0,i

• if N parameters, configurations

s = (s1, . . . , sN) : V (G ) ! {±1}N

• if all N parameters are independent, then it would be like havingN non-interacting copies of a Ising model on the same graph G (orN independent choices of an initial state in an Ising model on G )


Metropolis–Hastings

• detailed balance condition P(s)P(s ! s 0) = P(s 0)P(s 0 ! s) forprobabilities of transitioning between states (Markov process)

• transition probabilities P(s ! s 0) = ⇡A(s ! s 0) · ⇡(s ! s 0) with⇡(s ! s 0) conditional probability of proposing state s 0 given states and ⇡A(s ! s 0) conditional probability of accepting it

• Metropolis–Hastings choice of acceptance distribution (Gibbs)

⇡A(s ! s 0) =

⇢1 if H(s 0)� H(s) 0

exp(��(H(s 0)� H(s))) if H(s 0)� H(s) > 0.

satisfying detailed balance

• selection probabilities ⇡(s ! s 0) single-spin-flip dynamics

• ergodicity of Markov process ) unique stationary distribution


Example: Single parameter dynamics Subject-Verb parameter

Initial configuration: most languages in SSWL have +1 forSubject-Verb; use interaction energies from MediaLab data


Equilibrium: low temperature all aligned to +1; high temperature:

Temperature: fluctuations in bilingual users between di↵erentstructures (“code-switching” in Linguistics)


Entailment relations among parameters

• Example: {p1, p2} = {Strong Deixis, Strong Anaphoricity}

p1 p2`1 +1 +1`2 �1 0`3 +1 +1`4 +1 �1

{`1, `2, `3, `4} = {English,Welsh,Russian,Bulgarian}


Modeling Entailment

• variables: S`,p1 = exp(⇡iX`,p1) 2 {±1}, S`,p2 2 {±1, 0} andY`,p2 = |S`,p2 | 2 {0, 1}

• Hamiltonian H = HE + HV

HE = Hp1 + Hp2 = �X

`,`02languagesJ``0

⇣�S`,p1 ,S`0,p1

+ �S`,p2 ,S`0,p2

⌘

HV =X

`

HV ,` =X

`

J` �X`,p1,Y`,p2

J` > 0 anti-ferromagnetic

• two parameters: temperature as before and coupling energy ofentailment

• if freeze p1 and evolution for p2: Potts model with externalmagnetic field


Acceptance probabilities

⇡A(s ! s ± 1 (mod 3)) =

⇢1 if �H 0exp(��H) if �H > 0.

�H := min{H(s + 1 (mod 3)),H(s � 1 (mod 3))}� H(s)

Equilibrium configuration

(p1, p2) HT/HE HT/LE LT/HE LT/LE

`1 (+1, 0) (+1,�1) (+1,+1) (+1,�1)`2 (+1,�1) (�1,�1) (+1,+1) (+1,�1)`3 (�1, 0) (�1,+1) (+1,+1) (�1, 0)`4 (+1,+1) (�1,�1) (+1,+1) (�1, 0)


Average value of spin

p1 left and p2 right in low entailment energy case


Syntactic Parameters in Kanerva Networks

• Jeong Joon Park, Ronnel Boettcher, Andrew Zhao, Alex Mun,Kevin Yuh, Vibhor Kumar, Matilde Marcolli, Prevalence andrecoverability of syntactic parameters in sparse distributedmemories, arXiv:1510.06342

– Address two issues: relative prevalence of di↵erent syntacticparameters and “degree of recoverability” (as sign of underlyingrelations between parameters)– If corrupt information about one parameter in data of group oflanguages can recover it from the data of the other parameters?– Answer: di↵erent parameters have di↵erent degrees ofrecoverability– Used 21 parameters and 165 languages from SSWL database


Kanerva networks (sparse distributed memories)• P. Kanerva, Sparse Distributed Memory, MIT Press, 1988.

• field F2 = {0, 1}, vector space FN2 large N

• uniform random sample of 2k hard locations with 2k << 2N

• median Hamming distance between hard locations

• Hamming spheres of radius slightly larger than median value(access sphere)

• writing to network: storing datum X 2 FN2 , each hard location in

access sphere of X gets i-th coordinate (initialized at zero)incremented depending on i-th entry ot X

• reading at a location: i-th entry determined by majority rule ofi-th entries of all stored data in hard locations within access sphere

Kanerva networks are good at reconstructing corrupted data


Procedure

• 165 data points (languages) stored in a Kanerva Network in F212

(choice of 21 parameters)

• corrupting one parameter at a time: analyze recoverability

• language bit-string with a single corrupted bit used as readlocation and resulting bit string compared to original bit-string(Hamming distance)

• resulting average Hamming distance used as score ofrecoverability (lowest = most easily recoverable parameter)


Parameters and frequencies01 Subject-Verb (0.64957267)

02 Verb-Subject (0.31623933)

03 Verb-Object (0.61538464)

04 Object-Verb (0.32478634)

05 Subject-Verb-Object (0.56837606)

06 Subject-Object-Verb (0.30769232)

07 Verb-Subject-Object (0.1923077)

08 Verb-Object-Subject (0.15811966)

09 Object-Subject-Verb (0.12393162)

10 Object-Verb-Subject (0.10683761)

11 Adposition-Noun-Phrase (0.58974361)

12 Noun-Phrase-Adposition (0.2905983)

13 Adjective-Noun (0.41025642)

14 Noun-Adjective (0.52564102)

15 Numeral-Noun (0.48290598)

16 Noun-Numeral (0.38034189)

17 Demonstrative-Noun (0.47435898)

18 Noun-Demonstrative (0.38461539)

19 Possessor-Noun (0.38034189)

20 Noun-Possessor (0.49145299)

A01 Attributive-Adjective-Agreement (0.46581197)


Specific e↵ects due to individual parameterMatilde Marcolli Geometry, Physics, Linguistics

Overall e↵ect related to relative prevalence of a parameter


More refined e↵ect after normalizing for prelavence (syntacticdependencies)


• Overall e↵ect relating recoverability in a Kanerva Network toprevalence of a certain parameter among languages (depends onlyon frequencies: see in random data with assigned frequencies)

• Additional e↵ects (that deviate from random case) which detectpossible dependencies among syntactic parameters: increasedrecoverability beyond what e↵ect based on frequency

• Possible neuroscience implications? Kanerva Networks as modelsof human memory (parameter prevalence linked to neurosciencemodels)

• More refined data if divided by language families?


Phylogenetic Linguistics (WORK IN PROGRESS)

• Constructing family trees for languages(sometimes possibly graphs with loops)

• Main information about subgrouping: shared innovationa specific change with respect to other languages in the family thatonly happens in a certain subset of languages

- Example: among Mayan languages: Huastecan branchcharacterized by initial w becoming voiceless before a vowel and tsbecoming t, q becoming k , ... Quichean branch by velar nasalbecoming velar fricative, c becoming c (prepalatal a↵ricate topalato-alveolar)...

Known result by traditional Historical Linguistics methods:


Mayan Language Tree


Computational Methods for Phylogenetic Linguistics

• Peter Foster, Colin Renfrew, Phylogenetic methods and theprehistory of languages, McDonald Institute Monographs, 2006

• Several computational methods for constructing phylogenetictrees available from mathematical and computational biology

• Phylogeny Programshttp://evolution.genetics.washington.edu/phylip/software.html

• Standardized lexical databases: Swadesh list(100 words, or 207 words)


• Use Swadesh lists of languages in a given family to look forcognates:- without additional etymological information (keep false positives)- with additional etymological information (remove false positives)

• Two further choices about loan words:- remove loan words- keep loan words

• Keeping loan words produces graphs that are not trees

• Without loan words it should produce trees, but small loops stillappear due to ambiguities (di↵erent possible trees matching samedata)... more precisely: coding of lexical data ...


Coding of lexical data

• After compiling lists of cognate words for pairs of languageswithin a given family(with/without lexical information and loan words)

• Produce a binary string S(L1, L2) = (s1, . . . , sN) for each pair oflanguages L1, L2, with entry 0 or 1 at the i-th word of the lexicallist of N words if cognates for that meaning exist in the twolanguages or not (important to pay attention to synonyms)

• lexical Hamming distance between two languages

d(L1, L2) = #{i 2 {1, . . . ,N} | si = 1}

counts words in the list that do not have cognates in L1 and L2


Distance-matrix method of phylogenetic inference

• after producing a measure of “genetic distance”Hamming metric dH(La, Lb)

• hierarchical data clustering: collecting objects in clustersaccording to their distance

• simplest method of tree construction: neighbor joining

(1) - create a (leaf) vertex for each index a(ranging over languages in given family)

(2) - given distance matrix D = (Dab)distances between each pair Dab = dH(La, Lb)construct a new matrix Q-test

Q = (Qab) with Qab = (n � 2)Dab �nX

k=1

Dak �nX

k=1

Dbk

this matrix Q decides first pairs of vertices to join


(3) - identify entries Qab with lowest values: join each such pair(a, b) of leaf vertices to a newly created vertex vab

(4) - set distances to new vertex by

d(a, vab) =1

2Dab +

1

2(n � 2)

nX

k=1

Dak �nX

k=1

Dbk

!

d(b, vab) = Dab � d(a, vab)

d(k , vab) =1

2(Dak + Dbk � Dab)

(5) - remove a and b and keep vab and all the remaining verticesand the new distances, compute new Q matrix and repeat untiltree is completed


Neighborhood-Joining Method for Phylogenetic Inference


Example of a neighbor-joining lexical linguistic phylogenetic tree

from Delmestri-Cristianini’s paperMatilde Marcolli Geometry, Physics, Linguistics

N. Saitou, M. Nei, The neighbor-joining method: a newmethod for reconstructing phylogenetic trees, Mol Biol Evol.Vol.4 (1987) N. 4, 406-425.

R. Mihaescu, D. Levy, L. Pachter, Why neighbor-joiningworks, arXiv:cs/0602041v3

A. Delmestri, N. Cristianini, Linguistic Phylogenetic Inferenceby PAM-like Matrices, Journal of Quantitative Linguistics,Vol.19 (2012) N.2, 95-120.

F. Petroni, M. Serva, Language distance and treereconstruction, J. Stat. Mech. (2008) P08012


Syntactic Phylogenetic Trees (instead of lexical)

• instead of coding lexical data based on cognate words, use binaryvariables of syntactic parameters

• Hamming distance between binary string of parameter values

• shown recently that one gets an accurate reconstruction of thephylogenetic tree of Indo-European languages from syntacticparameters only

• G. Longobardi, C. Guardiano, G. Silvestri, A. Boattini, A. Ceolin,Towards a syntactic phylogeny of modern Indo-Europeanlanguages, Journal of Historical Linguistics 3 (2013) N.1, 122–152.• G. Longobardi, C. Guardiano, Evidence for syntax as a signal ofhistorical relatedness, Lingua 119 (2009) 1679–1706.


Work in Progress

• Sharjeel Aziz, Vy-Luan Huynh, David Warrick, Matilde Marcolli,Syntactic Phylogenetic Trees, in preparation...coming soon to an arXiv near you

– Assembled a phylogenetic tree of world languages using theSSWL database of syntactic parameters– Ongoing comparison with specific historical linguisticreconstruction of phylogenetic trees– Comparison with Computational Linguistic reconstructions basedon lexical data (Swadesh lists) and on phonetical analysis– not all linguistic families have syntactic parameters mapped withsame level of completeness... di↵erent levels of accuracy inreconstruction


from geometry and physics to computational linguistics

Documents