statistical nlp spring 2010 lecture 25: diachronics dan klein – uc berkeley

49
Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Upload: hilda-hampton

Post on 28-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Statistical NLPSpring 2010

Lecture 25: Diachronics

Dan Klein – UC Berkeley

Page 2: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Evolution: Main Phenomena

Mutations of sequences

Time

Speciation

Time

Page 3: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Tree of Languages

Challenge: identify the phylogeny Much work in

biology, e.g. work by Warnow, Felsenstein, Steele…

Also in linguistics, e.g. Warnow et al., Gray and Atkinson…

http://andromeda.rutgers.edu/~jlynch/language.html

Page 4: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Statistical Inference Tasks

Modern TextAncestralWord Forms

Cognate Groups /Translations

GrammaticalInference

Inputs Outputs

Phylogeny

FR IT PT ES

fuego feu

focus

fuego

feuhuevo

oeuf

les faits sont très clairs

Page 5: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Outline

AncestralWord Forms

Cognate Groups /Translations

GrammaticalInference

fuego feu

focus

fuego

feuhuevo

oeuf

les faits sont très clairs

Page 6: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Language Evolution: Sound Change

camera /kamera/Latin

chambre /ʃambʁ/French

Deletion: /e/

Change: /k/ .. /tʃ/ .. /ʃ/

Insertion: /b/

Eng. camera from Latin, “camera obscura”

Eng. chamber from Old Fr. before the initial /t/ dropped

Page 7: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Diachronic Evidence

tonitru non tonotrutonight not tonite

Yahoo! Answers [2009] Appendix Probi [ca 300]

Page 8: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Synchronic (Comparative) Evidence

Page 9: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Simple Model: Single Characters

CG CC CC GG

CG

C GG

Page 10: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

A Model for Words Forms

focus /fokus/

fuego /fweɣo/

/fogo/

fogo /fogo/

fuoco /fwɔko/

µI B

IT ES PT

IB

LAµE S

µI T µP T

[BGK 07]

Page 11: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Contextual Changes

/fokus/

/fwɔko/

f# o

f# w ɔ

µI T

Page 12: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Changes are Systematic

/fokus/

/fweɣo/

/fogo/

/fogo//

fwɔko/

/fokus/

/fweɣo/

/fogo/

/fogo//

fwɔko/

/kentru

m/

/sentro

/

/sentr

o/

/sentro

/

/tʃɛntro

/

Page 13: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Experimental Setup

Data sets Small: Romance

French, Italian, Portuguese, Spanish

2344 words Complete cognate sets Target: (Vulgar) Latin

Large: Oceanic 661 languages 140K words Incomplete cognate sets Target: Proto-Oceanic [Blust,

1993]

FR IT PT ES

Page 14: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Data: Romance

Page 15: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Learning: Objective

maxµ

P (µjw1 : : :wL )

/fokus/

/fweɣo/

/fogo/

/fogo//fwɔko/

z

w

Page 16: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Learning: EM

/fokus/

/fweɣo/

/fogo/

/fogo//fwɔko/

/fokus/

/fweɣo/

/fogo/

/fogo//fwɔko/

M-Step Find parameters which

fit (expected) sound change counts

Easy: gradient ascent on theta

E-Step Find (expected) change

counts given parameters

Hard: variables are string-valued

Page 17: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Computing Expectations

‘grass’

Standard approach, e.g. [Holmes 2001]: Gibbs sampling each sequence

[Holmes 01, BGK 07]

Page 18: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

A Gibbs Sampler

‘grass’

Page 19: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

A Gibbs Sampler

‘grass’

Page 20: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

A Gibbs Sampler

‘grass’

Page 21: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Getting Stuck

How to jump to a state where the liquids /r/ and /l/ have a common

ancestor?

?

Page 22: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Getting Stuck

Page 23: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Solution: Vertical Slices

Single Sequence

Resampling

Ancestry Resampling

[BGK 08]

Page 24: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Details: Defining “Slices”

The sampling domains (kernels) are indexed by contiguous subsequences (anchors) of the observed leaf sequences

Correct construction section(G) is non-trivial but very efficient

anchor

Page 25: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Results: Alignment Efficiency

Is ancestry resampling faster than basic Gibbs?

Hypothesis: Larger gains for deeper trees

Depth of the phylogenetic tree

Sum

of P

airs

Setup: Fixed wall time

Synthetic data, same parameters

Page 26: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Results: Romance

Page 27: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Learned Rules / Mutations

Page 28: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Learned Rules / Mutations

Page 29: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Comparison to Other Methods

Evaluation metric: edit distance from a reconstruction made by a linguist (lower is better)

UsOakes

Comparison to system from [Oakes, 2000]

Uses exact inference and deterministic rules

Reconstruction of Proto-Malayo-Javanic cf [Nothefer, 1975]

Page 30: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Data: Oceanic

Proto-Oceanic

Page 31: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Data: Oceanic

http://language.psy.auckland.ac.nz/austronesian/research.php

Page 32: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Result: More Languages Help

Number of modern languages used

Mea

n ed

it di

stan

ceDistance from [Blust, 1993] Reconstructions

Page 33: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Results: Large Scale

Page 34: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Visualization: Learned universals

*The model did not have features encoding natural classes

Page 35: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Regularity and Functional Load

In a language, some pairs of sounds are more contrastive than others (higher functional load)

Example: English “p”/“b” versus “t”/”th”

“p”/“b”: pot/dot, pin/din, dress/press, pew/dew, ...

“t”/”th”: thin/tin

Page 36: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Functional Load: Timeline

Functional Load Hypothesis (FLH): sounds changes are less frequent when they merge phonemes with high functional load [Martinet, 55]

Previous research within linguistics: “FLH does not seem to be supported by the data” [King, 67]

Caveat: only four languages were used in King’s study [Hocket 67; Surandran et al., 06]

Our work: we reexamined the question with two orders of magnitude more data [BGK, under review]

Page 37: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Regularity and Functional Load

Functional load as computed by [King, 67]

Data: only 4 languages from the Austronesian dataM

erge

r pos

terio

r pro

babi

lity

Each dot is a sound change identified by the system

Page 38: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Regularity and Functional Load

Data: all 637 languages from the Austronesian data

Functional load as computed by [King, 67]

Mer

ger p

oste

rior p

roba

bility

Page 39: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Outline

AncestralWord Forms

Cognate Groups /Translations

GrammaticalInference

fuego feu

focus

fuego

feuhuevo

oeuf

les faits sont très clairs

Page 40: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Cognate Groups

/fweɣo/

/fogo/

/fwɔko/

/berβo/

/vɛrbo/

/vɛrbo//

tʃɛntro/

/sentro/

/sɛntro/

‘fire’

Page 41: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Model: Cognate Survival

omnis

-

-

-ogn

i

IT ES PT

IB

LA

+

-

-

-+

IT ES PT

IB

LA

Page 42: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Results: Grouping Accuracy

Fraction of Words Correctly Grouped

Method

[Hall and Klein, in submission]

Page 43: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Semantics: Matching Meanings

day

Occurs with:“night”“sun”

“week”

tag

DE

tag ENEN

Occurs with:“nacht”“sonne”“woche”

Occurs with:“name”“label”“along”

Page 44: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Outline

AncestralWord Forms

Cognate Groups /Translations

GrammaticalInference

fuego feu

focus

fuego

feuhuevo

oeuf

les faits sont très clairs

Page 45: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Grammar Induction

congress narrowly passed the amended bill les faits sont très clairs

Task: Given sentences, infer grammar (and parse tree structures)

Page 46: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Shared Prior

les faits sont très clairs

P (µ̀ jµ) = N (µ;¾2I )

P (µ) = N (0;¾2I )

congress narrowly passed the amended bill

Page 47: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Results: Phylogenetic Prior

Dut

ch

Dan

ish

Sw

edis

h

Spa

nish

Por

tugu

ese

Slo

vene

Chi

nese

Eng

lish

WG NGRMG

IE

GLAvg rel gain: 29%

Page 48: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Conclusion Phylogeny-structured models can:

Accurately reconstruct ancestral words Give evidence to open linguistic debates Detect translations from form and context Improve language learning algorithms

Lots of questions still open: Can we get better phylogenies using these high-

res models? What do these models have to say about the very

earliest languages? Proto-world?

Page 49: Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Thank you!

nlp.cs.berkeley.edu