part-of-speech tagging for historical english · penn corpora of historical english modern briesh...

81
Part-of-Speech Tagging for Historical English Yi Yang and Jacob Eisenstein Georgia Tech

Upload: others

Post on 29-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

Part-of-SpeechTaggingforHistoricalEnglish

YiYangandJacobEisensteinGeorgiaTech

Page 2: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

[MuralidharanandHearst,2011&2012]

‣ DigitalhumaniEesresearch

‣ HowdoestheportrayalofmenandwomendifferinShakespeare’splays?

‣ What’sthelanguageusepaMernsinNorthAmericanslavenarraEves?

Page 3: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

[MuralidharanandHearst,2011&2012]

‣ NLPcanhelp!

‣ DigitalhumaniEesresearch

‣ HowdoestheportrayalofmenandwomendifferinShakespeare’splays?

‣ What’sthelanguageusepaMernsinNorthAmericanslavenarraEves?

Page 4: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

[MuralidharanandHearst,2011&2012]

‣ NLPcanhelp!

‣ DigitalhumaniEesresearch

‣ HowdoestheportrayalofmenandwomendifferinShakespeare’splays?

‣ What’sthelanguageusepaMernsinNorthAmericanslavenarraEves?

‣ OnlyifNLPworksforhistoricaltexts…

Page 5: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

EarlyModernEnglish

Heesaidnobodyhadsaidanythingagtmee.

[HenryOxinden,1660]

Page 6: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

EarlyModernEnglish

Heesaidnobodyhadsaidanythingagtmee.

‣SpellingvariaEon

He againstHe me

[HenryOxinden,1660]

Page 7: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

StanfordPOSTagger

Heesaidnobodyhadsaidanythingagtmee.

‣SpellingvariaEon

Stanford:

Page 8: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

StanfordPOSTagger

Heesaidnobodyhadsaidanythingagtmee.X X X

‣SpellingvariaEon

Stanford:Gold:

Page 9: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

TransferLossforPOSTagging

0

5

10

15

20

25

3.0

Errorrate

ModernEnglish

[Raysonetal.,2007]

Page 10: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

TransferLossforPOSTagging

0

5

10

15

20

25

18.0

3.0

Errorrate

ModernEnglish

EarlyModernEnglish

[Raysonetal.,2007]

Page 11: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

Approaches

‣ SpellingnormalizaEon }‣ Mapfromhistoricalspellingstocontemporaryforms.

Raysonetal.(2007)Scheibleetal.(2011)Bollmann(2011)

Page 12: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

Approaches

‣ DomainadaptaEon(thiswork)

‣ SpellingnormalizaEon }‣ Mapfromhistoricalspellingstocontemporaryforms.

‣ BuildrobustNLPsystemswithrepresentaEonlearning.

Raysonetal.(2007)Scheibleetal.(2011)Bollmann(2011)

}Yang&Eisenstein(2014)Yang&Eisenstein(2015)

Page 13: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

SpellingNormalizaEon

[VARD;BaronandRayson,2008]

Original:Heesaidnobodyhadsaidanythingagtmee.

Normalized:Heesaidnobodyhadsaidanythingagedme.

Page 14: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

SpellingNormalizaEon

‣CorrectnormalizaEon

[VARD;BaronandRayson,2008]

Original:Heesaidnobodyhadsaidanythingagtmee.

Normalized:Heesaidnobodyhadsaidanythingagedme.

X

Page 15: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

SpellingNormalizaEon

‣CorrectnormalizaEon

[VARD;BaronandRayson,2008]

Original:Heesaidnobodyhadsaidanythingagtmee.

Normalized:Heesaidnobodyhadsaidanythingagedme.

‣IncorrectnormalizaEon

X X

against

Page 16: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

SpellingNormalizaEon

‣CorrectnormalizaEon

[VARD;BaronandRayson,2008]

Original:Heesaidnobodyhadsaidanythingagtmee.

Normalized:Heesaidnobodyhadsaidanythingagedme.

‣IncorrectnormalizaEon‣FalsenegaEve

X X

against

X

He

Page 17: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

SpellingNormalizaEon

[VARD;BaronandRayson,2008]

Normalized:Heesaidnobodyhadsaidanythingagedme.

X XX

Stanford:Gold:

Page 18: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

SpellingNormalizaEon

[VARD;BaronandRayson,2008]

Normalized:Heesaidnobodyhadsaidanythingagedme.

X XX

Stanford:Gold:

XX

Page 19: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

RepresentaEonLearning

Heesaidnobodyhadsaidanythingagtmee.

Page 20: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

RepresentaEonLearning

Heesaidnobodyhadsaidanythingagtmee.

Page 21: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

RepresentaEonLearning

Heesaidnobodyhadsaidanythingagtmee.

Page 22: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

RepresentaEonLearning

Heesaidnobodyhadsaidanythingagtmee.

Hee

saidwascametold…

} HeI

We…

saidwascametold…

}IVOOV Context Context

Page 23: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

Model

Page 24: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

FeatureEmbeddings

[FEMA;YangandEisenstein,2015]

Heesaidnobodyhadsaidanythingagtmee.

Page 25: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

FeatureEmbeddings

[FEMA;YangandEisenstein,2015]

Heesaidnobodyhadsaidanythingagtmee.

Page 26: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

FeatureEmbeddings

[FEMA;YangandEisenstein,2015]

Heesaidnobodyhadsaidanythingagtmee.CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

1

2

3

4

Page 27: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

FeatureEmbeddings

[FEMA;YangandEisenstein,2015]

Heesaidnobodyhadsaidanythingagtmee.CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

1

2

3

4

Page 28: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

FeatureEmbeddings

[FEMA;YangandEisenstein,2015]

Heesaidnobodyhadsaidanythingagtmee.CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

1

2

3

4

Page 29: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

FeatureEmbeddings

[FEMA;YangandEisenstein,2015]

Heesaidnobodyhadsaidanythingagtmee.CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

1

2

3

4

Page 30: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

FeatureEmbeddings

[FEMA;YangandEisenstein,2015]

CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

1

2

3

4

u2

Outputembeddings

Inputembeddings

v1

v3

v4

p(ft|f2) / exp

�u2

>vt

Page 31: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

FeatureEmbeddings

[FEMA;YangandEisenstein,2015]

CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

1

2

3

4

u2

Outputembeddings

Inputembeddings

v1

v3

v4

p(ft|f2) / exp

�u2

>vt

` =TX

t 6=2

logp(ft|f2)

Page 32: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

WordEmbeddings

[word2vec;Mikolovetal.,2013]

CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

heesaidnobodyhad…

}words

1

2

3

4

1

2

3

4

‣ Wordembeddings

‣ Featureembeddings

Page 33: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

WordEmbeddings

[word2vec;Mikolovetal.,2013]

CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

heesaidnobodyhad…

}words

1

2

3

4

1

2

3

4

‣ Wordembeddings

‣ Featureembeddings

‣ GenericrepresentaEons

Page 34: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

WordEmbeddings

[word2vec;Mikolovetal.,2013]

CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

heesaidnobodyhad…

}words

1

2

3

4

1

2

3

4

‣ Wordembeddings

‣ Featureembeddings

‣ GenericrepresentaEons

‣ Task-specificrepresentaEons

Page 35: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

WordEmbeddings

[word2vec;Mikolovetal.,2013]

CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

heesaidnobodyhad…

}words

1

2

3

4

1

2

3

4

‣ Wordembeddings

‣ Featureembeddings

‣ GenericrepresentaEons

‣ Task-specificrepresentaEons

‣ Wordco-occurrences

Page 36: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

WordEmbeddings

[word2vec;Mikolovetal.,2013]

CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

heesaidnobodyhad…

}words

1

2

3

4

1

2

3

4

‣ Wordembeddings

‣ Featureembeddings

‣ GenericrepresentaEons

‣ Task-specificrepresentaEons

‣ Wordco-occurrences

‣ Featureco-occurrences

Page 37: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

LearningfromMulEpleDomains

[FEMA;YangandEisenstein,2015]

‣PreviousworkonunsuperviseddomainadaptaEoninvolvesintwodomains.

Page 38: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

LearningfromMulEpleDomains

[FEMA;YangandEisenstein,2015]

‣PreviousworkonunsuperviseddomainadaptaEoninvolvesintwodomains.‣UnsupervisedmulE-domainadaptaEon

Page 39: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

LearningfromMulEpleDomains

[FEMA;YangandEisenstein,2015]

‣PreviousworkonunsuperviseddomainadaptaEoninvolvesintwodomains.‣UnsupervisedmulE-domainadaptaEon

Page 40: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

MulEpleFeatureEmbeddings

[FEMA;YangandEisenstein,2015]

Heesaidnobodyhadsaidanythingagtmee.

Page 41: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

MulEpleFeatureEmbeddings

[FEMA;YangandEisenstein,2015]

Heesaidnobodyhadsaidanythingagtmee.

DomainAMributes: Genre Epoch

Page 42: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

MulEpleFeatureEmbeddings

[FEMA;YangandEisenstein,2015]

Heesaidnobodyhadsaidanythingagtmee.

leMers 1600+

DomainAMributes: Genre Epoch

Page 43: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

MulEpleFeatureEmbeddings

[FEMA;YangandEisenstein,2015]

Heesaidnobodyhadsaidanythingagtmee.

leMers 1600+

DomainAMributes:

CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

1

2

3

4

Genre Epoch

Page 44: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

MulEpleFeatureEmbeddings

[FEMA;YangandEisenstein,2015]

Heesaidnobodyhadsaidanythingagtmee.

leMers 1600+

DomainAMributes:

CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

1

2

3

4

= +(shared) (leMers)

+(1600+)

Genre Epoch

Page 45: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

MulEpleFeatureEmbeddings

[FEMA;YangandEisenstein,2015]

Heesaidnobodyhadsaidanythingagtmee.

leMers 1600+

DomainAMributes:

CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

1

2

3

4

= +(shared) (leMers)

+(1600+)

Genre Epoch

Page 46: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

MulEpleFeatureEmbeddings

[FEMA;YangandEisenstein,2015]

Heesaidnobodyhadsaidanythingagtmee.CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

1

2

3

4

(shared) (leMers)= + +

(1600+)

Page 47: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

MulEpleFeatureEmbeddings

[FEMA;YangandEisenstein,2015]

Heesaidnobodyhadsaidanythingagtmee.CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

1

2

3

4

(shared) (leMers)

= + +

(1600+)

u2 = h(shared)2 + h(letters)

2 + h(1600+)2

= + +

Page 48: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

MulEpleFeatureEmbeddings

[FEMA;YangandEisenstein,2015]

Heesaidnobodyhadsaidanythingagtmee.CurrWord=heeNextWord=said

Prefix1=hSuffix1=e

}features

1

2

3

4

(shared) (leMers) (1600+)

u2 = h(shared)2 + h(letters)

2 + h(1600+)2

= + +

p(ft|f2) / exp

�u2

>vt

Page 49: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

Experiments

Page 50: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

PennCorporaofHistoricalEnglishModernBriEshEnglish(MBE)

1840-1914

1770-1839

1700-1769

0 110,000 220,000 330,000 440,000

343,024

427,424

322,255

#oftokens

EarlyModernEnglish(EME)

1640-1710

1570-1639

1500-1569

0 177,500 355,000 532,500 710,000

640,255

706,587

614,315

#oftokens

[KrochandTaylor,2000;Krochetal.,2004]

Page 51: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

TagsetMappings

‣PennCorporaofHistoricalEnglish(PCHE)tagset:83tags‣PennTreebank(PTB)tagset:45tags

[MoonandBaldridge,2007]

Page 52: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

TagsetMappings

‣PennCorporaofHistoricalEnglish(PCHE)tagset:83tags‣PennTreebank(PTB)tagset:45tags

[MoonandBaldridge,2007]

ADJ

PCHE PTB

JJADVALSO RB

VB VBVBI… …

Page 53: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

Systems

‣Supportvectormachine(SVM)tagger‣SixteenbasicfeaturetemplatesbyRatnaparkhi(1996)

Page 54: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

Systems

‣Supportvectormachine(SVM)tagger

‣RepresentaEonlearningmethods

‣SixteenbasicfeaturetemplatesbyRatnaparkhi(1996)

‣Structuralcorrespondencelearning(SCL)‣Brownclustering‣word2vecembeddings‣MulEplefeatureembeddings(FEMA)

[Blitzeretal.,2006;Brownetal.,1992;Mikolovetal.,2013]

Page 55: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

TemporalAdaptaEonModernBriEshEnglish(MBE)

1840-1914

1770-1839

1700-1769

0 110,000 220,000 330,000 440,000

343,024

427,424

322,255

#oftokens

EarlyModernEnglish(EME)

1640-1710

1570-1639

1500-1569

0 177,500 355,000 532,500 710,000

640,255

706,587

614,315

#oftokens

Train Train

Test1 Test1

Test2 Test2

Page 56: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

0

1.2

2.4

3.6

4.8

6

4.6

Averageerrorrate

Baseline SCL Brown word2vecFEMA

Results:ModernBriEshEnglish

Page 57: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

0

1.2

2.4

3.6

4.8

6

4.44.24.34.6

Averageerrorrate

Baseline SCL Brown word2vecFEMA

Results:ModernBriEshEnglish

Page 58: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

0

1.2

2.4

3.6

4.8

6

3.74.44.24.3

4.6

Averageerrorrate

Baseline SCL Brown word2vecFEMA

Results:ModernBriEshEnglish

(-0.9)

(Ourmethod)

Page 59: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

0

2.2

4.4

6.6

8.8

11

9.4

BaselineSCL Brown word2vec

FEMA

Averageerrorrate

Results:EarlyModernEnglish

Page 60: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

0

2.2

4.4

6.6

8.8

11

8.38.08.29.4

BaselineSCL Brown word2vec

FEMA

Averageerrorrate

Results:EarlyModernEnglish

Page 61: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

0

2.2

4.4

6.6

8.8

11

6.6

8.38.08.29.4

BaselineSCL Brown word2vec

FEMA

Averageerrorrate

Results:EarlyModernEnglish

(-2.8)

(Ourmethod)

Page 62: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

AdaptaEonfromPTB

PennTreebank

ModernBriEshEnglish

EarlyModernEnglish

0 500,000 1,000,000 1,500,000 2,000,000

1,961,157

1,092,703

969,905

#oftokens

Train

Test1

Test2

Page 63: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

AdaptaEonfromPTB

StandardevaluaEonscenarioforEnglishPOStagging.

Page 64: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

AdaptaEonfromPTB

StandardevaluaEonscenarioforEnglishPOStagging.

‣Lowresourcelanguages‣Specificgenres,styles,orepochs

InsufficientdataannotaEonforhistoricaltexts.

Page 65: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

0

4.6

9.2

13.8

18.4

23

18.9

Errorrate

Baseline SCL Brown word2vec FEMA

Results:ModernBriEshEnglish

Page 66: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

0

4.6

9.2

13.8

18.4

23

18.318.418.418.9

Baseline SCL Brown word2vec FEMAErrorrate

Results:ModernBriEshEnglish

Page 67: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

0

4.6

9.2

13.8

18.4

23

17.518.318.418.418.9

Baseline SCL Brown word2vec FEMAErrorrate

Results:ModernBriEshEnglish

(-1.4)

(Ourmethod)

Page 68: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

0

6

12

18

24

30

25.9

Baseline SCL Brown word2vec FEMAErrorrate

Results:EarlyModernEnglish

Page 69: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

0

6

12

18

24

30

24.224.024.125.9

Baseline SCL Brown word2vec FEMAErrorrate

Results:EarlyModernEnglish

Page 70: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

Results:EarlyModernEnglish

0

6

12

18

24

30

22.124.224.024.1

25.9

Baseline SCL Brown word2vec FEMAErrorrate

(-3.8)(Ourmethod)

Page 71: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

0

6

12

18

24

30

22.125.9

Baseline FEMA+VARD

FEMAErrorrate

NormalizaEonvs.RepresentaEonLearning

(-3.8) (-2.6)(-4.9)

RepresentaEonlearning

(-3.8)

FEMA

(-3.8)22.1

Page 72: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

0

6

12

18

24

30

23.322.125.9

Baseline VARD FEMA+VARD

FEMAErrorrate

NormalizaEonvs.RepresentaEonLearning

(-3.8) (-2.6)(-4.9)

RepresentaEonlearning

SpellingnormalizaEon

Page 73: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

0

6

12

18

24

30

21.023.322.1

25.9

Baseline VARD FEMA+VARD

FEMAErrorrate

NormalizaEonvs.RepresentaEonLearning

(-3.8) (-2.6)(-4.9)

RepresentaEonlearning+

normalizaEon

RepresentaEonlearning

SpellingnormalizaEon

Page 74: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

token annotaEonsinPCHE annotaEonsinPTB

,(comma) ,(comma;83.4%).(period;16.6%) ,(comma)

.(period) ,(comma;12.3%).(period;87.7%) .(period)

to TO(54.6%)IN(44.3%) TO

all/any/every JJ DT

ErrorAnalysis

‣AnnotaEoninconsistenciesandtagsetmismatches

Page 75: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

token annotaEonsinPCHE annotaEonsinPTB

,(comma) ,(comma;83.4%).(period;16.6%) ,(comma)

.(period) ,(comma;12.3%).(period;87.7%) .(period)

to TO(54.6%)IN(44.3%) TO

all/any/every JJ DT

ErrorAnalysis

‣AnnotaEoninconsistenciesandtagsetmismatches

Page 76: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

token annotaEonsinPCHE annotaEonsinPTB

,(comma) ,(comma;83.4%).(period;16.6%) ,(comma)

.(period) ,(comma;12.3%).(period;87.7%) .(period)

to TO(54.6%)IN(44.3%) TO

all/any/every JJ DT

ErrorAnalysis

‣AnnotaEoninconsistenciesandtagsetmismatches

Page 77: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

ErrorAnalysis

token annotaEonsinPCHE annotaEonsinPTB

,(comma) ,(comma;83.4%).(period;16.6%) ,(comma)

.(period) ,(comma;12.3%).(period;87.7%) .(period)

to TO(54.6%)IN(44.3%) TO

all/any/every JJ(quanEfier) DT

‣AnnotaEoninconsistenciesandtagsetmismatches

Page 78: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

Conclusions

Page 79: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

Conclusions

‣ FeatureembeddingsoutperformwordembeddingsbyexploiEngtask-specificinformaEoninfeaturetemplates.

Page 80: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

Conclusions

‣ RepresentaEonlearningandspellingnormalizaEonarecomplementaryforimprovingtaggingperformance.

‣ FeatureembeddingsoutperformwordembeddingsbyexploiEngtask-specificinformaEoninfeaturetemplates.

Page 81: Part-of-Speech Tagging for Historical English · Penn Corpora of Historical English Modern BriEsh English (MBE) 1840-1914 1770-1839 1700-1769 0 110,000 220,000 330,000 440,000 343,024

Conclusions

‣ RepresentaEonlearningandspellingnormalizaEonarecomplementaryforimprovingtaggingperformance.

‣ TagsetmismatchesmakeithardtoevaluatemodernPOStaggersforhistoricalEnglish.

‣ FeatureembeddingsoutperformwordembeddingsbyexploiEngtask-specificinformaEoninfeaturetemplates.