nlp for low-resource or endangered languages and cross

56
NLP for Low-resource or Endangered Languages and Cross-lingual transfer Antonios Anastasopoulos MTMA May 31, 2019

Upload: others

Post on 16-Apr-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NLP for Low-resource or Endangered Languages and Cross

NLP for Low-resource or Endangered Languages

and Cross-lingual transferAntonios Anastasopoulos

MTMA

May 31, 2019

Page 2: NLP for Low-resource or Endangered Languages and Cross

2

There are about ______ languages in the world 7000

Page 3: NLP for Low-resource or Endangered Languages and Cross

3

There are about ______ languages in the world

According to UNESCO, ___% of the world’s languages are endangered or vulnerable.

7000

43

Page 4: NLP for Low-resource or Endangered Languages and Cross

�4

[from The Endangered Languages Project]

: dormant : critically endangered : endangered : at risk

Page 5: NLP for Low-resource or Endangered Languages and Cross

�5

[from The Endangered Languages Project]

: dormant : critically endangered : endangered : at risk

Page 6: NLP for Low-resource or Endangered Languages and Cross

�6

[from The Endangered Languages Project]

: dormant : critically endangered : endangered : at risk

Page 7: NLP for Low-resource or Endangered Languages and Cross

�7

[from The Endangered Languages Project]

: dormant : critically endangered : endangered : at risk

Page 8: NLP for Low-resource or Endangered Languages and Cross

�8

[from The Endangered Languages Project]

: dormant : critically endangered : endangered : at risk

Page 9: NLP for Low-resource or Endangered Languages and Cross

�9

Language Documentation1. Collect (record) data

2. Transcribe and translate 3. Perform analysis

4. Elicit further paradigms 5. Prepare a grammar

Page 10: NLP for Low-resource or Endangered Languages and Cross

�10

Steven Bird

Making an Audio Rosetta Stone

Page 11: NLP for Low-resource or Endangered Languages and Cross

My work

11

Alignment (segmentation)

Transcription

Translation

Analysis

Develop methods that will automate and speed up thelanguage documentation process:

Page 12: NLP for Low-resource or Endangered Languages and Cross

Leveraging Translations for Speech Transcription in Low-Resource Settings

Antonios Anastasopoulos and David Chiang. Interspeech 2018

12

Speech Transcription using Translations

Page 13: NLP for Low-resource or Endangered Languages and Cross

13

x1 · · · xN

encoder

h1 · · ·hN

attention

c1 · · · cM

decoder

s1 · · · sM

softmax

P(y1 · · · yM)

multi-source models: encoder-decoder model

x1 · · · xN

encoder

h1 · · ·hN

attention

c1 · · · cM

decoder

s1 · · · sM

softmax

P(y1 · · · yM)

Page 14: NLP for Low-resource or Endangered Languages and Cross

14

Character Error Rate

0

20

40

60

80

Ainu (2k)

Mboshi (5k)

Spanish (17k)

44.6

68.2

74.9

52.0

29.8

40.7

speech translation ensemble multisource multisource+tied multisource+shared

multi-source models: results

Page 15: NLP for Low-resource or Endangered Languages and Cross

15

x11 · · · x1

N1

encoder

h11 · · ·h1

N1

attention

c11 · · · c1

M

decoder

s11 · · · s1

M

x21 · · · x2

N2

encoder

h21 · · ·h2

N2

attention

c21 · · · c2

M

decoder

s21 · · · s2

M

softmax

P(y1 · · · yM)

[Dutt et al. 2017]

multi-source models: coupled ensemble

Page 16: NLP for Low-resource or Endangered Languages and Cross

16

Character Error Rate

0

20

40

60

80

Ainu (2k)

Mboshi (5k)

Spanish (17k)

42.236.8

40.644.6

68.2

74.9

52.0

29.8

40.7

speech translation ensemble multisource multisource+tied multisource+shared

multi-source models: results

Page 17: NLP for Low-resource or Endangered Languages and Cross

17

x11 · · · x1

N1

encoder

h11 · · ·h1

N1

x21 · · · x2

N2

encoder

h21 · · ·h2

N2

attention attention

c1 · · · cM

decoder

s1 · · · sM

softmax

P(y1 · · · yM)

multi-source models: multisource

Page 18: NLP for Low-resource or Endangered Languages and Cross

18

Character Error Rate

0

20

40

60

80

Ainu (2k)

Mboshi (5k)

Spanish (17k)

41.637.5

46.042.2

36.840.6

44.6

68.2

74.9

52.0

29.8

40.7

speech translation ensemble multisource multisource+tied multisource+shared

multi-source models: results

Page 19: NLP for Low-resource or Endangered Languages and Cross

19

x11 · · · x1

N1

encoder

h11 · · ·h1

N1

x21 · · · x2

N2

encoder

h21 · · ·h2

N2

attention attention

c1 · · · cM

decoder

s1 · · · sM

softmax

P(y1 · · · yM)

multi-source models: attention parameter sharing

Standard:

Shared:

Page 20: NLP for Low-resource or Endangered Languages and Cross

20

Character Error Rate

0

20

40

60

80

Ainu (2k)

Mboshi (5k)

Spanish (17k)

38.7

28.6

40.6 41.637.5

46.042.2

36.840.6

44.6

68.2

74.9

52.0

29.8

40.7

speech translation ensemble multisource multisource+shared

multi-source models: results

Page 21: NLP for Low-resource or Endangered Languages and Cross

Tied Multitask Models for Speech Transcription and Translation

Antonios Anastasopoulos and David Chiang. NAACL 2018.

21

Speech Transcription and Translation

Page 22: NLP for Low-resource or Endangered Languages and Cross

22

x1 · · · xN

encoder

h1 · · ·hN

attention

c1 · · · cM

decoder

s1 · · · sM

softmax

P(y1 · · · yM)

x1 · · · xN

encoder

h1 · · ·hN

attention

c1 · · · cM

decoder

s1 · · · sM

softmax

P(y1 · · · yM)

multi-task models: pivot

Page 23: NLP for Low-resource or Endangered Languages and Cross

23

x1 · · · xN

encoder

h1 · · ·hN

attention

c1 · · · cM

decoder

s1 · · · sM

softmax

P(y1 · · · yM)

x1 · · · xN

encoder

h1 · · ·hN

attention

c1 · · · cM

decoder

s1 · · · sM

softmax

P(y1 · · · yM)

x1 · · · xN

encoder

h1 · · ·hN

attention

c1 · · · cM

decoder

s1 · · · sM

softmax

P(y1 · · · yM)

multi-task models: end-to-end

An Attentional Model for Speech Translation without Transcription Long Duong, Antonios Anastasopoulos,

Trevor Cohn, Steven Bird, and David Chiang. NAACL 2016.

Page 24: NLP for Low-resource or Endangered Languages and Cross

24

x1 · · · xN

encoder

h1 · · ·hN

attention attention

c11 · · · c1

M1

decoder

s11 · · · s1

M1

softmax

P(y11 · · · y1

M1 )

c21 · · · c2

M2

decoder

s21 · · · s2

M2

softmax

P(y21 · · · y2

M2 )

multi-task models: simple multitask

Page 25: NLP for Low-resource or Endangered Languages and Cross

25

Translation character BLEU

0

6.5

13

19.5

26

English (Ainu) French (Mboshi) English (Spanish)

26.0

21.018.3

21.620.8

12.0

single-task multitask cascade triangle triangle+trans

Transcription Character Error Rate

0

17.5

35

52.5

70

Ainu Mboshi Spanish

57.4

36.940.1

63.2

42.344.0

single-task multitask cascade triangle triangle+trans

multi-task models: results

Page 26: NLP for Low-resource or Endangered Languages and Cross

26

x1 · · · xN

encoder

h1 · · ·hN

attention

c11 · · · c1

M1

decoder

s11 · · · s1

M1

softmax

P(y11 · · · y1

M1 )

attentions

c21 · · · c2

M2

decoder

s21 · · · s2

M2

softmax

P(y21 · · · y2

M2 )

multi-task models: triangle

Page 27: NLP for Low-resource or Endangered Languages and Cross

27

Translation character BLEU

0

7.5

15

22.5

30

English (Ainu) French (Mboshi) English (Spanish)

28.8

22.019.8

26.0

21.018.3

21.620.8

12.0

single-task multitask triangle triangle+trans

Transcription Character Error Rate

0

17.5

35

52.5

70

Ainu Mboshi Spanish

58.4

31.938.9

57.4

36.940.1

63.2

42.344.0

single-task multitask triangle triangle+trans

multi-task models: results

Page 28: NLP for Low-resource or Endangered Languages and Cross

28

x1 · · · xN

encoder

h1 · · ·hN

attention

c11 · · · c1

M1

decoder

s11 · · · s1

M1

softmax

P(y11 · · · y1

M1 )

attentions

c21 · · · c2

M2

decoder

s21 · · · s2

M2

softmax

P(y21 · · · y2

M2 )

Rtrans = ��trans kA12A1 � A2k22.

If A attends over B…

and B attends over C…

this should be similar to A attending directly over C.

multi-task models: transitivity regularizer

Page 29: NLP for Low-resource or Endangered Languages and Cross

29

Translation character BLEU

0

7.5

15

22.5

30

English (Ainu) French (Mboshi) English (Spanish)

28.6

23.420.3

28.8

22.019.8

26.0

21.018.3

21.620.8

12.0

single-task multitask triangle triangle+trans

Transcription Character Error Rate

0

17.5

35

52.5

70

Ainu Mboshi Spanish

59.1

32.343.0

58.4

31.938.9

57.4

36.940.1

63.2

42.344.0

single-task multitask triangle triangle+trans

multi-task models: results

Page 30: NLP for Low-resource or Endangered Languages and Cross

30

We can improve translation and transcription accuracy by jointly performing the two tasks.

Translation can be further improved by using intermediate representations and transitivity.

Page 31: NLP for Low-resource or Endangered Languages and Cross

31

Other (relevant and ongoing) work

work with Graham Neubig

Build a tool for linguists that uses ML in its backend to aid annotation:

Page 32: NLP for Low-resource or Endangered Languages and Cross

Data Augmentation, Cross-Lingual Transfer,

and other nice things

Page 33: NLP for Low-resource or Endangered Languages and Cross

Using related languages for MT

"Generalized Data Augmentation for Low-Resource Translation"

Mengzhou Xia, Xiang Kong, Antonios Anastasopoulos, and Graham NeubigACL 2019

Page 34: NLP for Low-resource or Endangered Languages and Cross

Transfer for MTTypical scenario: continued training

figure from "Transfer Learning across Low-Resource, Related Languages for NMT". Nguyen and Chiang, 2018.

Page 35: NLP for Low-resource or Endangered Languages and Cross

Machine TranslationThe current best approach is a semi-supervised one:

• Back-translation of target-side monolingual data

What if we don’t have tons of monolingual data for a language?

Does the quality of the back-translated data matter?

Page 36: NLP for Low-resource or Endangered Languages and Cross

GeneralizedBack-Translation

For low-resource languages, there maybe exist a related high-resource one e.g.

1. Azerbaijani (Turkish)

2. Belarusian (Russian)

3. Galician (Portuguese)

4. Slovak (Czech)

We should use them!

Page 37: NLP for Low-resource or Endangered Languages and Cross

GeneralizedBack-Translation

Page 38: NLP for Low-resource or Endangered Languages and Cross

GeneralizedBack-Translation

Typical:

only use [1] for data augmentation

OR

add [b] to [c] and train.

But HRL to LRL might be easier!

Page 39: NLP for Low-resource or Endangered Languages and Cross

From HRL to LRLAssuming a parallel dataset is probably too much.

If the languages are related enough:

1. Get monolingual embeddings

2. Align the embedding space [Lample et al, 2018]

3. Learn a dictionary

4. Word substitution in HRL to create pseudo-LRL

Page 40: NLP for Low-resource or Endangered Languages and Cross

From ENG to LRL through HRL

ENG to LRL system would be bad (duh!)

ENG to HRL system would be better…

… and HRL to LRL might be easy-ish (cause related)

Page 41: NLP for Low-resource or Endangered Languages and Cross

Results

Page 42: NLP for Low-resource or Endangered Languages and Cross

Takeaways

Translating from HRL to LRL:

• it is better to use word substitution than simple NMT or standard UMT (cf lines 5,6 to 7,8,9)

Pivoting from ENG though HRL, improvements but not as much.

Best of both worlds works best (line 12)

• More ENG data, as good as possible LRL data

Page 43: NLP for Low-resource or Endangered Languages and Cross

Using Related Languages for

Morphological Inflection

Page 44: NLP for Low-resource or Endangered Languages and Cross

Inflection task and SIGMORPHON

SIGMORPHON challenge:100 language pairs (43 test languages)

Page 45: NLP for Low-resource or Endangered Languages and Cross

Previous Work

Concatenate tags and lemma, single encoder-decoder

• Issue: inherently different (order, function)

Half task is identifying stem/root and copy characters, so other works focus on copying

• explicit copy mechanism, or

• hard monotonic attention, or

• learn to output the string transduction steps

Page 46: NLP for Low-resource or Endangered Languages and Cross

Augmentation approach: hallucinating data

Most low-resource languages have just 50 or 100 examples.

You can hallucinate more data:

b a i l a r| | | / /b a i l a b a

replacing the red parts with random characters

Page 47: NLP for Low-resource or Endangered Languages and Cross

Results on transfer from single language

If languages are genetically distant, transfer does NOT help.

Same alphabet crucial: see Kurmanji-Sorani

Page 48: NLP for Low-resource or Endangered Languages and Cross

But we can do better if transferring from multiple (related) languages

e.g.

Page 49: NLP for Low-resource or Endangered Languages and Cross

Interpreting the model

Page 50: NLP for Low-resource or Endangered Languages and Cross

Takeaways1. Monolingual data hallucination can take you a long way…

2. … and it’s preferable to cross-lingual transfer from distant languages

3. If close enough languages, both data hallucination and cross-lingual transfer should help

4. The closer the languages, the larger the improvements

Main Issues:

• Data Hallucination is language-agnostic. A more informed sampling could probably do better

• Different alphabets really hurt performance (Dutch-Yiddish, Kurmanji-Sorani). Need to find either an a priori mapping between the two, or map the to a common space (IPA?)

Page 51: NLP for Low-resource or Endangered Languages and Cross

What language should you use for cross-lingual

transfer?"Choosing Transfer Languages for Cross-Lingual Learning"

Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou

Xia, Shruti Rijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, Antonios Anastasopoulos, Patrick Littell and Graham Neubig

ACL 2019

Page 52: NLP for Low-resource or Endangered Languages and Cross

Setting

Cross-lingual transfer on 4 tasks:

1. MT: 54x54 pairs (X-Eng TED)

2. POS-tagging: 60x26

3. Entity Linking: 53x9

4. DEP parsing: 30x30

Page 53: NLP for Low-resource or Endangered Languages and Cross

Learning to Rank

For each language pair, extract features:

1. dataset dependent:

• dataset size

• type-token ratio

• word/subword overlap

2. dataset-independent:

1. typological features (from URIEL)

[plug: check out the lang2vec python library, now with pre-computed distances!]

Page 54: NLP for Low-resource or Endangered Languages and Cross

Learning to RankFor each test language and a list of potential transfer languages (each pair represented by the features), train a model to rank the candidate languages

Model: tree-based LambdaRank (good in limited feature/data settings)

[ Available as a python package too: https://github.com/neulab/langrank ]

Page 55: NLP for Low-resource or Endangered Languages and Cross

Other Cool Things

Page 56: NLP for Low-resource or Endangered Languages and Cross

Language Technology for Language Documentation and Revitalization

Hackathon-type workshop at CMU, Aug 12-16, 2019▪ Language community members▪ Documentary linguists▪ Computational linguists▪ Computer scientists and developers

Example projects:

▪ Building and using tools for rapid dictionary creation▪ Building and using tools for development of speech recognition systems▪ Building and using tools to analyze the syntax of language, and extract

example sentences for educational materials▪ Creating a plugin that incorporates language techology into language

documentation software such as ELAN/Praat