language stuff - alan ritteraritter.github.io/courses/slides/speech_mt.pdf7 hal daumé iii...

67
Language Stuff (Slides from Hal Daume III)

Upload: others

Post on 07-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

Language Stuff(Slides from Hal Daume III)

Page 2: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI2 Hal Daumé III ([email protected])

Digitizing Speech

Page 3: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI3 Hal Daumé III ([email protected])

Speech in an Hour

➢ Speech input is an acoustic wave form

s p ee ch l a b

Graphs from Simon Arnfield’s web tutorial on speech, Sheffield:http://www.psyc.leeds.ac.uk/research/cogn/speech/tutorial/

“l” to “a”transition:

Page 4: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI4 Hal Daumé III ([email protected])

➢ Frequency gives pitch; amplitude gives volume➢ sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000 cycles/sec)

➢ Fourier transform of wave displayed as a spectrogram➢ darkness indicates energy at each frequency

s p ee ch l a b

freq

uen

cyam

pl i

tud

e

Spectral Analysis

Page 5: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI5 Hal Daumé III ([email protected])

Adding 100 Hz + 1000 Hz Waves

Time (s)0 0.05

–0.9654

0.99

0

Page 6: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI6 Hal Daumé III ([email protected])

Spectrum

100 1000Frequency in Hz

Am

pli

tude

Frequency components (100 and 1000 Hz) on x-axis

Page 7: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI7 Hal Daumé III ([email protected])

Part of [ae] from “lab”

➢ Note complex wave repeating nine times in figure

➢ Plus smaller waves which repeats 4 times for every large pattern

➢ Large wave has frequency of 250 Hz (9 times in .036 seconds)

➢ Small wave roughly 4 times this, or roughly 1000 Hz

➢ Two little tiny waves on top of peak of 1000 Hz waves

Page 8: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI8 Hal Daumé III ([email protected])

Back to Spectra

➢ Spectrum represents these freq components

➢ Computed by Fourier transform, algorithm which separates out each frequency component of wave.

➢ x-axis shows frequency, y-axis shows magnitude (in decibels, a log measure of amplitude)

➢ Peaks at 930 Hz, 1860 Hz, and 3020 Hz.

Page 9: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI9 Hal Daumé III ([email protected])

Acoustic Feature Sequence

➢ Time slices are translated into acoustic feature vectors (~39 real numbers per slice)

➢ These are the observations, now we need the hidden states X

frequency

……………………………………………..e12e13e14e15e16………..

Page 10: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI10 Hal Daumé III ([email protected])

State Space

➢ P(E|X) encodes which acoustic vectors are appropriate for each phoneme (each kind of sound)

➢ P(X|X’) encodes how sounds can be strung together

➢ We will have one state for each sound in each word

➢ From some state x, can only:➢ Stay in the same state (e.g. speaking slowly)

➢ Move to the next position in the word

➢ At the end of the word, move to the start of the next word

➢ We build a little state graph for each word and chain them together to form our state space X

Page 11: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI11 Hal Daumé III ([email protected])

HMMs for Speech

Page 12: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI12 Hal Daumé III ([email protected])

Markov Process with Bigrams

Figure from Huang et al page 618

Page 13: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI13 Hal Daumé III ([email protected])

Decoding

➢ While there are some practical issues, finding the words given the acoustics is an HMM inference problem

➢ We want to know which state sequence x1:T is most likely

given the evidence e1:T:

➢ From the sequence x, we can simply read off the words

Page 14: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI14 Hal Daumé III ([email protected])

Training (aka “preview of ML”)

➢ Two key components of a speech HMM:

➢ Acoustic model: p(E | X)

➢ Language model: p(X | X')

➢ Where do these come from?

➢ Can we estimate these models from data:

➢ p(E | X) might be estimated from transcribed speech

➢ p(X | X') might be estimated from large amounts of raw text

Page 15: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI15 Hal Daumé III ([email protected])

n-gram Language Models

➢ Assign a probability to a sequences of words

➢ If I gave you a copy of the web, how would you estimate these probabilities?

pw1,w2, ... , wI = ∏i=1

I

pwi ∣w1, ... ,wi−1

≈ ∏i=1

I

pwi ∣wi−k , ... , wi−1

Need to “smooth” estimates intelligently to avoid zero probability n-grams.

Language modeling is the art of good smoothing.

See [Goodman 1998], [Teh 2007]

Page 16: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI16 Hal Daumé III ([email protected])

Acoustic models

➢ What if I gave you data like:

➢ How would you estimate p(E|X)?

➢ What's wrong with this approach?

freq

uen

cy

………………………...…………..sp ee ch l ae b......

Page 17: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI17 Hal Daumé III ([email protected])

Acoustic models II

➢ What does our data really look like:

➢ We'd like to know alignments between transcript and waveform

➢ Suppose someone gave us a good speech recognizer.... could we figure out alignments from that?

yesterday I went to visit the speech lab

Acc:

W:

Page 18: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI18 Hal Daumé III ([email protected])

Expectation Maximization

➢ A general framework to do parameter estimation in the presence of hidden variables

➢ Repeat ad infinitum:➢ E-step: make probabilistic guesses at latent variables

➢ M-step: fit parameters according to these guesses

I LIKE A IW:

Page 19: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI19 Hal Daumé III ([email protected])

Expectation Maximization

I LIKE A IW:

Acc:

e p( e | “I”) p( e | “LIKE”) p( e | “A”)

0.33 0.33 0.33

0.33 0.33 0.33 0.33 0.33 0.33

→ 2 → 1 → 1

→ 1 → 1 → 1

→ 1 → 1 → 1

Page 20: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI20 Hal Daumé III ([email protected])

Expectation Maximization

I LIKE A IW:

Acc:

e p( e | “I”) p( e | “LIKE”) p( e | “A”)

0.5 0.33 0.33

0.25 0.33 0.33 0.25 0.33 0.33

→ 4 → 1 → 1

→ 1 → 2 → 2

→ 1 → 2 → 2

Page 21: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI21 Hal Daumé III ([email protected])

State of the Art DBNs for Speech

Page 22: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI22 Hal Daumé III ([email protected])

Summary

➢ HMMs allow us to “separate” two models:➢ acoustic model (how does what I want to say sound?)

➢ language model (what do I want to say)

➢ Speech recognition is “just” decoding in an HMM/DBN

➢ Plus a heck of a lot of engineering

➢ Expectation maximization lets us estimate parameters in models with hidden variables

➢ Most research today focuses on language modeling

Page 23: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI23 Hal Daumé III ([email protected])

Translate Centauri -> Arcturan

1a. ok-voon ororok sprok .1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .12b. wat nnat forat arrat vat gat .

Your assignment, translate this Centauri sentence to Arcturan:

farok crrrok hihok yorok clok kantok ok-yurp

Page 24: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI24 Hal Daumé III ([email protected])

Topology of the Field

Human Language Technologies

Automatic SpeechRecognition

InformationRetrieval

ComputationalLinguistics

Natural LanguageProcessing

ICASSP

??? ACL

SIGIR

NLP

Machine Translation

Summarization

Question Answering

Information Extraction

Parsing

“Understanding”

Generation

Page 25: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI25 Hal Daumé III ([email protected])

A Bit of History

1940s Computations begins, AI hot, Turing test

Machine translation = Code-breaking?

1950s Cold war continues

1960s Chomsky and statistics, ALPAC report

1970s Dry spell

1980s Statistics makes significant advances in speech

1990s Web arrives

Statistical revolution in machine translation, parsing, IE, etc

Serious “corpus” work, increasing focus on evaluation

2000s Focus on optimizing loss functions, reranking

How much can we automate?

Huge process in machine translation

Gigantic corpora become available, scaling

New challenges

Page 26: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI26 Hal Daumé III ([email protected])

Ready-to-use Data

1994 1996 1998 2000 2002 20040

20

40

60

80

100

120

140

160

180

French-English

Chinese-English

Arabic-English

Mil

lio

ns

of

Wo

rds

(En

gli

sh s

i de)

Page 27: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI27 Hal Daumé III ([email protected])

Classical MT (1970s and 1980s)

SourceText

SourceLanguageAnalysis

SourceLexicon

TargetText

TargetLanguageGeneration

TargetLexicon

KnowledgeBase

Transfer/Interlingua

Representation

TransferRules

Page 28: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI28 Hal Daumé III ([email protected])

Layers of complexity

➢ Text:

➢ Morphology:

➢ Syntax:

➢ Semantics:

John saw hisbrotherJohn see hebrother

+past +gen

S

NNP VBD PRN NN

NP NP

VP

Agent [ Person John ]Event see (+past)Patient Person brother

Poss *

Page 29: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI29 Hal Daumé III ([email protected])

How Much Analysis?

Source Words Target Words

SourceMorphology

Source Syntax

Source Semantics

Interlingua

TargetMorphology

Target Syntax

Target Semantics

AnalysisGeneration

Direct

Classical

Not practicalfor open domain

Becoming Popular

Page 30: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI30 Hal Daumé III ([email protected])

Syntactic Transfer

➢ Now, just get a bunch of linguists to sit down and write rules and grammars

The student will see the man

D N AX V D N

NP NP

SUB A V OB

S

Der Student wird den Mann seben

D N AX D N V

NP NP

SUB A OB V

S

Page 31: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI31 Hal Daumé III ([email protected])

While We're Busy Writing Grammars

Claude Shannon

InformationSource

NoisyChannel

CorruptedOutput

p(s) p(o | s) p(s | o) ∝ p(s) * p(o | s)

“Imagined”Words

Speech Process Acoustic Signal

Need: p(word sequence) and p(signal | word sequence)

Page 32: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI32 Hal Daumé III ([email protected])

Acoustic Modeling: p(a | w)

Signal:

Transcription: the man ate

Key notion: acoustic-word alignments

Chicken-and-eggproblem that wecan solve usingExpectationMaximization(EM)

Page 33: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI33 Hal Daumé III ([email protected])

Speech Rec = Machine Translation?

➢ Peter F. Brown

➢ Stephen A. Della Pietra

➢ Vincent J. Della Pietra

➢ Robert Mercer

➢ The Mathematics of Statistical Machine Translation: Parameter Estimation

➢ Computational Linguistics 19 (2), June 1993

➢ Probably the most important paper in NLP in the last 20 years

“Brown 93”

Page 34: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI34 Hal Daumé III ([email protected])

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 35: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI35 Hal Daumé III ([email protected])

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 36: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI36 Hal Daumé III ([email protected])

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Page 37: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI37 Hal Daumé III ([email protected])

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

???

Page 38: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI38 Hal Daumé III ([email protected])

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 39: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI39 Hal Daumé III ([email protected])

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 40: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI40 Hal Daumé III ([email protected])

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 41: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI41 Hal Daumé III ([email protected])

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

???

Page 42: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI42 Hal Daumé III ([email protected])

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 43: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI43 Hal Daumé III ([email protected])

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

process of

elimination

Page 44: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI44 Hal Daumé III ([email protected])

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

cognate?

Page 45: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI45 Hal Daumé III ([email protected])

Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp }

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

zerofertility

Page 46: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI46 Hal Daumé III ([email protected])

Unsupervised EM Training

… la maison ….. la maison bleue …... la fleur …

… the house ….. the blue house ...… the flower …

All P(french-word | english-word) equally likely

Page 47: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI47 Hal Daumé III ([email protected])

Unsupervised EM Training

… la maison ..… la maison bleue ...… la fleur …

… the house ..… the blue house ...… the flower …

“la” and “the” observed to co-occur frequently,so P(la | the) is increased.

Page 48: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI48 Hal Daumé III ([email protected])

Unsupervised EM Training

… la maison ..… la maison bleue …... la fleur …

… the house ..… the blue house ...… the flower …

“maison” co-occurs with both “the” and “house”, butP(maison | house) can be raised without limit, to 1.0,

while P(maison | the) is limited because of “la”

(pigeonhole principle)

Page 49: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI49 Hal Daumé III ([email protected])

Unsupervised EM Training

… la maison ….. la maison bleue ...… la fleur …

… the house ..… the blue house …... the flower …

settling down after another iteration

Page 50: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI50 Hal Daumé III ([email protected])

Unsupervised EM Training

… la maison ….. la maison bleue …... la fleur …

… the house ….. the blue house ...… the flower …

Inherent hidden structure revealed by EM training!

• “A Statistical MT Tutorial Workbook” (Knight, 1999). Promises free beer.

• “The Mathematics of Statistical Machine Translation” (Brown et al, 1993)

• Software: GIZA++

Page 51: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI51 Hal Daumé III ([email protected])

The IBM Model [Brown et al., 1993]

Mary did not slap the green witch

Mary not slap slap slap the green witch n(3|slap)

Maria no daba una botefada a la bruja verde

d(j|i)

Mary not slap slap slap NULL the green witch

P NULL

Maria no daba una botefada a la verde bruja

t(la|the)

Use the EM algorithm for training the parameters

Page 52: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI52 Hal Daumé III ([email protected])

Decoding for Machine Translation

EnglishSource

TranslationModel

FrenchOutput

p(e) p(f | e) p(e | f) ∝ p(e) * p(f | e)

Decoding: e = argmax

epe p f ∣ e

Problem in NP-hard; use search:

Beam Search

Greedy SearchInteger Programming

A* Search

Page 53: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI53 Hal Daumé III ([email protected])

Progress in Statistical MT

insistent Wednesday may recurred her trips to Libya tomorrow for flying

Cairo 6-4 ( AFP ) - an official announced today in the Egyptian lines company for flying Tuesday is a company " insistent for flying " may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment .

And said the official " the institution sent a speech to Ministry of Foreign Affairs of lifting on Libya air , a situation her receiving replying are so a trip will pull to Libya a morning Wednesday " .

Egyptair Has Tomorrow to Resume Its Flights to Libya

Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.

" The official said that the company had sent a letter to the Ministry of Foreign Affairs, information on the lifting of the air embargo on Libya, where it had received a response, the first take off a trip to Libya on Wednesday morning ".

20022002 20032003

slide from C. Wayne, DARPA

Page 54: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI54 Hal Daumé III ([email protected])

Reference translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

Reference translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

Reference translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

Tri-gram match

Bi-gram matches

Automatic Evaluation of Translation

“Bleu” metric

Page 55: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI55 Hal Daumé III ([email protected])

Minimum Error Rate Training for MT

➢ Desire MT system with high BLEU/??? scores

➢ Algorithm:➢ Build MT system based on generative parameters

➢ Decode development corpus to get n-best lists (~10k best)

➢ Optimize parameters to get high BLEU scores on n-best lists

➢ Repeat until converged

[Och, ACL03]

Page 56: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI56 Hal Daumé III ([email protected])

56

Phrase-Based Translation

Austra

lia iswith

Nor

th

diplom

atic relations

have

that

coun

tries

few

one of

Kor

e

a

澳洲 是 北与 韩 有 邦交 的 国家 之一少数

Australia is withNorthKorea

isdiplomaticrelations

one of the few

countries

Australia isdiplomaticrelations

withNorthKorea

isone of the few

countries

[Koehn, Och and Marcu, NAACL03]

Page 57: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI57 Hal Daumé III ([email protected])

57

Training Phrase-Based MT Systems

Mar

ia no

dab

a

una

botef a

da a la

bru

ja

ver

de

witch

green

the

slap

not

did

Mary

[Koehn, Och and Marcu, NAACL03]

Page 58: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI58 Hal Daumé III ([email protected])

58

Decoding Phrase-Based MT

Maria no daba una botefada a la bruja verde

Mary did not slap the green witch

➢ Each step induces a cost attributed to:➢ Language model probability: p(slap | did not)

➢ T-table probability: p(the | a la) and p(a la | the)

➢ Distortion probability: p(skip 1) [for a la --> verde]

➢ Length penalty

➢ ...

[Koehn, Och and Marcu, NAACL03]

Page 59: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI59 Hal Daumé III ([email protected])

59

Hierarchical Phrase-Based MT

澳洲 是 与北 韩 有 邦交 的 国家 之一少数

fewcountries

NorthKorea

diplomaticrelationshave with 之一的isAustralia

Australia isNorthKorea

diplomaticrelations

fewcountries

有 的与

[Chiang, ACL05]

Page 60: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI60 Hal Daumé III ([email protected])

60

Hierarchical Phrase-Based MT

fewcountries

NorthKorea

diplomaticrelationshave with 之一的isAustralia

fewcountries

NorthKorea

diplomaticrelationshave withthatisAustralia

one of

the

isAustraliafew

countriesNorthKorea

diplomaticrelationshave withthat 之一

[Chiang, ACL05]

Page 61: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI61 Hal Daumé III ([email protected])

Syntax for MT

I ate lunch

PRO VBD NN

NPNP

VP

S

tabetaate

VBD

hirugohan

lunch

NN

NP

watashi

I

PRO

NP

VP

x yy wo x

S

x yx wa y

I

PRO

NP

ate lunch

VBD NN

NP

VP

wa

ate lunch

VBD NN

NP

VP

watashi wa

atelunch

VBD

NN

NP

watashi wa wo

watashi wa hirugohan wo tabeta

Kevin Knight, Daniel Marcu, Ignacio Thayer, Jonathan Graehl, Jon May, Steve DeNeefe

Page 62: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI62 Hal Daumé III ([email protected])

Syntax for MT

➢ Decoding:

➢ Tree-to-tree/string automata

➢ CKY parsing algorithm

➢ Rule learning:

➢ Parsed English corpus

➢ Aligned data (GIZA++)

➢ Extract rules and assign probabilities

Time

BL

EU

Phrase-based MT

Syntax-based MT

Page 63: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

Conversation as MT[Ritter, Cherry, Dolan EMNLP 2011]

SMT

INPUT: Foreign Text

OUTPUT: English Text

LEARNING: Parallel Corpora

Page 64: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

Conversation as MT[Ritter, Cherry, Dolan EMNLP 2011]

SMT Chat

INPUT: Foreign Text User Utterance

OUTPUT: English Text Response

LEARNING: Parallel Corpora Conversations

Page 65: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

Phrase-based MT[Ritter, Cherry, Dolan EMNLP 2011]

Who wants to come over for dinner tomorrow?Input:

Output:

{want toYum ! I

{be there

{tomorrow !

{

Page 66: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating
Page 67: Language Stuff - Alan Ritteraritter.github.io/courses/slides/speech_mt.pdf7 Hal Daumé III (me@hal3.name) CS421: Intro to AI Part of [ae] from “lab” Note complex wave repeating

CS421: Intro to AI63 Hal Daumé III ([email protected])

Summary

➢ Old school tranislation = interlingua➢ Works well for limited domains

➢ Costs a lot of money

➢ New school translation = statistical➢ Started out naïve

➢ Becoming more linguistically motivated every year

➢ Translation is currently the “hot topic” in NLP➢ It looks like linguistics really is going to help, after all

➢ (so long as you use it wisely in conjunction with statistics)