administration introduction/signup sheet course web site course location and time: thursday,

40
Administration Introduction/Signup sheet Course web site http://www.cs.princeton.edu/courses/archive/spring09/cos401/ Course location and time: Thursday, 1:30pm – 4:20pm, Robertson Hall 023 TA: Juan Carlos Niebles Office: 215 Computer Science Bldg. Phone: (609) 258-8241 Email: jniebles [at] princeton Office hour: TBD or by appointment. Suggested Reading List: (NSW) Readings in Machine Translation, S. Nirenberg, H. Somers and Y. Wilks, MIT Press, 2002 (AT) Translation Engines: Techniques for Machine Translation, Arturo Trujillo, Springer 1999 (JM) Speech and Language Processing, Jurafsky and Martin, Prentice Hall (HS) An introduction to machine translation, W.John Hutchins and Harold L. Somers, London: Academic Press, 1992. Assessment: Class participation and attendance 15% Homework assignments 20% Midterm exam 30% Final exam/Term Paper 35%

Post on 20-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

AdministrationIntroduction/Signup sheet

Course web site

http://www.cs.princeton.edu/courses/archive/spring09/cos401/

Course location and time: Thursday, 1:30pm – 4:20pm, Robertson Hall 023

TA: Juan Carlos Niebles

Office: 215 Computer Science Bldg.

Phone: (609) 258-8241

Email:  jniebles [at] princeton

Office hour: TBD or by appointment.

Suggested Reading List:

(NSW) Readings in Machine Translation, S. Nirenberg, H. Somers and Y. Wilks, MIT Press, 2002

(AT) Translation Engines: Techniques for Machine Translation, Arturo Trujillo, Springer 1999

(JM) Speech and Language Processing, Jurafsky and Martin, Prentice Hall

(HS) An introduction to machine translation, W.John Hutchins and Harold L. Somers, London: Academic Press, 1992.

Assessment:

Class participation and attendance 15%

Homework assignments 20%

Midterm exam 30%

Final exam/Term Paper 35%

Page 2: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Machine Translation

Srinivas BangaloreAT&T ResearchFlorham Park, NJ 07932

Page 3: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

The funnier side of translation…• In a Belgrade hotel elevator

– “The lift is being fixed for the next day. During that time we regret that you will be unbearable”

• In a Paris hotel lobby– “Please leave your values at the front desk”

• On the menu of a Swiss restaurant– “Our wines leave you nothing to hope for”

• Outside a Hong Kong tailor shop– “Ladies may have a fit upstairs”

• In an advertisement by a Hong Kong dentist– “Teeth extracted by the latest Methodists”

• In a Norwegian cocktail lounge– “Ladies are requested not to have children in the bar”

• In a pet shop in Malaysia– “For hygienic purposes, do not feed your hand to the dog”

• Machine Translation: – The spirit is willing but the flesh is weak Russian The vodka is good

but the meat is rottenSource: the web

Page 4: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Outline

• History of Machine Translation

• Machine Translation Paradigms

• Machine Translation Evaluation

• Applications of Machine Translation

Page 5: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Early days of Machine Translation

• Success in cryptography (code-breaking) during the war

• Source Text Encoded Source Text Transmit Text

• Receive Text Decode Text Target Text

• Ciphers: algorithms to encode and decode– Plain text cipher text decoded cipher text– cat dog; fog bat; ?? bog;

• Warren Weaver (1947)– When I look at an article in Russian, I say: 'This is really written in

English, but it has been coded in some strange symbols. I will now proceed to decode.

•Ciphers are created to be hard to break, but are usually unambiguous.

•Natural Languages are not as simple!!

Page 6: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Complexity of Machine Translation

•Computer program compilation is translation– Languages are designed to be unambiguous and formal– Source language and target language

• Natural languages are ambiguous – Lexical (e.g. bank, lead)– Structural (e.g. john saw a man with a telescope; flying planes can

be dangerous)

• For Machine Translation:– Ambiguity is compounded!!– Mapping between words of the two languages is not unique– Lexical gaps

• Languages have different mappings from concepts to words– Word order differences

• English: Subject-Verb-Object;• Japanese, Hindi: Subject-Object-Verb.

Page 7: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Issues in Machine Translation• Orthography

– Writing from left-to-right vs right-to-left– Character sets (alphabetic, logograms, pictograms)– Segmentation into word/word-like units

• Morphology

• Lexical: Word senses– bank “river bank”, “financial institution”

• Syntactic: Word order– Subject-verb-object subject-object-verb

• Semantic: meaning– “ate pasta with a spoon”, “ate pasta with marinara”, “ate pasta with John”

• Pragmatic: world knowledge– “Can you pass me the salt?”

• Social: conversational norms– pronoun usage depends on the conversational partner

• Cultural: idioms and phrases– “out of the ballpark”, “came from leftfield”

• Contextual

•In addition for Speech Translation– Prosody: JOHN eats bananas: John EATS bananas; John eats BANANAS– Pronunciation differences– Speech recognition errors

• In a multilingual environment– Code Switching: Use of linguistic apparatus of one language to express ideas in another language.

Page 8: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Machine Translation: Why and what’s it good for?

• Understanding people across linguistic barriers– Socio-Political– Commercial: Globalization

• Limited availability of human expertise

• What is it good for?– Tasks with limited vocabulary and syntax (technical manuals)– Rough translations for web pages, emails– Applications that use translation as one of the components

• What is it not good for?– Hard and Important domains (Literature, Legal, Medical)

• Machine Translation need not be fully automated!!– Human assisted machine translation– Machine assisted human translation– Machine Translation as a productivity enhancement tool.

Page 9: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Machine Translation: Past and Present

1947-1954

1954-1966

1966-1980s

1980-1990

1990-present

MT as code breaking, IBM-Georgetown Univ. demonstration

Large bilingual dictionaries, linguistic and formal grammar motivated syntactic reordering, lots of funding, little progress

ALPAC report: “there is no immediate or predictable prospect of useful fully automatic machine translation”.1966

Translation continued in Canada, France and Germany. Beyond English-Russian translation. Meteo for translating weather reports. Systran in 1970

Emphasis on ‘indirect’ translation: semantic and knowledge-based.Advent of microcomputers. Translation companies: Systran, Logos, GlobalLink. Domain specific machine-aided translation systems.

Corpus-based methods: IBM’s Candide, Japanese ‘example-based’ translation.Speech-to-Speech translation: Verbmobil, Janus. ‘Pure’ to practical MT for embedded applications: Cross-lingual IR

Page 10: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

MT Approaches: Different levels of meaning transfer

Direct MT

Interlingua

Transfer-basedMT

Source Target

Depth of Analysis

Parsing

Semantic Interpretation

Semantic Generation

Syntactic Generation

Syntactic Structure

Syntactic Structure

Page 11: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Spanish : ajá quiero usar mi tarjeta de crédito

English : yeah I wanna use my credit card

Alignment : 1 3 4 5 7 0 6

Direct Machine Translation • Words are replaced using a dictionary

– Some amount of morphological processing

• Word reordering is limited

• Quality depends on the size of the dictionary, closeness of languages

English : I need to make a collect call

Japanese : 私は コレクト コールを かける 必要があります

Alignment : 1 5 0 3 0 2 4

Page 12: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Example-based MT

Translation-by-analogy:

a. A collection of source/target text pairs

b. A matching metric

c. An word or phrase-level alignment

d. Method for recombination

ATR EBMT System (E. Sumita, H. Iida, 1991); CMU Pangloss EBMT (R. Brown, 1996)

Exact match (direct translation)

Target

ALIGNMENT (transfer)

MATCHING(analysis)

RECOMBINATION(generation)

Source

Page 13: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Example run of EBMT

English-Japanese Examples in the Corpus:

1. He buys a notebook Kare wa noto o kau

2. I read a book on international politics Watashi wa kokusai seiji nitsuite kakareta hon o yomu

Translation Input: He buys a book on international politics

Translation Output: Kare wa kokusai seiji nitsuite kakareta hon o kau

• Challenge: Finding a good matching metric• He bought a notebook

• A book was bought

• I read a book on world politics

Page 14: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

NLP Pipeline: Beads on a String

Tokenization Sentence Segmentation Part-of-speech

tagging

Named Entity Detection

Noun/Verb Chunking

Syntactic Parsing

Semantic Role Labeling

Word Sense Disambiguation

Co-reference resolution

Page 15: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Named Entity Detection

Noun/Verb Chunking

Syntactic Parsing

Semantic Role Labeling

Word Sense Disambiguation

Co-reference resolution

Part-of-speech tagging

Tokenization Sentence Segmentation

NLP Pipeline: Sentence Segmentation

U.S. President lives in Washington D.C. He will travel to Florida this week.

U.S. President lives in Washington D.C.

He will travel to Florida this week.

Page 16: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Named Entity Detection

Noun/Verb Chunking

Syntactic Parsing

Semantic Role Labeling

Word Sense Disambiguation

Co-reference resolution

TokenizationPart-of-speech tagging

Sentence Segmentation

NLP Pipeline: Part-of-speech Tagging

He will travel to Florida this week .

He/PRP will/MD travel/VB to/TO Florida/NNP this/DT week/NN ./.

Page 17: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Word Sense Disambiguation

Co-reference resolution

Named Entity Detection

Noun/Verb Chunking

Syntactic Parsing

Semantic Role Labeling

TokenizationPart-of-speech tagging

Sentence Segmentation

NLP Pipeline: Named Entity Detection

President Bush will travel to Florida on February 20 2007 to meet with the CEO of AT&T

President Bush will travel to Florida on February 20 2007 to meet with the CEO of AT&T

Page 18: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Syntactic Parsing

Word Sense Disambiguation

Co-reference resolution

Named Entity Detection

Noun/Verb Chunking

Semantic Role Labeling

TokenizationPart-of-speech tagging

Sentence Segmentation

NLP Pipeline: Noun/Verb Chunking

President Bush will travel to Florida on February 20 2007 to meet with the CEO of AT&T

President Bush will travel to Florida on February 20 2007 to meet with the CEO of AT&T

Page 19: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Word Sense Disambiguation

Semantic Role Labeling

Noun/Verb Chunking

Sentence Segmentation

Syntactic Parsing

Co-reference resolution

Named Entity Detection

TokenizationPart-of-speech tagging

NLP Pipeline: Syntactic Parsing

$PERSON will travel to $PLACE on $DATE to meet with the $JOB of $ORG

will travel

$Person to on to meet

$PLACE $DATEwith

$JOB

the of

$ORG

Page 20: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Noun/Verb Chunking

Word Sense Disambiguation

Semantic Role Labeling

Sentence Segmentation

Syntactic Parsing

Co-reference resolution

Named Entity Detection

TokenizationPart-of-speech tagging

NLP Pipeline: Semantic Role Labeling

will travel

$Person to on

$PLACE $DATE

the of

$ORGNamed Entity Detection

Part-of-speech tagging

will travel

$Personto on

$PLACE $DATE

ARG0 ARGM-tmp

ARGM-loc

Page 21: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Word Sense Disambiguation

Semantic Role Labeling

Noun/Verb Chunking

Sentence Segmentation

Syntactic Parsing

TokenizationPart-of-speech tagging

NLP Pipeline: Word Sense Disambiguation

The man went to the bank to get some money

The man went to the bank to get some money

The man went to the bank to get some flowers

The man went to the bank to get some flowers

Co-reference resolution

Page 22: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Word Sense Disambiguation

Semantic Role Labeling

Noun/Verb Chunking

Sentence Segmentation

Syntactic Parsing

TokenizationPart-of-speech tagging

NLP Pipeline: Co-reference resolution

The U.S. President lives in Washington D.C.

He will return to the capital this week .

Co-reference resolution

The U.S. President lives in Washington D.C.

He will return to the capital this week .

Page 23: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Syntactic Transfer-based Machine Translation

• Direct and Example-based approaches – Two ends of a spectrum– Recombination of fragments for better coverage.

• What if the matching/transfer is done at syntactic parse level

• Three Steps – Parse: Syntactic parse of the source language sentence

• Hierarchical representation of a sentence– Transfer: Rules to transform source parse tree into target parse

tree• Subject-Verb-Object Subject-Object-Verb

– Generation: Regenerating target language sentence from parse tree• Morphology of the target language

• Tree-structure provides better matching and longer distance transformations than is possible in string-based EBMT.

Page 24: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

I

Examples of SynTran-MT

quiero

ajá usar

mi tarjeta

de

crédito

wanna

yeah use

my card

credit

•Mostly parallel parse structures

• Might have to insert word – pronouns, morphological particles

Page 25: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Example of SynTran MT -2

• Pros:– Allows for structure transfer– Re-orderings are typically restricted to the parent-child nodes.

• Cons:– Transfer rules are for each language pair (N2 sets of rules)– Hard to reuse rules when one of the languages is changed

need

I make

to call

a collect

必要があります (need)

私は (I)

かける (make)

コールを (call)

コレクト (collect)

Page 26: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Interlingua-based Machine Translation

• Syntactic transfer-based MT – Couples the syntax of the two

languages

• What if we abstract away the syntax

– All that remains is meaning – Meaning is the same across

languages – Simplicity: Only N components

needed to translate among N languages

• Two “small” problems:– What is meaning?– How do we represent meaning?

Direct MT

Interlingua

Transfer-basedMT

Source Target

Parsing

Semantic Interpretation

Semantic Generation

Syntactic Generation

Syntactic Structure

Syntactic Structure

English analyzer

Spanish analyzer

Japanese analyzer

Spanish Generator

Japanese Generator

English generator

Interlingual representation

Page 27: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Example of Interlingua Machine Translation

)2(_);2,(1);1,( ecallcollecteIMakeeeINeed

need

I make

to call

a collect

indefssDefinitene

collectattributes

call

Theme

IAgent

InfinitiveTense

MakeEvent

Theme

IAgent

presentTense

NeedEvent

:

::

:

:

:

:

:

:

:

必要があります (need)

私は (I)

かける (make)

コールを (call)

コレクト (collect)

Interlingua representation

Page 28: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Probabilistic Direct Machine Translation

• Starting early 1990s, full circle back to code-breaking paradigm of machine translation

– With a probabilistic twist

• What is it:

•If you want to translate from English to Japanese– assume that the English text started out as a Japanese text– but went through a noisy channel which changed it into English

• Goal is to recover the best (most probable) Japanese text– J*=argmaxJ P(J|E) = argmaxJ P(E|J)*P(J)

• P(E|J) : Translation faithfulness; P(J): Translation fluency

• Popular approach due to:– Availability of large amounts of bilingual data (parallel data)– Large memory and high speed computers

私は コレクト コールを かける 必要があります I need to make a

collect call

Noisy Channel/Encryption

P(E|J)

Page 29: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Probabilistic Direct Machine Translation Learn pattern mappings (words and sequences of words) between pairs of sentences in the two languages.

- Use the result of translation; not the process of translation

- Infer a process that produces a similar result.

English : I need to make a collect call

Japanese : 私は コレクト コールを かける 必要があります

Alignment : 1 5 0 3 0 2 4

Spanish : ajá quiero usar mi tarjeta de crédito

English : yeah I wanna use my credit card

Alignment : 1 3 4 5 7 0 6

Page 30: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Applications of Machine Translation

Page 31: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Applications of Machine Translation

Sector

Consumer Business Government

Example Applications

• Call Center

• Web Search

• Call Center

• Collaborative Workspace

• Surveillance

• Information Dissemination

Translation needs

• Multilingual dialog

• Web page translation

• Localization

• Document translation

• E-mail/Chat translation

• Speech/text translation

AT&T MT

prototypes

• Multilingual customer care

• Multilingual Instant Messaging

• Speech/Text Instant messaging

Page 32: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Multilingual Customer Care

Page 33: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Making Travel Arrangements using Multilingual Chat

Page 34: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Large Vocabulary Speech Recognition and Translation

Page 35: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Large Vocabulary Speech Recognition and Translation

Page 36: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Evaluation of Machine Translation

Page 37: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

• What is a good translation?• Meaning preserving and (social, cultural, conversation)

context- appropriate rendering of the source language sentence

• Bilingual Human Annotators• Mark the output of a translation system on a 5 point scale.

• Expensive!!

• Too coarse to arrive at a feedback signal to improve the translation system

• Objective Metrics: Approximations to the real thing!!• Lexical Accuracy (LA)

– Bag of words.

• Translation Accuracy (TA)– Based on string alignment

• Application-driven evaluation– “How May I Help You?”– Spoken dialog for call routing– Classification based on salient phrase detection

Machine Translation Evaluation

Page 38: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Machine Translation Evaluation for call routing

Page 39: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

Summary

• Fully Automatic Machine Translation in its full complexity is a very hard task

• Pragmatic approaches to Machine Translation have been successful

– Limited domain/vocabulary– Human-assisted machine translation – Machine-assisted human translation

• A range of applications for “rough” machine translation

• Machine Translation will improve as we better understand how people communicate.

Page 40: Administration Introduction/Signup sheet Course web site  Course location and time: Thursday,

book

the fliesplease

this flightthree

... ...010100101000100110

100100110100101110

010000100000100110

qing3 yU4ding4 zhe4 ban1ji1

ENGLISH SPEECH

ENGLISH WORD LATTICE

CHINESE TEXT

CHINESE SPEECH

ACOUSTIC SEGMENTFEATURE VALUES

PRONUNCIATION

FEATURE EXTRACTION

RECOGNITION SEARCH

MACHINE TRANSLATION

PHONETIC ANALYSIS

AUDIO SYNTHESIS

請預訂這班機

Spoken Language Translation