ebmt1 example based machine translation as used in the pangloss system at carnegie mellon university...

EBMT 1

EBMT

Example Based Machine Translationas used in the Pangloss system at Carnegie Mellon University

Dave Inman

EBMT 2

Outline

• EBMT in outline?• What data do we need?• How do we create a lexicon?• Indexing the corpus.• Finding chunks to translate.• Matching a chunk against the target.• Quality of translation.• Speed of translation.• Good and bad points• Conclusions.

EBMT 3

EBMT in outline - Corpus

CorpusS1: The cat eats a fish. Le chat mange un poisson.S2: A dog eats a cat. Un chien mange un chat.…..S99,999,999 ….

Indexthe: S1cat: S1eats: S1…dog: S2

EBMT 4

EBMT in outline – find chunks

A source language sentence is input. The cat eats a dog.

Chunks of this sentence are matched against the corpus.The cat : S1The cat eats: S1The cat eats a: S1a dog : S2

EBMT 5

How does EBMT work in outline - Corpus

1. The target language sentences are retrieved for each chunk.

The cat eats : S1CorpusS1: The cat eats a fish. Le chat mange un poisson

2. The chunks are aligned with target sentences (hard!).The cat eats Le chat mange

EBMT 6

How does EBMT work in outline - Corpus

Chunks are scored to find good match…The cat eats Le chat mange Score 78%The cat eats Le chat dorme Score 43%…a dog un chien Score 67%a dog le chien Score 56%a dog un arbre Score 22%

The best translated chunks are put together to make the final translation.The cat eats Le chat mange a dog un chien

EBMT 7

What data do we need?

1.A large corpus of parallel sentences.…if possible in the same domain as the

translations.

2.A bilingual dictionary…but we can induce this from the corpus.

3.A target language root/synonym list.… so we can see similarity between words and inflected forms (e.g. verbs)

4.Classes of words easily translated… such as numbers, towns, weekdays.

EBMT 8

How to create a lexicon.

1.Take each sentence pair in the corpus.

2.For each word in the source sentence, add each word in the target sentence and increment the frequency count.

3.Repeat for as many sentences as possible.

4.Use a threshold to get possible alternative translations.

EBMT 9

How to create a lexicon..example

The cat eats a fish. Le chat mange un poisson.

the le,1 chat,1

mange,1

un,1 poisson,1

cat le,1 chat,1

mange,1

un,1 poisson,1

eats le,1 chat,1

mange,1

un,1 poisson,1

a le,1 chat,1

mange,1

un,1 poisson,1

fish le,1 chat,1

mange,1

un,1 poisson,1

EBMT 10

Create a lexicon…after many sentences

the le,956la,925un,235------ Threshold ----------chat,47mange,33poisson,28....arbre,18

EBMT 11

Create a lexicon…after many sentences

cat chat,963------ Threshold ----------le,604la,485un,305mange,33poisson,28....arbre,47

EBMT 12

Indexing the corpus.

For speed the corpus is indexed on the source language sentences.

Each word in each source language sentence is stored with info about the target sentence.

Words can be added to the corpus and the index easily updated.

Tokens are used for common classes of words (e.g. numbers). This makes matching more effective.

EBMT 13

Finding chunks to translate.

Look up each word in the source sentence in the index.

Look for chunks in the source sentence (at least 2 words adjacent) which match the corpus.

Select last few matches against the corpus (translation memory).

Pangloss uses the last 5 matches for any chunk.

EBMT 14

Matching a chunk against the target.

For each source chunk found previously, retrieve the target sentences from the corpus (using the index).

Try to find the translation for the source chunk from these sentences.

This is the hard bit!

Look for the minimum and maximum segments in the target sentences which could correspond with the source chunk. Score each of these segments.

EBMT 15

Scoring a segment…

Unmatched Words : Higher priority is given to sentences containing all the words in an input chunk.

Noise : Higher priority is given to corpus sentences which have fewer extra words.

Order : Higher priority is given to sentences containing input words in the order which is closer to their order in the input chunk.

Morphology : Higher priority is given to sentences in which words match exactly rather than against morphological variants.

EBMT 16

Whole sentence match…

If we are lucky the whole sentence will be found in the corpus!

In that case the target sentence is used without previous alignment.

Useful if translation memory is available (sentences recently translated are added to the corpus).

EBMT 17

Quality of translation.

Pangloss was tested against source sentences in a different domain to the examples in the corpus.

Pangloss “covered” about 70% of the sentences input.

This means a match was found against the corpus….

…but not necessarily a good match.

Others report around 60% of the translation can be understood by a native speaker. Systran manages about 70%.

EBMT 18

Speed of translation.

Translations are much faster than for Systran.

Simple sentences translated in seconds.

Corpus can be added to (translation memory) at about 6MBytes per minute (Sun Sparc Station)

A 270 Mbytes corpus takes 45 minutes to index.

EBMT 19

Good points.

Fast

Easy to add a new language pair

No need to analyse languages (much)

Can induce a dictionary from the corpus

Allows easy implementation of translation memory

Graceful degradation as size of corpus reduced

EBMT 20

Bad points.

Quality is second best at present

Depends on a large corpus of parallel, well translated sentences

30% of source has no coverage (translation)

Matching of words is brittle – we can see a match Pangloss cannot.

Domain of corpus should match domain to be translated - to match chunks

EBMT 21

Conclusions.

An alternative to Systran

Faster

Lower quality

Quick to develop for a new language pair – if corpus exists!

Needs no linguistics

Might improve as bigger corpora become available?

ebmt1 example based machine translation as used in the pangloss system at carnegie mellon university...

Documents

chat mange slide

cat eats

sentences cat chat

s1 cat

chat mange score

s1 eats

outline corpus corpus

s1 corpus s1