ebmt1 example based machine translation as used in the pangloss system at carnegie mellon university...
Post on 20-Dec-2015
217 views
TRANSCRIPT
EBMT 1
EBMT
Example Based Machine Translationas used in the Pangloss system at Carnegie Mellon University
Dave Inman
EBMT 2
Outline
• EBMT in outline?• What data do we need?• How do we create a lexicon?• Indexing the corpus.• Finding chunks to translate.• Matching a chunk against the target.• Quality of translation.• Speed of translation.• Good and bad points• Conclusions.
EBMT 3
EBMT in outline - Corpus
CorpusS1: The cat eats a fish. Le chat mange un poisson.S2: A dog eats a cat. Un chien mange un chat.…..S99,999,999 ….
Indexthe: S1cat: S1eats: S1…dog: S2
EBMT 4
EBMT in outline – find chunks
A source language sentence is input. The cat eats a dog.
Chunks of this sentence are matched against the corpus.The cat : S1The cat eats: S1The cat eats a: S1a dog : S2
EBMT 5
How does EBMT work in outline - Corpus
1. The target language sentences are retrieved for each chunk.
The cat eats : S1CorpusS1: The cat eats a fish. Le chat mange un poisson
2. The chunks are aligned with target sentences (hard!).The cat eats Le chat mange
EBMT 6
How does EBMT work in outline - Corpus
Chunks are scored to find good match…The cat eats Le chat mange Score 78%The cat eats Le chat dorme Score 43%…a dog un chien Score 67%a dog le chien Score 56%a dog un arbre Score 22%
The best translated chunks are put together to make the final translation.The cat eats Le chat mange a dog un chien
EBMT 7
What data do we need?
1.A large corpus of parallel sentences.…if possible in the same domain as the
translations.
2.A bilingual dictionary…but we can induce this from the corpus.
3.A target language root/synonym list.… so we can see similarity between words and inflected forms (e.g. verbs)
4.Classes of words easily translated… such as numbers, towns, weekdays.
EBMT 8
How to create a lexicon.
1.Take each sentence pair in the corpus.
2.For each word in the source sentence, add each word in the target sentence and increment the frequency count.
3.Repeat for as many sentences as possible.
4.Use a threshold to get possible alternative translations.
EBMT 9
How to create a lexicon..example
The cat eats a fish. Le chat mange un poisson.
the le,1 chat,1
mange,1
un,1 poisson,1
cat le,1 chat,1
mange,1
un,1 poisson,1
eats le,1 chat,1
mange,1
un,1 poisson,1
a le,1 chat,1
mange,1
un,1 poisson,1
fish le,1 chat,1
mange,1
un,1 poisson,1
EBMT 10
Create a lexicon…after many sentences
the le,956la,925un,235------ Threshold ----------chat,47mange,33poisson,28....arbre,18
EBMT 11
Create a lexicon…after many sentences
cat chat,963------ Threshold ----------le,604la,485un,305mange,33poisson,28....arbre,47
EBMT 12
Indexing the corpus.
For speed the corpus is indexed on the source language sentences.
Each word in each source language sentence is stored with info about the target sentence.
Words can be added to the corpus and the index easily updated.
Tokens are used for common classes of words (e.g. numbers). This makes matching more effective.
EBMT 13
Finding chunks to translate.
Look up each word in the source sentence in the index.
Look for chunks in the source sentence (at least 2 words adjacent) which match the corpus.
Select last few matches against the corpus (translation memory).
Pangloss uses the last 5 matches for any chunk.
EBMT 14
Matching a chunk against the target.
For each source chunk found previously, retrieve the target sentences from the corpus (using the index).
Try to find the translation for the source chunk from these sentences.
This is the hard bit!
Look for the minimum and maximum segments in the target sentences which could correspond with the source chunk. Score each of these segments.
EBMT 15
Scoring a segment…
Unmatched Words : Higher priority is given to sentences containing all the words in an input chunk.
Noise : Higher priority is given to corpus sentences which have fewer extra words.
Order : Higher priority is given to sentences containing input words in the order which is closer to their order in the input chunk.
Morphology : Higher priority is given to sentences in which words match exactly rather than against morphological variants.
EBMT 16
Whole sentence match…
If we are lucky the whole sentence will be found in the corpus!
In that case the target sentence is used without previous alignment.
Useful if translation memory is available (sentences recently translated are added to the corpus).
EBMT 17
Quality of translation.
Pangloss was tested against source sentences in a different domain to the examples in the corpus.
Pangloss “covered” about 70% of the sentences input.
This means a match was found against the corpus….
…but not necessarily a good match.
Others report around 60% of the translation can be understood by a native speaker. Systran manages about 70%.
EBMT 18
Speed of translation.
Translations are much faster than for Systran.
Simple sentences translated in seconds.
Corpus can be added to (translation memory) at about 6MBytes per minute (Sun Sparc Station)
A 270 Mbytes corpus takes 45 minutes to index.
EBMT 19
Good points.
Fast
Easy to add a new language pair
No need to analyse languages (much)
Can induce a dictionary from the corpus
Allows easy implementation of translation memory
Graceful degradation as size of corpus reduced
EBMT 20
Bad points.
Quality is second best at present
Depends on a large corpus of parallel, well translated sentences
30% of source has no coverage (translation)
Matching of words is brittle – we can see a match Pangloss cannot.
Domain of corpus should match domain to be translated - to match chunks
EBMT 21
Conclusions.
An alternative to Systran
Faster
Lower quality
Quick to develop for a new language pair – if corpus exists!
Needs no linguistics
Might improve as bigger corpora become available?