dkrlm: discriminant knowledge-rich language modeling for machine translation

DKRLM: Discriminant Knowledge-Rich

Language Modeling for Machine Translation

Alon Lavie

“Visionary Talk”LTI Faculty Retreat

May 4, 2007

May 4, 2007 DKRLM 2

Background: Search-based MT

• All state-of-the-art MT approaches work within a general search-based paradigm– Translation Models “propose” pieces of translation

for various sub-sentential segments– Decoder puts these pieces together into complete

translation hypotheses and searches for the best scoring hypothesis

• (Target) Language Modeling is the most dominant source of information in scoring alternative translation hypotheses

May 4, 2007 DKRLM 3

The Problem

• Most MT systems use standard statistical LMs that come from SR, usually “as is”– SRI-LM toolkit, CMU/CU LM, SALM toolkit– Until recently, usually trigram models

• The Problem: these LMs are not good at discriminating between good and bad translations!

• How do we know?– Oracle experiments on n-best lists of MT output

consistently show that far better translations are “hiding” in the n-best lists but are not being selected by our MT systems

– Also true of our MEMT system… which led me to start thinking about this problem!

May 4, 2007 DKRLM 4

The Problem

• Why do standard statistical LMs not work well for MT?– MT hypotheses are very different from SR hypotheses

• Speech: mostly correct word-order, confusable homonyms• MT: garbled syntax and word-order, wrong choices for some

translated words– MT violates some basic underlying assumptions of

statistical LMs:• Indirect Discrimination: better translations should have better

LM scores, but LMs are not trained to directly discriminate between good and bad translations!

• Fundamental Probability Estimation Problems: Backoff “Smoothing” for unseen n-grams is based on an assumption of training data sparsity, but the majority of n-grams in MT hypotheses have not been seen because they are not grammatical (they really should have a zero probability!)

May 4, 2007 DKRLM 5

The New Idea• Rather than attempting to model the probabilities of

unseen n-grams, we look at the problem differently:– Extract instances of lexical, syntactic and semantic

features from each translation hypothesis– Determine whether these instances have been “seen

before” (at least once) in a large monolingual corpus• The Conjecture: more grammatical MT hypotheses are

likely to contain higher proportions of feature instances that have been seen in a corpus of grammatical sentences.

• Goals: – Find the set of features that provides the best

discrimination between good and bad translations– Learn how to combine these into a LM-like function for

scoring alternative MT hypotheses

May 4, 2007 DKRLM 6

Outline

• Knowledge-Rich Features• Preliminary Experiments:

– Compare feature occurrence statistics for MT hypotheses versus human-produced (reference) translations

– Compare ranking of MT and “human” systems according to statistical LMs versus a function based on long n-gram occurrence statistics

– Compare n-grams and n-chains as features for binary classification “human versus MT”

• Research Challenges• New Connections with IR

May 4, 2007 DKRLM 7

Knowledge-Rich Features

• Lexical Features:– “long” n-gram sequences (4 words and up)

• Syntactic/Semantic Features:– POS n-grams– Head-word Chains– Specific types of dependencies:

• Verbs and their dependents• Nouns and their dependents• “long-range” dependencies

– Content word co-occurrence statistics• Mixtures of Lexical and Syntactic Features:

– Abstracted versions of word n-gram sequences, where words are replaced by POS tags or Named-entity tags

May 4, 2007 DKRLM 8

Head-Word Chains (n-chains)

• Head-word Chains are chains of syntactic dependency links (from dependent to their heads)

• Bi-chains: [theboy] [boyate] [theapple] [redapple] [appleate]

• Tri-chains: [theboyate] [theappleate] [redappleate]

• Four-chains: none (for this example)!

The boy ate the red apple

May 4, 2007 DKRLM 9

Specific Types of Dependencies

• Some types of syntactic dependencies may be more important than others for MT

• Consider specific types of dependencies that are most important for syntactic and semantic structure:– Dependencies involving content words– Long-distance dependencies– Verb/argument dependencies: focus only on the bi-

chains where the head is the verb: [boyate] and [appleate]

– Noun/modifier dependencies: focus only on the bi-chains where the noun is the head: [theboy] [anapple] [redapple]

May 4, 2007 DKRLM 10

Feature Occurrence Statistics for MT Hypotheses

• The general Idea: determine the fraction of feature instances that have been observed to occur in a large human-produced corpus

• For n-grams: – Extract all n-gram sequences of order n from the

hypothesis– Look-up whether each n-gram instance occurs in the

corpus– Calculate fractions of “found” n-grams for each order n

• For n-chains:– Parse the MT hypothesis (into dependency structure)– Look-up whether each n-chain instance occurs in a

database of n-chains extracted from the large corpus– Calculate fractions of “found” n-chains for each order n


Content-word Co-occurrence Statistics

• Content-word co-occurrences: (unordered) pairs of content words (nouns, verbs, adjectives, adverbs) that co-occur in the same sentence

• Restricted version: subset of co-occurrences that are in a direct syntactic dependency within the sentence (subset of bi-chains)

• Idea: – Learn co-occurrence pair strengths from large monolingual corpora

using statistical measures: DICE, t-score, chi-square, likelihood ratio

– Use average co-occurrence pair strength as a feature for scoring MT hypotheses

– Weak way of capturing the syntax/semantics within sentences• Preliminary experiments show that these features are

somewhat effective in discriminating between MT output and human references

• Thanks Ben Han! [MT Lab Project, 2005]


Preliminary Experiments I• Goal: compare n-gram occurrence statistics for MT hypotheses

versus human-produced (reference) translations• Setup:

– Data: NIST Arabic-to-English MT-Eval 2003 (about 1000 sentences)– Output from three strong MT systems and four reference

translations– Used Suffix-Array LM toolkit [Zhang and Vogel 2006] modified to

return for each string call the length of the longest suffix of the string that occurs in the corpus

– SALM used to index a subset of 600 million words from the Gigaword corpus

– Searched for all n-gram sequences of length eight extracted from the translation

• Thanks to Greg Hanneman!


Preliminary Experiments I

MT Translations

Reference Translations

Ref/MT Ratio Margin

8-grams 2.1% 2.9% 1.38 +38%

7-grams 4.9% 6.4% 1.31 +31%

6-grams 11.4% 14.1% 1.24 +24%

5-grams 25.2% 29.1% 1.15 +15%

4-grams 48.4% 52.2% 1.08 +8%

3-grams 75.9% 77.7% 1.02 +2%

2-grams 94.8% 94.4% 0.995 -0.5%

1-grams 99.3% 98.2% 0.989 -1.1%


Preliminary Experiments II• Goal: Compare ranking of MT and “human” systems according

to statistical LMs versus a function based on long n-gram occurrence statistics

• Same data setup as in the first experiment• Calculate sentence scores as average per word LM score• System score is average over all its sentence scores• Score each system with three different LMs:

– SRI-LM trigram LM trained on 260 million words– SALM suffix-array LM trained on 600 million words– A new function that assigns exponentially more weight to longer n-

gram “hits”:

n

i

iord

nscore

1

)8)((

31


Preliminary Experiments II

System SRI-LM trigram LM

SALM 8-gram LM Occurrence-based Exp score

Ref ahe -2.23 1 -5.59 1 0.01059 1

Ref ahi -2.28 4 -5.87 4 0.00957 2

Ref ahd -2.31 5 -5.99 5 0.00926 3

Ref ahg -2.33 6 -6.04 7 0.00914 4

MT system 1 -2.27 3 -5.77 3 0.00895 5

MT system 2 -2.24 2 -5.75 2 0.00855 6

MT system 3 -2.39 7 -6.01 6 0.00719 7


Preliminary Experiments III

• Goal: Directly discriminate between MT and human translations using a binary SVM classifier trained on n-gram versus n-chain occurrence statistics

• Setup:– Data: NIST Chinese-to-English MT-Eval 2003 (919 sentences)– Four MT system outputs and four human reference translations– N-chain database created using SALM by extracting all n-chains

from a dependency-parsed version of the English Europarl corpus (600K sentences)

– Train SVM classifier on 400 sentences from two MT systems and two human “systems”

– Test classification accuracy on 200 unseen test sentences from the same MT and human systems

– Features for SVM: n-gram “hit” fractions (all n) vs. n-chain fractions

• Thanks to Vamshi Ambati


Preliminary Experiments III

• Results:– Experiment 1:

• N-gram classifier: 49% accuracy • N-chain classifier: 69% accuracy

– Experiment 2:• N-gram classifier: 52% accuracy • N-chain classifier: 63% accuracy

• Observations:– Mixing both n-gram and n-chains did not improve

classification accuracy– Features include both high and low-order instances

(did not try with only high-order ones)– N-chain database is from different domain than test

data, and not a very large corpus


Preliminary Conclusions

• Statistical LMs do not discriminate well between MT hypotheses and human reference translations also poor in discriminating between good and bad MT hypotheses

• Long n-grams and n-chains occurrence statistics differ significantly between MT hypotheses and human reference translations

• Can potentially be useful as discriminant features for identifying better (more grammatical and fluent) translations


Research Challenges• Develop Infrastructure for Computing with Knowledge-Rich

Features– Scale up to querying against much larger monolingual corpora

(terabytes and up)– Parsing and annotation of such vast corpora

• Explore more complex features • Finding the set of features that are most discriminant• Develop Methodologies for training LM-like discriminant

scoring functions:– SVM and/or other classifiers on MT versus human– SVM and/or other classifiers on MT versus MT “Oracle”– Direct regression against human judgments– Parameter optimization for maximizing automatic MT metric

scores (BLEU, METEOR, etc.)• “Incremental” features that can be used during decoding

versus full set of features for n-best list reranking


New Connections with IR

• The “occurrence-based” formulation of the LM problem transforms it from a counting and estimation problem to an IR-like querying problem:– To be effective, we think this may require querying

against extremely large volumes of monolingual text, and structured versions of such text can we do this against local snapshots of the entire web?

– SALM suffix-array infrastructure can currently handle up to about the size of the Gigaword corpus (within 16GB memory)

– Can IR engines such as LEMUR/Indri be adapted to the task?


New Connections with IR

• Challenges this type of task imposes on IR (insights from Jamie Callan):– The larger issue: IR search engines as query

interfaces to vast collections of structured text:• Building an index suitable for very fast “n-gram”

lookups that satisfy certain properties. • The n-gram sequences might be a mix of surface

features and derived features based on text annotations, e.g., $PersonName, or POS=N

– Specific Challenges:• How to build such indexes for fast access?• What does the query language look like?• How to deal with memory/disk vs. speed tradeoff

issues?• Can we get LTI students to do this kind of research?


Final Words…

• Novel and exciting new research direction there are at least one or two PhD theses hiding in here…

• Submitted as a grant proposal to NSF last December (jointly with Rebecca Hwa from Pitt)

• Influences: Some of these ideas were influenced by Jaime’s CBMT work, and by Rebecca’s work on using syntactic features for automatic MT evaluation metrics

• Acknowledgments: – Thanks to Joy Zhang and Stephan Vogel for making the the

SALM toolkit available to us– Thanks to Rebecca Hwa and to my students Ben Han, Greg

Hanneman and Vamshi Ambati for preliminary work on these ideas.

dkrlm: discriminant knowledge-rich language modeling for machine translation

Documents