![Page 1: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/1.jpg)
1
Novel Inference, Training and Decoding Methods over Translation Forests
Zhifei LiCenter for Language and Speech Processing
Computer Science Department
Johns Hopkins University
Advisor: Sanjeev Khudanpur
Co-advisor: Jason Eisner
![Page 2: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/2.jpg)
2
Statistical Machine Translation Pipeline
Decoding
Bilingual
Data
Translation
Models
Monolingual
English
Language
Models
Discriminative Training
Optimal
Weights
Unseen
SentencesTranslation
Outputs
Generative Training
Generative Training
Held-out Bilingual Data
![Page 3: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/3.jpg)
3
垫子 上 的 猫
Training a Translation Model
dianzi shangdemao
a caton the mat
dianzi shang
the mat
![Page 4: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/4.jpg)
4
垫子 上 的 猫
Training a Translation Model
dianzi shangdemao
a caton the mat
dianzi shangthe mat,mao a cat,
![Page 5: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/5.jpg)
5
垫子 上 的 猫
Training a Translation Model
dianzi shangde
on the mat
dianzi shang de
on the mat,
dianzi shangthe mat,mao a cat,
![Page 6: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/6.jpg)
6
Training a Translation Model
dianzi shang de
on the mat,
dianzi shangthe mat,
垫子 上 的 猫demao
a caton
de maoa cat on,
mao a cat,
![Page 7: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/7.jpg)
7
Training a Translation Model
dianzi shang de
on the mat,
dianzi shangthe mat,
垫子 上 的 猫de
on
de maoa cat on,
de on,
mao a cat,
![Page 8: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/8.jpg)
8
dianzi shang de gou垫子 上 的 狗
the dog on the mat
dianzi shang de gou
Decoding a Test Sentence
Translation is easy?
Derivation Tree
![Page 9: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/9.jpg)
9
dianzi shang de mao
a cat on the mat
垫子 上 的 猫
zhongguo de shouducapital of China
wo de maomy cat
zhifei de maozhifei ’s cat
Translation Ambiguity
![Page 10: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/10.jpg)
10
dianzi shang de mao
Joshua (chart parser)
Joshua (chart parser)
![Page 11: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/11.jpg)
11
dianzi shang de mao
the mat a catthe mat a cata cat on the mata cat on the mat
a cat of the mata cat of the mat the mat ’s a catthe mat ’s a cat
Joshua (chart parser)
Joshua (chart parser)
![Page 12: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/12.jpg)
12
dianzi shang de mao
hypergraph
hypergraph
Joshua (chart parser)
Joshua (chart parser)
![Page 13: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/13.jpg)
13
A hypergraph is a compact data structure to encode exponentially
many trees.hypered
gehypered
ge
edgeedge
hyperedge
hyperedge
nodenodeFSAFSA
Packed Forest
Packed Forest
![Page 14: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/14.jpg)
14
A hypergraph is a compact data structure to encode exponentially
many trees.
![Page 15: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/15.jpg)
15
A hypergraph is a compact data structure to encode exponentially
many trees.
![Page 16: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/16.jpg)
16
A hypergraph is a compact data structure to encode exponentially
many trees.
![Page 17: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/17.jpg)
17
A hypergraph is a compact data structure to encode exponentially
many trees.Structure sharing
![Page 18: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/18.jpg)
18
Why Hypergraphs?•Contains a much larger hypothesis
space than a k-best list
•General compact data structure
• special cases include
• finite state machine (e.g., lattice),
• and/or graph
• packed forest
• can be used for speech, parsing, tree-based MT systems, and many more
![Page 19: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/19.jpg)
19
p=2
p=2
p=1
p=1
p=3
p=3
p=2
p=2
Weighted HypergraphWeighted
Hypergraph
Linear model:Linear model:
weights features
derivationforeign input
![Page 20: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/20.jpg)
20
Probabilistic HypergraphProbabilistic Hypergraph
Log-linear model:
Log-linear model:
Z=2+1+3+2=8
Z=2+1+3+2=8
p=2/8
p=2/8
p=1/8
p=1/8
p=3/8
p=3/8
p=2/8
p=2/8
![Page 21: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/21.jpg)
21
The hypergraph defines a probability distribution over
trees!Probabilistic HypergraphProbabilistic Hypergraph
the distribution is parameterized by Θ
p=2/8
p=2/8
p=1/8
p=1/8
p=3/8
p=3/8
p=2/8
p=2/8
![Page 22: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/22.jpg)
22
Probabilistic HypergraphProbabilistic Hypergraph
How do we set the parameters Θ?
Training
Why are the problems difficult?- brute-force will be too slow as there are
exponentially many trees, so require sophisticated dynamic programs- sometimes intractable, require
approximations
The hypergraph defines a probability distribution over
trees! the distribution is parameterized by Θ
Decoding Which translation do we present to a user?
What atomic operations do we need to perform?
Atomic Inference
![Page 23: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/23.jpg)
23
Inference, Training and Decoding on Hypergraphs• Atomic Inference
• finding one-best derivations
• finding k-best derivations
• computing expectations (e.g., of features)• Training• Perceptron, conditional random field (CRF),
minimum error rate training (MERT), minimum risk, and MIRA
• Decoding• Viterbi decoding, maximum a posterior (MAP) decoding,
and minimum Bayes risk (MBR) decoding
![Page 24: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/24.jpg)
24
Outline• Hypergraph as Hypothesis Space
• Unsupervised Discriminative Training
‣ minimum imputed risk
‣ contrastive language model estimation
• Variational Decoding
• First- and Second-order Expectation Semirings
main focus
![Page 25: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/25.jpg)
25
Training Setup• Each training example consists of
• a foreign sentence (from which a hypergraph is generated to represent many possible translations)
• a reference translation
x: dianzi shang de mao
y: a cat on the mat
• Training
•adjust the parameters Θ so that the reference translation is preferred by the model
![Page 26: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/26.jpg)
27
Supervised: Minimum Empirical Risk• Minimum Empirical Risk
Training
x ylossempirical
distribution
- MERT- CRF- PeceptronWhat if the input x is
missing?
MT decoder
MT output
negated BLEU
• Uniform Empirical Distribution
![Page 27: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/27.jpg)
28
• Minimum Empirical Risk Training
Unsupervised: Minimum Imputed Risk
•Minimum Imputed Risk Training
loss
: reverse model
: imputed input
: forward system
Speech recognition?
Round trip translation
![Page 28: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/28.jpg)
29
Training Reverse Model
and are parameterized and trained separately
Our goal is to train a good forward system
is fixed when training
![Page 29: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/29.jpg)
30
Approximating
• Approximations
exponentially many x, stored ina hypergraph
- k-best
- sampling
- lattice
variational approximation +
lattice decoding (Dyer et al., 2008)
CFG is not closed under composition!
SCFG SCFG
![Page 30: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/30.jpg)
31
The Forward System
•Deterministic Decoding• use one-best translation
•Randomized Decoding• use a distribution of translations
the objective is not differentiable ☹
differentiable ☺expected loss
![Page 31: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/31.jpg)
33
Experiments
•Supervised Training• require bitext
•Unsupervised Training• require monolingual English
•Semi-supervised Training• interpolation of supervised and
unsupervised
![Page 32: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/32.jpg)
34
Semi-supervised Training
Adding unsupervised data helps!
40K sent. pairs
551 features
![Page 33: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/33.jpg)
36
Supervised vs. UnsupervisedUnsupervised training performs as well as (and often better than) the supervised one!Unsupervised uses 16 times of data as supervised. For example,
Chinese
English
Sup 100 16*100
Unsup
16*100
16*16*100
•More experiments• different k-best size
• different reverse model
But, fair comparison!
![Page 34: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/34.jpg)
41
Outline• Hypergraph as Hypothesis Space
• Unsupervised Discriminative Training
‣ minimum imputed risk
‣ contrastive language model estimation
• Variational Decoding
• First- and Second-order Expectation Semirings
![Page 35: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/35.jpg)
42
Language Modeling• Language Model
• assign a probability to an English sentence y
• typically use an n-gram model
• Global Log-linear Model (whole-sentence maximum-entropy LM)
All English sentences with any length!
Sampling ☹slow
(Rosenfeld et al., 2001)
(Rosenfeld et al., 2001)
a set of n-grams occurred in yLocally normalize
d
Globally normalized
![Page 36: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/36.jpg)
43
• Global Log-linear Model (whole-sentence maximum-entropy LM)
(Rosenfeld et al., 2001)
(Rosenfeld et al., 2001)
Contrastive Estimation
• Contrastive Estimation (CE)
neighborhood or contrastive set
improve both speed and accuracy
not proposed for language modelingNeighborhood
Function
loss
(Smith and Eisner, 2005)
(Smith and Eisner, 2005)
train to recover the original English as much as
possiblea set of alternate Eng. sentences of
![Page 37: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/37.jpg)
44
Contrastive Language Model Estimation• Step-1: extract a confusion grammar
(CG)• an English-to-English SCFG
• Step-2: for each English sentence, generate a contrastive set (or neighborhood) using the CG
• Step-3: discriminative training
paraphrase
insertion
re-ordering
neighborhood function
![Page 38: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/38.jpg)
45
Step-1: Extracting a Confusion Grammar (CG)• Deriving a CG from a bilingual grammar
• use Chinese side as pivots
Bilingual RuleBilingual Rule Confusion Rule
Confusion Rule
Our neighborhood function is learned and MT-specific.
CG captures the confusion an MT system will have when translating an
input.
![Page 39: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/39.jpg)
46
Step-2: Generating Contrastive Sets
a cat on the mat
CGCG the cat the matthe cat ’s the
matthe mat on the catthe mat of the cat
Contrastive set:
Contrastive set:
Translating “dianzi shang de mao”?
![Page 40: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/40.jpg)
47
Step-3: Discriminative Training
• Training Objective
• Step-2: for each English sentence, generate a contrastive set (or neighborhood) using the CG
• Step-3: discriminative training
• Iterative Training
contrastive set
CE maximizes the conditional likelihood
expected loss
![Page 41: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/41.jpg)
48
Applying the Contrastive Model
• We can use the contrastive model as a regular language model
• We can incorporate the contrastive model into an end-to-end MT system as a feature
• We may also use the contrastive model to generate paraphrase sentences (if the loss function measures semantic similarity)
• the rules in CG are symmetric
![Page 42: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/42.jpg)
50
Parsing
Confusion grammar
Monolingual
English Training Contrastive
LM
English
SentenceOne-best English
Test on Synthesized Hypergraphs of English Data
RankHypergraph
(Neighborhood)
BLEU Score?
![Page 43: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/43.jpg)
51
baseline LM (5-gram)
word penalty • Target side of a confusion rule
• Rule bigram features
The contrastive LM better recovers the original English than a regular n-gram LM.
All the features look at only the target sides of
confusion rules
Results on Synthesized Hypergraphs
![Page 44: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/44.jpg)
52
Results on MT Test Set
Add CLM as a feature
The contrastive LM helps to improve MT performance.
![Page 45: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/45.jpg)
53
Adding Features on the CG itself• On English Set
• On MT Set
Paraphrasing model
Paraphrasing model
glue rules or regular confusion rules?
one big feature
![Page 46: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/46.jpg)
61
•Supervised: Minimum Empirical Risk
•Unsupervised: Contrastive LM Estimation
Summary for Discriminative Training
•Unsupervised: Minimum Imputed Risk
require bitext
require monolingual
English
require monolingual
English
![Page 47: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/47.jpg)
62
•Supervised Training
•Unsupervised: Contrastive LM Estimation
Summary for Discriminative Training
•Unsupervised: Minimum Imputed Risk
require a reverse model
can have both TM and LM features
can have LM features only
require bitext
require monolingual
English
require monolingual
English
![Page 48: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/48.jpg)
63
Outline• Hypergraph as Hypothesis Space
• Unsupervised Discriminative Training
‣ minimum imputed risk
‣ contrastive language model estimation
• Variational Decoding
• First- and Second-order Expectation Semirings
![Page 49: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/49.jpg)
64
Variational Decoding•We want to do inference under p, but it is
intractable
• Instead, we derive a simpler distribution q*
•Then, we will use q* as a surrogate for p in inference
p(y|x)
Q q*(y)
P
intractable MAP decoding
tractable estimation
tractable decoding
(Sima’an 1996)
![Page 50: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/50.jpg)
65
Variational Decoding for MT: an Overview
Sentence-specific decoding
Sentence-specific decoding
Foreign sentence x SMTSMT
MAP decoding under P is intractable
p(y | x)=∑d∈D(x,y) p(d|x)
p(y | x)=∑d∈D(x,y) p(d|x)
11Generate a hypergraph for the foreign sentence
Three steps:Three steps:
p(d | x)p(d | x)
53
![Page 51: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/51.jpg)
66
q* is an n-gram model over output strings.
Decode using q*on the hypergraph
11
p(d | x)p(d | x)
p(d | x)p(d | x)
q*(y | x)q*(y | x)
22
33
Estimate a model from the hypergraph by minimizing KL
Generate a hypergraph
q*(y | x)q*(y | x)
≈∑d∈D(x,y) p(d|x)≈∑d∈D(x,y) p(d|x)
54
Approximate a hypergraph with a lattice!
![Page 52: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/52.jpg)
67
Outline• Hypergraph as Hypothesis Space
• Unsupervised Discriminative Training
‣ minimum imputed risk
‣ contrastive language model estimation
• Variational Decoding
• First- and Second-order Expectation Semirings
![Page 53: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/53.jpg)
68
• First-order expectations: - expectation - entropy - expected loss - cross-entropy - KL divergence - feature expectations- first-order gradient of Z
• Second-order expectations: - expectation over product - interaction between features- Hessian matrix of Z - second-order gradient descent- gradient of expectation - gradient of expected loss or entropy
Probabilistic HypergraphProbabilistic Hypergraph
• “Decoding” quantities: - Viterbi- K-best- Counting- ......
A semiring framework to compute all of these
Recipe to compute a quantity:• Choose a semiring
• Specific a semiring weight for each hyperedge
• Run the inside algorithm
![Page 54: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/54.jpg)
69
Applications of Expectation Semirings: a Summary
![Page 55: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/55.jpg)
70
• Unsupervised Discriminative Training
‣ minimum imputed risk
‣ contrastive language model estimation
• Variational Decoding
• First- and Second-order Expectation Semirings
Inference, Training and Decoding on Hypergraphs
(Li et al., ACL 2009)
(Li and Eisner, EMNLP 2009)
(In Preparation)(In Preparation)
![Page 56: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/56.jpg)
71
My Other MT Research• Training methods (supervised)
• Discriminative forest reranking with Perceptron (Li and Khudanpur, GALE book chapter 2009)
• Discriminative n-gram language models(Li and Khudanpur, AMTA 2008)
• Algorithms
• Oracle extraction from hypergraphs (Li and Khudanpur, NAACL 2009)
• Efficient intersection between n-gram LM and CFG (Li and Khudanpur, ACL SSST 2008)
• Others• System combination (Smith et al., GALE book
chapter 2009)
• Unsupervised translation induction for Chinese abbreviations (Li and Yarowsky, ACL 2008)
![Page 57: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/57.jpg)
72
Research other than MT
• Information extraction
• Relation extraction between formal and informal phrases (Li and Yarowsky, EMNLP 2008)
• Spoken dialog management
• Optimal dialog in consumer-rating systems using a POMDP (Li et al., SIGDial 2008)
![Page 58: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/58.jpg)
73
Joshua project• An open-source parsing-based MT toolkit (Li et al.
2009)
• support Hiero (Chiang, 2007) and SAMT (Venugopal et al., 2007)
• Team members
• Zhifei Li, Chris Callison-Burch, Chris Dyer, Sanjeev Khudanpur, Wren Thornton, Jonathan Weese, Juri Ganitkevitch, Lane Schwartz, and Omar Zaidan
All the methods presented have been implemented in Joshua!
Only rely on word-aligner and SRI LM!
![Page 59: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/59.jpg)
74
Thank you!XieXie!
谢谢 !
![Page 60: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/60.jpg)
75
![Page 61: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/61.jpg)
76
![Page 62: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/62.jpg)
77
![Page 63: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/63.jpg)
78
Decoding over a hypergraph
Pick a single translation to output
(why not just pick the tree with the highest weight?)
Given a hypergraph of possible translations (generated for a given foreign
sentence by already-trained model)
![Page 64: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/64.jpg)
79
Spurious Ambiguity •Statistical models in MT exhibit spurious
ambiguity
• Many different derivations (e.g., trees or segmentations) generate the same translation string
•Tree-based MT systems
• derivation tree ambiguity
•Regular phrase-based MT systems
• phrase segmentation ambiguity
![Page 65: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/65.jpg)
80
Spurious Ambiguity in Derivation Trees 机器 翻译 软件
S->( 机器 , machine)S->( 翻译 , translation)S->( 软件 , software)
S->( 机器 , machine)S->( 翻译 , translation)S->( 软件 , software)
S->( 机器 , machine) 翻译 S->( 软件 , software)
S->(S0 S1, S0 S1)
S->(S0 S1, S0 S1)
S->(S0 S1, S0 S1)
S->(S0 S1, S0 S1)
S->(S0 翻译 S1, S0 translation S1)
• Same output: “machine translation software”
• Three different derivation trees
jiqi fanyi yuanjian
machine translation software
machine transfer software
Another translation:
![Page 66: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/66.jpg)
81
MAP, Viterbi and N-best Approximations
•Viterbi approximation
•N-best approximation (crunching) (May and Knight 2006)
•Exact MAP decoding
NP-hard (Sima’an 1996)
![Page 67: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/67.jpg)
82
red translation
blue translation
green translation
0.160.140.140.130.120.110.100.10
probabilityderivation
translation string
MAP
MAP vs. Approximations
0.28
0.28
0.44
Viterbi
0.16
0.14
0.13
4-best crunchin
g0.16
0.28
0.13
• Our goal: develop an approximation that considers all the derivations but still allows tractable decoding
• Viterbi and crunching are efficient, but ignore most derivations
• Exact MAP decoding under spurious ambiguity is intractable on HG
![Page 68: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/68.jpg)
83
Variational DecodingDecoding using Variational
approximation Decoding using a sentence-specific
approximate distribution
![Page 69: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/69.jpg)
84
Variational Decoding for MT: an Overview
Sentence-specific decoding
Sentence-specific decoding
Foreign sentence x SMTSMT
MAP decoding under P is intractable
p(y | x)p(y | x)
11Generate a hypergraph for the foreign sentence
Three steps:Three steps:
p(y, d | x)p(y, d | x)
68
![Page 70: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/70.jpg)
85
q* is an n-gram model over output strings.
Decode using q*on the hypergraph
11
p(d | x)p(d | x)
p(d | x)p(d | x)
q*(y | x)q*(y | x)
22
33
Estimate a model from the hypergraph
Generate a hypergraph
q*(y | x)q*(y | x)
≈∑d∈D(x,y) p(d|x)≈∑d∈D(x,y) p(d|x)
69
![Page 71: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/71.jpg)
86
Variational Inference•We want to do inference under p, but it is
intractable
• Instead, we derive a simpler distribution q*
•Then, we will use q* as a surrogate for p in inference
p
Q q*
P
![Page 72: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/72.jpg)
87
constant
Variational Approximation•q*: an approximation having minimum distance to p
• Three questions
• how to parameterize q?
• how to estimate q*?
• how to use q* for decoding?
a family of distributions
an n-gram modelcompute expected n-gram counts and normalizescore the hypergraph with the n-gram model
![Page 73: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/73.jpg)
88
KL divergences under different variational models
•Larger n ==> better approximation q_n ==> smaller KL divergence from p
•The reduction of KL divergence happens mostly when switching from unigram to bigram
![Page 74: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/74.jpg)
89
BLEU Results on Chinese-English NIST MT 2004
Tasks
• variational decoding improves over Viterbi, MBR, and crunching
(Kumar and Byrne, 2004)
(Kumar and Byrne, 2004)
New!New!
(May and Knight, 2006)
(May and Knight, 2006)
![Page 75: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/75.jpg)
90
Variational Inference•We want to do inference under p, but it is
intractable
• Instead, we derive a simpler distribution q*
•Then, we will use q* as a surrogate for p in inference
p
Q q*
P
intractable
tractable
tractable
![Page 76: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/76.jpg)
91
Outline• Hypergraph as Hypothesis Space
• Unsupervised Discriminative Training
‣ minimum imputed risk
‣ contrastive language model estimation
• Variational Decoding
• First- and Second-order Expectation Semirings
![Page 77: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/77.jpg)
92
• First-order quantities: - expectation - entropy - Bayes risk - cross-entropy - KL divergence - feature expectations- first-order gradient of Z
• Second-order quantities: - expectation over product - interaction between features- Hessian matrix of Z - second-order gradient descent- gradient of expectation - gradient of entropy or Bayes risk
Probabilistic HypergraphProbabilistic Hypergraph
• “decoding” quantities: - Viterbi- K-best- Counting- ......
A semiring framework to compute all of these
![Page 78: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/78.jpg)
93
‣ choose a semiring
‣ specify a weight for each hyperedge
‣ run the inside algorithm
Compute Quantities on a Hypergraph: a Recipe
• three steps:•Semiring-weighted inside algorithm
a set with plus and times operationse.g., integer numbers with regular +
and ×
each weight is a semiring member
complexity is O(hypergraph size)
![Page 79: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/79.jpg)
94
Semirings• “Decoding” time semirings
• counting, Viterbi, K-best, etc.
• “Training” time semirings
• first-order expectation semirings
• second-order expectation semirings (new)
• Applications of the Semirings (new)
• entropy, risk, gradient of them, and many more
(Goodman, 1999)
(Goodman, 1999)
(Eisner, 2002)
(Eisner, 2002)
![Page 80: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/80.jpg)
95
dianzi0 shang1 de2 mao3
How many trees?
How many trees?
four☺four☺compute it
use a semiring?
compute ituse a
semiring?
![Page 81: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/81.jpg)
96
ke=1
‣ choose a semiring
‣ specify a weight for each hyperedge‣ run the inside algorithm
counting semiring: ordinary integers with
regular + and x
1 1
1111
11
Compute the Number of Derivation TreesThree steps:
![Page 82: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/82.jpg)
97
ke=1k(v1)= k(e1) k(v2)= k(e2)
k(v5)= k(e7) k(v3) k(e8) k(v4)
k(v5)=4
k(v4)= k(e5) k(v1) k(v2) k(e6) k(v1) k(v2)
k(v3)= k(e3) k(v1) k(v2) k(e4) k(v1) k(v2)
k(v4)=2
k(v1)=1
k(v3)=2
k(v2)=1
Bottom-up process in computing the number of trees
1 1
1111
11
![Page 83: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/83.jpg)
98
the mat a cat
the mat a cat
a cat on the mat
a cat on the mat
a cat of the mat
a cat of the mat
the mat ’s a cat
the mat ’s a cat
expected translation length?
expected translation length?2/8×4 + 6/8×5 =
4.752/8×4 + 6/8×5 =
4.75
4
5
5
5
p=2/8p=2/8
p=1/8
p=1/8
p=3/8
p=3/8
p=2/8
p=2/8
variance?
variance?2/8×(4-4.75)^2 + 6/8×(5-
4.75)^2 ≈ 0.192/8×(4-4.75)^2 + 6/8×(5-
4.75)^2 ≈ 0.19
![Page 84: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/84.jpg)
99
First- and Second-order Expectation Semirings
• each member is a 2-tuple:
• each member is a 4-tuple:
First-order:First-order:
Second-order:
Second-order:
(Eisner, 2002)
(Eisner, 2002)
![Page 85: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/85.jpg)
100
ke=1k(v1)= k(e1) k(v2)= k(e2)
k(v5)= k(e7) k(v3) k(e8) k(v4)
k(v4)= k(e5) k(v1) k(v2) k(e6) k(v1) k(v2)
k(v3)= k(e3) k(v1) k(v2) k(e4) k(v1) k(v2)
k(v5)= 〈 8, 4.75 〉
k(v4)= 〈 1, 2 〉
k(v3)= 〈 0, 1 〉
k(v2)= 〈 1, 2 〉k(v1)= 〈 1, 2 〉
〈 1, 2 〉
〈 1, 2 〉
〈 2, 0 〉
〈 3, 3 〉
〈 1, 1 〉
〈 2, 2 〉
〈 1, 0 〉
〈 1, 0 〉
First-order:each semiring member is a 2-tuple
Fake number
s
![Page 86: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/86.jpg)
101
ke=1k(v1)= k(e1) k(v2)= k(e2)
k(v5)= k(e7) k(v3) k(e8) k(v4)
k(v4)= k(e5) k(v1) k(v2) k(e6) k(v1) k(v2)
k(v3)= k(e3) k(v1) k(v2) k(e4) k(v1) k(v2)
〈 1,2,2,4〉
〈 2,0,0,0 〉〈 3,3,3,3
〉
〈 1,1,1,1〉〈 2,2,2,2 〉
〈 1, 0,0,0 〉
〈 1,0,0,0〉
〈 1,2,2,4〉
k(v5)= 〈 8,4.5,4.5,5 〉k(v4)= 〈 1,2,1
,3 〉k(v3)= 〈 1,1,1,1
〉
k(v2)= 〈 1,2,2,4 〉
k(v1)= 〈 1,2,2,4 〉
Second-order:each semiring member is a 4-tuple
Fake number
s
![Page 87: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/87.jpg)
102
Expectations on Hypergraphs
• r(d) is a function over a derivation d
• Expectation over a hypergraph
e.g., translation length is additively decomposed!
• r(d) is additively decomposed
e.g., the length of the translation yielded by d
exponential size
![Page 88: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/88.jpg)
103
Second-order Expectations on Hypergraphs
• r and s are additively decomposed
• Expectation of products over a hypergraph
r and s can be identical or different functions.
exponential size
![Page 89: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/89.jpg)
104
Compute expectation using expectation semiring:Compute expectation using expectation semiring:
Why?Why? entropy is an expectation
Entropy:Entropy:
pe : transition probability or log-linear score at edge e re ?
log p(d) is additively decomposed!
![Page 90: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/90.jpg)
105
Why?Why? cross-entropy is an expectation
Entropy:Entropy:
pe : transition probability or log-linear score at edge e re ?
Cross-entropy:Cross-entropy:
log q(d) is additively decomposed!
Compute expectation using expectation semiring:Compute expectation using expectation semiring:
![Page 91: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/91.jpg)
106
Why?Why? Bayes risk is an expectation
Entropy:Entropy:
pe : transition probability or log-linear score at edge e re ?
Cross-entropy:Cross-entropy:Bayes risk:Bayes risk:
(Tromble et al. 2008)
(Tromble et al. 2008)
L(Y(d)) is additively decomposed!
Compute expectation using expectation semiring:Compute expectation using expectation semiring:
![Page 92: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/92.jpg)
107
Applications of Expectation Semirings: a Summary
![Page 93: 1 Novel Inference, Training and Decoding Methods over Translation Forests Zhifei Li Center for Language and Speech Processing Computer Science Department](https://reader030.vdocuments.us/reader030/viewer/2022032606/56649ea75503460f94baa3de/html5/thumbnails/93.jpg)
108
• First-order quantities: - expectation - entropy - Bayes risk - cross-entropy - KL divergence - feature expectations- first-order gradient of Z
• Second-order quantities: - Expectation over product - interaction between features- Hessian matrix of Z - second-order gradient descent- gradient of expectation - gradient of entropy or Bayes risk
Probabilistic HypergraphProbabilistic Hypergraph
• “decoding” quantities: - Viterbi- K-best- Counting- ......
A semiring framework to compute all of these