conditional random fields
DESCRIPTION
Conditional Random Fields. William W. Cohen CALD. Announcements. Upcoming assignments: Today: Sha & Pereira, Lafferty et al Mon 2/23: Klein & Manning, Toutanova et al Wed 2/25: no writeup due Mon 3/1: no writeup due Wed 3/3: project proposal due: personnel + 1-2 page - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/1.jpg)
Conditional Random Fields
William W. Cohen
CALD
![Page 2: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/2.jpg)
Announcements
• Upcoming assignments:– Today: Sha & Pereira, Lafferty et al– Mon 2/23: Klein & Manning, Toutanova et al– Wed 2/25: no writeup due– Mon 3/1: no writeup due– Wed 3/3: project proposal due: personnel + 1-2
page – Spring break week, no class
![Page 3: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/3.jpg)
Review: motivation for CMM’s
Ideally we would like to use many, arbitrary, overlapping features of words.
St -1
St
Ot
St+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
![Page 4: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/4.jpg)
Motivation for CMMs
St -1
St
Ot
St+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state
...),|Pr( ,1 ttt sxs
![Page 5: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/5.jpg)
Implications of the model
• Does this do what we want?
• Q: does Y[i-1] depend on X[i+1] ?– “a nodes is conditionally independent of its non-descendents given
its parents”
![Page 6: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/6.jpg)
Label Bias Problem
• P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)
• Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri)In the training data, label value 2 is the only label value observed after label value 1Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x
• However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro).
• Per-state normalization does not allow the required expectation
• Consider this MEMM:
![Page 7: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/7.jpg)
Label Bias Problem
• Consider this MEMM, and enough training data to perfectly model it:
Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3= 0.5 * 1 * 1
Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’= 0.5 * 1 *1
Pr(0123|rib)=1
Pr(0453|rob)=1
![Page 8: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/8.jpg)
How important is label bias?
• Could be avoided in this case by changing structure:
• Our models are always wrong – is this “wrongness” a problem?
• See Klein & Manning’s paper for next week….
![Page 9: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/9.jpg)
Another view of label bias [Sha & Pereira]
So what’s the alternative?
![Page 10: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/10.jpg)
Review of maxent
'
)(0
))',(exp(
)),(exp()|Pr(
)),(exp(),Pr(
))(exp()Pr(
y iii
iii
iii
iii
i
xf
yxf
yxfxy
yxfyx
xfx i
![Page 11: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/11.jpg)
Review of maxent/MEMM/CMMs
j j
ijjjii
jjjjnn
iii
y iii
iii
xZ
yyxfxyyxxyy
xZ
yxf
yxf
yxfxy
)(
)),,(exp()|Pr()...|...Pr(
:MEMMfor
)(
)),(exp(
))',(exp(
)),(exp()|Pr(
1
,111
'
![Page 12: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/12.jpg)
Details on CMMs
j j
ijjjii
jjjjnn xZ
yyxfxyyxxyy
)(
)),,(exp()|Pr()...|...Pr(
1
,111
jjjjijjji
jj
ijjjii
jj
ijjjii
j
yyxfyyxFxZ
yyxF
xZ
yyxf
),,(),,( where,)(
)),,(exp(
)(
)),,(exp(
11
1
1
![Page 13: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/13.jpg)
From CMMs to CRFs
jjjjii
jj
iii
jj
ijjjii
j
yyxfyxFxZ
yxF
xZ
yyxf
),,(),( where,)(
)),(exp(
)(
)),,(exp(
1
1
Recall why we’re unhappy: we don’t want local normalization
)(
)),(exp(
xZ
yxFi
ii
New model
![Page 14: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/14.jpg)
What’s the new model look like?
)(
),,(exp(
)(
)),(exp( 1
xZ
yyxf
xZ
yxFi j
jjjii
iii
x1 x2 x3
y1 y2 y3
What’s independent?
![Page 15: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/15.jpg)
What’s the new model look like?
)(
),,(exp(
)(
)),(exp( 1
xZ
yyxf
xZ
yxFi j
jjii
iii
x
y1 y2 y3
What’s independent now??
![Page 16: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/16.jpg)
Hammerley-Clifford
• For positive distributions P(x1,…,xn):– Pr(xi|x1,…,xi-1,xi+1,…,xn) = Pr(xi|Neighbors(xi))
– Pr(A|B,S) = Pr(A|S) where A,B are sets of nodes and S is a set that separates A and B
– P can be written as normalized product of “clique potentials”
C
CxZ
x clique
)(1
)Pr(
So this is very general: any Markov distribution can be written in this form (modulo nits like “positive distribution”)
![Page 17: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/17.jpg)
Definition of CRFs
X is a random variable over data sequences to be labeled
Y is a random variable over corresponding label sequences
![Page 18: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/18.jpg)
Example of CRFs
![Page 19: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/19.jpg)
Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF
![Page 20: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/20.jpg)
Lafferty et al notation
1 2 1 2( , , , ; , , , ); andn n k k
x is a data sequencey is a label sequence v is a vertex from vertex set V = set of label random variablese is an edge from edge set E over Vfk and gk are given and fixed. gk is a Boolean vertex feature; fk is a
Boolean edge featurek is the number of features
are parameters to be estimated
y|e is the set of components of y defined by edge ey|v is the set of components of y defined by vertex v
If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:
(y | x) exp ( , y | , x) ( , y | , x)
k k e k k ve E,k v V ,k
p f e g v
![Page 21: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/21.jpg)
Conditional Distribution (cont’d)
• CRFs use the observation-dependent normalization Z(x) for the conditional distributions:
Z(x) is a normalization over the data sequence x
(y | x) exp ( , y | , x) ( , y |1
(x), x)
k k e k k ve E,k v V ,k
p f e g vZ
• Learning:– Lafferty et al’s IIS-based method is rather inefficient.
– Gradient-based methods are faster
– Trickiest bit is computing normalization, which is over exponentially many y vectors.
![Page 22: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/22.jpg)
CRF learning – from Sha & Pereira
![Page 23: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/23.jpg)
CRF learning – from Sha & Pereira
![Page 24: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/24.jpg)
CRF learning – from Sha & Pereira
Something like forward-backward
Idea:
• Define matrix of y,y’ “affinities” at stage i
• Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I
• Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1
![Page 25: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/25.jpg)
x
y1 y2 y3
y1 y2 y3
![Page 26: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/26.jpg)
Forward backward ideas
name
nonName
name
nonName
name
nonName
a
b c
d
e
f g
h
......
bhafbgae
hg
fe
dc
ba
![Page 27: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/27.jpg)
CRF learning – from Sha & Pereira
![Page 28: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/28.jpg)
CRF learning – from Sha & Pereira
![Page 29: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/29.jpg)
Sha & Pereira results
CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron
![Page 30: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/30.jpg)
Sha & Pereira results
in minutes, 375k examples
![Page 31: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/31.jpg)
POS tagging Experiments in Lafferty et al
• Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging• Each word in a given input sentence must be labeled with one of 45 syntactic tags• Add a small set of orthographic features: whether a spelling begins with a number
or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies
• oov = out-of-vocabulary (not observed in the training set)
![Page 32: Conditional Random Fields](https://reader038.vdocuments.us/reader038/viewer/2022103006/5681399f550346895da13be6/html5/thumbnails/32.jpg)
POS tagging vs MXPost