margin learning, online learning, and the voted perceptron splodd ~= ae* – 3, 2011 * autumnal...
TRANSCRIPT
Margin Learning, Online Learning, and The Voted Perceptron
SPLODD~= AE* – 3, 2011
* Autumnal Equinox
Review
• Computer science is full of equivalences– SQL relational algebra– YFCL optimizing … on the training data– gcc –O4 foo.c gcc foo.c
• Also full of relationships between sets:– Finding smallest error-free decision tree >> 3-SAT– DataLog >> relational algebra– CFL >> Det FSMs = RegEx
Review
• Bayes Nets: describe a (family of) joint distribution(s) between random variables– They are an operational description (a program)
for how data can be generated– They are a declarative description (a definition) for
the joint distribution, and from this we can derive algorithms for doing stuff other than generation
• There is a close connection between Naïve Bayes and loglinear models
NB vs loglinear models
Loglinear classif.
NB classif.
Multinomial? classif.
* SymDir(100)* AbsDisc(0.01)* *
**
**
**
**
**
*
*
*
*
**
*
*
* Max CL(y|x) + G(0,1.0)
iii yfyP ),(exp)|( xx
*
NB-JLNB-CL
NB-CL*
)|Pr(ln
is of word),(),( ,
yw
wjyfyf
i
wji
xxx
NB vs loglinear models
Loglinear classif.
NB classif.
Multinomial? classif.
* SymDir(100)* *
*
*
**
*
*
* Max CL(y|x) + G(0,1.0)
iii yfyP ),(exp)|( xx
*
)|Pr(ln
is of word),(),( ,
yw
wjyfyf
i
wji
xxx
Y Wj
“Optimal if”
Similarly for sequences…
• An HMM is a Bayes net– It implies a set of independence assumptions– ML parameter setting and Viterbi are optimal if these
hold• A CRF is a Markov field– It implies a set of independence assumptions– These, plus the goal of maximizing Pr(y|x), give us a
learning algorithm• You can construct features so that any HMM can
be emulated by a CRF with those features
In sequence space…
CRF/loglinear models
HMMs
Multinomial? models
* SymDir(100)* AbsDisc(0.01)* *
**
**
**
**
**
*
*
*
*
**
*
*
* Max CL(y|x) + G(0,1.0)
iiiFwP ),(exp)|( yxxy
*
JL CL
CL*
Review: CRFs/Markov Random FieldsWhen will prof Cohen post the notes
),(...),(),...,,...,( 1121111 nnnnn yyyyyyYYP
Semantics of a Markov random field
Y1 Y2 Y3 Y4 Y5 Y6 Y7
What’s independent: Pr(Yi|other Y’s) = Pr(Yi|Yi-1,Yi+1)Probability distribution:
Review: CRFs/Markov Random Fields
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
),(...),(1
),...,,...,( 1121111 nnnnn xxMxxMZ
xxXXP
Review: CRFs/Markov Random FieldsWhen will prof Cohen post the notes
Eji
jijinn yyyyYYP),(
,11 ),(),...,,...,(
Y1 Y2 Y3 Y4 Y5 Y6 Y7
What’s independent: Pr(Yi|other Y’s) = Pr(Yi|neighbors of Yi)Probability distribution:
Yf
Pseudo-likelihood and dependency networks
• Any Markov field defines a (family of) probability distributions D– But not a simple program for generation/sampling– We can use MCMC in the general case
• If you have for each node i, PD(Xi|Pai), that’s a dependency net– Still no simple program for generation/sampling (but can use Gibbs)– You can learn these from data using YFCL– Equivalently: learning this maximizes pseudo-likelihood, just as HMM
learning maximizes (real) likelihood on a sequence.
• A weirdness: every MRF has an equivalent dependency net, but every dependency net (set of local conditionals) does not have an equivalent MRF
And now for …