margin learning, online learning, and the voted perceptron splodd ~= ae* – 3, 2011 * autumnal...

Margin Learning, Online Learning, and The Voted Perceptron

SPLODD~= AE* – 3, 2011

* Autumnal Equinox

Review

• Computer science is full of equivalences– SQL relational algebra– YFCL optimizing … on the training data– gcc –O4 foo.c gcc foo.c

• Also full of relationships between sets:– Finding smallest error-free decision tree >> 3-SAT– DataLog >> relational algebra– CFL >> Det FSMs = RegEx

Review

• Bayes Nets: describe a (family of) joint distribution(s) between random variables– They are an operational description (a program)

for how data can be generated– They are a declarative description (a definition) for

the joint distribution, and from this we can derive algorithms for doing stuff other than generation

• There is a close connection between Naïve Bayes and loglinear models

NB vs loglinear models

Loglinear classif.

NB classif.

Multinomial? classif.

* SymDir(100)* AbsDisc(0.01)* *

**

**

**

**

**

*

*

*

*

**

*

*

* Max CL(y|x) + G(0,1.0)

iii yfyP ),(exp)|( xx

*

NB-JLNB-CL

NB-CL*

)|Pr(ln

is of word),(),( ,

yw

wjyfyf

i

wji

xxx

NB vs loglinear models

Loglinear classif.

NB classif.

Multinomial? classif.

* SymDir(100)* *

*

*

**

*

*

* Max CL(y|x) + G(0,1.0)

iii yfyP ),(exp)|( xx

*

)|Pr(ln

is of word),(),( ,

yw

wjyfyf

i

wji

xxx

Y Wj

“Optimal if”

Similarly for sequences…

• An HMM is a Bayes net– It implies a set of independence assumptions– ML parameter setting and Viterbi are optimal if these

hold• A CRF is a Markov field– It implies a set of independence assumptions– These, plus the goal of maximizing Pr(y|x), give us a

learning algorithm• You can construct features so that any HMM can

be emulated by a CRF with those features

In sequence space…

CRF/loglinear models

HMMs

Multinomial? models

* SymDir(100)* AbsDisc(0.01)* *

**

**

**

**

**

*

*

*

*

**

*

*

* Max CL(y|x) + G(0,1.0)

iiiFwP ),(exp)|( yxxy

*

JL CL

CL*

Review: CRFs/Markov Random FieldsWhen will prof Cohen post the notes

),(...),(),...,,...,( 1121111 nnnnn yyyyyyYYP

Semantics of a Markov random field

Y1 Y2 Y3 Y4 Y5 Y6 Y7

What’s independent: Pr(Yi|other Y’s) = Pr(Yi|Yi-1,Yi+1)Probability distribution:

Review: CRFs/Markov Random Fields

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

When will prof Cohen post the notes …

),(...),(1

),...,,...,( 1121111 nnnnn xxMxxMZ

xxXXP

Review: CRFs/Markov Random FieldsWhen will prof Cohen post the notes

Eji

jijinn yyyyYYP),(

,11 ),(),...,,...,(

Y1 Y2 Y3 Y4 Y5 Y6 Y7

What’s independent: Pr(Yi|other Y’s) = Pr(Yi|neighbors of Yi)Probability distribution:

Yf

Pseudo-likelihood and dependency networks

• Any Markov field defines a (family of) probability distributions D– But not a simple program for generation/sampling– We can use MCMC in the general case

• If you have for each node i, PD(Xi|Pai), that’s a dependency net– Still no simple program for generation/sampling (but can use Gibbs)– You can learn these from data using YFCL– Equivalently: learning this maximizes pseudo-likelihood, just as HMM

learning maximizes (real) likelihood on a sequence.

• A weirdness: every MRF has an equivalent dependency net, but every dependency net (set of local conditionals) does not have an equivalent MRF

And now for …

margin learning, online learning, and the voted perceptron splodd ~= ae* – 3, 2011 * autumnal...

Documents