foundations of machine learningmohri/mls/ml_maxent_models.pdf · foundations of machine learning...

56
Foundations of Machine Learning page Foundations of Machine Learning Maximum Entropy Models, Logistic Regression 1 Mehryar Mohri Courant Institute and Google Research [email protected]

Upload: others

Post on 06-Apr-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

Foundations of Machine Learning page

Foundations of Machine Learning Maximum Entropy Models,

Logistic Regression

1

Mehryar MohriCourant Institute and Google Research

[email protected]

Page 2: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

MotivationProbabilistic models:

• density estimation.

• classification.

2

Page 3: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Notions of information theory.

Introduction to density estimation.

Maxent models.

Conditional Maxent models.

This Lecture

3

Page 4: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

EntropyDefinition: the entropy of a discrete random variable with probability mass distribution is

Properties:

• measure of uncertainty of .

• maximal for uniform distribution. For a finite support, by Jensen’s inequality:

4

p(x) = Pr[X = x]X

H(X) = �E[log p(X)] = �X

x2X

p(x) log p(x).

X

H(X) � 0.

H(X) = E

log

1

p(X)

� log E

1

p(X)

�= logN.

(Shannon, 1948)

Page 5: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

EntropyBase of logarithm: not critical; for base 2, is the number of bits needed to represent .

Definition and notation: the entropy of a distribution is defined by the same quantity and denoted by .

Special case of Rényi entropy (Rényi, 1961).

Binary entropy: .

5

� log2(p(x))

p(x)

p

H(p)

H(p) = �p log p� (1� p) log(1� p)

Page 6: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Relative EntropyDefinition: the relative entropy (or Kullback-Leibler divergence) between two distributions and (discrete case) is

with

Properties: • asymmetric: in general, for .

• non-negative: for all and .

• definite: .

6

0 log

0

q= 0 and p log

p

0

= +1.

D(p k q) 6= D(q k p) p 6= q

p q

D(p k q) � 0 p q

(D(p k q) = 0) ) (p = q)

D(p k q) = Ep

log

p(X)

q(X)

�=

X

x2Xp(x) log

p(x)

q(x),

(Shannon, 1948; Kullback and Leibler, 1951)

Page 7: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Non-Negativity of Rel. EntropyBy the concavity of log and Jensen's inequality,

7

�D(p k q) =X

x : p(x)>0

p(x) log

✓q(x)

p(x)

log

✓ X

x : p(x)>0

p(x)q(x)

p(x)

= log

✓ X

x : p(x)>0

q(x)

◆ log(1) = 0.

Page 8: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

x

y

}F (y)

BF (y||x)

F (x) + (y � x) ·⇥F (x)

Bregman DivergenceDefinition: let be a convex and differentiable function defined over a convex set in a Hilbert space . Then, the Bregman divergence associated to is defined by

8

F

C HBF F

BF (x k y) = F (x)� F (y)� hrF (y), x� yi .

(Bregman, 1967)

Page 9: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Bregman DivergenceExamples:

• note: relative entropy not a Bregman divergence since not defined over an open set; but, on the simplex, coincides with unnormalized relative entropy

9

BF (x k y) F (x)

Squared L2-distance kx� yk2 kxk2Mahalanobis distance (x� y)

>K

�1(x� y) x

>K

�1x

Unnormalized relative entropy

eD(x k y)

Pi2I xi log xi � xi

eD(p k q) =

X

x2Xp(x) log

p(x)

q(x)

�+

�q(x)� p(x)

�.

Page 10: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Conditional Relative EntropyDefinition: let and be two probability distributions over . Then, the conditional relative entropy of and with respect to distribution over is defined by

with , , and the conventions , , and .

• note: the definition of conditional relative entropy is not intrinsic, it depends on a third distribution .

10

p q

X ⇥ Y p q

r X

E

X⇠r

hD

�p(·|X) k q(·|X)

�i=

X

x2Xr(x)

X

y2Yp(y|x) log p(y|x)

q(y|x)

= D(

ep k eq),

ep(x, y) = r(x)p(y|x) eq(x, y) = r(x)q(y|x)0 log 0 = 0

0 log

00 = 0

p log p0 = +1

r

Page 11: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Notions of information theory.

Introduction to density estimation.

Maxent models.

Conditional Maxent models.

This Lecture

11

Page 12: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Density Estimation ProblemTraining data: sample of size drawn i.i.d. from set according to some distribution ,

Problem: find distribution out of hypothesis set that best estimates .

12

S

D

S = (x1, . . . , xm).

PD

m X

p

Page 13: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Maximum Likelihood SolutionMaximum Likelihood principle: select distribution maximizing likelihood of observed sample ,

13

S

pML = argmax

p2PPr[S|p]

= argmax

p2P

mY

i=1

p(xi)

= argmax

p2P

mX

i=1

log p(xi).

p 2 P

Page 14: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Relative Entropy FormulationLemma: let be the empirical distribution for sample , then

Proof:

14

S

pML = argminp2P

D(bpS k p).

bpS

D(

bpS

k p) =X

x

bpS

(x) logbpS

(x)�X

x

bpS

(x) log p(x)

= �H(

bpS

)�X

x

Pm

i=1 1x=xi

mlog p(x)

= �H(

bpS

)�mX

i=1

X

x

1

x=xi

mlog p(x)

= �H(

bpS

)�mX

i=1

log p(xi

)

m.

Page 15: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Maximum a Posteriori (MAP)Maximum a Posteriori principle: select distribution that is the most likely, given the observed sample and assuming a prior distribution over ,

• note: for a uniform prior, ML = MAP.

15

p 2 PS

Pr[p] P

pMAP = argmax

p2PPr[p|S]

= argmax

p2P

Pr[S|p] Pr[p]Pr[S]

= argmax

p2PPr[S|p] Pr[p].

Page 16: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Notions of information theory.

Introduction to density estimation.

Maxent models.

Conditional Maxent models.

This Lecture

16

Page 17: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Density Estimation + FeaturesTraining data: sample of size drawn i.i.d. from set according to some distribution ,

Features: associated to elements of ,

Problem: find distribution out of hypothesis set that best estimates .

• for simplicity, in what follows, is assumed to be finite.

17

S

D

S = (x1, . . . , xm).

PD

p

m

X

X

X

� : X ! RN

x 7! �(x) =

�1(x)...�N (x)

�.

Page 18: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

FeaturesFeature functions assumed to be in and .

Examples of :

• family of threshold functions defined over variables.

• functions defined via decision trees with larger depths.

• -degree monomials of the original features.

• zero-one features (often used in NLP, e.g., presence/absence of a word or POS tag).

18

{x 7! 1xi✓

: x 2 RN , ✓ 2 R}N

�j H

H

k�k1 ⇤

k

Page 19: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Maximum Entropy PrincipleIdea: empirical feature vector average close to expectation. For any , with probability at least

Maxent principle: find distribution that is closest to a prior distribution (typically uniform distribution) while verifying .

Closeness is measured using relative entropy.

• note: no set needed to be specified.

19

(E. T. Jaynes, 1957, 1983)

��� E

x⇠D[�(x)]� E

x⇠ bD[�(x)]

���1

2Rm

(H) + ⇤

slog

2�

2m,

� > 0 1� �

p

p0���Ex⇠p[�(x)]� E

x⇠ bD[�(x)]���1

P

Page 20: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Maxent FormulationOptimization problem:

• convex optimization problem, unique solution.

• : standard Maxent (or unregularized Maxent).

• : regularized Maxent.

20

min

p2�D(p k p0)

subject to:

��� E

x⇠p[�(x)]� E

x⇠S

[�(x)]

���1

�.

� = 0

� > 0

Page 21: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Relation with EntropyRelationship with entropy: for a uniform prior ,

21

D(p k p0) =X

x2Xp(x) log

p(x)

p0(x)

= �X

x2Xp(x) log p0(x) +

X

x2Xp(x) log p(x)

= log |X |�H(p).

p0

Page 22: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Maxent ProblemOptimization: convex optimization problem.

22

min

p

X

x2Xp(x) log p(x)

subject to: p(x) � 0, 8x 2 XX

x2Xp(x) = 1

�����X

x2Xp(x)�

j

(x)� 1

m

mX

i=1

j

(x

i

)

����� �, 8j 2 [1, N ].

Page 23: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Gibbs DistributionsGibbs distributions: set of distributions with ,

Rich family:

• for linear and quadratic features: includes Gaussians and other distributions with non-PSD quadratic forms in exponents.

• for higher-degree polynomials of raw features: more complex multi-modal distributions.

23

Q w�RN

with

pw

Z =

X

x

p0[x] exp�w ·�(x)

�.

pw[x] =

p0[x] exp�w ·�(x)

Z

=

p0[x] exp�PN

j=1 wj�j(x)�

Z

,

Page 24: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Dual ProblemsRegularized Maxent problem:

if , otherwise;

;

if , otherwise.

Regularized Maximum Likelihood problem with Gibbs distributions:

24

D(p k p0) = D(p k p0) p 2 � +1C = {u : kuk � E

S[�k] k1 �}

IC(x) = 0 x 2 C

IC(x) = +1

minp

F (p) = D(p k p0) + IC(Ep[�]),

sup

wG(w) =

1

m

mX

i=1

log

pw[xi]

p0[xi]

�� �kwk1.

(with

Page 25: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Duality TheoremTheorem: the regularized Maxent and ML with Gibbs distributions problems are equivalent,

• furthemore, let , then, for any ,

25

(Della Pietra et al., 1997; Dudík et al., 2007; Cortes et al., 2015)

supw2RN

G(w) = minp

F (p).

p⇤ = argminp

F (p) ✏ > 0

⇣|G(w)� sup

w2RN

G(w)| < ✏⌘)

⇣D(p⇤ k pw) ✏

⌘.

Page 26: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

NotesMaxent formulation:

• no explicit restriction to a family of distributions .

• but solution coincides with regularized ML with a specific family !

• more general Bregman divergence-based formulation.

26

P

P

Page 27: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

L1-Regularized MaxentOptimization problem:

Bayesian interpretation: equivalent to MAP with Laplacian prior (Williams, 1994),

27

qprior(w)

with qprior(w) =N�

j=1

�j

2exp(��j |wj |).

where

inf

w2RN�kwk1 �

1

m

mX

i=1

log pw[xi].

max

wlog

⇣ mY

i=1

pw[xi] qprior(w)

pw[x] =

1

Z

exp

�w ·�(x)

�.

(Kazama and Tsujii, 2003)

Page 28: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Generalization GuaranteeNotation: , .

Theorem: Fix . Let be the solution of the L1-reg. Maxent problem for . Then, with probability at least ,

28

LD(w) = E

x⇠D[� log pw[x]]

LS

(w) = E

x⇠S

[� log pw[x]]

bw� > 0

1� �

� = 2Rm(H) + ⇤

qlog(

2� )/2m

LD(bw) infw LD(w) + 2kwk1

2Rm(H) + ⇤

qlog

2�

2m

�.

(Dudík et al., 2007)

Page 29: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

ProofBy Hölder’s inequality and the concentration bound for average feature vectors,

Since is a minimizer,

29

LD(bw)� LS(bw) = bw · [ES[�]� E

D[�]]

kbwk1 kES[�]� E

D[�]k1 �kbwk1.

bw

LD(bw)� LD(w) = LD(bw)� LS(bw) + LS(bw)� LD(w)

�kbwk1 + LS(bw)� LD(w)

�kwk1 + LS(w)� LD(w) 2�kwk1.

Page 30: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

L2-Regularized MaxentDifferent relaxations:

• L1 constraints:

• L2 constraints:

30

�j � [1, N ],��� E

x�p[�j(x)] � E

x�bp[�j(x)]

��� � �j .

��� Ex⇠p

[�(x)]� Ex⇠bp

[�(x)]���2 B.

(Chen and Rosenfeld, 2000; Lebanon and Lafferty, 2001)

Page 31: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

L2-Regularized MaxentOptimization problem:

Bayesian interpretation: equivalent to MAP with Gaussian prior (Goodman, 2004),

31

qprior(w)

with

where

max

wlog

⇣ mY

i=1

pw[xi] qprior(w)

pw[x] =

1

Z

exp

�w ·�(x)

�.

qprior

(w) =NY

j=1

1p2⇡�2

e�w2

j

2�2 .

inf

w2RN�kwk22 �

1

m

mX

i=1

log pw[xi].

Page 32: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Notions of information theory.

Introduction to density estimation.

Maxent models.

Conditional Maxent models.

This Lecture

32

Page 33: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Conditional Maxent ModelsMaxent models for conditional probabilities:

• conditional probability modeling each class.

• use in multi-class classification.

• can use different features for each class.

• a.k.a. multinomial logistic regression.

• logistic regression: special case of two classes.

33

Page 34: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

ProblemData: sample drawn i.i.d. according to some distribution ,

• , or in multi-label case.

Features: mapping .

Problem: find accurate conditional probability models , , based on .

34

D

S = ((x1, y1), . . . , (xm, ym))�(X�Y)m.

Y={1, . . . , k} Y={0, 1}k

� : X � Y � RN

�Pr[· | x] x�X

Page 35: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Conditional Maxent PrincipleIdea: empirical feature vector average close to expectation. For any , with probability at least ,

Maxent principle: find conditional distributions that are closest to priors (typically uniform distributions) while verifying .

Closeness is measured using conditional relative entropy based on .

35

(Berger et al., 1996; Cortes et al., 2015)

� > 0 1� ����� E

x⇠bpy⇠D[·|x]

[�(x, y)]� E

x⇠bpy⇠bp[·|x]

[�(x, y)]

����1

2Rm

(H) +

slog

2�

2m.

p[·|x]p0[·|x]

bp

���Ex⇠bp

y⇠p[·|x][�(x, y)]� E

x⇠bpy⇠bp[·|x]

[�(x, y)]���1

Page 36: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Cond. Maxent FormulationOptimization problem: find distribution solution of

• convex optimization problem, unique solution.

• : unregularized conditional Maxent.

• : regularized conditional Maxent.

36

p

� = 0

� > 0

minp[·|x]2�

X

x2Xbp[x]D

�p[·|x] k p0[·|x]

s.t.��� E

x⇠bp

hE

y⇠p[·|x][�(x, y)]

i� E

(x,y)⇠S

[�(x, y)]���1

�.

(Berger et al., 1996; Cortes et al., 2015)

Page 37: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Dual ProblemsRegularized conditional Maxent problem:

Regularized Maximum Likelihood problem with conditional Gibbs distributions:

37

eF (p) = E

x⇠bp

D

�p[·|x] k p0[·|x]

�+ I�

�p[·|x]

��+ I

C

⇣E

x⇠bpy⇠p[·|x]

[�]⌘.

eG(w) =

1

m

mX

i=1

log

pw[yi|xi]

p0[yi|xi]

�� �kwk1 ,

where 8(x, y) 2 X ⇥ Y,

pw[y|x] =p0[y|x] exp

�w ·�(x, y)

Z(x)

Z(x)=

X

y2Yp0[y|x] exp(w ·�(x, y)).

Page 38: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Duality TheoremTheorem: the regularized conditional Maxent and ML with conditional Gibbs distributions problems are equivalent,

• furthemore, let , then, for any ,

38

(Cortes et al., 2015)

✏ > 0

supw2RN

eG(w) = minp

eF (p).

p⇤ = argminp

eF (p)

⇣| eG(w)� sup

w2RN

eG(w)| < ✏

⌘) E

x⇠bp

hD

�p⇤[·|x] k pw[·|x]

�i ✏.

Page 39: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Regularized Cond. MaxentOptimization problem: convex optimizations, regularization parameter .

39

� � 0

where

min

w2RN�kwk1 �

1

m

mX

i=1

log pw[yi|xi]

or min

w2RN�kwk22 �

1

m

mX

i=1

log pw[yi|xi],

8(x, y) 2 X ⇥ Y,

pw[y|x]= exp(w ·�(x, y))

Z(x)

Z(x)=

X

y2Yexp(w ·�(x, y)).

(Berger et al., 1996; Cortes et al., 2015)

Page 40: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

More Explicit FormsOptimization problem:

40

minw�RN

���w�1

��w�22

+1m

m�

i=1

log� �

y�Yexp

�w · �(xi, y)�w · �(xi, yi)

��.

min

w2RN

(�kwk1�kwk22

�w · 1

m

mX

i=1

�(x

i

, y

i

) +

1

m

mX

i=1

log

X

y2Ye

w·�(xi,y)i.

Page 41: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Related ProblemOptimization problem: log-sum-exp replaced by max.

41

min

w2RN

(�kwk1�kwk22

+

1

m

mX

i=1

max

y2Y

⇣w ·�(x

i

, y)�w ·�(x

i

, y

i

)

| {z }�⇢w(xi,yi)

.

Page 42: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Common Feature ChoiceMulti-class features:

L2-regularized cond. maxent optimization:

42

w =

���������

w1...wy�1

wy

wy+1...w|Y|

���������

�(x, y) =

2

666666664

0...0

�(x)0...0

3

777777775

w ·�(x, y) = wy · �(x).

min

w2RN�

X

y2Ykwyk22 +

1

m

mX

i=1

log

X

y2Yexp

⇣wy · �(xi)�wyi · �(xi)

⌘�.

Page 43: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

PredictionPrediction with :

43

pw[y|x]= exp(w·�(x,y))

Z(x)

by(x) = argmaxy2Y pw[y|x] = argmaxy2Y w ·�(x, y).

Page 44: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Binary ClassificationSimpler expression:

44

with (x) = �(x,+1)��(x,�1).

X

y2Yexp

⇣w ·�(x

i

, y)�w ·�(x

i

, y

i

)

= e

w·�(xi,+1)�w·�(xi,yi)+ e

w·�(xi,�1)�w·�(xi,yi)

= 1 + e

�yiw·[�(xi,+1)��(xi,�1)]

= 1 + e

�yiw· (xi),

Page 45: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Logistic RegressionBinary case of conditional Maxent.

Optimization problem:

• convex optimization.

• variety of solutions: SGD, coordinate descent, etc.

• coordinate descent: similar to AdaBoost with logistic loss instead of exponential loss.

45

(Berkson, 1944)

�(�u) = log2(1 + e�u) � 1u0

min

w2RN

(�kwk1�kwk22

+

1

m

mX

i=1

log

h1 + e�yiw· (xi)

i.

Page 46: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Generalization BoundTheorem: assume that for all . Then, for any , with probability at least over the draw of a sample of size , for all ,

46

j 2 [1, N ]

� > 0 1� �

S m

±�j 2 H

f : x 7! w · �(x)

where .u0 = log(1 + 1/e)

R(f) 1

m

mX

i=1

log

u0

⇣1 + e�yiw·�(xi)

⌘+ 4kwk1Rm

(H)

+

rlog log2 2kwk1

m+

slog

2�

m,

Page 47: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

ProofProof: by the learning bound for convex ensembles holding uniformly for all , with probability at least , for all and ,

Choosing and using yields immediately the learning bound of the theorem.

47

⇢ 1� � f⇢ > 0

⇢ = 1kwk1

1u0 logu0(1 + e�u�1

)

R(f) 1

m

mX

i=1

1

y

i

w·�(xi

)⇢kwk1

�10+

4

⇢Rm(H) +

slog log2

2⇢

m+

slog

2�

m.

Page 48: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Logistic RegressionLogistic model:

Properties:

• linear decision rule, sign of log-odds ratio:

• logistic form:

48

(Berkson, 1944)

Pr[y=+1 | x] =ew·�(x,+1)

Z(x),

where Z(x) = ew·�(x,+1) + ew·�(x,�1)

log

Pr[y = +1 | x]Pr[y=�1 | x] = w ·

��(x,+1)��(x,�1)

�= w · (x).

Pr[y=+1 | x] = 1

1 + e

�w·[�(x,+1)��(x,�1)]=

1

1 + e

�w· (x).

Page 49: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

Logistic/Sigmoid Function

49

f: x �� 11 + e�x

Pr[y=+1 | x] = f

�w · (x)

�.

Page 50: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

ApplicationsNatural language processing (Berger et al., 1996; Rosenfeld, 1996; Pietra et al., 1997; Malouf, 2002; Manning and Klein, 2003; Mann et al., 2009;

Ratnaparkhi, 2010).

Species habitat modeling (Phillips et al., 2004, 2006; Dudík et al., 2007;

Elith et al, 2011).

Computer vision (Jeon and Manmatha, 2004).

50

Page 51: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

ExtensionsExtensive theoretical study of alternative regularizations: (Dudík et al., 2007) (see also (Altun and Smola, 2006) though some proofs unclear).

Maxent models with other Bregman divergences (see for example (Altun and Smola, 2006)).

Structural Maxent models (Cortes et al., 2015):

• extension to the case of multiple feature families.

• empirically outperform Maxent and L1-Maxent.

• conditional structural Maxent: coincide with deep boosting using the logistic loss.

51

Page 52: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

pageFoundations of Machine Learning

ConclusionLogistic regression/maxent models:

• theoretical foundation.

• natural solution when probabilites are required.

• widely used for density estimation/classification.

• often very effective in practice.

• distributed optimization solutions.

• no natural non-linear L1-version (use of kernels).

• connections with boosting.

• connections with neural networks.

52

Page 53: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

Foundations of Machine Learning page

References• Yasemin Altun, Alexander J. Smola. Unifying Divergence Minimization and

Statistical Inference Via Convex Duality. COLT 2006: 139-153

• Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, (22-1), March 1996;

• Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American Statistical Association 39, 357–365.

• Lev M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7:200–217, 1967.

• Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Umar Syed. Structural Maxent. In Proceedings of ICML, 2015.

Page 54: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

Foundations of Machine Learning page

References• Imre Csiszar and Tusnady. Information geometry and alternating minimization

procedures. Statistics and Decisions, Supplement Issue 1, 205-237, 1984.

• Imre Csiszar. A geometric interpretation of Darroch and Ratchliff's generalized iterative scaling. The Annals of Statisics, 17(3), pp. 1409-1413. 1989.

• J. Darroch and D. Ratchliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), pp. 1470-1480, 1972.

• Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:4, pp.380--393, April, 1997.

• Dudík, Miroslav, Phillips, Steven J., and Schapire, Robert E. Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. JMLR, 8, 2007.

Page 55: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

Foundations of Machine Learning page

References• E. Jaynes. Information theory and statistical mechanics. Physics Reviews,

106:620–630, 1957.

• E. Jaynes. Papers on Probability, Statistics, and Statistical Physics. R. Rosenkrantz (editor), D. Reidel Publishing Company, 1983.

• Solomon Kullback and Richard A. Leibler. On information and sufficiency. Ann. Math. Statist., 22(1):79–86, 1951.

• O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-maximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, 1998.

Page 56: Foundations of Machine Learningmohri/mls/ml_maxent_models.pdf · Foundations of Machine Learning page Conditional Relative Entropy Definition: let and be two probability distributions

Foundations of Machine Learning page

References• Alfréd Rényi. On measures of entropy and information. In Proceedings of the

Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages 547–561. University of California Press, 1961.

• Roni Rosenfeld. A maximum entropy approach to adaptive statistical language modelling. Computer, Speech and Language 10:187--228, 1996.

• Claude E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379423, 1948.