an asymptotic analysis of generative, discriminative, and pseudolikelihood estimators by percy liang...

An Asymptotic Analysis of Generative, Discriminative,

and Pseudolikelihood Estimators by

Percy Liang and Michael Jordan

(ICML 2008 )

Presented by Lihan He

ECE, Duke University

June 27, 2008

Introduction

Exponential family estimators

Generative

Fully discriminative

Pseudolikelihood discriminative

Asymptotic analysis

Experiments

Conclusions

Outline

Introduction

Data points are not considered to be drawn independently.

There are correlations between data points.

Given data , we have to

consider the joint distribution over all the data points.

Correspondingly, the overall likelihood is not the product of the

likelihood for each data point.

1 1 2 2( , ) {( , ), ( , ),..., ( , )}n nz x y x y x y x y

1 1 2 1 1 1( ) ( ,..., ) ( ) ( | )... ( | ,..., )n n np z p z z p z p z z p z z z

Introduction

Generative vs. Discriminative

Generative model: • A model for randomly generating observed data;• Learning a joint probability distribution over both observations and labels

1 1( , ) ( ,..., , ,..., )n np x y p x x y y

Discriminative model: • A model only of the label variables conditional on the observed data;• Learning a conditional distribution over labels given observations

1 1( | ) ( ,..., | ,..., )n np y x p y y x x

Introduction

Full Likelihood vs. Pseudolikelihood

Full likelihood:

Pseudolikelihood: • An approximation of the full likelihood;• Computationally more efficient.

( ) ( | for all { , } )i j i ji

p z p z z z z E

1 1 2 1 1 1( ) ( ,..., ) ( ) ( | )... ( | ,..., )n n np z p z z p z p z z p z z z

• Could be intractable;• Computationally inefficient.

A set of dependencies between data points

Estimators

Exponential Family Estimators

( ) exp{ ( ) ( )}Tp z z A for zZ

( , ) and z x y Z X Yx( ) :z features

: model parameters

( ) :A normalization

Example: conditional random field

Estimators

Composite Likelihood Estimators [Lindsay 1988]

One class of pseudolikelihood estimator;

Consists of a weighted sum of component likelihoods, each of which is

the probability of one subset of data points conditioned on another.

Partitions the output space (denoted by r) according to a fixed distribution

Pr, and obtains the component likelihood.

Defines criterion function

which reflects the quality of the estimator.

The maximum composite likelihood estimator

~( ) log ( | ( ))rr Pm z p z z r z Ee

ˆ [ ( )]z m zEˆ arg max

Estimators

Three estimators to be compared in the paper:

Generative:

one component

Fully discriminative:

one component

Pseudolikelihood discriminative:

for each data point, we have one component

( , )gr x y X Y

( , )dr x y x x Y

( , ) {( ', ') : ' , ' , ' for }i j jr x y x y x x y y y j i Y

Estimators

Risk Decomposition

Bayes risk *

*

( , )~( | ) [ log ( | )]

X Y pR H Y X E p Y X

have only finite data intrinsic suboptimality of the estimator

*~arg max ( )o

Z pm Z

EDefine unrelated to data samples z

Asymptotic Analysis

before

Well-specified model: , achieves O(n-1) convergence rate.Misspecified model: only fully discriminative estimator achieves O(n-1) rate.

Asymptotic Analysis

Experiments

Toy example: four-node binary-valued graphical model 1 2 1 2( , , , )z x x y y

True model:* * *

1 2 1 1 2 2 1 2 2 1( ) ( ) [ ( ) ( )] [ ( ) ( )]Tz y y x y x y x y x y 1 1 1 1 1

1 2 1 1 2 2( ) ( ) [ ( ) ( )]Tz y y x y x y 1 1 1

Learned model:

When , the learned model is well-specified;

When , the learned model is misspecified.

* 0 * 0

Experiments

* 0 well-specified

* 0.5 misspecified

20000n* *( ) 1g

* *( ) 1, 0h

Experiments

Part-of-speech (POS) Tagging:

Input: a sequence of words 1( ,..., )lx x x

Output: a sequence of POS tags , i.e. noun, verb,etc. (45 tags total)1( ,..., )ly y y

Specified model:

Node features : indicator functions of the form( , )node i iy x ( , )i iy a x b 1

Edge features : indicator functions of the form1( , )edge i iy y 1( , )i iy a y b 1

Training: Wall Street Journal, 38K sentences.

Testing: Wall Street Journal, 5.5K sentences, different sections from training.

Experiments

Use the learned generative model to sample 1000 training samples and 1000 test samples, as synthetic data.

Conclusions

When model is well-specified: Three estimators all achieve O(n-1) convergence rate;

There are no approximation error;

The asymptotic estimation error

generative < fully discriminative < pseudolikelihood

discriminative When model is misspecified:

Fully discriminative estimator still achieves O(n-1) convergence

rate, but the other two estimators achieve O(n-1/2) convergence rate ;

The approximation error and asymptotic estimation error for

fully discriminative estimator is lower than the generative estimator

and

the pseudolikelihood discriminative estimator.

an asymptotic analysis of generative, discriminative, and pseudolikelihood estimators by percy liang...

Documents

learned model

misspecified model

labels discriminative

conclusionswhen model

discriminativegenerative

learned generative model

discriminative estimator

tags totalspecified