an asymptotic analysis of generative, discriminative, and pseudolikelihood estimators by percy liang...

16
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan He ECE, Duke University June 27, 2008

Upload: osborn-martin

Post on 04-Jan-2016

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan

An Asymptotic Analysis of Generative, Discriminative,

and Pseudolikelihood Estimators by

Percy Liang and Michael Jordan

(ICML 2008 )

Presented by Lihan He

ECE, Duke University

June 27, 2008

Page 2: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan

Introduction

Exponential family estimators

Generative

Fully discriminative

Pseudolikelihood discriminative

Asymptotic analysis

Experiments

Conclusions

Outline

Page 3: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan

Introduction

Data points are not considered to be drawn independently.

There are correlations between data points.

Given data , we have to

consider the joint distribution over all the data points.

Correspondingly, the overall likelihood is not the product of the

likelihood for each data point.

1 1 2 2( , ) {( , ), ( , ),..., ( , )}n nz x y x y x y x y

1 1 2 1 1 1( ) ( ,..., ) ( ) ( | )... ( | ,..., )n n np z p z z p z p z z p z z z

Page 4: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan

Introduction

Generative vs. Discriminative

Generative model: • A model for randomly generating observed data;• Learning a joint probability distribution over both observations and labels

1 1( , ) ( ,..., , ,..., )n np x y p x x y y

Discriminative model: • A model only of the label variables conditional on the observed data;• Learning a conditional distribution over labels given observations

1 1( | ) ( ,..., | ,..., )n np y x p y y x x

Page 5: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan

Introduction

Full Likelihood vs. Pseudolikelihood

Full likelihood:

Pseudolikelihood: • An approximation of the full likelihood;• Computationally more efficient.

( ) ( | for all { , } )i j i ji

p z p z z z z E

1 1 2 1 1 1( ) ( ,..., ) ( ) ( | )... ( | ,..., )n n np z p z z p z p z z p z z z

• Could be intractable;• Computationally inefficient.

A set of dependencies between data points

Page 6: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan

Estimators

Exponential Family Estimators

( ) exp{ ( ) ( )}Tp z z A for zZ

( , ) and z x y Z X Yx( ) :z features

: model parameters

( ) :A normalization

Example: conditional random field

Page 7: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan

Estimators

Composite Likelihood Estimators [Lindsay 1988]

One class of pseudolikelihood estimator;

Consists of a weighted sum of component likelihoods, each of which is

the probability of one subset of data points conditioned on another.

Partitions the output space (denoted by r) according to a fixed distribution

Pr, and obtains the component likelihood.

Defines criterion function

which reflects the quality of the estimator.

The maximum composite likelihood estimator

~( ) log ( | ( ))rr Pm z p z z r z Ee

ˆ [ ( )]z m zEˆ arg max

Page 8: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan

Estimators

Three estimators to be compared in the paper:

Generative:

one component

Fully discriminative:

one component

Pseudolikelihood discriminative:

for each data point, we have one component

( , )gr x y X Y

( , )dr x y x x Y

( , ) {( ', ') : ' , ' , ' for }i j jr x y x y x x y y y j i Y

Page 9: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan

Estimators

Risk Decomposition

Bayes risk *

*

( , )~( | ) [ log ( | )]

X Y pR H Y X E p Y X

have only finite data intrinsic suboptimality of the estimator

*~arg max ( )o

Z pm Z

EDefine unrelated to data samples z

Page 10: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan

Asymptotic Analysis

before

Well-specified model: , achieves O(n-1) convergence rate.Misspecified model: only fully discriminative estimator achieves O(n-1) rate.

Page 11: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan

Asymptotic Analysis

Page 12: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan

Experiments

Toy example: four-node binary-valued graphical model 1 2 1 2( , , , )z x x y y

True model:* * *

1 2 1 1 2 2 1 2 2 1( ) ( ) [ ( ) ( )] [ ( ) ( )]Tz y y x y x y x y x y 1 1 1 1 1

1 2 1 1 2 2( ) ( ) [ ( ) ( )]Tz y y x y x y 1 1 1

Learned model:

When , the learned model is well-specified;

When , the learned model is misspecified.

* 0 * 0

Page 13: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan

Experiments

* 0 well-specified

* 0.5 misspecified

20000n* *( ) 1g

* *( ) 1, 0h

Page 14: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan

Experiments

Part-of-speech (POS) Tagging:

Input: a sequence of words 1( ,..., )lx x x

Output: a sequence of POS tags , i.e. noun, verb,etc. (45 tags total)1( ,..., )ly y y

Specified model:

Node features : indicator functions of the form( , )node i iy x ( , )i iy a x b 1

Edge features : indicator functions of the form1( , )edge i iy y 1( , )i iy a y b 1

Training: Wall Street Journal, 38K sentences.

Testing: Wall Street Journal, 5.5K sentences, different sections from training.

Page 15: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan

Experiments

Use the learned generative model to sample 1000 training samples and 1000 test samples, as synthetic data.

Page 16: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan

Conclusions

When model is well-specified: Three estimators all achieve O(n-1) convergence rate;

There are no approximation error;

The asymptotic estimation error

generative < fully discriminative < pseudolikelihood

discriminative When model is misspecified:

Fully discriminative estimator still achieves O(n-1) convergence

rate, but the other two estimators achieve O(n-1/2) convergence rate ;

The approximation error and asymptotic estimation error for

fully discriminative estimator is lower than the generative estimator

and

the pseudolikelihood discriminative estimator.