julian center on regression for proportion data july 10, 2007 (68)

50
Julian Center Julian Center on on Regression for Regression for Proportion Data Proportion Data July 10, 2007 (68)

Upload: jeffery-holt

Post on 30-Dec-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

Julian CenterJulian Centeronon

Regression for Regression for Proportion DataProportion Data

July 10, 2007

(68)

MaxEnt2007

Regression For Regression For Proportion DataProportion Data

Julian CenterJulian Center

Creative Research Corp.Creative Research Corp.

Andover, MA, USAAndover, MA, USA

MaxEnt2007 Julian Center

OverviewOverview IntroductionIntroduction

What is proportion data?What is proportion data? What do we mean by regression?What do we mean by regression? ExamplesExamples Why should you care?Why should you care?

Coordinate Transformation to Facilitate Regression.Coordinate Transformation to Facilitate Regression. Measurement ModelsMeasurement Models

MultinomialMultinomial Laplace Approximation to MultinomialLaplace Approximation to Multinomial Log-NormalLog-Normal

Regression ModelsRegression Models Kernal Regression (Nadaraya-Watson Model)Kernal Regression (Nadaraya-Watson Model) Gaussian Process RegressionGaussian Process Regression

With Log Normal MeasurementsWith Log Normal Measurements With Multinomial Measurements – Expectation PropagationWith Multinomial Measurements – Expectation Propagation

ConclusionConclusion

MaxEnt2007 Julian Center

What is Proportion Data?What is Proportion Data?² Proportion data = Compositional data½Categorical data.

² Proportion data = A ( + 1)-dimensional vector r of relative

proportions of items assigned to oneof + 1 categories.

Similar to a discrete probability distribution.

² Inmathematical terms, r is con…ned to the -simplex,

r 2S =nr 2R+1

+ : 1+1r = 1

o

Here1(+1) is the ( + 1)-dimensional vector of all ones, i.e.h1(+1)

i

= 18

MaxEnt2007 Julian Center

What is Regression?What is Regression?

Regression = Smoothing + Regression = Smoothing + Calibration + Interpolation.Calibration + Interpolation.

Relates data gathered under one set Relates data gathered under one set of conditions to data gathered under of conditions to data gathered under similar, but different conditions.similar, but different conditions.

Accounts for measurement “noise”.Accounts for measurement “noise”. Determines Determines p(p(rr||x).x).

MaxEnt2007 Julian Center

ExamplesExamples Geostatistics: Composition of rock samples at Geostatistics: Composition of rock samples at

different locations.different locations. Medicine: Response to different levels of Medicine: Response to different levels of

treatment.treatment. Political Science: Opinion polls across Political Science: Opinion polls across

different demographic groups.different demographic groups. Climate Research:Climate Research:

Infer climate history from fossil pollen samples.Infer climate history from fossil pollen samples. Calibrate model using present day samples from Calibrate model using present day samples from

known climates.known climates. Typically, examine 400 pollen grains and sort into Typically, examine 400 pollen grains and sort into

14 categories14 categories

MaxEnt2007 Julian Center

Why Should You Care?Why Should You Care?

Either, you have proportion data to Either, you have proportion data to analyze.analyze.

Or, you want to do pattern classification.Or, you want to do pattern classification. Or, you want to use a similar approach to Or, you want to use a similar approach to

your problem.your problem. Transform constrained variables so that a Transform constrained variables so that a

Laplace approximation makes sense.Laplace approximation makes sense. Two different regression techniques.Two different regression techniques. Expectation Propagation for improving Expectation Propagation for improving

model fit.model fit.

MaxEnt2007 Julian Center

Coordinate Coordinate TransformationTransformation

Well-known regression methods can’t Well-known regression methods can’t deal with the pesky constraints of the deal with the pesky constraints of the simplex.simplex.

We need a one-to-one mapping We need a one-to-one mapping between the d-simplex and d-between the d-simplex and d-dimensional real vectors.dimensional real vectors.

Then we can model probability Then we can model probability distributions on real vectors and relate distributions on real vectors and relate them to distributions on the simplex.them to distributions on the simplex.

MaxEnt2007 Julian Center

Coordinate Coordinate TransformationTransformation

Wecan establish a one-to-onemapping between S and R by

sm : R ! S ; sm (f) =h1(+1) exp

³T f

´ i ¡ 1exp

³T f

´

clr : S ! R ; clr (y) = T ln(y)

whereT is a £ ( + 1)-dimensional matrix that satis…es

T T = I T1(+1) = 0

T T+1

+ 11(+1)1

(+1) = I (+1)

The rows of T span the orthogonalComplement of 1(d+1)

Symmetric Softmax Activation Function

Centered Log Ratio Linkage Function

We can always find T by theGram-Schmidt Process

MaxEnt2007 Julian Center

ln(y1)=- ln(y2)

f

Softmax is insensitiveto this direction.

Coordinate Coordinate TransformationTransformation

ln(y2)

ln(y1)

Image of SimplexUnder ln

y1

y2 Simplex

MaxEnt2007 Julian Center

Measurement ModelsMeasurement Models

MultinomialMultinomial Log-NormalLog-Normal

MaxEnt2007 Julian Center

Assumethat theproportion vector r comes from independentsamples fromthediscreteprobability distribution representedby the vector y

(r j y ) = M ( r jy )

M ( r jy ) , !

Q ( [r ])!

Y

([y ])

[r ]

To get the likelihood function for f = clr (y ), we takeinto account the J acobian of the transformation,

Q [y ].

The log-likelihood function corresponding to f is

(f ) = ( + + 1) r ln (y ) +

r = r + 1(+1)( + + 1)

Measurement ModelMeasurement Model- Multinomial -- Multinomial -

MaxEnt2007 Julian Center

Multinomial Multinomial Measurement ModelMeasurement Model

Binomial Likelihood Functions

0

0.002

0.004

0.006

0.008

0.01

-6 -5 -4 -3 -2 -1 0 1

f

likel

iho

od

0

0.0025

0.005

0.01

0.02

0.05

0.07

0.1

0.2

0.3

0.5

R1=

S=400

MaxEnt2007 Julian Center

Measurement ModelMeasurement Model- Laplace Approximation -- Laplace Approximation -

Some regression methods assume a Gaussian Some regression methods assume a Gaussian measurement model.measurement model.

Therefore, we are tempted to approximate each Therefore, we are tempted to approximate each Multinomial measurement with a Gaussian Multinomial measurement with a Gaussian measurement.measurement.

Let’s try a Laplace approximation to each Let’s try a Laplace approximation to each measurement.measurement.

Laplace Approximation:Laplace Approximation: Find the peak of the log-likelihood function.Find the peak of the log-likelihood function. Pick a Gaussian centered at the peak with Pick a Gaussian centered at the peak with

covariance matrix that matches the negative second covariance matrix that matches the negative second derivative of the log-likelihood function at the peak.derivative of the log-likelihood function at the peak.

Pick an amplitude factor to match the height of the Pick an amplitude factor to match the height of the peak.peak.

MaxEnt2007 Julian Center

Measurement ModelMeasurement Model- Laplace Approximation -- Laplace Approximation -

Thevalue of f that maximizes the log-likelihood is

m = T ln(r )

TheLaplaceapproximation to a singlemeasurement is

(f ) = N (f jm V )

= j2 V j¡12 exp

·¡12(f ¡ m ) V ¡ 1

(f ¡ m )¸

where

= j2 V j12

!Q

( [r ])!exp[ (m )]

V ¡ 1 = ( + + 1) T

hDiag (r ) ¡ r r

iT

MaxEnt2007 Julian Center

Laplace Approximation to Laplace Approximation to MultinomialMultinomial

r1=0/400

0

0.0002

0.0004

0.0006

0.0008

0.001

-7 -6 -5 -4 -3 -2 -1 0

f

p(f

) Laplace Approx

Multinomiala

MaxEnt2007 Julian Center

Laplace Approximation to Laplace Approximation to MultinomialMultinomial

r1=1/400

0

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

0.0014

-6 -5 -4 -3 -2 -1 0

f

p(f

)

Laplace Approx

Multinomial

MaxEnt2007 Julian Center

Laplace Approximation to Laplace Approximation to MultinomialMultinomial

r1=2/400

0

0.0005

0.001

0.0015

0.002

-6 -5 -4 -3 -2 -1 0

f

p(f

) Laplace Approx

Multinomial

MaxEnt2007 Julian Center

Laplace Approximation to Laplace Approximation to MultinomialMultinomial

r1=4/400

0

0.0005

0.001

0.0015

0.002

0.0025

-5 -4 -3 -2 -1 0

f

p(f

) Laplace Approx

Multinomial

MaxEnt2007 Julian Center

Laplace Approximation to Laplace Approximation to MultinomialMultinomial

r1=80/400

0

0.001

0.002

0.003

-4 -3 -2 -1 0

f

p(f

)

Laplace Approx

Multinomial

MaxEnt2007 Julian Center

Laplace Approximation to Laplace Approximation to MultinomialMultinomial

r1=120/400

0

0.002

0.004

0.006

0.008

0.01

-1 0

f

p(f

)

Laplace Approx

Multinomial

MaxEnt2007 Julian Center

Measurement ModelMeasurement Model- Log-Normal -- Log-Normal -

² General log-normal model form:

(f ) = N (f jm V )

² Can match Laplaceapproximation to multinomial.

² Can domuchmore.

² Basis for regression methods.

e.g. Over-dispersion or under-dispersion

MaxEnt2007 Julian Center

Regression ModelsRegression Models

Way of relating data taken under Way of relating data taken under different conditions.different conditions.

Intuition: Similar conditions should Intuition: Similar conditions should produce similar data.produce similar data.

The best to use methods depends on The best to use methods depends on the problem.the problem.

Two methods considered here:Two methods considered here: Nadaraya-Watson model.Nadaraya-Watson model. Gaussian Process model.Gaussian Process model.

MaxEnt2007 Julian Center

Nadaraya-Watson ModelNadaraya-Watson Model

Based on applying Parzen density Based on applying Parzen density estimation to the joint distribution of estimation to the joint distribution of ff and and xx

General Form:

(f x) =X

=1 (f xj )

Simpli…ed Model:

(f xj ) = N³f jbf B

´N (xjx D)

MaxEnt2007 Julian Center

x

f

All Data PointsAll Data Points

MaxEnt2007 Julian Center

x

f

Nadaraya-Watson ModelNadaraya-Watson Model

MaxEnt2007 Julian Center

Nadaraya-Watson ModelNadaraya-Watson Model

Thismodel implies that

(x) =X

=1 (xj )

(xj ) = N (xjx D)

(f jx) = (f x) (x)

=X

=1 (x) N

³f jbf B

´

(x) = (xj ) (x)

MaxEnt2007 Julian Center

To determine the distribution for a newmeasurement, wecompute

(rj x) =Z

(rj f) (f jx) f

=X

=1 (x)

Z (rj f) N

³f jbf B

´ f

If weuse theLaplaceapproximation to themultinomial,we can solve the integrals analytically to get

(rj x) = X

=1 (x) N

³mjbf B + V

´

where m andV arecomputed from r as described above.Otherwise, we can usestochastic integration to compute the integrals.

Nadaraya Watson ModelNadaraya Watson Model

MaxEnt2007 Julian Center

Nadaraya-Watson ModelNadaraya-Watson Model

Problem: We must compare a new Problem: We must compare a new point to every training point.point to every training point.

Solution: Solution: Choose a sparse set of “knots”, and Choose a sparse set of “knots”, and

center density components only on center density components only on knots.knots.

Adjust weights and covariances by Adjust weights and covariances by “diagnostic training”.“diagnostic training”.

Mixture model training tools apply.Mixture model training tools apply.

MaxEnt2007 Julian Center

x

f

Sparse Nadaraya-Watson Sparse Nadaraya-Watson ModelModel

MaxEnt2007 Julian Center

Gaussian Process ModelGaussian Process Model

Probability distribution on functions.Probability distribution on functions. Specified by mean function Specified by mean function m(m(xx)) and and

covariance kernel covariance kernel k(k(xx11,,xx22).). For any finite collection of points, For any finite collection of points,

the corresponding function values the corresponding function values are jointly Gaussian.are jointly Gaussian.

MaxEnt2007 Julian Center

x

f

Gaussian Process ModelGaussian Process Model

MaxEnt2007 Julian Center

Applying Gaussian Process Applying Gaussian Process Regression to Proportion Regression to Proportion

DataData Prior – Model each component of Prior – Model each component of ff((xx)) as a as a

zero-mean Gaussian process with zero-mean Gaussian process with covariance kernel covariance kernel k(k(xx11,,xx22). ). Assume that Assume that the components of the components of ff are independent of are independent of each other.each other.

Posterior – Use the Laplace Posterior – Use the Laplace approximations to the measurements and approximations to the measurements and apply Kalman filter methods. apply Kalman filter methods.

Use Expectation Propagation to improve Use Expectation Propagation to improve fit.fit.

MaxEnt2007 Julian Center

Sparse Gaussian Process Sparse Gaussian Process ModelModel

Choosea subset of K training points to act as knots.

Rearrange latent function values at theknots in one largevector g

[g](¡ 1) + , [f (x )] 2 f 12¢¢¢ g 2 f 12¢¢¢ g

2

6664

[f (x1)]1 [f (x2)]1 ¢¢¢ [f (x )]1[f (x1)]2 [f (x2)]2 ¢¢¢ [f (x )]2... ... .. . ...[f (x1)] [f (x2)] ¢¢¢ [f (x )]

3

7775

MaxEnt2007 Julian Center

Sparse Gaussian Process Sparse Gaussian Process ModelModel

Under our assumptions, the prior (g) = N (gj0G)

where

G , I ­ C =

2

6664

C 0 ¢¢¢ 00 C ¢¢¢ 0... ... .. . ...0 0 ¢¢¢ C

3

7775

[C ] , ³x x

´ 2 f 12¢¢¢ g

MaxEnt2007 Julian Center

Sparse Gaussian Process Sparse Gaussian Process ModelModel

(f (x) jg) = N [f (x) jH (x)g (x) I ]

where

H (x) , I ­hk (x) C ¡ 1

i

(x) , (xx) ¡ k (x) C ¡ 1k (x)

[k (x)] , (xx ) 2 f 12¢¢¢ g

Wecan express this by the equation

f (x) = H (x) g+ u (x)

whereu (x) » N [0 (x) I ] and u (x) is independent of g.

MaxEnt2007 Julian Center

Sparse Gaussian Process Sparse Gaussian Process ModelModel

In particular, the values of the latent function at thetraining points can beexpressed as

f = H g+ u

whereH = H (x ) and u = u (x ).

To simplify computations, weassumethatu is independent of u for 6= .

Note that if x is oneof theknots, i.e., · ,then u = 0 andH is a £ sparsematrixthat simply selects the appropriateelements of g.

MaxEnt2007 Julian Center

Using the log-normal measurement model,

(r j g) =

ZN (f jm V ) N (f jH g I ) f

= N (m jH gR )

where R = V + I . Thus everything is Gaussian and therefore (gjT ) = N (gjbgP ).

GP– Log-Normal ModelGP– Log-Normal Model

MaxEnt2007 Julian Center

GP– Log-Normal ModelGP– Log-Normal ModelWecan determine bg and P by theKalman…lter algorithm.(1) Start with

bg ( 0

P ( G

(2) For = 1 to iterate

K ( P H (H PH + R )¡ 1

bg ( bg+ K (m ¡ H bg)

P ( P ¡ K H P

If webelieve that the log-normal measurement model is correct,thenweare…nished after onepass through all the training data.

MaxEnt2007 Julian Center

Wecan compute the evidence by

= (T ) ="Y

N (0jm R )

#

N (0j0G) [N (0jbgP )]¡ 1

Wecan determine the probability distribution ofseeing a newmeasurement r at x by

(rj xT ) = NhmjH (x) bgV + U (x) +H (x) PH (x)

i

GP – Log-Normal ModelGP – Log-Normal Model

11

MaxEnt2007 Julian Center

GP Multinomial ModelGP Multinomial ModelIf webelieve that themeasurement model is reallymultinomial,we can get amore accurate approximation using theExpectation Propagation (EP) algorithm.

As beforeweapproximate the joint distribution (r1r2¢¢¢r g) by the form

(g) =Y

N (H gjm R ) N (gj0G )

Now our aim is to adjust the ’s,m’s, andR ’s to minimizethe Kullback-Leibler divergence

( jj ) = ¡Zln

à (g) (g)

!

(g) g

MaxEnt2007 Julian Center

Expectation Propagation Expectation Propagation MethodMethod

Tominimize ( jj ), we iteratively chooseameasurement andminimizie ( ¤jj ¤) where

¤ (g) = (r j g)

N (H gjm R ) (g)

¤ (g) = ¤ N (H gjm¤

R ¤ )

N (H gjm R ) (g)

We can accomplish this by choosing ¤ ,m¤ , andR ¤

so that themoments of ¤ (g) match those of ¤ (g).

MaxEnt2007 Julian Center

Expectation Propagation Expectation Propagation MethodMethod

To approximate themoments, wecompute

¤ ¼1

X

=1

³r j h( )

´

N³h( )jm R

´

bh ¼1 ¤

1

X

=1h()

³r j h( )

´

N³h( )jm R

´

W ¼1 ¤

1

X

=1h( )h( )

³r j h( )

´

N³h()jm R

´ ¡ bhbh

where

h( ) » N³H bgH PH

´

MaxEnt2007 Julian Center

Expectation Propagation Expectation Propagation MethodMethod

To get ¤ to have thesamemoments as ¤, wechoose

R ¤¡ 1 = R ¡ 1

+W ¡ 1 ¡³H P H

´ ¡ 1

m¤ = R ¤

·R ¡ 1

m +W ¡ 1bh ¡³H PH

´ ¡ 1

H bg¸

MaxEnt2007 Julian Center

Expectation Propagation Expectation Propagation MethodMethod

If is one of the knots,

³r j h( )

´= M

³ r jh( )

´

Otherwise, we approximate it by

³r jh()

´=

ZM

³ r jh( ) + u

´N (uj0 I ) u

¼1

X

=1M

³ r jh( ) + u( )

´

u( ) » N (0 I )

MaxEnt2007 Julian Center

Expectation Propagation Expectation Propagation MethodMethod

Nowwecan update thesmoother parameters.

If R ¤¡ 1 = R ¡ 1

then the error covarianceP doesnot change

and weupdate the estimate of g by

bg ( bg+ P H R

¡ 1 (m¤

¡ m )

Otherwise, weuse

R ¢ (³R ¤¡ 1

¡ R ¡ 1

´¡ 1

K ( P H ³H PH

+ R ¢´ ¡ 1

P ( P ¡ K H Pbg ( bg+ K

hR ¢

³R ¤¡ 1

m¤ ¡ R ¡ 1

m´¡ H bg

i

MaxEnt2007 Julian Center

Expectation Propagation Expectation Propagation MethodMethod

Finally, wereplace the parameters for measurement

( ¤m ( m¤

R ( R ¤

and go to thenext iteration.

MaxEnt2007 Julian Center

Choosing the Regression Choosing the Regression ModelModel

If you have two samplings taken under the same conditions, do you want to treat them as coming from a bimodal distribution (NW Model) or combine them into one big sampling (GP Model)?

MaxEnt2007 Julian Center

ConclusionConclusion

A coordinate transformation makes A coordinate transformation makes it possible to analyze proportion data it possible to analyze proportion data with known regression methods.with known regression methods.

The Multinomial distribution can be The Multinomial distribution can be well approximated by a Gaussian on well approximated by a Gaussian on the transformed variable.the transformed variable.

The choice of regression model The choice of regression model depends on the effect that you want depends on the effect that you want – multimodal vs unimodal fit.– multimodal vs unimodal fit.

MaxEnt2007 Julian Center