julian center on regression for proportion data july 10, 2007 (68)

Julian CenterJulian Centeronon

Regression for Regression for Proportion DataProportion Data

July 10, 2007

(68)

MaxEnt2007

Regression For Regression For Proportion DataProportion Data

Julian CenterJulian Center

Creative Research Corp.Creative Research Corp.

Andover, MA, USAAndover, MA, USA

MaxEnt2007 Julian Center

OverviewOverview IntroductionIntroduction

What is proportion data?What is proportion data? What do we mean by regression?What do we mean by regression? ExamplesExamples Why should you care?Why should you care?

Coordinate Transformation to Facilitate Regression.Coordinate Transformation to Facilitate Regression. Measurement ModelsMeasurement Models

MultinomialMultinomial Laplace Approximation to MultinomialLaplace Approximation to Multinomial Log-NormalLog-Normal

Regression ModelsRegression Models Kernal Regression (Nadaraya-Watson Model)Kernal Regression (Nadaraya-Watson Model) Gaussian Process RegressionGaussian Process Regression

With Log Normal MeasurementsWith Log Normal Measurements With Multinomial Measurements – Expectation PropagationWith Multinomial Measurements – Expectation Propagation

ConclusionConclusion


What is Proportion Data?What is Proportion Data?² Proportion data = Compositional data½Categorical data.

² Proportion data = A ( + 1)-dimensional vector r of relative

proportions of items assigned to oneof + 1 categories.

Similar to a discrete probability distribution.

² Inmathematical terms, r is con…ned to the -simplex,

r 2S =nr 2R+1

+ : 1+1r = 1

o

Here1(+1) is the ( + 1)-dimensional vector of all ones, i.e.h1(+1)

i

= 18


What is Regression?What is Regression?

Regression = Smoothing + Regression = Smoothing + Calibration + Interpolation.Calibration + Interpolation.

Relates data gathered under one set Relates data gathered under one set of conditions to data gathered under of conditions to data gathered under similar, but different conditions.similar, but different conditions.

Accounts for measurement “noise”.Accounts for measurement “noise”. Determines Determines p(p(rr||x).x).


ExamplesExamples Geostatistics: Composition of rock samples at Geostatistics: Composition of rock samples at

different locations.different locations. Medicine: Response to different levels of Medicine: Response to different levels of

treatment.treatment. Political Science: Opinion polls across Political Science: Opinion polls across

different demographic groups.different demographic groups. Climate Research:Climate Research:

Infer climate history from fossil pollen samples.Infer climate history from fossil pollen samples. Calibrate model using present day samples from Calibrate model using present day samples from

known climates.known climates. Typically, examine 400 pollen grains and sort into Typically, examine 400 pollen grains and sort into

14 categories14 categories


Why Should You Care?Why Should You Care?

Either, you have proportion data to Either, you have proportion data to analyze.analyze.

Or, you want to do pattern classification.Or, you want to do pattern classification. Or, you want to use a similar approach to Or, you want to use a similar approach to

your problem.your problem. Transform constrained variables so that a Transform constrained variables so that a

Laplace approximation makes sense.Laplace approximation makes sense. Two different regression techniques.Two different regression techniques. Expectation Propagation for improving Expectation Propagation for improving

model fit.model fit.


Coordinate Coordinate TransformationTransformation

Well-known regression methods can’t Well-known regression methods can’t deal with the pesky constraints of the deal with the pesky constraints of the simplex.simplex.

We need a one-to-one mapping We need a one-to-one mapping between the d-simplex and d-between the d-simplex and d-dimensional real vectors.dimensional real vectors.

Then we can model probability Then we can model probability distributions on real vectors and relate distributions on real vectors and relate them to distributions on the simplex.them to distributions on the simplex.



Wecan establish a one-to-onemapping between S and R by

sm : R ! S ; sm (f) =h1(+1) exp

³T f

´ i ¡ 1exp

³T f

´

clr : S ! R ; clr (y) = T ln(y)

whereT is a £ ( + 1)-dimensional matrix that satis…es

T T = I T1(+1) = 0

T T+1

+ 11(+1)1

(+1) = I (+1)

The rows of T span the orthogonalComplement of 1(d+1)

Symmetric Softmax Activation Function

Centered Log Ratio Linkage Function

We can always find T by theGram-Schmidt Process


ln(y1)=- ln(y2)

f

Softmax is insensitiveto this direction.


ln(y2)

ln(y1)

Image of SimplexUnder ln

y1

y2 Simplex


Measurement ModelsMeasurement Models

MultinomialMultinomial Log-NormalLog-Normal


Assumethat theproportion vector r comes from independentsamples fromthediscreteprobability distribution representedby the vector y

(r j y ) = M ( r jy )

M ( r jy ) , !

Q ( [r ])!

Y

([y ])

[r ]

To get the likelihood function for f = clr (y ), we takeinto account the J acobian of the transformation,

Q [y ].

The log-likelihood function corresponding to f is

(f ) = ( + + 1) r ln (y ) +

r = r + 1(+1)( + + 1)

Measurement ModelMeasurement Model- Multinomial -- Multinomial -


Multinomial Multinomial Measurement ModelMeasurement Model

Binomial Likelihood Functions

0

0.002

0.004

0.006

0.008

0.01

-6 -5 -4 -3 -2 -1 0 1

f

likel

iho

od

0

0.0025

0.005

0.01

0.02

0.05

0.07

0.1

0.2

0.3

0.5

R1=

S=400


Measurement ModelMeasurement Model- Laplace Approximation -- Laplace Approximation -

Some regression methods assume a Gaussian Some regression methods assume a Gaussian measurement model.measurement model.

Therefore, we are tempted to approximate each Therefore, we are tempted to approximate each Multinomial measurement with a Gaussian Multinomial measurement with a Gaussian measurement.measurement.

Let’s try a Laplace approximation to each Let’s try a Laplace approximation to each measurement.measurement.

Laplace Approximation:Laplace Approximation: Find the peak of the log-likelihood function.Find the peak of the log-likelihood function. Pick a Gaussian centered at the peak with Pick a Gaussian centered at the peak with

covariance matrix that matches the negative second covariance matrix that matches the negative second derivative of the log-likelihood function at the peak.derivative of the log-likelihood function at the peak.

Pick an amplitude factor to match the height of the Pick an amplitude factor to match the height of the peak.peak.


Measurement ModelMeasurement Model- Laplace Approximation -- Laplace Approximation -

Thevalue of f that maximizes the log-likelihood is

m = T ln(r )

TheLaplaceapproximation to a singlemeasurement is

(f ) = N (f jm V )

= j2 V j¡12 exp

·¡12(f ¡ m ) V ¡ 1

(f ¡ m )¸

where

= j2 V j12

!Q

( [r ])!exp[ (m )]

V ¡ 1 = ( + + 1) T

hDiag (r ) ¡ r r

iT


Laplace Approximation to Laplace Approximation to MultinomialMultinomial

r1=0/400

0

0.0002

0.0004

0.0006

0.0008

0.001

-7 -6 -5 -4 -3 -2 -1 0

f

p(f

) Laplace Approx

Multinomiala



r1=1/400

0

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

0.0014

-6 -5 -4 -3 -2 -1 0

f

p(f

)

Laplace Approx

Multinomial



r1=2/400

0

0.0005

0.001

0.0015

0.002

-6 -5 -4 -3 -2 -1 0

f

p(f

) Laplace Approx

Multinomial



r1=4/400

0

0.0005

0.001

0.0015

0.002

0.0025

-5 -4 -3 -2 -1 0

f

p(f

) Laplace Approx

Multinomial



r1=80/400

0

0.001

0.002

0.003

-4 -3 -2 -1 0

f

p(f

)

Laplace Approx

Multinomial



r1=120/400

0

0.002

0.004

0.006

0.008

0.01

-1 0

f

p(f

)

Laplace Approx

Multinomial


Measurement ModelMeasurement Model- Log-Normal -- Log-Normal -

² General log-normal model form:

(f ) = N (f jm V )

² Can match Laplaceapproximation to multinomial.

² Can domuchmore.

² Basis for regression methods.

e.g. Over-dispersion or under-dispersion


Regression ModelsRegression Models

Way of relating data taken under Way of relating data taken under different conditions.different conditions.

Intuition: Similar conditions should Intuition: Similar conditions should produce similar data.produce similar data.

The best to use methods depends on The best to use methods depends on the problem.the problem.

Two methods considered here:Two methods considered here: Nadaraya-Watson model.Nadaraya-Watson model. Gaussian Process model.Gaussian Process model.


Nadaraya-Watson ModelNadaraya-Watson Model

Based on applying Parzen density Based on applying Parzen density estimation to the joint distribution of estimation to the joint distribution of ff and and xx

General Form:

(f x) =X

=1 (f xj )

Simpli…ed Model:

(f xj ) = N³f jbf B

´N (xjx D)


x

f

All Data PointsAll Data Points


x

f




Thismodel implies that

(x) =X

=1 (xj )

(xj ) = N (xjx D)

(f jx) = (f x) (x)

=X

=1 (x) N

³f jbf B

´

(x) = (xj ) (x)


To determine the distribution for a newmeasurement, wecompute

(rj x) =Z

(rj f) (f jx) f

=X

=1 (x)

Z (rj f) N

³f jbf B

´ f

If weuse theLaplaceapproximation to themultinomial,we can solve the integrals analytically to get

(rj x) = X

=1 (x) N

³mjbf B + V

´

where m andV arecomputed from r as described above.Otherwise, we can usestochastic integration to compute the integrals.

Nadaraya Watson ModelNadaraya Watson Model



Problem: We must compare a new Problem: We must compare a new point to every training point.point to every training point.

Solution: Solution: Choose a sparse set of “knots”, and Choose a sparse set of “knots”, and

center density components only on center density components only on knots.knots.

Adjust weights and covariances by Adjust weights and covariances by “diagnostic training”.“diagnostic training”.

Mixture model training tools apply.Mixture model training tools apply.


x

f

Sparse Nadaraya-Watson Sparse Nadaraya-Watson ModelModel


Gaussian Process ModelGaussian Process Model

Probability distribution on functions.Probability distribution on functions. Specified by mean function Specified by mean function m(m(xx)) and and

covariance kernel covariance kernel k(k(xx11,,xx22).). For any finite collection of points, For any finite collection of points,

the corresponding function values the corresponding function values are jointly Gaussian.are jointly Gaussian.


x

f

Gaussian Process ModelGaussian Process Model


Applying Gaussian Process Applying Gaussian Process Regression to Proportion Regression to Proportion

DataData Prior – Model each component of Prior – Model each component of ff((xx)) as a as a

zero-mean Gaussian process with zero-mean Gaussian process with covariance kernel covariance kernel k(k(xx11,,xx22). ). Assume that Assume that the components of the components of ff are independent of are independent of each other.each other.

Posterior – Use the Laplace Posterior – Use the Laplace approximations to the measurements and approximations to the measurements and apply Kalman filter methods. apply Kalman filter methods.

Use Expectation Propagation to improve Use Expectation Propagation to improve fit.fit.


Sparse Gaussian Process Sparse Gaussian Process ModelModel

Choosea subset of K training points to act as knots.

Rearrange latent function values at theknots in one largevector g

[g](¡ 1) + , [f (x )] 2 f 12¢¢¢ g 2 f 12¢¢¢ g

2

6664

[f (x1)]1 [f (x2)]1 ¢¢¢ [f (x )]1[f (x1)]2 [f (x2)]2 ¢¢¢ [f (x )]2... ... .. . ...[f (x1)] [f (x2)] ¢¢¢ [f (x )]

3

7775



Under our assumptions, the prior (g) = N (gj0G)

where

G , I C =

2

6664

C 0 ¢¢¢ 00 C ¢¢¢ 0... ... .. . ...0 0 ¢¢¢ C

3

7775

[C ] , ³x x

´ 2 f 12¢¢¢ g



(f (x) jg) = N [f (x) jH (x)g (x) I ]

where

H (x) , I hk (x) C ¡ 1

i

(x) , (xx) ¡ k (x) C ¡ 1k (x)

[k (x)] , (xx ) 2 f 12¢¢¢ g

Wecan express this by the equation

f (x) = H (x) g+ u (x)

whereu (x) » N [0 (x) I ] and u (x) is independent of g.



In particular, the values of the latent function at thetraining points can beexpressed as

f = H g+ u

whereH = H (x ) and u = u (x ).

To simplify computations, weassumethatu is independent of u for 6= .

Note that if x is oneof theknots, i.e., · ,then u = 0 andH is a £ sparsematrixthat simply selects the appropriateelements of g.


Using the log-normal measurement model,

(r j g) =

ZN (f jm V ) N (f jH g I ) f

= N (m jH gR )

where R = V + I . Thus everything is Gaussian and therefore (gjT ) = N (gjbgP ).

GP– Log-Normal ModelGP– Log-Normal Model


GP– Log-Normal ModelGP– Log-Normal ModelWecan determine bg and P by theKalman…lter algorithm.(1) Start with

bg ( 0

P ( G

(2) For = 1 to iterate

K ( P H (H PH + R )¡ 1

bg ( bg+ K (m ¡ H bg)

P ( P ¡ K H P

If webelieve that the log-normal measurement model is correct,thenweare…nished after onepass through all the training data.


Wecan compute the evidence by

= (T ) ="Y

N (0jm R )

#

N (0j0G) [N (0jbgP )]¡ 1

Wecan determine the probability distribution ofseeing a newmeasurement r at x by

(rj xT ) = NhmjH (x) bgV + U (x) +H (x) PH (x)

i

GP – Log-Normal ModelGP – Log-Normal Model

11


GP Multinomial ModelGP Multinomial ModelIf webelieve that themeasurement model is reallymultinomial,we can get amore accurate approximation using theExpectation Propagation (EP) algorithm.

As beforeweapproximate the joint distribution (r1r2¢¢¢r g) by the form

(g) =Y

N (H gjm R ) N (gj0G )

Now our aim is to adjust the ’s,m’s, andR ’s to minimizethe Kullback-Leibler divergence

( jj ) = ¡Zln

Ã (g) (g)

!

(g) g


Expectation Propagation Expectation Propagation MethodMethod

Tominimize ( jj ), we iteratively chooseameasurement andminimizie ( ¤jj ¤) where

¤ (g) = (r j g)

N (H gjm R ) (g)

¤ (g) = ¤ N (H gjm¤

R ¤ )

N (H gjm R ) (g)

We can accomplish this by choosing ¤ ,m¤ , andR ¤

so that themoments of ¤ (g) match those of ¤ (g).



To approximate themoments, wecompute

¤ ¼1

X

=1

³r j h( )

´

N³h( )jm R

´

bh ¼1 ¤

1

X

=1h()

³r j h( )

´

N³h( )jm R

´

W ¼1 ¤

1

X

=1h( )h( )

³r j h( )

´

N³h()jm R

´ ¡ bhbh

where

h( ) » N³H bgH PH

´



To get ¤ to have thesamemoments as ¤, wechoose

R ¤¡ 1 = R ¡ 1

+W ¡ 1 ¡³H P H

´ ¡ 1

m¤ = R ¤

·R ¡ 1

m +W ¡ 1bh ¡³H PH

´ ¡ 1

H bg¸



If is one of the knots,

³r j h( )

´= M

³ r jh( )

´

Otherwise, we approximate it by

³r jh()

´=

ZM

³ r jh( ) + u

´N (uj0 I ) u

¼1

X

=1M

³ r jh( ) + u( )

´

u( ) » N (0 I )



Nowwecan update thesmoother parameters.

If R ¤¡ 1 = R ¡ 1

then the error covarianceP doesnot change

and weupdate the estimate of g by

bg ( bg+ P H R

¡ 1 (m¤

¡ m )

Otherwise, weuse

R ¢ (³R ¤¡ 1

¡ R ¡ 1

´¡ 1

K ( P H ³H PH

+ R ¢´ ¡ 1

P ( P ¡ K H Pbg ( bg+ K

hR ¢

³R ¤¡ 1

m¤ ¡ R ¡ 1

m´¡ H bg

i



Finally, wereplace the parameters for measurement

( ¤m ( m¤

R ( R ¤

and go to thenext iteration.


Choosing the Regression Choosing the Regression ModelModel

If you have two samplings taken under the same conditions, do you want to treat them as coming from a bimodal distribution (NW Model) or combine them into one big sampling (GP Model)?


ConclusionConclusion

A coordinate transformation makes A coordinate transformation makes it possible to analyze proportion data it possible to analyze proportion data with known regression methods.with known regression methods.

The Multinomial distribution can be The Multinomial distribution can be well approximated by a Gaussian on well approximated by a Gaussian on the transformed variable.the transformed variable.

The choice of regression model The choice of regression model depends on the effect that you want depends on the effect that you want – multimodal vs unimodal fit.– multimodal vs unimodal fit.

julian center on regression for proportion data july 10, 2007 (68)

Documents