spm slides

96
Nonparametric and Semiparametric Models Wolfgang H¨ ardle Marlene M¨ uller Stefan Sperlich Axel Werwatz Institut f¨ ur Statistik and ¨ Okonometrie CASE - Center for Applied Statistics and Economics Humboldt-Universit¨ at zu Berlin Outline Introduction Part I. Nonparametric Function Estimation Histogram Nonparametric Density Estimation Nonparametric Regression Part II. Generalized and Semiparametric Models Semiparametric Regression (Overview) Single Index Models Generalized Partial Linear Models Additive Models Generalized Additive Models Introduction 1-1 Chapter 1: Introduction SPM Introduction 1-2 Linear Regression E (Y |X)= X 1 β 1 + ... + X d β d = X β Example: Wage equation Y = log wages, X = schooling (measured in years), labor market experience (measured as: AGE SCHOOL 6) and experience squared. E (Y |SCHOOL, EXP) = β 1 + β 2 SCHOOL + β 3 EXP + β 4 EXP 2 . CPS 1985, n = 534 SPM

Upload: dedy-dwi-prastyo

Post on 14-Oct-2014

64 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Spm Slides

Nonparametric and Semiparametric Models

Wolfgang Hardle

Marlene Muller

Stefan SperlichAxel Werwatz

Institut fur Statistik and OkonometrieCASE - Center for Applied Statisticsand EconomicsHumboldt-Universitat zu Berlin

Outline

� Introduction

� Part I. Nonparametric Function Estimation

� Histogram� Nonparametric Density Estimation� Nonparametric Regression

� Part II. Generalized and Semiparametric Models

� Semiparametric Regression (Overview)� Single Index Models� Generalized Partial Linear Models� Additive Models� Generalized Additive Models

Introduction 1-1

Chapter 1:Introduction

SPM

Introduction 1-2

Linear Regression

E (Y |X) = X1β1 + . . . + Xdβd = X�β

Example: Wage equation

Y = log wages,X = schooling (measured in years), labor market experience(measured as: AGE − SCHOOL − 6) and experience squared.

E (Y |SCHOOL, EXP)

= β1 + β2 ∗ SCHOOL + β3 ∗ EXP + β4 ∗ EXP2.

CPS 1985, n = 534

SPM

Page 2: Spm Slides

Introduction 1-3

coefficient estimates for the wage equation:

Dependent Variable: Log WagesVariable Coefficients S.E. t-valuesSCHOOL 0.0898 0.0083 10.788EXP 0.0349 0.0056 6.185EXP2 –0.0005 0.0001 –4.307constant 0.5202 0.1236 4.209

R2 = 0.24, sample size n = 534

Table 1: Results from OLS estimation

SPM

Introduction 1-4

Wage <-- Schooling

5 10 15x1

0.5

11.

5

Wage <-- Experience

0 10 20 30 40 50x2

00.

10.

20.

30.

40.

5

Figure 1: wage-schooling profile and wage-experience profile

SPMcps85lin

SPM

Introduction 1-5

Wage <-- Schooling, Experience

6.010.0

Schooling

14.013.8

27.5Experience

41.21.5

1.9

2.3

Figure 2: Parametrically estimated regression function

SPMcps85lin

SPM

Introduction 1-6

Linear regression

E (Y |SCHOOL, EXP) = const + β1SCHOOL + β2EXP

Nonparametric Regression

E (Y |SCHOOL, EXP) = m(SCHOOL, EXP)

� m(•) is a smooth function

SPM

Page 3: Spm Slides

Introduction 1-7

Wage <-- Schooling, Experience

6.010.0

Schooling

14.013.8

27.5Experience

41.21.7

2.0

2.4

Figure 3: Nonparametrically estimated regression function

SPMcps85reg

SPM

Introduction 1-8

Example: Engel curve

Engel (1857)

..., daß je armer eine Familie ist, einen desto großeren Anteilvon der Gesammtausgabe muß zur Beschaffung derNahrung aufgewendet werden.

(The poorer a family, the bigger the share of total expendituresthat has to be used for food.)

Ernst Engel on BBI:

SPM

Introduction 1-9

Engel Curve

0 0.5 1 1.5 2 2.5 3

Net-income

00.

20.

40.

6

Food

Figure 4: Engel curve, U.K. Family Expenditure Survey 1973

SPMengelcurve2

SPM

Introduction 1-10

Example: Income distribution changes?

data: UK net-income 1969–1983 (Family Expenditure Survey)

Lognormal Density Estimates

7173

7577

7981

Kernel Density Estimates

7173

7577

7981

Figure 5: Log-normal and kernel density estimates of net-income

SPMfesdensities

SPM

Page 4: Spm Slides

Introduction 1-11

Example: Foreign exchange rates volatility

data: DM/US Dollar exchange rate St (October 1992 untilSeptember 1993), returns Yt = ln(St/St−1)

model:Yt = m(Yt−1) + σ(Yt−1) ξt

where

� m(•) is the mean function

� σ(•) is the volatility function

SPM

Introduction 1-12

FX Rate DM/US$ (10/92-09/93)

0 5000 10000 15000 20000 25000

1.4

1.5

1.6

1.7

FX Returns DM/US$ (10/92-09/93)

0 5000 10000 15000 20000 25000

-0.0

1-0

.01

00.

010.

01

Figure 6: DM/US Dollar exchange rate and returns

SPMfxrate

SPM

Introduction 1-13

FX Volatility Function

-2 -1 0 1 2

y(t-1)*E-3

24

68

4e-0

7+y(

t)*E

-7

Figure 7: The estimated variance function for DM/US Dollar with uniform

confidence bands

SPMfxvolatility

SPM

Introduction 1-14

Nonparametric regression

E (Y |SCHOOL, EXP) = m(SCHOOL, EXP)

Semiparametric Regression

E (Y |SCHOOL, EXP) = α + g1(SCHOOL) + g2(EXP)

� g1(.), g2(.) are smooth functions

SPM

Page 5: Spm Slides

Introduction 1-15

Example: Wage equation

Wage <-- Schooling

5 10 15X

-1-0

.50

YWage <-- Experience

0 10 20 30 40 50X

-0.4

-0.2

00.

2

YFigure 8: Additive model fit vs. parametric fit, wage-schooling (left) and

wage-experience (right) profiles

SPMcps85add

SPM

Introduction 1-16

Wage <-- Schooling, Experience

6.010.0

Schooling

14.013.8

27.5Experience

41.2-0.1

0.1

0.4

Figure 9: Surface plot for the additive model

SPMcps85add

SPM

Introduction 1-17

Example: Binary choice model

Y =

{1 if person imagines to move to west,0 otherwise.

⇒ E (Y |x) = P(Y = 1|x) = G (β�x)

typically: logistic link function (logit model)

E (Y |x) = P(Y = 1|x) =1

1 + exp(−β�x)

SPM

Introduction 1-18

Logit Model

-3 -2 -1 0 1 2

Index

00.

51

Lin

k Fu

nctio

n, R

espo

nses

Figure 10: Logit model for migration

SPMlogit

SPM

Page 6: Spm Slides

Introduction 1-19

Heteroscedasticity problems

binary choice model with

Var(ε) =1

4

{1 + (β�x)2

}2 ·Var(u),

where u has a (standard) logistic distribution

SPM

Introduction 1-20

True versus Logit Link

-4 -3 -2 -1 0 1 2 3 4

Index

00.

51

G(I

ndex

)

Figure 11: The link function of a homoscedastic logit model (thin) vs. a

heteroscedastic model (solid)

SPMtruelogit

SPM

Introduction 1-21

Wrong link consequences

True Ratio vs. Sampling Distribution

-1.5 -1 -0.5 0 0.5 1 1.5

True+Estimated Ratio

00.

20.

40.

6

Sam

plin

g D

istr

ibut

ion

Figure 12: Sampling distribution of the ratio of estimated coefficients and

the ratio’s true value

SPMsimulogit

SPM

Introduction 1-22

Single Index Model

-3 -2 -1 0 1 2

Index

00.

51

Lin

k Fu

nctio

n, R

espo

nses

Figure 13: Single index model for migration

SPMsim

SPM

Page 7: Spm Slides

Introduction 1-23

Example: Implied volatility surface (IVS)

Black Scholes (BS) formula values European call options:

CBSt = StΦ(d1) − Ke−rτΦ(d2)

d1 =log(St/K ) + (r + 1

2σ2)τ

σ√

τ

d2 = d1 − σ√

τ

Φ CDF of the standard normal distributionr Interest rate St Asset price at time tK Strike price τ = T − t Time to maturity with mat. date T

SPM

Introduction 1-24

IVS Example. Cont.

IVS σ: Ct − CBSt (St , K , r , τ, σ) = 0

Empirical IVS by using nonparametric method (IVS2001.mov)

0.13 0.26

0.39 0.52

0.65 0.80 0.88

0.96 1.03

1.11

0.20

0.40

0.60

0.80

1.00

Figure 14: t = 20020516 ODAX

SPM

Introduction 1-25

Example: Inner-German Migration

SPM

Introduction 1-26

Y =

{1 if person imagines to move to west,0 otherwise.

E (Y |X ) = P(Y = 1|X ) = G (β�X ),

where X , a vector of personal features, and f (X ) = G (β�X ) therelated parameter.

Leads to the log-likelihood

L(β) =n∑

i=1

[Yi log

G (β�Xi )

1 − G (β�Xi )+ log

{1 − G (β�Xi )

}].

SPM

Page 8: Spm Slides

Introduction 1-27

The choice of the logistic link function G (u) = (1 + e−u)−1 (logitmodel) corresponds to the canonical parametrization:

L(β) =n∑

i=1

{Yiβ

�Xi + log(1 + eβ�Xi

)}.

SPM

Introduction 1-28

influence of household income

1000 2000 3000 4000income

1.9

22.

12.

22.

3m

(inc

ome)

semiparametric fit

parametric fit

Figure 15: Estimated influence of income f (t)

SPM

Introduction 1-29

v

Figure 16: Logit model for migration

SPM

Introduction 1-30

Summary: Introduction

� Parametric models are fully determined up to a parameter(vector). They lead to an easy interpretation of the resultingfits.

� Nonparametric models only impose qualitative restrictions likea smooth density function or a smooth regression function m.They may support known parametric models. They open theway for new models by their flexibility.

� Semiparametric model combine parametric and nonparametricparts. This keeps the easy interpretation of the results, butgives more flexibility in some aspects of the model.

SPM

Page 9: Spm Slides

Histogram 2-1

Chapter 2:Histogram

SPM

Histogram 2-2

Histogram

X continuous random variable with f probability density function

X1, X2, . . . ,Xn i.i.d. observations

aim: estimate distribution (density function), display it graphically

0 0.5 1 1.5 2X

00.

51

1.5

2

Y

Figure 17: Histogram for exponential data

SPMsimulatedexponential

SPM

Histogram 2-3

Example: Buffalo Snowfall Data

0.005

0.010

0.016

0.021

31.52 63.15 94.78 126.40

Figure 18: Winter snowfall (in inches) at Buffalo, New York, 1910/11 to

1972/73 (n = 63) SPMbuffadata

Emanuel Parzen on BBI:

SPM

Histogram 2-4

Construction of the Histogram

(a) divide range (support of f ) into bins

Bj : [x0 + (j − 1)h, x0 + jh), j ∈ Z,

with origin x0 and binwidth h

Example:

bins: [x0, x0 + h), [x0 + h, x0 + 2h), [x0 + 2h, x0 + 3h), . . .

(b) count observations in each Bj (=: nj)

(c) normalize to 1: fj =nj

nh (relative frequencies, divided by h)

(d) draw bars with height fj for bin Bj

SPM

Page 10: Spm Slides

Histogram 2-5

Buffalo Snowfall Data and Bin Grid

0.005

0.010

0.016

0.021

31.52 63.15 94.78 126.40

Figure 19: Buffalo snowfall data and bin grid, binwidth h = 10 and origin

x0 = 20

SPMbuffagrid

SPM

Histogram 2-6

Histogram of Buffalo Snowfall Data

0.005

0.010

0.015

0.021

31.60 63.20 94.80 126.40

Figure 20: Binwidth h = 10 and origin x0 = 20

SPMbuffahisto

SPM

Histogram 2-7

Example: Histogram of stock returns

Histogram of Stock Returns

-0.1 0 0.1 0.2Data x

05

10

His

togr

am

Figure 21: Stock returns, binwidth h = 0.02 and origin x0 = 0

SPMstockreturnhisto

SPM

Histogram 2-8

Motivation of the histogram

Construction of Histogram

0 1 2 3 4

Data x

00.

10.

20.

30.

4

Den

sity

f

Figure 22: Approximation of the area under the density over an interval by

erecting a rectangle over the interval

SPM

Page 11: Spm Slides

Histogram 2-9

Mathematical Motivation

� with mj = center of jth bin, it holds

P(X ∈ Bj) =

∫Bj

f (u) du ≈ f (mj) · h

� approximation by

P(X ∈ Bj) ≈ 1

n#{Xi ∈ Bj}

=⇒ fh(mj) =1

nh#{Xi ∈ Bj}

SPM

Histogram 2-10

Histogram as a Maximum-LikelihoodEstimator

likelihood function:

L =n∏

i=1

f (xi )

support of the random variable

X ∈ [x , x ]

divide the support into d bins

Bj = [x + (j − 1)h, x + jh), j ∈ {1, . . . , d}with length

h =x − x

d

SPM

Histogram 2-11

if the density function is a step function

f (x) =d∑

j=1

fj I (x ∈ Bj)

with ∫ x

xf (x)dx = 1,

then it follows thatd∑

j=1

fjh = 1

SPM

Histogram 2-12

the likelihood function is

L =d∏

j=1

fnj

j

with constraintd∑

j=1

fjh = 1

hence, maximize

L(f1, · · · , fd) =d∑

j=1

nj log fj + λ

⎛⎝1 −d∑

j=1

hfj

⎞⎠

SPM

Page 12: Spm Slides

Histogram 2-13

the first order conditions are

nj = λhfj , j = 1, . . . , d (2.1)

d∑j=1

fjh = 1 (2.2)

Summing over the d equations in (2.1) and using (2.2) gives

n = λ,

thus, it follows from (2.1) that

fj =nj

nh

SPM

Histogram 2-14

Example: Two-steps densities

consider

F =

{f : f = f1I (x ∈ B1) + f2I (x ∈ B2),

∫f = 1, h = 1

}

L =n∏

i=1

f (xi ) = f n11 · f n2

2

n1 log f1 + n2 log f2 = n1 log f1 + n2 log(1 − f1)

n1

f1+

n2

1 − f1· (−1) = 0 ⇐⇒ n1(1 − f1) = n2f1

⇐⇒ n1 = f1(n1 + n2)

⇐⇒ f1 =n1

n1 + n2=

n1

n=

#{Xi ∈ B1}n

.

SPM

Histogram 2-15

(Asymptotic) Statistical Properties

for Xi (i = 1, . . . , n) ∼ f (density)we have

fh(x) =1

nh

n∑i=1

∑j

I (Xi ∈ Bj)I (x ∈ Bj)

� consistency: fh(x)P−→ f (x) if bias and variance → 0

� notice that fh is constant within a bin

SPM

Histogram 2-16

Bias, Variance and MSE

for f (x), x fixed, mj = bin center in which x falls

BiasE{fh(x) − f (x)} ≈ f ′(mj) · (mj − x)

Variance

Var{fh(x)} ≈ 1

nhf (mj)

Mean Squared Error (MSE)

MSE{fh(x)} = E{fh(x) − f (x)}2

MSE (x) = Var + Bias2 ≈ 1

nhf (mj) +

{f ′(mj)

}2(mj − x)2

SPM

Page 13: Spm Slides

Histogram 2-17

Expectation of the histogram

E{fh(x)} =1

nh

n∑i=1

E{I (Xi ∈ Bj)} , x ∈ Bj

=1

nhn E{I (Xi ∈ Bj)}

=1

h

∫ jh

(j−1)hf (u) du

=1

h· bin probability

SPM

Histogram 2-18

Bias of the histogram

in general

E{fh(x)} − f (x) =1

h

∫ jh

(j−1)hf (u) du − f (x) = 0

normally, we have

1

h

∫ jh

(j−1)hf (u) du = f (x)∫ jh

(j−1)hf (u) du︸ ︷︷ ︸

bin probability

= h · f (x)︸ ︷︷ ︸rectangle

SPM

Histogram 2-19

Bias of the Histogram

-3 -2 -1 0 1 2 3 4

00.

20.

40.

6

Nor

mal

dis

trib

utio

n

x

Figure 23: Binwidth h = 2, x = −1 (bin center), true distribution is normal

SPMhistobias2

SPM

Histogram 2-20

approximation of the bias using a Taylor expansion:

1

h

∫Bj

{f (u) − f (x)} du =1

h

∫Bj

{f ′(mj)(u − x) + O(h)} du

≈ 1

h

∫Bj

f ′(mj)(u − x) du

≈ f ′(mj) {mj − x}︸ ︷︷ ︸O(h)

f (u)− f (x) ≈ {f (mj)+ f ′(mj)(u−mj)}−{f (mj)+ f ′(mj)(x−mj)}

SPM

Page 14: Spm Slides

Histogram 2-21

Variance of the histogram

Var{fh(x)} =1

n2h2nVar{I (Xi ∈ Bj)}

=(Bernoulli)

1

n2h2n

∫Bj

f (u) du

(1 −

∫Bj

f (u) du

)

=1

nh

(1

h

∫Bj

f (u) du

)· {1 −O(h)}

≈ 1

nhf (x) + O

(1

nh

)≈ 1

nhf (x)

SPM

Histogram 2-22

Mean squared error (MSE)

MSE{fh(x)} = E{fh(x) − f (x)}2

MSE (fh(x)) = variance + bias2

≈ 1

nhf (mj) +

{f ′(mj)

}2(mj − x)2

if we let h → 0 and nh → ∞ with (mj − x) ≤ h/2, we obtain

MSE (fh(x)) → 0

SPM

Histogram 2-23

Bias^2, Variance and MSE

0.05 0.1 0.15 0.2

Bandwidth h

00.

010.

010.

02

Bia

s^2,

Var

ianc

e an

d M

SE

Figure 24: Bias (thin solid line), variance (dashed line) and MSE (thick

line) for the histogram (at x = 0.5)

SPMhistmse

SPM

Histogram 2-24

MISE - a Global Quality Measure

MISE = E

[∫ ∞

−∞{fh(x) − f (x)}2dx

]=

∫ ∞

−∞E[{fh(x) − f (x)}2

]dx

=

∫ ∞

−∞MSE [fh(x)]dx

two interpretations:

� average global error

� accumulated pointwise error

SPM

Page 15: Spm Slides

Histogram 2-25

AMISE

Using the approximation for the MSE

MSE{

fh(x)}≈ 1

nhf (x) + f ′(mj)

2(mj − x)2

we derive the Asymptotic MISE (AMISE )

MISE =

∫ ∞

−∞MSE

[fh(x)

]dx ≈ 1

nh+

h2

12

∫ ∞

−∞f ′(x)2 dx = AMISE

SPM

Histogram 2-26

derivation of the AMISE :

MISE ≈∫ ∞

−∞

⎡⎣ 1

nhf (x) +

∑j

I (x ∈ Bj)f′(mj)

2(mj − x)2

⎤⎦ dx

≈ 1

nh

∫ ∞

−∞f (x)dx +

∑j

f ′(mj)2

∫Bj

(mj − x)2dx

≈ 1

nh+∑

j

f ′(mj)2

∫ jh

(j−1)h

{(j − 1

2

)h − x

}2

dx

≈ 1

nh+∑

j

f ′(mj)2 2

3

(h

2

)3

≈ 1

nh+

h2

12

∑j

hf ′(mj)2

SPM

Histogram 2-27

Example: Another approximation

g(x)def= f ′(x)2

andg(x) ≈ g(mj) + g ′(mj)(x − mj)

it follows that∫Bj

g(x)dx ≈∫

Bj

g(mj)dx +

∫Bj

g ′(mj)(x − mj)dx

≈ g(mj)

∫ mj+h2

mj− h2

dx + g ′(mj)

∫ mj+h2

mj− h2

(x − mj)dx

≈ g(mj)h

thus

f ′(mj)2h ≈

∫Bj

f ′(x)2dx

SPM

Histogram 2-28

using f ′(mj)2h ≈ ∫

Bjf ′(x)2dx

MISE ≈ 1

nh+

h2

12

∑j

∫Bj

f ′(x)2dx

from ∑j

∫Bj

f ′(x)2dx =

∫ ∞

−∞f ′(x)2dx

we obtain

AMISE =1

nh+

h2

12

∫ ∞

−∞f ′(x)2dx

SPM

Page 16: Spm Slides

Histogram 2-29

optimization of AMISE w.r.t. h

hopt =

(6

n∫∞−∞ f ′(x)2dx

)1/3

= constant· n−1/3 ∼ n−1/3

“best” convergence rate of MISE :

MISE (hopt) =1

nhopt+

h2opt

12

∫ ∞

−∞f ′(x)2dx + O(h2

opt) + O(

1

nhopt

)= constant· n−2/3 + O(constant· n−2/3)

+O(constant· n−2/3)

thusMISE (hopt) ∼ n−

23

SPM

Histogram 2-30

the solution hopt does not help us in practice, since we need∫ ∞

−∞f ′(x)2dx (2.3)

How can we obtain that value?

Possible approach:

� plug-in method:choose a density and calculate (2.3)

SPM

Histogram 2-31

Example: Normal density

Assuming a normal density with variance σ2, i.e.

f (x) =1√

2πσ2exp

{− (x − μ)2

2σ2

}we can calculate∫ ∞

−∞f ′(x)2dx =

1

σ4√

2πσ2

1√2

∫ ∞

−∞(x − μ)2

1√2π σ2

2

exp

{− (x − μ)2

2σ2

2

}dx

=1

2σ5√

π

σ2

2=

1

4σ3√

π

The integral term is the formula for computing the variance of

normal distributed RV with mean μ and variance σ2

2 .So theoptimal bandwith:

hopt =

(6

n 14σ3

√π

) 13

= 3.5σn−1/3

SPM

Histogram 2-32

h=0.007, x0=0

-20 -15 -10 -5 0 5 10 15 20

x*E-2

05

1015

fh

h=0.02, x0=0

-0.2 -0.1 0 0.1 0.2

x

05

1015

fh

h=0.05, x0=0

-0.2 -0.1 0 0.1 0.2

x

05

1015

fh

h=0.1, x0=0

-0.2 -0.1 0 0.1 0.2

x

05

1015

fh

Figure 25: Four histograms for the stock returns data with binwidths h =

0.007, h = 0.02, h = 0.05, h = 0.1; origin x0 = 0

SPMhisdiffbin

SPM

Page 17: Spm Slides

Histogram 2-33

h=0.04, x0=0

-0.2 -0.1 0 0.1 0.2

x

05

1015

fh

h=0.04, x0=0.01

-0.2 -0.1 0 0.1 0.2

x

05

1015

fh

h=0.04, x0=0.02

-0.2 -0.1 0 0.1 0.2

x

05

1015

fh

h=0.04, x0=0.03

-20 -15 -10 -5 0 5 10 15 20

x*E-2

05

1015

fhFigure 26: Four histograms for the stock returns data corresponding to

different origins: x0 = 0, x0 = 0.01, x0 = 0.02, x0 = 0.03; binwidth

h = 0.04SPMhisdiffori

SPM

Histogram 2-34

Drawbacks of the Histogram

� constant over intervals, step function

� results depend strongly on origin

� binwidth choice

� (slow rate of convergence)

solution to the strong dependence on the origin x0:

� ASH = Averaged Shifted Histograms

SPM

Histogram 2-35

Average Shifted Histogram (ASH)

having a histogram with origin x0 = 0, i.e.

Bj = [(j − 1)h, jh) , j ∈ Z,

we generate M − 1 new bin grids by shifting:

Bj ,ldef= [(j − 1 + l/M)h, (j + l/M)h) , l ∈ {1, . . . ,M − 1}

the ASH is then obtained by

fh(x) =1

M

M−1∑l=0

1

nh

n∑i=1

⎧⎨⎩∑j

I (Xi ∈ Bj ,l)I (x ∈ Bj ,l)

⎫⎬⎭=

1

n

n∑i=1

⎧⎨⎩ 1

Mh

M−1∑l=0

∑j

I (Xi ∈ Bj ,l)I (x ∈ Bj ,l)

⎫⎬⎭ .

SPM

Histogram 2-36

Example

{X1, . . . ,X10} = {11, 12, 9, 8, 10, 14, 18, 5, 7, 3}.we take M = 5 and h = 5:

��

��

��

��

��

B4,0

B3,1

B3,2

B3,3

B3,4

1 2 3 4 5 4 3 2 1B∗

19B∗11

3 5 7 9 11 13 15 17 19 21

� � � � � � � � � �

Figure 27: Bins for ASH

SPM

Page 18: Spm Slides

Histogram 2-37

Example: Stock returns

Average Shifted Histogram

-20 -15 -10 -5 0 5 10 15 20

Stock Returns*E-2

05

1015

ASH

Figure 28: Average shifted histogram for stock returns data; average of 8

histograms (M = 8) with binwidth h = 0.04 and origin x0 = 0.

SPMashstock

SPM

Histogram 2-38

for comparison:

Ordinary Histogram

-20 -15 -10 -5 0 5 10 15 20

Stock Returns*E-2

05

1015

His

togr

am

Figure 29: Ordinary histogram for stock returns; binwidth h = 0.005; origin

x0 = 0

SPMhiststock

SPM

Histogram 2-39

Summary: Histogram

� The bias of a histogram is

E [fh(x) − f (x)] ≈ f ′(mj)(mj − x).

� The variance of a histogram is Var [fh(x)] ≈ 1nh f (x).

� The asymptotic MISE is given by AMISE (fh) = 1nh + h2

12‖f ′‖22.

� The optimal binwidth hopt that minimizes AMISE is

h0 =

(6

n‖f ′‖22

)1/3

∼ n−1/3.

� The optimal binwidth hopt that minimizes AMISE forN(μ, σ2) is

hopt = 3.5σn−1/3.

SPM

Kernel Density Estimation 3-1

Chapter 3:Kernel Density Estimation

SPM

Page 19: Spm Slides

Kernel Density Estimation 3-2

Kernel Density Estimation

kernel density estimate (KDE)

fh(x) =1

nh

n∑i=1

K

(x − Xi

h

)with kernel function K (u)

Example

K (u) =1

2I (|u| ≤ 1)

fh(x) =1

nh

n∑i=1

1

2I

(∣∣∣∣x − Xi

h

∣∣∣∣ ≤ 1

)

SPM

Kernel Density Estimation 3-3

KDE as a generalization of the histogram

histogram: (nh)−1# {Xi in interval containing x}

KDE with uniform kernel: (n2h)−1# {Xi in interval around x}

fh(x) =1

2hn

n∑i=1

I

(∣∣∣∣x − Xi

h

∣∣∣∣ ≤ 1

)=

1

2hn#{Xi in interval around x}

SPM

Kernel Density Estimation 3-4

Required properties of kernels

� K (•) is a density function:∫ ∞

−∞K (u)du = 1 and K (u) ≥ 0

� K (•) is symmetric: ∫ ∞

−∞uK (u)du = 0

SPM

Kernel Density Estimation 3-5

Different Kernel Functions

Kernel K (u)

Uniform 12 I (|u| ≤ 1)

Triangle (1 − |u|)I (|u| ≤ 1)Epanechnikov 3

4(1 − u2)I (|u| ≤ 1)Quartic 15

16(1 − u2)2I (|u| ≤ 1)Triweight 35

32(1 − u2)3I (|u| ≤ 1)Gaussian 1√

2πexp(−1

2u2)

Cosinus π4 cos(π

2 u)I (|u| ≤ 1)

Table 2: Kernel functions

SPM

Page 20: Spm Slides

Kernel Density Estimation 3-6

Uniform

-1.5 -1 -0.5 0 0.5 1 1.5

u

00.

51

K(u

)

Triangle

-1.5 -1 -0.5 0 0.5 1 1.5

u

00.

51

K(u

)

Epanechnikov

-1.5 -1 -0.5 0 0.5 1 1.5

u

00.

51

K(u

)

Quartic

-1.5 -1 -0.5 0 0.5 1 1.5

u

00.

51

K(u

)Figure 30: Some kernel functions: Uniform (top left), Triangle (bottom

left), Epanechnikov (top right), Quartic (bottom right)

SPMkernel

SPM

Kernel Density Estimation 3-7

Example: Construction of the KDE

consider the KDE using a Gaussian kernel

fh(x) =1

nh

n∑i=1

K

(x − Xi

h

)

=1

nh

n∑i=1

1√2π

exp(−1

2u2)

here we have

u =x − Xi

h

SPM

Kernel Density Estimation 3-8

Construction of Kernel Density

-2 -1 0 1

Data x

00.

10.

20.

30.

40.

5

Den

sity

est

imat

e fh

, Wei

ghts

Kh

Figure 31: Kernel density estimate as a sum of bumps

SPMkdeconstruct

SPM

Kernel Density Estimation 3-9

h=0.004

-0.15 -0.1 -0.05 0 0.05 0.1 0.15

x

05

1015

fh

h=0.015

-0.15 -0.1 -0.05 0 0.05 0.1 0.15

x

05

1015

fh

h=0.008

-0.15 -0.1 -0.05 0 0.05 0.1 0.15

x

05

1015

fh

h=0.050

-0.15 -0.1 -0.05 0 0.05 0.1 0.15

x

05

1015

fh

Figure 32: Four kernel density estimates for the stock returns data with

different bandwidths and Quartic kernel

SPMdensity

SPM

Page 21: Spm Slides

Kernel Density Estimation 3-10

Quartic Kernel, h=0.018

-0.15 -0.1 -0.05 0 0.05 0.1 0.15

x

05

10

fh

Uniform Kernel, h=0.018

-0.15 -0.1 -0.05 0 0.05 0.1 0.15

x

05

10

fh

Figure 33: Two kernel density estimates for the stock returns data with

different kernels and h = 0.018

SPMdenquauni

SPM

Kernel Density Estimation 3-11

Epanechnikov Kernel, h=0.18

0 0.5 1 1.5 2 2.5 3

x

00.

20.

40.

60.

8

fh

Triweight Kernel, h=0.18

0 0.5 1 1.5 2 2.5 3

x

00.

20.

40.

60.

8

fh

Figure 34: Different continuous kernels, net-income data from the U.K.

Family Expenditure Survey

SPMdenepatri

SPM

Kernel Density Estimation 3-12

(Asymptotic) Statistical Properties

bias of the kernel density estimator

Bias{

fh(x)}

= E[fh(x)

]− f (x)

= E

[1

nh

n∑i=1

K

(x − Xi

h

)]− f (x)

≈ h2

2f ′′(x)μ2(K ) for h → 0

where we define

μ2(K )def=

∫ ∞

−∞u2K (u)du

SPM

Kernel Density Estimation 3-13

Derivation of the Bias Approximation

E[fh(x)

]=

1

nh

n∑i=1

E

[K

(x − Xi

h

)]=

1

hE

[K

(x − X

h

)]=

∫ ∞

−∞

1

hK

(x − z

h

)f (z)dz

=

∫ ∞

−∞K (s)f (x + sh)ds

=(Taylor)

∫ ∞

−∞K (s)

{f (x) + shf ′(x) +

s2h2

2f ′′(x) + O(h2)

}ds

SPM

Page 22: Spm Slides

Kernel Density Estimation 3-14

we obtain

E[fh(x)

]=

∫ ∞

−∞K (s)

{f (x) + shf ′(x) +

s2h2

2f ′′(x) + O(h2)

}ds

= f (x) +h2

2f ′′(x)

∫s2K (s)ds + O(h2)

≈ f (x) +h2

2f ′′(x) μ2(K ) for h → 0

and thus

Bias{

fh(x)}

= E[fh(x)

]− f (x)

≈ h2

2f ′′(x)μ2(K )

SPM

Kernel Density Estimation 3-15

Kernel Density Bias Effects

-6 -4 -2 0 2

x

05

1015

20

Den

sity

f, B

ias*

E-2

Figure 35: f (x) and approximation for E[fh(x)

]SPMkdebias

SPM

Kernel Density Estimation 3-16

Variance of the Kernel Density Estimator

Var[fh(x)

]= Var

[1

nh

n∑i=1

K

(x − Xi

h

)]

≈ 1

nh‖K‖2

2f (x) for nh → ∞

where we define

‖K‖22

def=

∫ ∞

−∞{K (u)}2du

SPM

Kernel Density Estimation 3-17

Derivation of the Variance Approximation

Var[fh(x)

]=

1

n2Var

[1

h

n∑i=1

K

(x − Xi

h

)]

=1

nVar

[1

hK

(x − X

h

)]=

1

nE

[1

h2K

(x − X

h

)2]− 1

n

{E

[1

hK

(x − X

h

)]}2

=1

nh

∫ ∞

−∞K (s)2f (x + sh)ds − 1

n

{E[fh(x)

]}2

using the Taylor expansion of f (x + sh) and∫∞−∞ K (s)2sds = 0

(from the symmetry of the kernel)

SPM

Page 23: Spm Slides

Kernel Density Estimation 3-18

we obtain

Var[fh(x)

]=

1

nh

∫ ∞

−∞K (s)2f (x + sh)ds − 1

nE[fh(x)

]2=

1

nh‖K‖2

2{f (x) + O(h)} − 1

n{f (x) + O(h)}2

=1

nh‖K‖2

2f (x) + O(

1

nh

)≈ 1

nh‖K‖2

2f (x) for nh → ∞

SPM

Kernel Density Estimation 3-19

MSE of the kernel density estimator

MSE{

fh(x)}

= E[{fh(x) − f (x)}2

]≈ h4

4f ′′(x)2μ2(K )2 +

1

nh‖K‖2

2f (x)

for h → 0 and nh → ∞

recall the definitions

μ2(K )def=

∫ ∞

−∞u2K (u)du

‖K‖22

def=

∫ ∞

−∞{K (u)}2du

SPM

Kernel Density Estimation 3-20

Bias^2, Variance and MSE

2 4 6 8 10

Bandwidth h*E-2

05

1015

Bia

s^2,

Var

ianc

e an

d M

SE*E

-4

Figure 36: Bias2 (thin solid), variance (thin dashed) and MSE (thick blue

solid) for kernel density estimate as a function of the bandwidth

SPMkdemse

SPM

Kernel Density Estimation 3-21

How to choose the bandwidth for the KDE?

find the bandwidth which minimizes the MISE

MISE{

fh(x)}

= E

[∫ ∞

−∞{fh(x) − f (x)}2dx

]=

∫ ∞

−∞MSE [fh(x)]dx

=1

nh‖K‖2

2 +h4

4μ2(K )2‖f ′′‖2

2 + O(

1

nh

)+ O

(h4)

SPM

Page 24: Spm Slides

Kernel Density Estimation 3-22

we obtain for h → 0 and nh → ∞

AMISE{

fh

}=

1

nh‖K‖2

2 +h4

4μ2(K )2‖f ′′‖2

2

and thus

hopt =

( ‖K‖22

‖f ′′‖22μ2(K )2n

)1/5

∼ n−1/5

SPM

Kernel Density Estimation 3-23

Comparing MISE for histogram and KDE

histogram: MISE (hopt) ∼ n−2/3

kernel density estimator: MISE (hopt) ∼ n−4/5

� the error is smaller (−4/5 < −2/3) for the kernel densityestimator

� the kernel density estimator has better asymptotic properties

SPM

Kernel Density Estimation 3-24

Methods for Bandwidth Selection

we are trying to estimate the optimal bandwidth hopt , thatdepends on

‖f ′′‖22 (which is unknown)

we mainly distinguish

� plug-in methods

� jackknife methods

these methods can be compared via their asymptotic properties(the speed of their convergence to the true value of hopt)

SPM

Kernel Density Estimation 3-25

Plug-in Methods

Silverman’s rule of thumb

assuming a normal density function

f (x) = σ−1ϕ

(x − μ

σ

)and using a Gaussian kernel K (u) = ϕ(u) gives

‖f ′′‖22 = σ−5

∫ ∞

−∞ϕ′′(z)2dz

= σ−5 3

8√

π

≈ 0.212σ−5

SPM

Page 25: Spm Slides

Kernel Density Estimation 3-26

thus‖f ′′‖2

2 = 0.212 σ−5

Silverman’s rule-of-thumb bandwidth estimator is

hROT =

(‖ϕ‖2

2

‖f ′′‖22 μ2

2(ϕ) n

)1/5

= 1.06 σ n−1/5 (3.1)

SPM

Kernel Density Estimation 3-27

Better rule of thumb

Practical problem: A single outlier may cause a too large estimateof σ.

A more robust estimator is derived by calculating the interquartilerange R = X[0.75n] − X[0.25n], whereas we still assume

X ∼ N(μ, σ2) and Z = X−μσ ∼ N(0, 1).

Including these assumptions in R and inserting the quartiles for thestandard normal distribution results in:

σ =R

1.34(3.2)

SPM

Kernel Density Estimation 3-28

Plugging the relation (3.2) into (3.1) leads to

h0 = 1.06R

1.34n−1/5 = 0.79Rn−1/5

and combing this with Silverman’s rule of thumb results in the”better rule of thumb”:

hROT = 1.06 min

{σ,

R

1.34

}n−1/5.

SPM

Kernel Density Estimation 3-29

Park and Marron’s plug-in estimator

estimate

f ′′h

(x) =1

nh3

n∑i=1

K ′′(

x − Xi

h

)with bandwidth h (calculated using the rule of thumb)

using a bias correction

‖f ′′‖22 = ‖f ′′‖2

2 −1

nh5‖K ′′‖2

2

gives

hPM =

(‖K‖2

2

‖f ′′‖22μ2(K )2n

)1/5

SPM

Page 26: Spm Slides

Kernel Density Estimation 3-30

Park and Marron showed that

hPM − hopt

hopt= Op

(n−4/13

)The performance of hPM in simulations is quite good. In practice

though, for small bandwidths, ||f ′′||2 may be negative.

SPM

Kernel Density Estimation 3-31

Cross Validation

instead of estimating ‖f ′′‖22, least squares cross-validation

minimizes the integrated squared error

ISE(fh

)=

∫ ∞

−∞

{fh(x) − f (x)

}2dx

= ‖fh‖22 − 2

∫ ∞

−∞fh(x)f (x) dx + ‖f ‖2

2

note: only the first and second terms depend on h

SPM

Kernel Density Estimation 3-32

the first term can be rewritten:

‖fh‖22 =

∫ ∞

−∞

{1

nh

n∑i

K

(x − Xi

h

)}2

dx

=1

n2h2

∫ ∞

−∞

n∑i=1

n∑j=1

K

(x − Xi

h

)K

(x − Xj

h

)dx

=1

n2h

n∑i=1

n∑j=1

∫ ∞

−∞K (s)K

(Xi − Xj

h+ s

)ds

=1

n2h

n∑i=1

n∑j=1

∫ ∞

−∞K (s)K

(Xj − Xi

h− s

)ds

=1

n2h

n∑i=1

n∑j=1

K � K

(Xj − Xi

h

)

SPM

Kernel Density Estimation 3-33

Example

X , Y ∼ N(0, 1) with fX = ϕ and fY = ϕ

Z = X + Y

L(Z ) = N(0, 2)

fZ (z) =1√2π2

exp

{−1

2

(z2

2

)}=

1√2ϕ(z/

√2)

= ϕ � ϕ

SPM

Page 27: Spm Slides

Kernel Density Estimation 3-34

the second term can be estimated:∫fh(x)f (x)dx = E

[fh(X )

]≈ 1

n

n∑i=1

fh,−i (Xi )

=1

n

n∑i=1

1

h(n − 1)

∑i �=j

K

(Xi − Xj

h

)

=1

n(n − 1)h

n∑i=1

∑i �=j

K

(Xi − Xj

h

)

SPM

Kernel Density Estimation 3-35

cross-validation criterion

CV (h) = ‖fh‖22 − 2

∫fh(x)f (x)dx

=1

n2h

n∑i=1

n∑j=1

K � K

(Xj − Xi

h

)

− 2

n(n − 1)h

n∑i=1

n∑j=1,i �=j

K

(Xi − Xj

h

)

we choose the bandwidth

hCV = arg minh

CV (h)

SPM

Kernel Density Estimation 3-36

Example: Gaussian kernel (K = ϕ)

CV (h) =1

n2h

n∑i=1

n∑j=1

ϕ � ϕ

(Xj − Xi

h

)

− 2

n(n − 1)h

n∑i=1

n∑j=1,i �=j

ϕ

(Xi − Xj

h

)Applying the Gaussian kernel and ϕ � ϕ = 1√

2ϕ(u/

√2) lead to

CV (h) =1

2n2h√

π

n∑i=1

n∑j=1

exp

{−1

2

(Xj − Xi

h√

2

)2}

− 2

n(n − 1)h

1√2π

n∑i=1

n∑j=1,i �=j

exp

{−1

2

(Xi − Xj

h

)2}

.

SPM

Kernel Density Estimation 3-37

The CV bandwidth fulfils:

ISE(hCV

)min

hISE (h)

a.s.−→ 1

This is called asymptotic optimality. The result goes back to Stone(1984).

SPM

Page 28: Spm Slides

Kernel Density Estimation 3-38

Fan and Marron (1992):

√n

(h − hopt

hopt

)L→ N

(0, σ2

f

)

σ2f =

4

25

{∫f (4)(x)2f (x)dx

||f ′′|| 22

− 1

}The relative speed is n−1/2! σ2

f is the smallest possible variance.

SPM

Kernel Density Estimation 3-39

An optimal bandwidth selector?

� one best method does not exist!

� even asymptotically optimal criteria may show bad behavior insmall sample simulations

recommendation:

� use different selection methods and compare the resultingdensity estimates

SPM

Kernel Density Estimation 3-40

How to choose the kernel?

or: how can we compare density estimators based on differentkernels and bandwidths?

compare the AMISE for different kernels where the bandwidths are

h(l)opt =

(‖K (l)‖2

2

μ2(K (l))2

)1/5(1

‖f ′′‖22n

)1/5

= δ(l)c

and (l) ∈ {Uniform, Gaussian, . . .}

SPM

Kernel Density Estimation 3-41

we have that

AMISE

{fh

(l)opt

}=

1

nh(l)opt

‖K (l)‖22 +

(h(l)opt)

4

4‖f ′′‖2

2μ22(K

(l)) ,

which is equivalent to

AMISE

{fh

(l)opt

}=

(1

nc+

c4

4‖f ′′‖2

2

)T (K (l)) (3.3)

if we set

T (K (l))def={

(‖K (l)‖22)

4μ2(K(l))2

}1/5

SPM

Page 29: Spm Slides

Kernel Density Estimation 3-42

derivation of equation (3.3):

AMISE

{fh

(l)opt

}=

1

nh(l)opt

‖K (l)‖22 +

(h(l)opt)

4

4‖f ′′‖2

2μ22(K

(l))

=1

nδ(l)c‖K (l)‖2

2 +(δ(l)c)4

4‖f ′′‖2

2μ22(K

(l))

=1

nc‖K (l)‖2

2

(μ2(K

(l))2

‖K (l)‖22

)1/5

+

c4‖f ′′‖22

4

(‖K (l)‖2

2

μ2(K (l))2

)4/5

μ2(K(l))2

=

(1

nc+

c4

4‖f ′′‖2

2

){(‖K (l)‖2

2)4μ2(K

(l))2}1/5

SPM

Kernel Density Estimation 3-43

AMISE with hiopt consists of a kernel-independent term and a

factor of proportionality which depends on K i

AMISE{

fhiopt

}=

(1

nc+

c4

4‖f ′′‖2

2

)T (K i )

=5

4(‖f ′′‖2

2)1/5T (K i )n−4/5

where

cdef=

(1

‖f ′′‖22n

)1/5

SPM

Kernel Density Estimation 3-44

using the fact that the AMISE is the product of akernel-independent term and the T (K i ), we obtain

AMISE{

fhiopt

}AMISE

{fhj

opt

} =T (K i )

T (K j)

thus, to obtain a minimal AMISE, kernel i should be preferred tokernel j if

T (K i ) ≤ T (K j)

SPM

Kernel Density Estimation 3-45

Efficiency of Kernels

Kernel T (K ) T (K)T (KEpa)

Uniform 0.3701 1.0602Triangle 0.3531 1.0114Epanechnikov 0.3491 1.0000Quartic 0.3507 1.0049Triweight 0.3699 1.0595Gaussian 0.3633 1.0408Cosinus 0.3494 1.0004

Table 3: Efficiency of kernels

SPM

Page 30: Spm Slides

Kernel Density Estimation 3-46

Results for choosing the kernel

since using the optimal bandwidths for kernels i and j , we have

AMISE{

fhiopt

}AMISE

{fhj

opt

} ≈ 1

we can conclude:

� the choice of the kernel has only a small influence on AMISE

SPM

Kernel Density Estimation 3-47

Canonical Kernels

recall: we have derived for

hiopt = δic

that AMISE is the product of a kernel-independent term and akernel-dependent scaling factor

this result holds also for non-optimal bandwidths h = c if theyfulfill

hi = δih .

in that case:

AMISE{fhi} =

{1

nh+

(h)4

4‖f ′′‖2

2

}T (K i )

SPM

Kernel Density Estimation 3-48

thus, if hi = δih

AMISE{

fhi

}AMISE

{fhj

} =T (K i )

T (K j)

the factors of proportionality δi which guarantee an equal amountof smoothing are the canonical bandwidths and it holds that

hi

hj=

δi

δj⇒ hi = hj δ

i

δj

SPM

Kernel Density Estimation 3-49

Canonical Bandwidths for Different Kernels

Kernel δi

Uniform(

92

)1/5 ≈ 1.3510

Epanechnikov 151/5 ≈ 1.7188

Quartic 351/5 ≈ 2.0362

Triweight(

9450143

)1/5 ≈ 2.3122

Gaussian(

14π

)1/10 ≈ 0.7764

Table 4: Canonical bandwidths δi for different kernels

SPM

Page 31: Spm Slides

Kernel Density Estimation 3-50

The averaged squared error

0.1 0.2 0.3 0.4

h

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

1

2

Figure 37: Averaged squarred error for Gaussian (label 1) and quartic (label

2) kernel smoothers. The ASE is smallest for Gaussian kernel with h1 =

0.08, for the quartic kernel with h2 = 0.21.

SPMsimase

SPM

Kernel Density Estimation 3-51

Adjusting the Bandwidth across Kernels

we wantAMISE (hA, KA) ≈ AMISE (hB , KB)

which holds if

hB = hA δB

δA.

Example

kernel KA A = Uniformkernel KB B = Gauss

δA = 1.35, hA = 1.2, δB = 0.77 =⇒ hB = 0.68

SPM

Kernel Density Estimation 3-52

Confidence Intervals

let h = hn = cn−1/5, then the KDE has the following asymptoticnormal distribution:

n2/5{

fhn(x) − f (x)}

L−→ N

⎛⎜⎜⎝c2

2f ′′(x)μ2(K )︸ ︷︷ ︸

bx

,1

cf (x)‖K‖2

2︸ ︷︷ ︸v2x

⎞⎟⎟⎠ ,

consequently:

P(bx − z1−α

2vx ≤ n2/5{fh(x) − f (x)} ≤ bx + z1−α

2vx

)≈ 1 − α

SPM

Kernel Density Estimation 3-53

1 − α confidence interval for f (x)[fh(x) − h2

2f ′′(x)μ2(K ) − z1−α

2

√f (x)‖K‖2

2

nh,

fh(x) − h2

2f ′′(x)μ2(K ) + z1−α

2

√f (x)‖K‖2

2

nh

]

SPM

Page 32: Spm Slides

Kernel Density Estimation 3-54

Confidence Bands

limn→∞P

(fh(x) −

{fh(x)‖K‖2

2

nh

}1/2{z

(2δ log n)1/2+ dn

}1/2

≤ f (x) ≤ fh(x) +

{fh(x)‖K‖2

2

nh

}1/2{z

(2δ log n)1/2+ dn

}1/2

∀ x

)

= exp{−2 exp(−z)} = 0.95

with

dn = (2δ log n)1/2 + (2δ log n)−1/2 log

(1

‖K ′‖2

‖K‖2

)h = n−δ, δ ∈ (

1

5,1

2)

����! undersmoothing for bias reduction

SPM

Kernel Density Estimation 3-55

Example: Average hourly earnings

Lognormal and Kernel Density

10 20 30 40

Wages

05

10

Den

sity

*E-2

Figure 38: Average hourly earnings, parametric (log-normal, broken line)

and nonparametric density estimate (Quartic kernel, h = 5, solid line),

CPS 1985, n = 534

SPMcps85dist

SPM

Kernel Density Estimation 3-56

Kernel Density and Confidence Bands

10 20 30 40

Wages

05

10

Den

sity

*E-2

Figure 39: Average hourly earnings, parametric (log-normal, broken line)

with nonparametric kernel density estimate (thick solid line) estimate and

confidence bands (thin solid lines) for kernel estimate, CPS 1985, n = 534

SPMcps85dist

SPM

Kernel Density Estimation 3-57

Example: Difference between confidence intervals and bands

Confidence Bands and Intervals

10 20 30 40

Wages

05

10

Den

sity

*E-2

Figure 40: Average hourly earnings, 95% confidence bands (solid lines)

and confidence intervals (broken lines) of the kernel density estimate, CPS

1985, n = 534

SPMcps85dist

SPM

Page 33: Spm Slides

Kernel Density Estimation 3-58

Summary: Univariate Density

� The asymptotic MISE is given by

AMISE (fh) =1

nh‖K‖2

2 +h4

4{μ2(K )}2 ‖f ′′‖2

2

� The optimal bandwidth hopt that minimizes AMISE is

hopt =

( ‖K‖22

‖f ′′‖22 {μ2(K )}2n

)1/5

∼ n−1/5

hence MISE (hopt) ∼ n−4/5

� The AMISE optimal bandwidth according to Silverman’s rule ofthumb hopt ≈ 1.06σn−1/5

� The bandwidth can be estimated by Cross-Validation (based on thedelete-one estimates) or by Plug-In Methods (based on someestimate of the term ‖f ′′‖2

2 in the expression for AMISE).SPM

Kernel Density Estimation 3-59

Multivariate Kernel Density Estimation

d-dimensional data and d-dimensional kernel

X = (X1, . . . ,Xd)�, K : Rd → R+

� multivariate kernel density estimator (simple)

fh(x) =1

n

n∑i=1

1

hdK(

x − Xi

h

)each component is scaled equally.

� multivariate kernel density estimator (more general)

fh(x) =1

n

n∑i=1

1

h1· . . . · hdK(

x1 − X1

h1, . . . ,

xd − Xd

hd

)

SPM

Kernel Density Estimation 3-60

� multivariate kernel density estimator (most general)

fH(x) =1

n

n∑i=1

1

det(H)K {H−1(x − Xi )

}where H is a (symmetric) bandwidth matrix.Each component is scaled separately, correlation betweencomponents can be handled.

SPM

Kernel Density Estimation 3-61

How to get a Multivariate Kernel?

u = (u1, . . . , ud)�

� product kernel

K(u) = K (u1)· . . . ·K (ud)

� radially symmetric or spherical kernel

K(u) =K (‖u‖)∫

Rd K (‖u‖)duwith ‖u‖ def

=√

u�u

SPM

Page 34: Spm Slides

Kernel Density Estimation 3-62

Example: Bandwidth matrix H =

(1 00 1

)Product Kernel

-1 -0.5 0 0.5 1

x1

-1-0

.50

0.5

1

x2

Radial-symmetric Kernel

-1 -0.5 0 0.5 1

x1

-1-0

.50

0.5

1

x2

Figure 41: Contours from bivariate product (left) and bivariate radially

symmetric (right) Epanechnikov kernelSPMkernelcontours

SPM

Kernel Density Estimation 3-63

Example: Bandwidth matrix H =

(1 00 0.5

)Product Kernel

-1 -0.5 0 0.5 1

x1

-1-0

.8-0

.6-0

.4-0

.20

0.2

0.4

0.6

0.8

1

x2

Radial-symmetric Kernel

-1 -0.5 0 0.5 1

x1

-1-0

.8-0

.6-0

.4-0

.20

0.2

0.4

0.6

0.8

1

x2

Figure 42: Contours from bivariate product (left) and bivariate radially

symmetric (right) Epanechnikov kernelSPMkernelcontours

SPM

Kernel Density Estimation 3-64

Example: Bandwidth matrix H =

(1 0.5

0.5 1

)1/2

Product Kernel

-1 -0.5 0 0.5 1

x1

-1-0

.50

0.5

1

x2

Radial-symmetric Kernel

-1 -0.5 0 0.5 1

x1

-1-0

.50

0.5

1

x2

Figure 43: Contours from bivariate product (left) and bivariate radially

symmetric (right) Epanechnikov kernel

SPMkernelcontours

SPM

Kernel Density Estimation 3-65

Kernel properties

K is a density function∫Rd

K(u)du = 1, K(u) ≥ 0

K is symmetric ∫Rd

uK(u)du = 0d

K has a second moment (matrix)∫Rd

uu�K(u)du = μ2(K)Id ,

where Id denotes the d × d identity matrix

K has a kernel norm

‖K‖22 =

∫K2(u)du

SPM

Page 35: Spm Slides

Kernel Density Estimation 3-66

Statistical Properties

E[fH(x)

]− f (x) ≈ 1

2μ2(K) tr{H�Hf (x)H}

Var{fH(x)} ≈ 1

ndet(H)‖K‖2

2 f (x)

AMISE (H) =1

4μ2

2(K)

∫Rd

[tr{H�Hf (x)H}

]2dx +

1

ndet(H)‖K‖2

2

derivation using the multivariate Taylor expansion

f (x + t) = f (x) + t�∇f (x) +1

2t�Hf (x)t + O(t�t)

where ∇f denotes the gradient and Hf the Hessian of f

SPM

Kernel Density Estimation 3-67

E[fH(x)

]= E

[1

n

n∑i=1

1

det(H)K {H−1(x − Xi )

}]

=1

det(H)

∫Rd

K {H−1(u − x)}

f (u)du

=1

det(H)

∫Rd

K(s)f (Hs + x)det(H)ds

≈(Taylor)

∫Rd

K(s){

f (x) + s�H�∇f (x)

+1

2s�H�Hf (x)Hs

}ds

= f (x) +1

2

∫Rd

tr{H�Hf (x)Hss�

}K(s)ds

= f (x) +1

2μ2(K) tr{H�Hf (x)H}

SPM

Kernel Density Estimation 3-68

Var[fH(x)

]=

1

n2Var

[n∑

i=1

1

det(H)K {H−1(Xi − x)

}]

=1

nVar

[1

det(H)K {H−1(X − x)

}]=

1

nE

[1

det(H)2K {H−1(X − x)

}2]− 1

nE[fH(x)

]2≈ 1

n det(H)2

∫Rd

K {H−1(u − x)}2

f (u)du

=1

n det(H)

∫Rd

K(s)2f (x + Hs)ds

≈(Taylor)

1

n det(H)

∫Rd

K(z)2{

f (x) + s�H�∇f (x)}

ds

≈ 1

n det(H)‖K‖2

2f (x)

SPM

Kernel Density Estimation 3-69

Univariate Expressions as a Special Case

for d = 1 we obtain H = h, K = K , Hf (x) = f ′′(x) and

E[fH(x)

]− f (x) ≈ 1

2μ2(K) tr{H�Hf (x)H}

≈ 1

2μ2(K )h2f ′′(x)

Var[fH(x)

]≈ 1

n det(H)‖K‖2

2f (x)

≈ 1

nh‖K‖2

2f (x)

SPM

Page 36: Spm Slides

Kernel Density Estimation 3-70

Bandwidth selection

� plug-in bandwidths (rule-of-thumb bandwidths)

� resampling methods (cross-validation)

Rule-of-thumb bandwidth

find bandwidth matrix that minimizes

AMISE (H) =1

4μ2

2(K)

∫Rd

[tr{H�Hf (x)H}

]2dx +

1

n det(H)‖K‖2

2

SPM

Kernel Density Estimation 3-71

reformulation of the first term of AMISE under the assumptions

K = multivariate Gaussian, f ∼ Nd(μ,Σ).

Hence μ2(K) = 1, ‖K‖22 = 2−dπ−d/2∫

Rd

[tr{H�Hf (t)H}]2 dt

=1

2d+2πd/2det(Σ)1/2

[2 tr(H�Σ−1H)2 + {tr(H�Σ−1H)}2

]

SPM

Kernel Density Estimation 3-72

simple case H = diag(h1, . . . , hd)

hj ,opt =

(4

d + 2

)1/(d+4)

n−1/(d+4)σj

note: for d = 1 we arrive at Silverman’s rule of thumbh = 1.06n1/5σ

SPM

Kernel Density Estimation 3-73

Scott’s rule of thumb

Since factor ∈ [0.924, 1.059] David Scott proposes:

hj ,opt = n−1/(d+4)σj

David Scott on BBI:

General matrix rule of thumb

H = n−1/(d+4)Σ1/2

SPM

Page 37: Spm Slides

Kernel Density Estimation 3-74

Cross-Validation

ISE (H) =

∫{fH(x) − f (x)}2 dx

=

∫f 2H(x) dx − 2

∫fH(x)f (x) dx +

∫f 2(x) dx

E[fH(X)

]=

1

n

n∑i=1

fH,−i (Xi ), fH,−i (x) =1

n − 1

n∑i �=j ,j=1

KH(Xj−x)

CV (H) =1

n2det(H)

n∑i=1

n∑j=1

K � K {H−1(Xj − Xi )}

− 2

n(n − 1)

n∑i=1

n∑j=1j �=i

KH(Xj − Xi )

SPM

Kernel Density Estimation 3-75

Canonical bandwidths

Let j denote a kernel out of{multivariate Gaussian, multivariate Quartic, . . .}, then thecanonical bandwidth of kernel j is

δj =

{ ‖Kj‖22

μ2(Kj)2

}1/(d+4)

.

consequently, it holds

AMISE (Hj ,Kj) ≈ AMISE (Hi ,Ki )

where

Hi =δi

δjHj

SPM

Kernel Density Estimation 3-76

Example: Adjust from Gaussian to Quartic product kernel

d δG δQ δQ/δG

1 0.7764 2.0362 2.62262 0.6558 1.7100 2.60733 0.5814 1.5095 2.59644 0.5311 1.3747 2.58835 0.4951 1.2783 2.5820

Table 5: Bandwidth adjusting factors for Gaussian and multiplicative Quar-

tic kernel for different dimensions d

SPM

Kernel Density Estimation 3-77

Example: East-West German migration intention

A two-dimensional density estimation fh(x) = fh(x1, X − 2) for twoexplanatory variables on East-West German migration intention inSpring 1991.

2D Density Estimate

30.542.3

54.11262.1

2153.0

3044.0

Figure 44: Two-dimensional density estimation for age and household in-

come from East German SOEP 1991SPMdensity2D

SPM

Page 38: Spm Slides

Kernel Density Estimation 3-78

2D Density Estimate

20 30 40 50 60

Age

12

34

Inco

me*

E3

Figure 45: Two-dimensional density estimation for age and household in-

come from East German SOEP 1991SPMcontour2D

� distribution is considerably left skewed

� younger people are more likely to move to the western part ofGermany.

SPM

Kernel Density Estimation 3-79

Example: Credit scoring

� Credit scoring explained by duration of credit, income and age.

� To visualize this three dimensional density estimation, onevariable is hold fixed and the two others are displayed.

SPM

Kernel Density Estimation 3-80

Duration fixed at 38

4780.99288.6

13796.433.0

47.2

61.5

Income fixed at 9337

20.437.3

54.333.0

47.2

61.5

Age fixed at 47

20.437.3

54.34780.9

9288.6

13796.4

Figure 46: Three-dimensional density estimation, graphical representation

by two-dimensional intersectionsSPMslices3D

� In this example a lower income and a lower age will probablylead to a higher credit scoring.

� Results for the variable duration are mixed.

SPM

Kernel Density Estimation 3-81

Contours, 3D Density Estimate

11.919.9

27.81650.9

3051.9

4452.828.3

37.5

46.8

Figure 47: Three-dimensional density estimation for duration of credit,

household income and age, graphical representation by contour plot

SPMcontour3D

SPM

Page 39: Spm Slides

Kernel Density Estimation 3-82

Curse of Dimensionality

recall the AMISE for the multivariate KDE

AMISE (H) =1

4μ2

2(K)

∫Rd

[tr{H�Hf (x)H}

]2dx +

1

ndet(H)‖K‖2

2

consider the special case H = h· Id

hopt ∼ n−1

4+d , AMISE (hopt) ∼ n−4

4+d

Hence the convergence rate decreases with dimension

SPM

Kernel Density Estimation 3-83

Summary: Multivariate Density

� Straightforward generalization of the one-dimensional densityestimator.

� Multivariate kernels based on the univariate ones (product orradially-symmetric).

� Bandwidth selection based on same principles as bandwidthselection for univariate densities.

� New problem is the graphical representation of the estimates.We can use contour plots, 3D plots or 2D intersections.

� We find a curse of dimensionality: the rate of convergence ofthe estimate towards the true density decreases with thedimension.

SPM

Nonparametric Regression 4-1

Chapter 4:Nonparametric Regression

SPM

Nonparametric Regression 4-2

Nonparametric Regression

{(Xi , Yi )}, i = 1, . . . , n; X ∈ Rd , Y ∈ R

� Engel curve: X = net-income, Y = expenditure

Y = m(X ) + ε

� CHARN model: time series of the form

Yt = m(Yt−1) + σ(Yt−1)ξt

SPM

Page 40: Spm Slides

Nonparametric Regression 4-3

Univariate Kernel Regression

modelYi = m(Xi ) + εi , i = 1, . . . , n

m(•) smooth regression function, εi i.i.d. error terms with Eεi = 0

we aim to estimate the conditional expectation of Y given X = x

m(x) = E(Y |X = x) =

∫y f (y |x) dy =

∫y

f (x , y)

fX (x)dy

where f (x , y) denotes the joint density of (X , Y ) and fX (x) themarginal density of X

Example: Normal variables

(X , Y ) ∼ N(μ,Σ) =⇒ m(x) is linear

SPM

Nonparametric Regression 4-4

Let X =

(X1

X2

)∼ Np(μ, Σ), Σ =

(Σ11 Σ12

Σ21 Σ22

)and

X2.1 = X2 − Σ21Σ−111 X1.

The conditional distribution of X2 given X1 = x1 is normal with meanμ2 + Σ21Σ

−111 (x1 − μ1) and covariance Σ22.1, i.e.,

(X2|X1 = x1) ∼ Np−r (μ2 + Σ21Σ−111 (x1 − μ1), Σ22.1).

Proof

Since X2 = X2.1 + Σ21Σ−111 X1, for a fixed value of X1 = x1, X2 is

equivalent to X2.1 plus a constant term:

(X2|X1 = x1) = (X2.1 + Σ21Σ−111 x1),

which has the normal distribution N(μ2.1 + Σ21Σ−111 x1, Σ22.1). �

Note that the conditional mean of (X2|X1) is a linear function of X1 and

that the conditional variance does not depend on the particular value of

X1.

SPM

Nonparametric Regression 4-5

Fixed vs. Random Design

� fixed design: we “know” fX , the density of Xi

� random design: density fX has to be estimated

Nonparametric regression estimator

m(x) =1

n

n∑i=1

Whi (x)Yi

with weights

W NWhi (x) =

Kh(x − xi )

fh(x)for random design

W FDhi (x) =

Kh(x − xi )

fX (x)for fixed design

SPM

Nonparametric Regression 4-6

Nadaraya-Watson Estimator

idea: (Xi , Yi ) have joint a pdf, so we can estimate m(•) by amultivariate kernel estimator

fh,h

(x , y) =1

n

n∑i=1

1

hK

(x − Xi

h

)1

hK

(y − Yi

h

)and therefore

∫y f

h,h(x , y)dy = 1

n

∑ni=1 Kh(x − Xi )Yi

resulting estimator:

mh(x) =

n−1n∑

i=1Kh(x − Xi )Yi

n−1n∑

j=1Kh(x − Xj)

=rh(x)

fh(x)

SPM

Page 41: Spm Slides

Nonparametric Regression 4-7

Engel Curve

0 0.5 1 1.5 2 2.5 3

Net-income

00.

51

1.5

Food

Figure 48: Nadaraya-Watson kernel regression, h = 0.2, U.K. Family Ex-

penditure Survey 1973

SPMengelcurve1

SPM

Nonparametric Regression 4-8

Gasser & Muller Estimator

for fixed design with xi ∈ [a, b], and si =(x(i) + x(i+1)

)/2, s0 = a,

sn+1 = b, Gasser and Muller (1984) suggest the weights

W GMhi (x) = n

∫ si

si−1

Kh(x − u)du,

motivated by (with mean value theorem) :

(si − si−1) Kh(x − ξ) =

∫ si

si−1

Kh(x − u) du

note that as for the Nadaraya-Watson estimator∑

i WGMhi (x) = 1

SPM

Nonparametric Regression 4-9

0 0.5 1 1.5 2 2.5 3

X

-0.5

00.

51

1.5

2

Y

0 0.5 1 1.5 2 2.5 3

X

-1-0

.50

0.5

11.

52

Y

Figure 49: Gasser - Muller estimator (left) and its first derivative for sim-

ulated data.

SPMgasmul

SPM

Nonparametric Regression 4-10

Statistical properties of the GM estimator

Theorem

For h → 0, nh → ∞ and regularity conditions we obtain

n−1n∑

i=1

W GMhi (x)Yi = mh(x)

P−→ m(x)

and

AMSE{

mGMh (x)

}=

1

nhσ2‖K‖2

2 +h4

4μ2

2(K ){m′′(x)}2.

����! here we have fixed design, for random design mNWh may

have non-existing MSE (random denominator!)

SPM

Page 42: Spm Slides

Nonparametric Regression 4-11

Statistical properties of the NW estimator

mh(x) − m(x) =

{rh(x)

fh(x)− m(x)

}[fh(x)

fX (x)+

{1 − fh(x)

fX (x)

}]

=rh(x) − m(x)fh(x)

fX (x)+ {mh(x) − m(x)} fX (x) − fh(x)

fX (x)

calculate now bias and variance in the same way as for the KDE:

AMSE{mh(x)} =1

nh

σ2(x)

fX (x)‖K‖2

2︸ ︷︷ ︸variance part

+h4

4

{m′′(x) + 2

m′(x)f ′X (x)

fX (x)

}2

μ22(K)︸ ︷︷ ︸

bias part

SPM

Nonparametric Regression 4-12

h=0.05

0 0.5 1 1.5 2 2.5 3

Net-income

00.

51

1.5

Food

h=0.1

0 0.5 1 1.5 2 2.5 3

Net-income

00.

51

1.5

Food

h=0.2

0 0.5 1 1.5 2 2.5 3

Net-income

00.

51

1.5

Food

h=0.5

0 0.5 1 1.5 2 2.5 3

Net-income

00.

51

1.5

Food

Figure 50: Four kernel regression estimates for the 1973 U.K. Family Ex-

penditure data with bandwidths h = 0.05, h = 0.1, h = 0.2, and h = 0.5

SPMregress

SPM

Nonparametric Regression 4-13

Asymptotic normal distribution

Under regularity conditions and h = cn−1/5

n2/5 {mh(x) − m(x)}

L−→ N

⎛⎜⎜⎜⎝c2μ2(K )

{m′′(x)

2+

m′(x)f ′X (x)

fX (x)

}︸ ︷︷ ︸

bx

,σ2(x)‖K‖2

2

cfX (x)︸ ︷︷ ︸v2x

⎞⎟⎟⎟⎠ .

� bias is a function of m′′ and m′f ′

� variance is a function of σ2 and f

SPM

Nonparametric Regression 4-14

Pointwise Confidence Intervals

[mh(x) − z1−α

2

√‖K‖2σ2(x)

nhfh(x), mh(x) + z1−α

2

√‖K‖2σ2(x)

nhfh(x)

]

σ2(x) = n−1n∑

i=1

Whi (x){Yi − mh(x)}2,

� σ2 and f both influence the precision of the confidenceinterval

� correction for bias!?

analogous to KDE: confidence bands

SPM

Page 43: Spm Slides

Nonparametric Regression 4-15

Confidence Intervals

0 0.5 1 1.5 2 2.5 3

Net-income

00.

51

1.5

Food

Figure 51: Nadaraya-Watson kernel regression and 95% confidence inter-

vals, h = 0.2, U.K. Family Expenditure Survey 1973

SPMengelconf

SPM

Nonparametric Regression 4-16

Local Polynomial Estimation

Taylor expansion for sufficiently smooth functions

m(t) ≈ m(x)+m′(x)(t−x)+m′′(x)(t−x)21

2!+· · ·+m(p)(x)(t−x)p

1

p!

consider a weighted least squares problem

minβ

n∑i=1

{Yi − β0 − β1(Xi − x) − β2(Xi − x)2 −

− . . . − βp(Xi − x)p}2

Kh(x − Xi )

The resulting estimate for β provides estimates for m(ν)(x),ν = 0, 1, . . . , p

SPM

Nonparametric Regression 4-17

notations

X =

⎛⎜⎜⎜⎝1 X1 − x (X1 − x)2 . . . (X1 − x)p

1 X2 − x (X2 − x)2 . . . (X2 − x)p

......

.... . .

...1 Xn − x (Xn − x)2 . . . (Xn − x)p

⎞⎟⎟⎟⎠Y = (Y1, · · · , Yn)

W = diag ({Kh ( x − Xi}ni=1)

local polynomial estimator

β(x) =(X�WX

)−1X�WY

local polynomial regression estimator

mh,p(x) = β0(x)

SPM

Nonparametric Regression 4-18

(Asymptotic) Statistical Properties

under regularity conditions, h = cn−15 , nh → ∞

n2/5 {mh,1(x) − m(x)} L−→ N

(c2μ2(K )

m′′(x)

2,

σ2(x)‖K‖22

cfX (x)

)remarks:

� asymptotically equivalent to higher order kernel

� analog theorem can be stated for derivative estimation

SPM

Page 44: Spm Slides

Nonparametric Regression 4-19

Local Polynomial Regression

0 0.5 1 1.5 2 2.5 3

Net-income

00.

51

1.5

Food

Figure 52: Local polynomial regression, p = 1, h = 0.2, U.K. Family Ex-

penditure Survey 1973

SPMlocpolyreg

SPM

Nonparametric Regression 4-20

Derivative Estimation

0 0.5 1 1.5 2 2.5 3

Net-income

00.

51

1.5

Food

Figure 53: Local polynomial derivative estimation, p = 2, h by rule of

thumb, U.K. Family Expenditure Survey 1973

SPMderivest

SPM

Nonparametric Regression 4-21

Other Smoothers

k-Nearest Neighbor Estimator

mk(x) = n−1n∑

i=1

Wki (x)Yi

with defined weights

Wki (x) =

{n/k if i ∈ Jx

0 otherwise

where Jx = {i : Xi is one of the k nearest observations to x}.� smoothing parameter k large: smooth estimate

� smoothing parameter k small: rough estimate

SPM

Nonparametric Regression 4-22

k-NN Regression

0 0.5 1 1.5 2 2.5 3

Net-income

00.

51

1.5

Food

Figure 54: k-nearest-neighbor regression, k = 101, U.K. Family Expendi-

ture Survey 1973

SPMknnreg

SPM

Page 45: Spm Slides

Nonparametric Regression 4-23

Statistical Properties

for k → ∞, k/n → 0 and n → ∞

E {mk(x) − m(x)} ≈ μ2(K )

8fX (x)2

{m′′(x) + 2

m′(x)f ′X (x)

fX (x)

} (k

n

)2

Var{mk(x)} ≈ 2‖K‖22

σ2(x)

k.

� variance independent of fX !

� bias has an additional term depending on f 2X

� k = 2nhfX (x) ⇒ corresponds to kernel estimator with localbandwidth!

SPM

Nonparametric Regression 4-24

Median Smoothing

m(x) = med{Yi : i ∈ Jx}where

Jx = {i : Xi is one of the k nearest neighbors of x} .

SPM

Nonparametric Regression 4-25

Median Smoothing Regression

0 0.5 1 1.5 2 2.5 3

Net-income

00.

51

1.5

Food

Figure 55: Median smoothing regression, k = 101, U.K. Family Expendi-

ture Survey 1973

SPMmesmooreg

SPM

Nonparametric Regression 4-26

Spline Smoothing

The residual sum of squares RSS =∑n

i=1 {Yi − m(Xi )}2 is takenas a criterion for the goodness of fit.Spline smoothing avoids functions that are minimizing the RSS butat the same time merely interpolating the data, by adding astabilizer that penalizes non-smoothness.

Possible stabilizer:

||m′′||22 =

∫ {m′′(x)

}2dx

Resulting minimazation problem:

mλ = arg minm

n∑i=1

{Yi − m(Xi )}2 + λ‖m′′‖22,

where λ controls the weight given to the stabilizer.

SPM

Page 46: Spm Slides

Nonparametric Regression 4-27

In the class of all twice differentiable functions the solution mλ is aset of cubic polynomials

pi (x) = αi + βix + γix2 + δix

3, i = 1, . . . , n − 1.

For the estimator to be twice continuously differentiable at X(i) werequire

pi (X(i)) = pi−1(X(i)), p′i (X(i)) = p′

i−1(X(i)), p′′i (X(i)) = p′′

i−1(X(i))

and an additional boundary condition

p′′1

(X(1)

)= p′′

n−1

(X(n)

)= 0.

SPM

Nonparametric Regression 4-28

Computational algorithm (introduced by Reinsch)

RSS =n∑

i=1

{Y(i) − m(X(i))

}2= (Y − m)�(Y − m), (4.1)

where Y = (Y(1), . . . ,Y(n))�

If m(•) were indeed a piecewise cubic polynomial, then the penaltyterm could be expressed as a quadratic form in m:∫ {

m′′(x)}2

dx = m�Km, (4.2)

where K can be decomposed to K = QR−1Q�. Q and R denoteband matrices and functions of hi = X(i+1) − X(i).

From (4.1) and (4.2) we get

mλ = (I + λK)−1Y,

where I denotes a n-dimensional identity matrix.SPM

Nonparametric Regression 4-29

Example

We apply the algorithm for cubic spline estimate on the Engelcurve example.

Spline Regression

0 0.5 1 1.5 2 2.5 3

Net-income

00.

51

1.5

Food

Figure 56: Spline regression, λ = 0.005, U.K. Family Expenditure Survey

1973SPMspline

SPM

Nonparametric Regression 4-30

Spline Kernel

Under certain conditions the spline smoother is asymptoticallyequivalent to a kernel smoother with a spline kernel:

KS(u) =1

2exp

(− |u|√

2

)sin

( |u|√2

4

),

with local bandwidth h(Xi ) = λ1/4n−1/4 fX (Xi )−1/4.

SPM

Page 47: Spm Slides

Nonparametric Regression 4-31

Orthogonal Series Regression

m(•) represented as a series of basis functions (e.g. a Fourierseries):

m(x) =∞∑j=0

βjφj(x) ,

where {φj}∞j=0 is a known basis of functions and {βj}∞j=0 are theunknown Fourier coefficients.The goal is to estimate the unknown coefficients.

SPM

Nonparametric Regression 4-32

We indeed have infinite number of coefficients, cannot beestimated from a finite number of observations (n).Choose the number of terms N that will be included in therepresentation.

Series estimation procedure:

1. select a basis of functions

2. select N, N < n

3. estimate the N unknown coefficients by a suitable method

SPM

Nonparametric Regression 4-33

Legendre polynomials

As functions φj we used the Legendre polynomials (orthogonalizedpolynomials):

(m + 1)pm+1(x) = (2m + 1)xpm(x) − mpm−1(x)

On the interval [−1, 1] we have:

p0(x) =1√2, p1(x) =

√3x√2

, p2(x) =

√2

2√

5(3x2 − 1), . . .

SPM

Nonparametric Regression 4-34

Orthogonal Series Regression

0 0.5 1 1.5 2 2.5 3

Net-income

00.

51

1.5

Food

Figure 57: Orthogonal series regression with Legendre polynomials, N = 9,

U.K. Family Expenditure Survey 1973

SPMorthogon

SPM

Page 48: Spm Slides

Nonparametric Regression 4-35

Wavelet Regression

Wavelets Alternative to Orthogonal Series RegressionBasis of functions is orthonormal.The wavelet basis {ψjk} on R is generated from a mother waveletψ(•). And can be computed by

ψjk(x) = 2j/2ψ(2jx − k),

where k is a location shiftand 2j a scale factor

SPM

Nonparametric Regression 4-36

An example of the mother wavelet – the Haar wavelet

ψ(x) =

⎧⎨⎩−1 x ∈ [0, 1/2]

1 x ∈ (1/2, 1]0 otherwise.

SPM

Nonparametric Regression 4-37

Wavelet Regression

0 1 2 3 4 5 6

X

-1-0

.50

0.5

11.

5

Y

Figure 58: Wavelet regression for simulated data, n = 256

SPMwavereg

SPM

Nonparametric Regression 4-38

Bandwidth Choice in Kernel Regression

Bias^2, Variance and MASE

0.05 0.1 0.15

Bandwidth h

00.

010.

010.

020.

020.

030.

03

Bia

s^2,

Var

ianc

e an

d M

ASE

Figure 59: MASE (thick line), bias part (thin solid line) and variance part

(thin dashed line) for simulated data

SPMsimulmase

SPM

Page 49: Spm Slides

Nonparametric Regression 4-39

Example: Simulated data for previous figure

Simulated Data

0 0.5 1

x

-1-0

.50

0.5

1

y, m

, mh

Figure 60: Simulated data with curve m(x) = {sin(2πx3)}3Yi = m(Xi ) +

εi , Xi ∼ U[0, 1], εi ∼ N(0, 0.1)

SPMsimulmase

SPM

Nonparametric Regression 4-40

Cross Validation

ASE = n−1n∑

j=1

{mh(Xj) − m(Xj)}2w(Xj)

MASE = E{ASE |X1 = x1, · · · , Xn = xn}estimate MASE by resubstitution function (w is a weight function)

p(h) =1

n

n∑i=1

{Yi − mh(Xi )}2w(Xi )

separate estimation and validation by using leave-one-outestimators

CV (h) =1

n

n∑i=1

{Yi − mh,−i (Xi )}2w(Xi )

minimizing gives hCV

SPM

Nonparametric Regression 4-41

Nadaraya-Watson Estimate

0 0.5 1 1.5 2 2.5 3

Net-income

00.

51

1.5

Food

Figure 61: Nadaraya-Watson kernel regression, cross-validated bandwidth

hCV = 0.15, U.K. Family Expenditure Survey 1973

SPMnadwaest

SPM

Nonparametric Regression 4-42

Local Polynomial Estimate

0 0.5 1 1.5 2 2.5 3

Net-income

00.

51

1.5

Food

Figure 62: Local polynomial regression, p = 1, cross-validated bandwidth

hCV = 0.56, U.K. Family Expenditure Survey 1973

SPMlocpolyest

SPM

Page 50: Spm Slides

Nonparametric Regression 4-43

Penalizing Functions

The conditional expectation of p(h) is not equal to the conditionalexpectation of ASE (h). The penalizing function approach correctsthe bias by penalizing too small h. The corrected version of p(h):

G (h) =1

n

n∑i=1

{Yi − mh (Xi )}2 Ξ

{1

nWhi (Xi )

}w(Xi ) . (4.3)

The penalizing functions that satisfy the first Taylor expansionΞ(u) = 1 + 2u + O(u2), u → 0:

1. Shibata’s model selector: Ξs(u) = 1 + 2u

2. Generalized cross-validation: ΞGCV (u) = (1 − u)−2

3. Akaike’s Information Criterion: ΞAIC (u) = exp(2u)

SPM

Nonparametric Regression 4-44

Penalizing Functions

1 2 3 4

Bandwidth h

510

1520

Xi(

1/h)

Figure 63: Penalizing functions Ξ(h−1) as a function of h (from left to right:

Shibata’s model selector, Akaike’s Information Criterion, Finite Prediction

Error, Generalized cross-validation , Rice’s T )SPMpenalize

SPM

Nonparametric Regression 4-45

Multivariate Kernel Regression

E (Y |X) = E (Y |X1, . . . ,Xd) = m(X)

Multivariate Nadaraya-Watson estimator:

mH(x) =

n∑i=1

KH (Xi − x) Yi

n∑i=1

KH (Xi − x)

Multivariate local linear regression:

minβ0,β1

n∑i=1

{Yi − β0 − (Xi − x)�β1

}2 KH(Xi − x)

SPM

Nonparametric Regression 4-46

Notations:

X =

⎛⎜⎜⎜⎝1 X1 − x (X1 − x)2 . . . (X1 − x)p

1 X2 − x (X2 − x)2 . . . (X2 − x)p

......

.... . .

...1 Xn − x (Xn − x)2 . . . (Xn − x)p

⎞⎟⎟⎟⎠Y = (Y1, . . . ,Yn)

W = diag ({KH(Xi − x)}ni=1)

Multivariate local polynomial estimator:

β(x) =(β0, β1

�)�=(X�WX

)−1X�WY

Multivariate local Linear estimator:

m1,H(x) = β0(x)

SPM

Page 51: Spm Slides

Nonparametric Regression 4-47

True Function

0.2 0.5 0.80.2

0.50.8

-0.2

0.5

1.2

Nadaraya-Watson

0.2 0.5 0.80.2

0.50.8

-0.2

0.5

1.2

Local Linear

0.2 0.5 0.80.2

0.50.8

-0.2

0.5

1.2

Figure 64: Two-dimensional Nadaraya-Watson and Local Linear Estimate

SPMtruenadloc

SPM

Nonparametric Regression 4-48

Bias, Variance and Asymptotics

Theorem

The conditional asymptotic bias and variance of the multivariateNadaraya-Watson Kernel regression estimator are

Bias {mH|X1, . . . ,Xn} ≈ μ2(K)∇m(x)�HH�∇f (x)�

fx

+1

2μ2(K) tr

{H�Hm(x)H

}

Var {mH|X1, . . . ,Xn} ≈ 1

n det(H)‖K‖2

2

σ2(t)

fx

in the interior of the support of fX .

SPM

Nonparametric Regression 4-49

Theorem

The conditional asymptotic bias and variance of the multivariateLocal linear regression estimator are

Bias {m1,H|X1, . . . ,Xn} ≈ 1

2μ(K) tr

{H�Hm(x)H

}

Var {m1,H|X1, . . . ,Xn} ≈ 1

n det(H)‖K‖2

2

σ2(t)

fx

in the interior of the support of fX .

SPM

Nonparametric Regression 4-50

Curse of dimensionality

As in multivariate kernel density estimation, we have a curse ofdimensionality, since the asymptotic MSE depends on d :

AMSE (n, h) =1

nhdC1 + h4C2.

Consequently, the optimal bandwidth is hopt ∼ n−1/(4+d) andhence the rate of convergence for AMSE is n−4/(4+d).

SPM

Page 52: Spm Slides

Nonparametric Regression 4-51

Cross-Validation Bandwidth Choice

Motivated by

RSS(H) = n−1n∑

i=1

{Yi − mH(Xi )}2 ,

Cross-validation function

CV (H) = n−1n∑

i=1

{Yi − mH,−i (Xi )}2

with the leave-one-out- estimator

mH,−i (Xi ) =

∑j �=i KH (Xi − Yj) Yj∑

j �=i KH (Xi − Yj).

SPM

Nonparametric Regression 4-52

For the Nadaraya-Watson Estimator:

CV (H) = n−1n∑

i=1

{Yi − mH,−i (Xi )}2

= n−1n∑

i=1

{Yi − mH(Xi )}2

{Yi − mH,−i (Xi )

Yi − mH(Xi )

}2

andYi − mH(Xi )

Yi − mH,−i (Xi )= 1 − KH(0)∑

j KH(Xi − Xj)

SPM

Nonparametric Regression 4-53

For the Local Linear Estimator:

Consider for fixed points x the sums,

S0 =n∑

i=1

KH (Xi − x)

S1 =n∑

i=1

KH (Xi − x) (Xi − x)

S2 =n∑

i=1

KH (Xi − x) (Xi − x) (Xi − x)�

thenYi − mH(Xi )

Yi − mH,−i (Xi )= 1 − KH(0)

S0 − S�1 S−1

2 S1

SPM

Nonparametric Regression 4-54

Summary: Nonparametric Regression

� The regression function m(•) which relates a independentvariable X and an dependent variable Y is the conditionalexpectation m(x) = E(Y |X = x).

� A natural kernel regression estimate for random design can beobtained by replacing the unknown densities in E(Y |X = x)by kernel density estimates. This yields the Nadaraya-Watsonestimator

mh(x) =

∑ni=1 Kh(x − Xi )Yi∑nj=1 Kh(x − Xj)

=1

n

∑i=1

Whi (x)Yi .

For fixed design we can use the Gasser-Muller estimator with

W GMhi (x) = n

∫ si

si−1

Kh(x − u)du.

SPM

Page 53: Spm Slides

Nonparametric Regression 4-55

� The asymptotic MSE of the Nadaraya-Watson estimator is

AMSE{mh(x)} =1

nh

σ2(x)

fX (x)‖K‖2

2

+h4

4

{m′′(x) + 2

m′(x)f ′X (x)

fX (x)

}2

μ22(K ),

the asymptotic MSE of the Gasser-Muller estimator is missing

the 2m′(x)f ′X (x)

fX (x) term.� The Nadaraya-Watson estimator is a local constant estimator.

Extending to local polynomials of degree p yields theminimization problem

minβ

n∑i=1

{Yi − β0 − β1(Xi − x) − . . . − βp(Xi − x)p}2 Kh(x−Xi ).

Odd order local polynomial estimators outperform even orderfits.

SPM

Nonparametric Regression 4-56

In particular, the asymptotic MSE for the local linear kernelregression estimator is

AMSE{m1,h(x)} =1

nh

σ2(x)

fX (x)‖K‖2

2 +h4

4

{m′′(x)

}2μ2

2(K ).

The bias does not depend on fX , f ′X and m′ which makes thethe local linear estimator more design adaptive and improvesthe behavior in boundary regions.Local polynomial estimation allows an easy estimation ofregression function derivatives.

� Other smoothing methods are: k-NN estimation, splines,median smoothing, orthogonal series regression and wavelets.

SPM

Semiparametric and Generalized Regression Models 5-1

Chapter 5:Semiparametric and Generalized

Regression Models

SPM

Semiparametric and Generalized Regression Models 5-2

Semiparametric and Generalized RegressionModels

possibilities to avoid the curse of dimensionality

� variable selection in nonparametric regression� nonparametric link function g , but one-dimensional

(parametric) index� single index model (SIM)

E (Y |X) = m(X) = g(X�β)

� generalization to semi- or nonparametric index� generalized partial linear model (GPLM)

E (Y |X,T) = G{X�β + m(T)

}� generalized additive model (GAM)

E (Y |X,T) = G{X�β + f1(T1) + . . . + fd(Td)

}SPM

Page 54: Spm Slides

Semiparametric and Generalized Regression Models 5-3

components \ link known unknown

linear GLM SIMpartial nonparametric GPLM, GAM �

Table 6: Parametric → semiparametric

SPM

Semiparametric and Generalized Regression Models 5-4

Generalized Linear Models

modelE(Y |x) = G (x�β)

μ = G (η)

components of a GLM

� distribution of Y (exponential family)

� link function G (note that other authors call G−1 the linkfunction)

SPM

Semiparametric and Generalized Regression Models 5-5

Exponential Family

f (y , θ, ψ) = exp

{yθ − b(θ)

a(ψ)+ c(y , ψ)

}where

� f is the density of Y (continuous)

� f is the probability function of Y (discrete)

note that

� in GLM θ = θ(β)

� θ is the parameter of interest

� ψ is a nuisance parameter (as σ in regression)

SPM

Semiparametric and Generalized Regression Models 5-6

Example: Y ∼ N(μ, σ2)

f (y) =1√2πσ

exp

{ −1

2σ2(y − μ)2

}= exp

{y

μ

σ2− μ2

2σ2− y2

2σ2− log(

√2πσ)

}exponential family with

a(ψ) = ψ2

b(θ) =θ2

2

c(y , ψ) = − y2

2σ2− log(

√2πσ)

where ψ = σ and θ = μ

SPM

Page 55: Spm Slides

Semiparametric and Generalized Regression Models 5-7

Example: Y is Bernoulli

f (y) = P(Y = y) = py (1 − p)1−y =

{p if y = 11 − p if y = 0

transform into

P(Y = y) =

(p

1 − p

)y

(1 − p) = exp

{y log

(p

1 − p

)}(1 − p)

define logit

θ = log

(p

1 − p

)⇐⇒ p =

1 + eθ

SPM

Semiparametric and Generalized Regression Models 5-8

thus, we arrive at

P(Y = y) = exp {yθ + log(1 − p)}

the parameters of the exponential family are

a(ψ) = 1

b(θ) = − log(1 − p) = log(1 + eθ)

c(y , ψ) = 0.

this is a one-parameter exponential family as there is no nuisanceparameter ψ

SPM

Semiparametric and Generalized Regression Models 5-9

Properties of the exponential family

f is a density function: ∫f (y , θ, ψ)dy = 1

The partial derivative with respect to θ:

0 =∂

∂θ

∫f (y , θ, ψ)dy

=

∫ {∂

∂θlog f (y , θ, ψ)

}f (y , θ, ψ)dy

= E

{∂

∂θ�(y , θ, ψ)

},

where �(y , θ, ψ) = log f (y , θ, ψ) denotes the log-likelihood.

SPM

Semiparametric and Generalized Regression Models 5-10

We know about the score function ∂∂θ �(y , θ, ψ) that

E

{∂2

∂θ2�(y , θ, ψ)

}= −E

{∂

∂θ�(y , θ, ψ)

}2

.

So we get following properties:

0 = E

{Y − b′(θ)

a(ψ)

}E

{−b′′(θ)a(ψ)

}= −E

{Y − b′(θ)

a(ψ)

}2

SPM

Page 56: Spm Slides

Semiparametric and Generalized Regression Models 5-11

from this we conclude

E(Y ) = μ = b′(θ)Var(Y ) = V (μ)a(ψ) = b′′(θ)a(ψ)

� the expectation of Y only depends on θ whereas the varianceof Y depends on θ and ψ

� we assume that a(ψ) is constant (there are modificationsusing prior weights)

SPM

Semiparametric and Generalized Regression Models 5-12

Link Functions

μ = G (η), η = x�β

the canonical link is given when

x�β = η = θ

Example: Y Bernoulli

� logit μ = 1/{1 + exp(−η)} (canonical)

� probit μ = ψ(η) (not in the exponential family!)

� complementary log-log link η = log{− log(1 − μ)}.

Example: Power link

� η = μλ (if λ = 0) and η = log μ (if λ = 0)

SPM

Semiparametric and Generalized Regression Models 5-13

Notation Range of y b(θ) μ(θ) Canonical Variance a(ψ)link θ(μ) V (μ)

Bernoulli B(1, μ) {0, 1} log(1 + eθ) eθ/(1 + eθ) logit μ(1 − μ) 1

Binomial B(k, μ) {0, 1, . . . , k} k log(1 + eθ) keθ/(1 + eθ) logit μ(1 − μ

k

)1

Poisson P(μ) {0, 1, 2, . . .} exp(θ) exp(θ) log μ 1

Geometric GE(μ) {0, 1, 2, . . .} − log(1 − eθ

)eθ/(1 − eθ) log

1+μ

)μ + μ2 1

Negative NB(μ, k) {0, 1, 2, . . .} −k log(1 − eθ

)keθ/(1 − eθ) log

k+μ

)μ +

μ2

k1

Binomial

Normal N(μ, σ2) (−∞, ∞) θ2/2 θ identity 1 σ2

Exponential Exp(μ) (0, ∞) − log(−θ) −1/θ reciprocal μ2 1

Gamma G(μ, ν) (0, ∞) − log(−θ) −1/θ reciprocal μ2 1/ν

Inverse IG(μ, σ2) (0, ∞) −(−2θ)1/2 −(−2θ)−1/2 squared μ3 σ2

Gaussian reciprocal

Table 7: Characteristics of some GLM distributions

SPM

Semiparametric and Generalized Regression Models 5-14

Maximum-Likelihood Algorithm

observations Y = (Y1, . . . ,Yn)�,

E (Yi |xi ) = μi = G (x�i β)

we maximize the log-likelihood of the sample, which is

�(Y, μ, ψ) =n∑

i=1

log f (Yi , θi , ψ)

where θi = θ(ηi ) = θ(x�i , β) .

Alternatively, one may minimize the deviance function

D(Y, μ, ψ) = 2 {�(Y,Y, ψ) − �(μ,Y, ψ)}

SPM

Page 57: Spm Slides

Semiparametric and Generalized Regression Models 5-15

For Yi ∼ N(μi , σ2) we have

�(Yi , θi , ψ) = log

(1√2πσ

)− 1

2σ2(Yi − μi )

2.

This gives the sample log-likelihood

�(Y, μ, σ) = n log

(1√2πσ

)− 1

2σ2

n∑i=1

(Yi − μi )2.

SPM

Semiparametric and Generalized Regression Models 5-16

Log-Likelihood and Exponential Family

The deviance function:

D(Y, μ, ψ) = 2{�(Y, μmax , ψ) − �(Y, μ, ψ)}

μmax is the non-restricted vector maximizing �(Y, •, ψ)The first term is not dependent on the model (not on β), so theminimization of the deviance corresponds to the maximization ofthe log-likelihood.

SPM

Semiparametric and Generalized Regression Models 5-17

�(Y, μ, ψ) =n∑

i=1

{Yiθi − b(θi )

a(ψ)− c(Yi , ψ)

}neither a(ψ) nor c(Yi , ψ) have an influence on the maximizationw.r.t. β, hence we maximize only

�(Y, μ, ψ) =n∑

i=1

{Yiθi − b(θi )}

gradient of �

D(β) =∂

∂β�(Y, μ, ψ) =

n∑i=1

{Yi − b′(θi )

} ∂

∂βθi

SPM

Semiparametric and Generalized Regression Models 5-18

we have to solveD(β) = 0

� Newton-Raphson step

βnew = βold −{H(βold)

}−1 D(βold)

� Fisher scoring step

βnew = βold −{

EH(βold)}−1 D(βold)

SPM

Page 58: Spm Slides

Semiparametric and Generalized Regression Models 5-19

Gradient and Hessian after some calculation

D(β) =n∑

i=1

{Yi − μi} G ′(ηi )

V (μi )xi

Newton-Raphson

H(β) =n∑

i=1

{−b′′(θi )

(∂

∂βθi

)(∂

∂βθi

)�− {Yi − b′(θi )} ∂2

∂ββ� θi

}

=n∑

i=1

{− G ′(ηi )

2

V (μi )+ {Yi − μi}G ′′(ηi )V (μi ) − G ′(ηi )

2V ′(μi )

V (μi )2

}xix

�i

Fisher scoring

EH(β) =n∑

i=1

{− G ′(ηi )

2

V (μi )

}xix

�i

SPM

Semiparametric and Generalized Regression Models 5-20

Simpler presentation of algorithm (Fisher scoring)

W = diag

(G ′(η1)

2

V (μ1), . . . ,

G ′(ηn)2

V (μn)

)

Y =

(Y1 − μ1

G ′(η1), . . . ,

Yn − μn

G ′(ηn)

)�, X =

⎛⎜⎝ X�1...

X�n

⎞⎟⎠each iteration step for β can be written as

βnew = βold + (X�WX)−1X�WY

= (X�WX)−1X�WZ

with adjusted dependent variables

Zi = x�i βold + (Yi − μi ){G ′(ηi )}−1.

SPM

Semiparametric and Generalized Regression Models 5-21

Remarks

� canonical link: Newton-Raphson = Fisher scoring

� initial values� For all but binomial models:

μi,0 = Yi and ηi,0 = G−1(μi,0)� For binomial models:

μi,0 = (Yi + 12 )/(wxi + 1) and ηi,0 = G−1(μi,0).

(wxi denotes the binomial weights, i.e. wxi = 1 in the Bernoullicase.)

� convergence is controlled by checking the relative change in βand/or deviance

SPM

Semiparametric and Generalized Regression Models 5-22

Asymptotics

Theorem

Denote μ = (μ1, . . . , μn)� = {G (x�1 β, . . . , x�n β)}�, then

√n(β − β)

L−−−→n→∞ Np(0,Σ)

and it holds approximately D(Y, μ, ψ) ∼ χ2n−p,

2{�(Y, μ, ψ) − �(Y, μ, ψ)} ∼ χ2p.

the asymptotic covariance of β can be estimated by

Σβ =1

nΣ = −EH(βlast) =

n∑i=1

{G ′(ηi ,last)

2

V (μi ,last)

}xix

�i

with last denoting the values from the last iteration step

SPM

Page 59: Spm Slides

Semiparametric and Generalized Regression Models 5-23

Example: Migration

Yes No (in %)Y migration intention 38.5 61.5X1 family/friends in west 85.6 11.2X2 unemployed/job loss certain 19.7 78.9X3 city size 10,000-100,000 29.3 64.2X4 female 51.1 49.8

Min Max Mean S.D.X5 age (years) 18 65 39.84 12.61T household income (DM) 200 4000 2194.30 752.45

Table 8: Descriptive statistics for migration data, n = 3235

SPM

Semiparametric and Generalized Regression Models 5-24

Example: Migration in Mecklenburg–Vorpommern

Yes No (in %)Y migration intention 39.1 60.9X1 family/friends in west 88.8 11.2X2 unemployed/job loss certain 21.1 78.9X3 city size 10,000-100,000 35.8 64.2X4 female 50.2 49.8

Min Max Mean S.D.X5 age (years) 18 65 39.94 12.89T household income (DM) 400 4000 2262.20 769.82

Table 9: Descriptive statistics for migration data in Mecklenburg–Vorpommern, n = 402

SPMmigmvdesc

SPM

Semiparametric and Generalized Regression Models 5-25

the dependent variable is the response

Y =

{1 move to the West0 stay in the East

for a binary dependent variable holds

E (Y |x) = 1 · P(Y = 1|x) + 0 · P(Y = 0|x)= P(Y = 1|x)

using a logit link we have

P(Y = 1|x) = G (x�β) =exp(x�β)

1 + exp(x�β)

SPM

Semiparametric and Generalized Regression Models 5-26

Example: Results for the migration data

Coefficients (t-value)const. 0.512 (2.39)family/friends in west 0.599 (5.20)unemployed/job loss 0.221 (2.31)city size 10-100,000 0.311 (3.77)female -0.240 (3.15)age -4.69·10−2 (14.56)household income 1.42·10−4 (2.73)

Logit model

Table 10: Logit coefficients for migration data (t-values in parenthesis)

SPM

Page 60: Spm Slides

Semiparametric and Generalized Regression Models 5-27

Logit Model

-3 -2 -1 0 1 2

Link Function, Responses

00.

51

Lin

k Fu

nctio

n, R

espo

nses

Figure 65: Logit model for migration, sample from Mecklenburg-

Vorpommern, n = 402

SPMlogit

SPM

Semiparametric and Generalized Regression Models 5-28

More on Index Models

Binary Responses

Y =

{1 if Y ∗ = X�β + ε ≥ 0,0 otherwise,

latent variableY ∗ = X�β + ε

assume that ε independent of X

E (Y |X = x) = P(ε ≥ −x�β) = 1 − P(ε ≤ −x�β)

= 1 − G (−x�β)

= G (x�β)

SPM

Semiparametric and Generalized Regression Models 5-29

Multicategorical Responses

utility Y ∗j

Y ∗j = X�γ j + uj

non-ordered case

Y = j ⇐⇒ Y ∗j > Y ∗

k for all k = j , k ∈ {0, 1, . . . , J}.differenced utilities

Y ∗j − Y ∗

k = X�(γ j − γk) + (uj − uk) = X�βjk + εjk .

corresponding probability is

P(Y = j |X = x) = P(−εj0 < x�βj0, . . . ,−εjJ < x�βjJ)

= G (x�βj0, . . . , x�βjJ).

SPM

Semiparametric and Generalized Regression Models 5-30

Multinomial logit

P(Y = 0|X = x) =1

1 +J∑

k=1

exp(x�βk)

P(Y = j |X = x) =exp(x�βj)

1 +J∑

k=1

exp(x�βk)

Conditional logit

P(Y = j |X = x) =exp(x�j β)

J∑k=0

exp(x�k β)

SPM

Page 61: Spm Slides

Semiparametric and Generalized Regression Models 5-31

Ordered modelY ∗ = x�β + ε.

assume thresholds c1, . . . , cJ−1 which divide the real axis into subintervals

Y =

⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩

0 if −∞ < Y ∗ ≤ 0,1 if 0 < Y ∗ ≤ c1

2 if c1 < Y ∗ ≤ c2...J if cJ−1 < Y ∗ < ∞

SPM

Semiparametric and Generalized Regression Models 5-32

Tobit ModelsY ∗ = X�β + ε

observe Y only for positive utility

Y = Y ∗ · I(Y ∗ > 0) =

{0 if Y ∗ ≤ 0,Y ∗ if Y ∗ > 0.

=⇒ model which depends on the index X�β

Y = (X�β + ε) · I(X�β + ε > 0).

SPM

Semiparametric and Generalized Regression Models 5-33

Sample Selection Models

(regression) Y ∗ = X�β + ε

(selection) U∗ = Z�γ + δ

Y = Y ∗ · I(U∗ > 0) =

{0 if U∗ ≤ 0,Y ∗ if U∗ > 0.

SPM

Semiparametric and Generalized Regression Models 5-34

Summary: SemiparametricRegression and GLM

� The basis for many semiparametric regression models is thegeneralized linear model (GLM), which is given by

E (Y |X) = G{X�β} .

Here, β denotes the parameter vector to be estimated and Gdenotes a known link function. Prominent examples of thistype of regression are binary choice models (logit or probit) orcount data models (Poisson regression).

� The GLM can be generalized in several ways: Considering anunknown smooth link function (instead of G ) leads to thesingle index model (SIM). Assuming a nonparametric additiveargument of G leads to the generalized additive model (GAM),

SPM

Page 62: Spm Slides

Semiparametric and Generalized Regression Models 5-35

whereas a combination of additive linear and nonparametriccomponents in the argument of G give a generalized partiallinear model (GPLM) or generalized partial linear partialadditive model (GAPLM). If there is no link function (or G isthe identity function) then we speak of additive models (AM)or partial linear models (PLM) or additive partial linearmodels (APLM).

� The estimation of the GLM is performed through aninteractive algorithm. This algorithm, the iterativelyreweighted least squares (IRLS) algorithm, applies weightedleast squares to the adjusted dependent variable Z in eachiteration step:

βnew = (X�WX)−1X�WZ

This numerical approach needs to be appropriately modifiedfor estimating the semiparametric modifications of the GLM.

SPM

Single Index Models 6-1

Chapter 6:Single Index Models

SPM

Single Index Models 6-2

Single Index Models

aims to summarize all information of the regressors in one singlenumber X�β

parametric single index model (GLM)

E(Y |X) = m(X) = G (X�β) (6.1)

semiparametric single index model (SIM)

E(Y |X) = m(X) = g(X�β) (6.2)

� β = (β1, . . . , βp)� is the unknown parameter vector

� G (•) is a known link function

� g(•) is an unknown smooth link function

SPM

Single Index Models 6-3

Misspecified parametric SIM

Example: Binary Response Models

Y =

{1 if Y ∗ = X�β − ε ≥ 0,0 otherwise,

M1: ε ∼ standard logistic, independent of X

E(Y |X) = G (X�β) = 1/{1 + exp(−X�β)}M2: ε ∼ N(0, 1), independent of X

E(Y |X) = G (X�β) = Φ(X�β)

M3: ε ∼ heteroscedastic logit

E(Y |X) = G (X�β) = 1/{1 + exp(−X�β/h(x))}where h(x) = 0.25[1 + 2(X�β)2 + 4(X�β)4]

SPM

Page 63: Spm Slides

Single Index Models 6-4

Parametric link functions

0index (x’beta)

00.

51

G(x

’bet

a)

logit

probit

heteroskedastic

Figure 66: Link functions for M1, M2 and M3

SPMparametriclinkfunctions

SPM

Single Index Models 6-5

results: from Horowitz (1993):

� Suppose β1 = 1 and β2 = 1 ⇒ β2/β1 = 1

� β∗1 and β∗

2 are probability limits of MLE

True Estimated Max error of β∗2/β∗

1Model Model E(Y |X)

M1 M1 0 1M2 M1 0.02 1M3 M1 0.37 0.45

=⇒ misspecification error is big (small), if true link is reallydifferent (not very different) from link used for estimation

SPM

Single Index Models 6-6

Estimation of Single Index Models

single index model (SIM)

E(Y |X) = m(X) = g(X�β) (6.3)

where

� β = (β1, . . . , βp)� finite dimensional parameter vector

� g(•) smooth link function

SPM

Single Index Models 6-7

SIM Algorithm

� estimate β by β

� compute index values η = X�β

� estimate the link function g(•) by using a (univariate)nonparametric method for the regression of Y on η

SPM

Page 64: Spm Slides

Single Index Models 6-8

Identification Issues

consider

model (A) E(Y |X) = g(X�β + c)

model (B) E(Y |X) = g(X�β)

whereg(•) = g (• + c)

=⇒ an intercept in the model cannot be identified

SPM

Single Index Models 6-9

consider

model (A) E(Y |X) = g(X�β)

model (B) E(Y |X) = g(X�β)

whereg(•) = g

( .

c

), β = c ·β

=⇒ β are identified only upon scale factor

SPM

Single Index Models 6-10

Two Link Functions

-3 -2 -1 0 1 2 3 4Index

00.

20.

40.

60.

81

E(Y

|X)=

G(I

ndex

) an

d E

(Y|X

)=G

(-In

dex)

Figure 67: Two link functions

SPM

Single Index Models 6-11

Consider the latent variable

Y ∗ = vβ(X) − ε, ε = ω{vβ(X)} · ζ

ω(•) an uknown functionζ standard logistic error term (independent of X)After some calculation

P(Y = 1|X = x) = E (Y |X = x) = Gζ

[vβ(x)

ω{vβ(X)}]

SPM

Page 65: Spm Slides

Single Index Models 6-12

The regression function E (Y |X) is unknown because of unknownω(•), the functional form of the link Gζ is known.The resulting not necessarily monotone link function is:

g(•) = Gζ

[ •ω{•}

].

To compare the effects of two particular explanatory variables, wecan only use the ratio of their coefficients.

SPM

Single Index Models 6-13

Estimation

� Pseudo-Maximum Likelihood Estimationuse nonparametric estimate of link function g(•) (or itsderivative) in the log-likelihood

� (Weighted) Semiparametric Least Squares (SLS, WSLS)Construct a objective function to estimate β with

√n-rate.

Inside the objective function are used the conditionaldistribution of Y (or its nonparametric estimates). Theobjective function will be minimized w.r.t. parameter β.

� Average Derivative Methodsdirect estimation of the parametric coefficients via the averagederivative of the regression function m(•)

SPM

Single Index Models 6-14

Related methods

� nonlinear index functions

E (Y |X) = m(X) = g {vβ(X)}

� multiple indices: Projection Pursuit, Sliced Inverse regression

� series expansion for g : Gallant and Nychka (1987) estimator

� maximum score estimators: Horowitz (smoothed version),Manski (1985)

SPM

Single Index Models 6-15

Average Derivatives (ADE)

assumeX = (T,U),

i.e. split the explanatory variables into a continuous partT = (T1, . . . ,Tq)

� and a discrete part U = (U1, . . . ,Up)�

single index model with only continuous variables

E (Y |T) = m(T) = g(T�γ)

vector of average derivatives

δ = E {∇m(T)} = E{

g ′(T�γ)}

γ

where

∇m(t) =

{∂m(t)

∂t1, . . . ,

∂m(t)

∂tq

}�

SPM

Page 66: Spm Slides

Single Index Models 6-16

Transfer Derivative of m on to Density f

score vector

s(t)def= ∇log f (t) =

∇f (t)

f (t)

consider

δ = E{∇m(T)} =

∫∇m(t)f (t) dt

=(Integr.by parts)

−∫ ∇f (t)

f (t)f (t) m(t) dt = −

∫s(t) m(t) f (t) dt

= −E{s(t) m(t)} = −E{s(t) Y }estimation by sample analog

δ = −1

n

n∑i=1

sH(Ti ) Yi , H = bandwidth matrix

SPM

Single Index Models 6-17

Score estimator with trimming

estimation of s

sH(t) = −{fH(t)}−1

{−∂ fH(t)

∂t1, . . . ,−∂ fH(t)

∂tq

}

fH(t) is a (multivariate) density estimator and

∂tjfH(t) =

1

ndet(H)

n∑j=1

∂tjKH (t − Tj)

trimming away small values of fH(•) in the denominator

δ = n−1n∑

i=1

sH(Ti ) Yi I{fH(Ti ) > bn}︸ ︷︷ ︸trimming factor

SPM

Single Index Models 6-18

Asymptotic properties

√n(δ − δ)

L−→ N(0,Σδ)

where Σδ is the covariance matrix ofs(T)Y + {∇m(T) − s(T)m(T)}

=⇒ parametric rate of convergence

SPM

Single Index Models 6-19

Final estimators

δ = n−1n∑

i=1

sH(Ti ) Yi I{

fH(Ti ) > bn

}

mh(t) = gh(t�γ) =

∑ni=1 Kh(t

�γ − T�i γ)Yi∑n

i=1 Kh(t�γ − T�i γ)

gh(•) converges for h ∼ n−1/5 at rate n2/5, i.e. like a univariatekernel regression estimator

SPM

Page 67: Spm Slides

Single Index Models 6-20

Weighted ADE

introduce weights instead of trimming

δ = E {∇m(T) w(T)} = E{

g ′(T�γ) w(T)}

γ

density weighted ADE (WADE) means w(t) = f (t)

δ =

∫∇m(t) f 2(t) dt = −2

∫m(t)∇f (t) f (t) dt

= −2 E{Y ∇f (T)}

δ = −2

n

n∑i=1

Yi

{∂ fH(t)

∂t1, . . . ,

∂ fH(t)

∂tq

}�

SPM

Single Index Models 6-21

Example: Unemployment after completion of an apprenticeship inWest Germany

GLM WADE(logit) h = 1 h = 1.25 h = 1.5 h = 1.75 h = 2

constant -5630 – – – – –EARNINGS -1.00 -1.00 -1.00 -1.00 -1.00 -1.00CITY SIZE -0.72 -0.23 -0.47 -0.66 -0.81 -0.91DEGREE -1.79 -1.06 -1.52 -1.93 -2.25 -2.47URATE 363.03 169.63 245.75 319.48 384.46 483.31

Table 11: WADE fit of unemployment data

SPM

Single Index Models 6-22

Discrete Covariates

assume now continuous and discrete covariates

E(Y |T,U) = g(U�β + T�γ)

how to extend ADE/WADE in this case?

simplest case: binary U (i.e. ∈ {0, 1})

E(Y |T, U) = g(T�γ) if U = 0

E(Y |T, U) = g(T�γ + β) if U = 1

SPM

Single Index Models 6-23

β

-3.0 0.0 3.0 6.0 9.0

Index Values t

0.0

0.2

0.4

0.6

0.8

1.0

Lin

k Fu

nctio

ns F

(t),

F(t

+α)

Horizontal Distance

Figure 68: Coefficient of the variable U

SPM

Page 68: Spm Slides

Single Index Models 6-24

Integral Estimator

split sample: (0) and (1) denote the observations coming from thesubsamples according to Ui = 0 and Ui = 1, then use the integraldifference as an estimator for β

β = J(1) − J(0)

β

-3.0 0.0 3.0 6.0 9.0

Index Values t

0.0

0.2

0.4

0.6

0.8

1.0

Lin

k Fu

nctio

ns F

(t),

F(t

+α)

Integral

Figure 69: The ”integral” estimator

SPM

Single Index Models 6-25

How to estimate the integrals?

J(0) =

n0∑i=0

Y(0)i (T

(0)i+1 − T

(0)i )�γ

J(1) =

n1∑i=0

Y(1)i (T

(1)i+1 − T

(1)i )�γ

the estimator is√

n-consistent and can be improved for efficiencyby a one-step estimator

SPM

Single Index Models 6-26

Multicategorial U

thresholded link function

g = co I (g < co) + g I (co ≤ g ≤ c1) + c1 I (g > c1).

integrated link function conditional on u(k)

J(k) =

∫ v1

vo

g(v + β�u(k)) dv ,

k = 1, . . . ,M categories of U

SPM

Single Index Models 6-27

Multiple integral comparisons

J(k) − J(0) = (c1 − c0){

u(k) − u(0)}

β,

or simplyΔJ = (c1 − c0) Δu β

where

ΔJ =

⎛⎝ J(1) − J(0)

· · ·J(M) − J(0)

⎞⎠ , Δu =

⎛⎝ u(1) − u(0)

· · ·u(M) − u(0)

⎞⎠

SPM

Page 69: Spm Slides

Single Index Models 6-28

Estimator for β

β = (c1 − co)−1(Δu�Δu)−1Δu�ΔJ

replace J(k) by

J(k) =

∫ v1

vo

g (k)(v) dv

� g (k) is obtained by a univariate regression of the estimated

“continuous” indices γ�T(k)i on Y

(k)i

� using a√

n-consistent estimate γ and Nadaraya-Watson

estimators g (k) for g(k), the estimated coefficient β is itself√n-consistent and asymptotically normal

SPM

Single Index Models 6-29

Testing model (mis-)specification

aim: check if the parametric model is appropriate by comparing itwith a non- or semiparametric model

Hardle-Mammen test (parametric vs. nonparametric)

Yi = m(Xi ) + εi , Xi ∈ Rd

notation

mh(•) =

n∑i=1

Kh(• − Xi )Yi

n∑k=1

Kh(• − Xi )

, Kh,ng(•) =

n∑i=1

Kh(• − Xi ) g(Xi )

n∑i=1

Kh(• − Xi )

SPM

Single Index Models 6-30

TestH0 : m ∈ {mθ : θ ∈ Θ}

test statistic (with weight function w)

Tn = nhd/2

∫{mh(x) −Kh,nmθ

(x)}2w(x)dx

� h ∼ n−1/(d+4)

� asymptotically normal distributed

� BUT with bias and variance containing unknown expressions=⇒ use wild bootstrap

SPM

Single Index Models 6-31

Horowitz-Hardle test / Werwatz test (GLM vs. SIM)

E (Y |X) = G{vβ(X)}, X ∈ Rd , β ∈ Rk

idea: compare E {Y |vβ(X) = v} (nonparametrically estimated)with G (v) (G known link)

H0 : E{Y |vβ(X) = v} = G (v)

H1 : E{Y |vβ(X) = v} = smooth function in v

test statistic

T =√

hn∑

i=1

[Yi − G{vβ(X)}

] [G−i

{vβ(Xi )

}− G

{vβ(Xi )

}]w{

vβ(X)}

SPM

Page 70: Spm Slides

Single Index Models 6-32

Nonparametric Estimator for the Link

G−i (•) ={Gh,−i (•) − (h/h)d G

h,−i(•)}

{1 − (h/h)d}

where Gh,−i is a (Bierens-corrected) leave-one-out estimator withbandwidth h

h ∼ n−1/2d+1, h ∼ n−δ/2d+1, 0 < δ < 1

� G−i is asymptotically unbiased with optimal rate

� T is asymptotically N(0, σ2T )

SPM

Single Index Models 6-33

Summary: Single Index Models� A single index model (SIM) is of the form

E (Y |X) = m(X) = g {vβ(X)} ,

where vβ(•) is an up to the parameter vector β known indexfunction and g(•) is an unknown smooth link function. Inmost applications the index function is of linear form, i.e.,vβ(x) = x�β).

� Due to the nonparametric form of the link function, neither anintercept nor a scale parameter can be identified. Forexample, if the index is linear, we have to estimate

E (Y |X) = g{X�β

}.

There is no intercept parameter and β can only be estimated

SPM

Single Index Models 6-34

up to unknown scale factor. To identify the slope parameterof interest, β is usually assumed to have one componentidentical to 1 or to be a vector of length 1.

� The estimation of a SIM usually proceeds in the followingsteps: First the parameter β is estimated. Then, using theindex values ηi = X�

i β, the nonparametric link function g isestimated by an univariate nonparametric regression method.

� For the estimation of β, two approaches are available:Iterative methods as semiparametric least squares (SLS) orpseudo-maximum likelihood estimation (PMLE) and directmethods as (weighted) average derivative estimation(WADE/ADE). Iterative methods can easily handle mixeddiscrete-continuous regressors but may need to employsophisticated routines to deal with possible local optima of theoptimization criterion. Direct methods as WADE/ADE avoid

SPM

Single Index Models 6-35

the technical difficulties of an optimization but do requirecontinuous explanatory variables. An extension of the directapproach is only possible for a small number of additionaldiscrete variables.

� There are specification tests available to test whether we havea GLM or a true SIM.

SPM

Page 71: Spm Slides

Generalized Partial Linear Models 7-1

Chapter 7:Generalized Partial Linear Models

SPM

Generalized Partial Linear Models 7-2

Generalized Partial Linear Models

PLME (Y |U,T) = U�β + m(T)

GPLME (Y |U,T) = G{U�β + m(T)}

� β = (β1, . . . , βp)� is a finite dimensional parameter

� m(•) is a smooth function

SPM

Generalized Partial Linear Models 7-3

Partial Linear Models (PLM)

Y = β�U + m(T) + ε

expectations conditional on T

E (Y |T) = β�E (U|T) + E{m(T)|T} + E (ε|T)

difference

Y − E (Y |T)︸ ︷︷ ︸Y

= β�{U − E (U|T)︸ ︷︷ ︸U

} + ε − E (ε|T)︸ ︷︷ ︸ε

SPM

Generalized Partial Linear Models 7-4

Speckman estimators

for the parameter β: linear regression of Yi on Ui

β =(U�U

)−1U�Y

for the function m(•): nonparametric regression of Yi − U�i β on Ti

m = S(Y − Uβ

)where the smoother matrix S is defined:

(S)ij =KH(Ti − Tj)∑i KH(Ti − Tj)

andU = U − SU, Y = Y − SY

SPM

Page 72: Spm Slides

Generalized Partial Linear Models 7-5

Backfitting

denote P the projection matrix P = U(U�U)−1U� and S thesmoother matrix

backfitting means to solve

Uβ = P(Y − m)

m = S(Y − Uβ).

explicit solution is given by

β = {U�(I − S)U}−1U�(I − S)Y,

m = S(Y − Uβ).

SPM

Generalized Partial Linear Models 7-6

Introducing a link function

possible estimation methods for GPLM

� profile likelihood

� generalized Speckman estimator

� backfitting for GPLM

SPM

Generalized Partial Linear Models 7-7

Profile Likelihood

� fix the parameter β and to estimate the nonparametricfunction in dependence of this fixed β

� resulting estimate for mβ(•) is then used to construct theprofile likelihood for β

as a consequence of the profile likelihood method, β isasymptotically efficient

SPM

Generalized Partial Linear Models 7-8

Details on Profile likelihood

requires: Y given U, T is parametric in β, m(T)

� least favorable curve mβ(•)maxmβ (t)

E{

�(Y ,U�β + mβ(t), ψ)︸ ︷︷ ︸likelihood

|T = t}

conditional expectation =⇒ can be estimated by

maxmβ (t)

∑i

�{Yi ,U�i β + mβ(t), ψ}KH(Ti − t) (7.1)

� profile likelihood estimator for β

maxβ

∑i

�i{U�i β + mβ(Ti )}

SPM

Page 73: Spm Slides

Generalized Partial Linear Models 7-9

denote by �i the individual log-likelihood for observation i and by�′i , �′′i its first and second derivatives w.r.t. ηi = U�

i β + m(Ti )

Example: Profile likelihood for normal PLM

Yi = U�i β + m(Ti ) + εi , εi ∼ N(0, σ2), i.i.d.

likelihood

�i{U�i β + m(Ti )} = − 1

2σ2{yi − U�

i β − m(Ti )}2 + . . .

� least favorable curve

maxmβ(t)

∑i

{Yi − U�i β − mβ(t)}2 KH(t − Ti ) (7.2)

mβ(t) =

∑i KH(t − Ti ) (Yi − U�

i β)∑i KH(t − Ti )

⇐⇒ mβ = S(Y − Uβ

)SPM

Generalized Partial Linear Models 7-10

� profile likelihood estimator for β

maxβ

∑i

�i{U�i β + mβ(Ti )}

=⇒ need to solve

0 =n∑

i=1

�′i{U�i β + mβ(Ti )} {Ui +

∂βmβ(Ti )}

finding maximum in (7.1) resp. (7.2) means to solve

0 =n∑

i=1

�′i{U�i β + mβ(t)}KH(t − Ti ) (7.3)

taking the derivative of (7.3) with respect to β gives

∂βmβ(t) = −

n∑i=1

KH(t − Ti )�′′i {U�

i β + mβ(t)}Ui

n∑i=1

KH(t − Ti )�′′i {U�i β + mβ(t)}

SPM

Generalized Partial Linear Models 7-11

for the PLM we have �′′i ≡ −1, hence

Ui +∂

∂βmβ(t) = Ui −

n∑i=1

KH(t − Ti )Ui

n∑i=1

KH(t − Ti )

= Ui = [(I − S)U]i

=⇒ need to solve

0 =n∑

i=1

{Yi − U�i β − mβ(Ti )} [(I − S)U]i = U�(Y − Uβ)

since

{Yi − U�i β − mβ(Ti )} = [Y − Uβ − S(Y − Uβ)]i

= [(I − S)(Y − Uβ)]i

=⇒ Speckman solution

β =(U�U

)−1

U�YSPM

Generalized Partial Linear Models 7-12

Profile Likelihood Algorithm (P)

maximize usual likelihood

0 =n∑

i=1

�′i{U�i β + mβ(Ti )} {Ui +

∂βmβ(Ti )}

maximize smoothed quasi-likelihood

0 =n∑

i=1

�′i{U�i β + mβ(Tj)}KH(Ti − Tj)

SPM

Page 74: Spm Slides

Generalized Partial Linear Models 7-13

Algorithm (P)� updating step for β

βnew = β − B−1n∑

i=1

�′i (U�i β + mi ) Ui

B =n∑

i=1

�′′i (U�i β + mi ) Ui U

�i

Uj = Uj −

n∑i=1

�′′i (U�i β + mj)KH(Ti − Tj)Ui

n∑i=1

�′′i (U�i β + mj)KH(Ti − Tj)

.

� updating step for mj

mnewj = mj −

n∑i=1

�′i (U�i β + mj)KH(Ti − Tj)

n∑i=1

�′′i (U�i β + mj)KH(Ti − Tj)

.

SPM

Generalized Partial Linear Models 7-14

define SP the smoother matrix by its elements

(SP)ij =�′′i (U

�i β + mj)KH(Ti − Tj)

n∑i=1

�′′i (U�i β + mj)KH(Ti − Tj)

Algorithm (P)

� updating step for β

βnew = (U�WU)−1U�WZ

with

U = (I − SP)U,

Z = Uβ − W−1v.

U design, I identity, v = (�′i ), W = diag(�′′i )

SPM

Generalized Partial Linear Models 7-15

Backfitting Algorithm (B)

AlgorithmB

� generalized partial linear model

E (Y |U,T) = G{U�β + m(T)}

=⇒ backfitting for the adjusted dependent variable

use now the smoother matrix S defined through

(S)ij =�′′i (U

�i β + mi )KH(Ti − Tj)

n∑i=1

�′′i (U�i β + mi )KH(Ti − Tj)

SPM

Generalized Partial Linear Models 7-16

Algorithm (B)

� updating step for β

βnew = (U�WU)−1U�WZ,

� updating step for m

mnew = S(Z − Uβ),

using the notations

U = (I − S)U,

Z = (I − S)Z = Uβ − W−1v.

U design, I identity, v = (�′i ), W = diag(�′′i )

SPM

Page 75: Spm Slides

Generalized Partial Linear Models 7-17

note that the update of the index Uβ + m can be expressed by alinear estimation matrix RB :

Uβnew + mnew = RBZ

withRB = U{U�WU}−1U�W(I − S) + S.

SPM

Generalized Partial Linear Models 7-18

Generalized Speckman Estimator (S)

� generalized partial linear model

E (Y |U,T) = G{U�β + m(T)}

=⇒ Speckman approach for the adjusted dependent variable

recall smoother matrix from backfitting with elements

(S)ij =�′′i (U

�i β + mi )KH(Ti − Tj)

n∑i=1

�′′i (U�i β + mi )KH(Ti − Tj)

SPM

Generalized Partial Linear Models 7-19

Algorithm (S)

� updating step for β

βnew = (U�WU)−1U�WZ,

� updating step for m

mnew = S(Z − Uβ)

using the notations

U = (I − S)U,

Z = (I − S)Z = Uβ − W−1v.

U design, I identity, v = (ell ′i ), W = diag(�′′i )

SPM

Generalized Partial Linear Models 7-20

the generalized Speckman estimator (S) shares the property ofbeing linear on the variable Z:

Uβnew + mnew = RSZ

withRS = U{U�WU}−1U�W(I − S) + S

note that here in contrast to (B) always U is used

SPM

Page 76: Spm Slides

Generalized Partial Linear Models 7-21

Comparison of (P), (S), (B)

(P) SPij =

�′′i (UTi β + mj)KH(Ti − Tj)

n∑i=1

�′′i (UTi β + mj)KH(Ti − Tj)

βnew = (UTWU)−1UTWZ, mnew = . . .

(S) Sij =�′′i (UT

i β + mi )KH(Ti − Tj)n∑

i=1

�′′i (UTi β + mi )KH(Ti − Tj)

βnew = (UTWU)−1UTWZ, mnew = S(Z − Uβ)

(B) Sij =�′′i (UT

i β + mi )KH(Ti − Tj)n∑

i=1

�′′i (UTi β + mi )KH(Ti − Tj)

βnew = (UTWU)−1UTWZ, mnew = S(Z − Uβ)

SPM

Generalized Partial Linear Models 7-22

Testing parametric versus Semiparametric

likelihood ratio (LR) test statistic

LR = 2n∑

i=1

{�(Yi , μi , ψ) − �(Yi , μi , ψ)

}= 2

{�(Y, mu, ψ) − �(Y, mu, ψ)

}semiparametric: μi = G{U�

i β + m(Ti )}parametric: μi = G{U�

i β + T�i γ + γ0}

alternatively use the deviance

D(Y,mu, ψ) = 2 {�(Y,mumax , ψ) − �(Y,mu, ψ)}

=⇒ LR = D(Y, mu, ψ) − D(Y, mu, ψ)

SPM

Generalized Partial Linear Models 7-23

if at convergence of iterative estimation

η = RZ = R(η − W−1v)

then

D(Y, μ) = a(ψ)D(Y, μ, ψ) ≈ −(Z − η)�W−1(Z − η)

we need approximate degrees of freedom for the semiparametricestimator

df err (mu) = n − tr(2R − R�WRW−1

)or simpler

df err (mu) = n − tr (R)

SPM

Generalized Partial Linear Models 7-24

for backfitting (B) and algorithm (S) we have

η = RZ

for profile likelihood (P) we can use approximately

RP = U{U�WU}−1U�W(I − SP) + S

where U denotes (I − SP)U

SPM

Page 77: Spm Slides

Generalized Partial Linear Models 7-25

Modified LR test

problem: under the parametric hypothesis, the parametric estimateis unbiased whereas the non-/semiparametric estimates are alwaysbiased

=⇒ use bias-adjusted parametric estimate

m(Tj)

from{G (U�

i β + T�i γ + γ0),Ui ,Ti}, i = 1, . . . , n

modified LR test statistic

LR = 2n∑

i=1

{�(μi , μi , ψ) − �(μi , μi , ψ)

}= 2

{�(mu, mu, ψ) − �(mu, mu, ψ)

}

SPM

Generalized Partial Linear Models 7-26

asymptotically equivalent

˜LR =

n∑i=1

wi

{U�

i (β − β) + m(Ti ) − m(Ti )}2

with

wi =[G ′{U�

i β + m(Ti )}]2V [G{U�

i β + m(Ti )}].

Theorem

Under the linearity hypothesis holds:

� LR =˜LR + op(vn),

� v−1n (LR − bn)

L−→ N(0, 1)

SPM

Generalized Partial Linear Models 7-27

Bootstrapping the modified LR testTheorem

It holdsdK (LR

∗, LR)

L−→ 0

where dK denotes the Kolmogorov distance.

bootstrap algorithm

(a) generate nboot samples {Y ∗1 , . . . ,Y ∗

n } with

E∗(Y ∗i ) = μi = G (U�

i β + T�i γ + γ0)

Var∗(Y ∗i ) = a(ψ) V (μi )

(b) calculate estimates based on each of the bootstrap samples and

from these the bootstrap test statistics LR∗

(c) the critical values (the quantiles of the distribution of LR) are then

estimated by the empirical quantiles of the nboot LR∗

values

SPM

Generalized Partial Linear Models 7-28

Comparison of methods

� backfitting (B) is best under independence; (P), (S) seembetter otherwise

� for large n: (P) ≈ (S)

� in testing parametric versus nonparametric (P) and (S) work

well with approximate degrees of freedom; bootstrapping LRimproves

� (S) seems a good compromise between accuracy andcomputational efficiency in estimation and specification testing

SPM

Page 78: Spm Slides

Generalized Partial Linear Models 7-29

Example: Migration

Yes No (in %)Y migration intention 38.5 61.5U1 family/friends in west 85.6 11.2U2 unemployed/job loss certain 19.7 78.9U3 city size 10,000-100,000 29.3 64.2U4 female 51.1 49.8

Min Max Mean S.D.U5 age (years) 18 65 39.84 12.61T household income (DM) 200 4000 2194.30 752.45

Table 12: Descriptive statistics for migration data, n = 3235

SPM

Generalized Partial Linear Models 7-30

h = 10%

1000 2000 3000 4000household income

0.6

0.8

11.

2m

(hou

seho

ld in

com

e)

h = 20%

1000 2000 3000 4000household income

0.6

0.8

11.

2m

(hou

seho

ld in

com

e)

h = 30%

1000 2000 3000 4000household income

0.6

0.7

0.8

0.9

1m

(hou

seho

ld in

com

e)

h = 40%

1000 2000 3000 4000household income

0.6

0.7

0.8

0.9

1m

(hou

seho

ld in

com

e)

Figure 70: The impact of household income on migration intention, non-

parametric, linear and bias-adjusted parametric estimate, bandwidths h =

10%, 20%, 30%, 40% of the range

SPM

Generalized Partial Linear Models 7-31

Coeff. (t-value) Coeff. (t-value)const. 0.512 (2.39) – –family/friends in west 0.599 (5.20) 0.598 (5.14)unemployed/job loss 0.221 (2.31) 0.230 (2.39)city size 10-100,000 0.311 (3.77) 0.302 (3.63)female -0.240 (3.15) -0.249 (3.26)age -4.69· 10−2 (14.56) -4.74· 10−2 (14.59)household income 1.42· 10−4 (2.73) – –

Linear (logit) Part. Linear

Table 13: Logit coefficients and GPLM coefficients for migration data (t-

values in parenthesis), h = 20% for the GPLM

SPM

Generalized Partial Linear Models 7-32

testing GLM vs. GPLM

h 10% 20% 30% 40%

R 0.001 0.001 0.116 0.516

R 0.001 0.002 0.109 0.488

Table 14: Observed significance levels for linearity test for migration data,

normal approximation of the quantiles

bootstrap approximation of quantiles: all computed significancelevels for rejection are below 0.01

SPM

Page 79: Spm Slides

Generalized Partial Linear Models 7-33

h = 10%

5 10test statistic R

00.

1de

nsity

est

imat

e fo

r R

*, n

orm

al d

ensi

ty

h = 20%

0 2 4 6 8test statistic R

00.

10.

20.

3de

nsity

est

imat

e fo

r R

*, n

orm

al d

ensi

ty

h = 30%

0 1 2 3 4 5test statistic R

00.

20.

40.

60.

8de

nsity

est

imat

e fo

r R

*, n

orm

al d

ensi

tyh = 40%

0 1 2 3test statistic R

00.

51

1.5

dens

ity e

stim

ate

for

R*,

nor

mal

den

sity

Figure 71: Density estimates of bootstrapped LR (thick) and limiting nor-

mal distribution (thin), nboot = 200 bootstrap replications, bandwidths

h = 10%, 20%, 30%, 40% of the range

SPM

Generalized Partial Linear Models 7-34

Example: Credit Scoring

Y =

{1 loan is repaid0 loan has defaulted

n = 1000, 300 of them defaulted

� the data set contains observations from three continuousvariables (duration and amount of credit, age of client) and 17discrete variables

SPM

Generalized Partial Linear Models 7-35

Coeff. (t-val.) Coeff. (t-val.) Coeff. (t-val.)

constant -17.605 (-1.91) -34.909 (-2.51) – –duration -0.036 (-3.85) -0.033 (-3.48) -0.037 (-4.23)log(amount) 1.654 ( 1.41) 4.847 ( 2.57) – –log(amount) square – – -0.229 (-2.26) – –log(age) 4.119 ( 1.59) 6.949 ( 1.11) – –log(age) square – – -0.501 (-0.60) – –log(age)∗ log(amount) -0.484 (-1.47) -0.384 (-1.17) – –. . . . . . . . . . . . . . . . . . . . .

Linear Quadratic Part. Linear

Table 15: Parametric logit and GPLM coefficients (t-values in parentheses),

bandwidths are 40% of range

SPM

Generalized Partial Linear Models 7-36

Influence: amount & age

6.67.7

8.73.3

3.6

4.0-1.6

-1.2

-0.8

log(amount)

log(age)

influence

Scatterplot: amount & age

6 7 8 9log(amount)

33.

54

log(

age)

Figure 72: Two-dimensional nonparametric function of amount and age

(left), bandwidths are 40% of range; scatterplot of of amount and age

(right)

SPM

Page 80: Spm Slides

Generalized Partial Linear Models 7-37

Contours: influence of amount & age

6 7 8 9log(amount)

33.

54

log(

age)

Figure 73: Contours for function of amount and age, bandwidths are 40%

of the range

SPM

Generalized Partial Linear Models 7-38

testing GLM vs. GPLM

h 20% 30% 40% 50% 60%

linear <0.01 <0.01 <0.01 0.01 0.29linear & interaction <0.01 <0.01 <0.01 0.07 0.40quadratic <0.01 <0.01 <0.01 0.35 0.55

Table 16: Observed significance levels for test of GLM against GPLM,

bootstrap sample size nboot = 100

SPM

Generalized Partial Linear Models 7-39

Summary: Generalized Partial LinearModels

� A partial linear model (PLM) is given by

E (Y |X) = X�β + m(T),

where β is an unknown parameter vector and m(•) is anunknown smooth function of a multidimensional argument T.

� A generalized partial linear model (GPLM) is of the form

E (Y |X) = G{X�β + m(T)},where G is a know link function, β is an unknown parametervector and m(•) is an unknown smooth function of amultidimensional argument T.

SPM

Generalized Partial Linear Models 7-40

� Partial linear models are usually estimated by Speckman’sestimator. This estimator determines first the parametriccomponent by applying an OLS estimator to anonparametrically modified design matrix and response vector.In a second step the nonparametric component is estimatedby smoothing the residuals w.r.t. the parametric part.

� Generalized partial linear models should be estimated by theprofile likelihood method or by a generalized Speckmanestimator.

� The profile likelihood approach is based on the fact that theconditional distribution of Y given U and T is parametric. Itsidea is to estimate the least favorable nonparametric functionmβ(•) in dependence of β. The resulting estimate for mβ(•)is then used to construct the profile likelihood for β.

SPM

Page 81: Spm Slides

Generalized Partial Linear Models 7-41

� The generalized Speckman estimator can be seen as asimplification of the profile likelihood method. It is based on acombination of a parametric IRLS estimator (applied to anonparametrically modified design matrix and responsevector) and a nonparametric smoothing method (applied tothe adjusted dependent variable reduced by its parametriccomponent).

� To check whether the underlying true model is a parametricGLM or a semiparametric GPLM, one can use specificationtests that are modifications or the classical likelihood ratiotest. In the semiparametric setting, either an approximatenumber of degrees of freedom is used or the test statistic itselfis modified such that bootstrapping its distribution leads toappropriate critical values.

SPM

Additive Models 8-1

Chapter 8:Additive Models

SPM

Additive Models 8-2

Additive Models

to avoid the curse of dimensionality and for better interpretabilitywe assume

m(x) = E (Y | X = x) = c +d∑

j=1

gj(xj)

=⇒ the additive functions gj can be estimated with the optimalone-dimensional rate

SPM

Additive Models 8-3

two possible methods for estimating an additive model:

� backfitting estimator

� marginal integration estimator

identification conditions for both methods

εXj{gj(Xj)} = 0, ∀j = 1, . . . , d

=⇒ E (Y ) = c

SPM

Page 82: Spm Slides

Additive Models 8-4

Backfitting

consider the following optimization problem

minm

E {Y − m(X)}2 such that m(x) =d∑

j=1

gj(xj)

we aim to find the best projection of Y to the space spanned byadditive univariate smooth functions

SPM

Additive Models 8-5

formulation in Hilbert space framework:

� let HYX be the Hilbert space of random variables which arefunctions of Y ,X

� let 〈U, V 〉 = E (UV ) the scalar product

� define HX and HXj, j = 1, . . . , d the corresponding subspaces

=⇒ we aim to find the element of HX1 ⊕ · · · ⊕ HXdclosest to

Y ∈ HYX or m ∈ HX

SPM

Additive Models 8-6

by the projection theorem, there exists a unique solution with

E [{Y − m(X)} |Xα] = 0

⇐⇒ gα(Xα) = E[{

Y −∑j �=α

gj(Xj)}|Xα

], α = 1, . . . , d

denote projection Pα(•) = E (•|Xα)

=⇒

⎛⎜⎜⎜⎝I P1 · · · P1

P2 I · · · P2...

. . ....

Pd · · · Pd I

⎞⎟⎟⎟⎠⎛⎜⎜⎜⎝

g1(X1)g2(X2)

...gd(Xd)

⎞⎟⎟⎟⎠ =

⎛⎜⎜⎜⎝P1YP2Y

...PdY

⎞⎟⎟⎟⎠

SPM

Additive Models 8-7

denote bySα the (n × n) smoother matrix

such that SαY is an estimate of the vector{E (Y1|Xα1), . . . ,E (Yn|Xαn)}�

=⇒

⎛⎜⎜⎜⎝I S1 · · · S1

S2 I · · · S2...

. . ....

Sd · · · Sd I

⎞⎟⎟⎟⎠︸ ︷︷ ︸

nd×nd

⎛⎜⎜⎜⎝g1

g2...

gd

⎞⎟⎟⎟⎠ =

⎛⎜⎜⎜⎝S1YS2Y

...SdY

⎞⎟⎟⎟⎠

note: in finite samples the matrix on the left side can be singular

SPM

Page 83: Spm Slides

Additive Models 8-8

Backfitting algorithm

in practice, the following backfitting algorithm (a simplification ofthe Gauss-Seidel procedure) is used:

� initialize g(0)α ≡ 0 ∀α, c = Y

� repeat for α = 1, . . . , d

rα = Y − c −α−1∑j=1

g(�+1)j −

d∑j=α+1

g(�)j

g(�+1)α (•) = Sα(rα)

� proceed until convergence is reached

SPM

Additive Models 8-9

Example: Smoother performance in additive models

simulated sample of n = 75 regression observations with regressorsXj i.i.d. uniform on [−2.5, 2.5], generated from

Y =4∑

j=1

gj(Xj) + ε, ε ∼ N(0, 1)

where

g1(X1) = − sin(2X1) g2(X2) = X 22 − E (X 2

2 )g3(X3) = X3 g4(X4) = exp(−X4) − E{exp(−X4)}

SPM

Additive Models 8-10

-2 -1 0 1 2X1

-1-0

.50

0.5

11.

5

g1

-2 -1 0 1 2X3

-2-1

01

2

g3

-2 -1 0 1 2X2

-20

24

g2

-2 -1 0 1 2X4

-20

24

6

g4

Figure 74: Estimated (solid lines) versus true additive component functions

(circles at the input values), local linear estimator with Quartic kernel,

bandwidths h = 1.0

SPM

Additive Models 8-11

Example: Boston housing pricesY median value of owner-occupied homes in $1000X1 per capita crime rate by townX2 proportion of non-retail business acres per townX3 nitric oxides concentration (parts per 10 million)X4 average number of rooms per dwellingX5 proportion of owner-occupied units built prior to 1940X6 weighted distances to five Boston employment centersX7 full-value property tax rate per $10,000X8 pupil-teacher ratio by townX9 1, 000(Bk − 0.63)2 where Bk is the proportion of people of Afro-

American descent by town,X10 percent lower status of the population

SPM

Page 84: Spm Slides

Additive Models 8-12

-4 -2 0 2 4X1

-15

-10

-50

5g1

-0.8 -0.6 -0.4 -0.2 X3

-8-6

-4-2

02

4g3

1 2 3 4 X5

-1-0

.50

0.5

11.

5 g

5

0 1 2 3 X2

-4-2

02

4 g

2

1.4 1.6 1.8 2 X4

-50

510

g4

0.5 1 1.5 2 2.5 X6

-10

010

20 g

6

fitted model:E (Y |x) = c+

∑10j=1 gj{log(Xj)}

5.4 5.7 6 6.3 6.6 X7

-4-2

02

4 g

7

0 2 4 6 X9

-2-1

01

2 g

9

2.6 2.7 2.8 2.9 3 3.1 X8

-20

24

6 g

9

0.5 1 1.5 2 2.5 3 3.5 X10

-10

010

20 g

10

Figure 75: Function estimates of additive components g1 to g6 (left), g7 to

g10 (right) and partial residuals, local linear estimator with Quartic kernel,

hj = 0.5 σj(Xj)

SPM

Additive Models 8-13

Marginal Integration

motivated by the identifiability conditions

EXj{gj(Xj)} = 0, EY = c ,

it follows

EXα

{m(xα, Xα)

}= EXα

⎧⎨⎩c + gα(xα) +∑j �=α

gj(Xj)

⎫⎬⎭= c + gα(xα)

where α = 1, . . . , α − 1, α + 1, . . .=⇒ estimate hyperdimensional surface and integrate out thenuisance directions

SPM

Additive Models 8-14

Example: Illustration of the marginal integration principle

consider the model

Y = 4 + X 21 + 2 · sin(X2) + ε ,

where X1 ∼ U[−2, 2], X2 ∼ U[−3, 3], Eε = 0

m(x1, x2) = E (Y |X = x) = 4 + x21 + 2 · sin(x2)

we find

EX2 {m(x1, X2)} =

∫ 3

−3

{4 + x2

1 + 2 · sin(u)} 1

6du = 4 + x2

1 ,

EX1 {m(X1, x2)} =

∫ 2

−2

{4 + u2 + 2 · sin(x2)

} 1

4du = 4 + 2 · sin(x2)

SPM

Additive Models 8-15

Example: Smoother performance of the marginal integrationestimator

simulated sample of n = 150 regression observations withregressors Xj i.i.d. uniform on [−2.5, 2.5], generated from

Y =4∑

j=1

gj(Xj) + ε, ε ∼ N(0, 1),

where

g1(X1) = − sin(2X1) g2(X2) = X 22 − E (X 2

2 )g3(X3) = X3 g4(X4) = exp(−X4) − E{exp(−X4)}

SPM

Page 85: Spm Slides

Additive Models 8-16

-2 -1 0 1 2X1

-1-0

.50

0.5

1

g1

-2 -1 0 1 2X3

-2-1

01

2

g3

-2 -1 0 1 2X2

-20

24

g2

-2 -1 0 1 2X4

05

10

g4

Figure 76: Estimated local linear (solid line) versus true additive component

functions (circles at the input values), local linear estimator with Quartic

kernel, h1 = 1, hj = 1.5 (j = 2, 3, 4), h for the nuisance directions

SPM

Additive Models 8-17

Details on Marginal Integration

Yi = c +d∑

j=1

gj(Xij) + εi

assumptions on εi :Eεi = 0, Var(εi ) = σ2(Xi ), mutually independent conditional onthe Xj

denote matrices

Z = (Zik) =((Xiα − xα)k

), k = 0, . . . , p

W�,α = diag (Wi�,α) = diag(Kh(Xiα − xα)K

h(Xiα − X�α)

)ni=1

SPM

Additive Models 8-18

Local Polynomial Estimates of m(xα,X�,α)

arg minβ

n∑i=1

{Yi − β0 − β1(Xiα − xα) − . . .}2Wi�,α

Derivative estimation for the Marginal Effects

g (ν)α (xα) =

ν!

n

n∑�=1

eν(Z�αW�,αZα)−1Z�

αW�,αY,

where ν > 0, eν = (0, . . . , 1, . . . , 0)�, and p − ν odd

SPM

Additive Models 8-19

Assumptions

(A1) K (•),K(•) are positive, bounded, symmetric, compactlysupported and Lipschitz continuous, K(•) is of order q > 2

(A2) nhh(d−1)

log2(n)→ ∞, hq

hp+1−ν → 0, h = h0n−1

2p+3

(A3) gj(•) have bounded Lipschitz continuous (p + 1)thderivatives

(A4) σ2(•) is bounded and Lipschitz continuous

(A5) the densities fX and fX−α are uniformly bounded awayfrom zero and infinity and are Lipschitz continuous

SPM

Page 86: Spm Slides

Additive Models 8-20

the proof mainly uses that the local polynomial estimator isasymptotically equivalent to kernel estimation with kernel

K ∗ν (u) =

p∑t=0

sνtutK (u)

K =

(∫ut+sK (u)du

)0≤t,s≤p

, sνt =(K−1

)νt

(“equivalent kernel” of higher order)

SPM

Additive Models 8-21

Theorem

Under conditions (A1)-(A5), p − ν odd and if x is in the interior ofthe support of fX, the asymptotic bias and variance of the estimator

can be expressed as n−p+1−ν2p+3 bα(xα) and n

−2(p+1−ν)2p+3 vα(xα), where

bα(xα) =ν!hp+1−ν

0

(p + 1)!μp+1 (K ∗

ν ){

g (p+1)α (xα)

}and

vα(xα) =ν!2

h2ν+10

‖K ∗ν ‖2

2

∫σ2(xα, xα

) f 2α

(xα

)f(xα, xα

)dxα.

SPM

Additive Models 8-22

Theorem

For the additive components or derivative estimators holds:

np+1−ν2p+3

{g (ν)

α (xα) − g (ν)α (xα)

}L−→ N {bα(xα), vα(xα)} .

Theorem

For the regression function estimator m = Y +∑

gα holds:

np+12p+3 {m (x) − m (x)} L−→ N {b(x), v(x)} ,

where b(x) =d∑

α=1bα (xα) and v(x) =

d∑α=1

vα (xα) .

SPM

Additive Models 8-23

Example: Production function estimation

assumption that separability (additivity) holds, i.e.

log (Y ) = c +d∑

α=1

gα {log(Xα)}

=⇒ model is a generalization of the Cobb-Douglas model

log (Y ) = c +d∑

α=1

βα log(Xα)

=⇒ scale elasticities correspond to g ′α(log Xα), return to scales to∑

j g ′j (log Xj)

SPM

Page 87: Spm Slides

Additive Models 8-24

livestock data of (middle sized) Wisconsin farms, 250 observationsfrom 1987

Y livestock

X1 family labor force

X2 hired labor force

X3 miscellaneous inputs as e.g. repairs, rent, custom hiring,supplies, insurance, gas, oil, or utilities

X4 animal inputs as e.g. purchased feed, breeding, or veterinaryservices

X5 intermediate run assets, that is assets with a useful life of oneto ten years

(all measured in US-Dollar)

SPM

Additive Models 8-25

Family Labor

8 9 10 11ln (X1)

00.

51

dens

ity

Miscellaneous Inputs

9 9.5 10 10.5 11 11.5ln (X3)

00.

20.

40.

60.

8

dens

ity

Hired Labor

2 4 6 8 10ln (X2)

00.

10.

20.

3

dens

ity

Animal Inputs

9 10 11 12ln (X4)

00.

20.

40.

6

dens

ity

Intermediate Run Assets

9.5 10 10.5 11 11.5ln (X5)

00.

5

dens

ity

Figure 77: Density estimates for the regressors X1, . . . ,X5, using the quartic

kernel with bandwidth h = 0.75 σj(Xj)

SPM

Additive Models 8-26

Family Labor

9 10

log (X1)

11.6

11.7

log

(Y)

Family Labor

9 10log (X1)

-0.5

0

log

(Y)

Hired Labor

4 6 8 10

log (X2)

1112

log

(Y)

Hired Labor

4 6 8 10

log (X2)

-0.2

00.

2

log

(Y)

Figure 78: Function estimates for the additive components X1, X2 (left),

parametric and nonparametric derivative estimates (right)

SPM

Additive Models 8-27

Miscellaneous Inputs

9 9.5 10 10.5 11 11.5

log (X3)

11.5

12

log

(Y)

Miscellaneous Inputs

9 9.5 10 10.5 11 11.5log (X3)

-0.5

00.

51

log

(Y)

Animal Inputs

9 10 11

log (X4)

11.5

12

log

(Y)

Animal Inputs

9 10 11log (X4)

0.2

0.4

0.6

log

(Y)

Figure 79: Function estimates for the additive components X3, X4 (left),

parametric and nonparametric derivative estimates (right)

SPM

Page 88: Spm Slides

Additive Models 8-28

Intermediate Run Assets

10 10.5 11 11.5

log (X5)

11.6

11.8

log

(Y)

Intermediate Run Assets

10 10.5 11 11.5log (X5)

00.

5

log

(Y)

Figure 80: Function estimate for the additive component X5 (left), para-

metric and nonparametric derivative estimates (right)

SPM

Additive Models 8-29

Backfitting versus Marginal Integration

� backfitting:projection into space of additive models, looks for an optimalfit

� marginal integration:estimates marginal effects by integrating out other directions,does not calculate optimal regression fit

=⇒ different to interpret, but if model is correctly specified, theyestimate the same

SPM

Additive Models 8-30

Partial Additive Partial Linear Models (APLM)

m(U,T) = E (Y |U,T) = c +d∑

j=1

gj(Tj) + U�β

� first approach:

gα(tα) =1

n

n∑j=1

m(tα,Tjα,Uj) − c

using m from

arg minn∑

i=1

{Yi − γ0 − γ1(Tiα − tα)}2 Wij I{Ui = Uj}

SPM

Additive Models 8-31

� second approach:

add in brackets “−Uiβ” and skip “I{Ui = Uj}”

=⇒ asymptotics for β get complicatedno additional assumptions needed only modification for thedensities because of Uβ can be estimated with parametric rate

SPM

Page 89: Spm Slides

Additive Models 8-32

Additive Models with Interaction terms

Y = m(X) + σ(X)ε

allowing now for interaction

m(X) = c +d∑

j=1

gα(Xα) +∑

1≤α<j≤d

gαj(Xα, Xj)

for identification

EXαfα(Xα) = EXαfαj(Xα, Xj) = EXjfαj(Xα, Xj) = 0

SPM

Additive Models 8-33

consider marginal integration as used before

θα(xα) =

∫m(xα, xα)fα(xα)dxα , 1 ≤ α ≤ d ,

and in addition

θαj(xα, xj) =

∫m(xα, xj ,xαj)fαj(xαj)dxαj ,

cαj =

∫gαj(xα, xj)fαj(xα, xj) dxα dxj , 1 ≤ α < j ≤ d ,

it can be shown that

θαj(xα, xj) − θα(xα) − θj(xj) +

∫m(x)f (x) dx = gαj(xα, xj) + cαj .

=⇒ centering this function in an appropriate way would hence giveus the interaction function of interest

SPM

Additive Models 8-34

Local Linear Estimator

{gαj + cαj} = θαj − θα − θj + c ,

where

θαj =1

n

n∑j=1

e�0 (X�αjWlαjXαj)

−1X�αjWlαjY,

and

Wlαj = diag

({1

nKH(Xiα − xα, Xij − xj)KH(Xiαj − xlαj)

}i=1,...,n

),

Xαj =

⎛⎜⎝ 1 X1α − xα X1j − xj...

......

1 Xnα − xα Xnj − xj

⎞⎟⎠SPM

Additive Models 8-35

Example: Production function estimation

we can evaluate the validity of the additivity assumption byintroducing second order interaction terms gαβ :

log (Y ) = c +d∑

α=1

gα {log(Xα)} +∑

1≤α<β≤d

gαβ {log(Xα), log(Xβ)}

livestock data of (middle sized) Wisconsin farms , 250 observationsfrom 1987, continued

SPM

Page 90: Spm Slides

Additive Models 8-36

Family Labor versus Hired Labor

8.9 9.7 10.4

4.76.7

8.7-0.04

-0.00

0.03

Family Labor versus Animal Inputs

8.9 9.5 10.2

9.410.1

10.8-0.03

0.02

0.08

Hired Labor versus Miscellaneous Inputs

4.1 6.3 8.5

9.310.0

10.7-0.12

0.03

0.18

Family Labor versus Miscellaneous Inputs

8.4 9.2 10.0

9.410.0

10.60.02

0.07

0.11

Family Labor versus Intermediate Run Assets

8.4 9.2 10.0

9.810.5

11.1-0.10

0.00

0.10

Hired Labor versus Animal Inputs

4.1 6.4 8.6

9.410.1

10.8-0.00

0.03

0.07

Hired Labor versus Intermediate Run Assets

4.6 6.7 8.7

9.810.5

11.2-0.11

-0.05

0.00

Miscellaneous Inputs versus Intermediate Run Assets

9.3 9.9 10.6

9.810.4

11.0-0.01

0.03

0.07

Miscellaneous Inputs versus Animal Inputs

9.3 10.0 10.7

9.310.0

10.7-0.08

-0.00

0.07

Animal Inputs versus Intermediate Run Assets

9.4 10.1 10.8

9.810.4

11.0-0.26

-0.18

-0.11

Figure 81: Estimates for interaction terms, Quartic kernel, hj = 1.7 σj(Xj)

and h = 4h

SPM

Additive Models 8-37

Testing for Interaction

different approaches

� looking at the interaction directly:∫g2

αβ(xα, xβ)fαβ(xα, xβ)dxαdxβ

� looking at the mixed derivative:∫θ(1,1) 2αβ (xα, xβ)fαβ(xα, xβ)dxαdxβ

SPM

Additive Models 8-38

Example: Production function estimation

applying a special model selection procedure, we tested stepwisefor each interaction

as a result we found significance for

� direct method:g13 (family labor and miscellaneous inputs) with p-value

of about 2%;g15 and f35 came closest

� derivative method:g15 (family labor and intermediate run assets)g35 (miscellaneous inputs and intermediate run assets)

SPM

Additive Models 8-39

Summary: Additive Models

� Additive models are of the form

E (Y |X) = m(X) = c +d∑

α=1

gα(Xα).

In estimation they can combine flexible nonparametricmodeling of many variables with statistical precision that istypical of just one explanatory variable, i.e. they circumventthe curse of dimensionality.

� In practice, there exist mainly two estimation procedures,backfitting and marginal integration. If the real model isadditive, then there are many similarities in terms of whatthey do to the data. Otherwise their behavior andinterpretation are rather different.

SPM

Page 91: Spm Slides

Additive Models 8-40

� The backfitting estimator is an iterative procedure of the kind:

g(l)α = Sα

⎧⎨⎩Y −∑j �=α

g(l−1)j

⎫⎬⎭ , l = 1, 2, 3, . . .

until some prespecified tolerance is reached. This is asuccessive one-dimensional regression of the partial residualson the corresponding Xα. For the regression estimator it fitsthe regression in general better than the integration estimator.But it pays for a low MSE (or MASE) for the regression withhigh MSE (MASE respectively) in the additive functionestimation. Furthermore, it is rather sensitive against thebandwidth choice. An increase in the correlation ρ of thedesign leads to a worse estimate.

SPM

Additive Models 8-41

� The marginal integration estimator is based on the idea that

EXα{m ( Xα,Xα} = c + gα(Xα).

Replacing m by a pre-estimator and the expectation byaveraging defines a consistent estimate. This estimator suffersmore from sparseness of observations than the backfittingestimator does. So for example the boundary effects are worsein the integration estimator. In the center of the support of Xthis estimator mostly has lower MASE for the estimators ofthe additive component functions. An increasing covariance ofthe explanatory variables affects the MASE strongly in anegative sense. Regarding the bandwidth this estimator seemsto be quite robust.

SPM

Additive Models 8-42

� If the real model is not additive, the integration estimator isestimating the marginals by integrating out the directions ofno interest. In contrast, the backfitting estimator is looking inthe space of additive models for the best fit of the response Yon X.

SPM

Generalized Additive Models 9-1

Chapter 9:Generalized Additive Models

SPM

Page 92: Spm Slides

Generalized Additive Models 9-2

Generalized Additive Models

GAM

E (Y |X) = G

⎧⎨⎩c +d∑

j=1

fj(Xj)

⎫⎬⎭with known link function G

identification conditions for both methods

εXj{gj(Xj)} = 0, ∀j = 1, . . . , d

SPM

Generalized Additive Models 9-3

Estimation via Backfitting and Local Scoring

� consider maximum likelihood estimation as in GLM, but applybackfitting to adjusted dependent variables Zi

� algorithm consists of two loops� inner loop: backfitting� outer loop: local scoring

=⇒ local scoring instead of Fisher scoring

SPM

Generalized Additive Models 9-4

Local Scoring Algorithm

initialization c = Y , g(0)α ≡ 0 for α = 1, . . . , d ,

loop over outer iteration counter m

η(m)i = c(m) +

∑g

(m)α (xiα),

μ(m)i = G (η

(m)i )

Zi = η(m)i +

(Yi − μ

(m)i

){G ′(η(m)

i )}−1

wi ={

G ′(η(m)i )

}2 {V (μ

(m)i )

}−1

obtain c(m+1), g(m+1)α by applying backfitting

to Zi with regressors xi and weights wi

until convergence is reached

SPM

Generalized Additive Models 9-5

Backfitting Algorithm

gamback

initialization Z = 1n

∑ni=1 Zi , g(0)

α ≡ 0 for α =1, . . . , d ,

repeat for α = 1, . . . , d the cycles:

rα = Z−Z−α−1∑k=1

g(l+1)α −

d∑k=α+1

g(l)k ,

g(l+1)α (•) = Sα(rα|w)

until convergence is reached

SPM

Page 93: Spm Slides

Generalized Additive Models 9-6

Estimation using Marginal Integration

� without linear part in index (neglecting constant c):

fα(tα) =1

n

n∑i=1

G−1{m(tα, Tα)}

� with linear part in index:

E (Y |U,T) = G{U�β + m(T)}where

m(T) = c +d∑

j=1

gj(Tj)

SPM

Generalized Additive Models 9-7

� estimate GPLM:

mβ(t) = arg minmβ(t)

smoothed log-likelihood

β = arg minβ

usual log-likelihood

� use multivariate estimate mβ(•) to obtain marginal functions

by marginal integration

=⇒ gα(tα) =1

n

n∑i=1

m(tα,Tiα) − c

SPM

Generalized Additive Models 9-8

Example: Migration data (for Saxony)

Y migration intention

U1 family/friend in West

U2 unemployed/job loss certain

U3 middle sized city (10,000–100,000 habitants)

U4 female (1 if yes)

T1 age of person (in years)

T2 household income (in DM)

SPM

Generalized Additive Models 9-9

Density of Age

20 30 40 50 60Age

0.01

0.01

50.

020.

025

Den

sity

Density of Income

0 1000 2000 3000 4000Income

00.

0001

0.00

020.

0003

0.00

040.

0005

Den

sity

Figure 82: Density plots for migration data (subsample from Sachsen),

AGE on the left, HOUSEHOLD INCOME on the right

SPM

Page 94: Spm Slides

Generalized Additive Models 9-10

Yes No (in %)Y MIGRATION INTENTION 39.6 60.4U1 FAMILY/FRIENDS 82.4 27.6U2 UNEMPLOYED/JOB LOSS 18.3 81.7U3 CITY SIZE 26.0 74.0U4 FEMALE 51.6 48.4

Min Max Mean S.D.T1 AGE 18 65 40.37 12.69T2 INCOME 200 4000 2136.31 738.72

Table 17: Descriptive statistic for migration data (subsample from Sachsen,n = 955)

SPM

Generalized Additive Models 9-11

GLM GAPLM

Coefficients S.E. p-values Coefficientsh = 0.75 h = 1.00

FAMILY/FRIENDS 0.7604 0.1972 <0.001 0.7137 0.7289UNEMPLOYED/JOB LOSS 0.1354 0.1783 0.447 0.1469 0.1308CITY SIZE 0.2596 0.1556 0.085 0.3134 0.2774FEMALE -0.1868 0.1382 0.178 -0.1898 -0.1871AGE (stand.) -0.5051 0.0728 <0.001 – –INCOME (stand.) 0.0936 0.0707 0.187 – –constant -1.0924 0.2003 <0.001 -1.1045 -1.1007

Table 18: Logit and GAPLM coefficients for migration data

SPM

Generalized Additive Models 9-12

logit(Migration) <-- Age

20 30 40 50 60T1

-1-0

.50

0.5

1

g1

logit(Migration) <-- Age

20 30 40 50 60T1

-1-0

.50

0.5

1

g1

logit(Migration) <-- Household Income

0 1000 2000 3000 4000T2

-0.5

00.

51

g2

logit(Migration) <-- Household Income

0 1000 2000 3000 4000T2

-0.5

00.

51

g2

Figure 83: Additive curve estimates for AGE (left) and INCOME (right) in

Sachsen (upper plots with h = 0.75, lower with h = 1.0)

SPM

Generalized Additive Models 9-13

Testing the GAPLM

hypothesisH0 : g1(t1) = γ· t1, for all t1

analogy to GPLM test statistic

LR =n∑

i=1

w(Ti )[G ′{U�

i β + m(Ti )}]2V {G (μi )} {g1(Ti1) − E ∗g∗

1 (Ti1)}2

where m(Ti ) = c + g1(Ti1) + . . . + gq(Tiq),

μi = G{U�

i β + m(Ti )}

and bias-adjusted bootstrap estimates

E ∗g∗1 of the linear estimate for component T1

SPM

Page 95: Spm Slides

Generalized Additive Models 9-14

Example: Migration data (for Saxony)

� as a test statistic we compute LR and derive its critical values

from the bootstrap test statistics LR∗

� bootstrap sample size is set to nboot = 499 replications

� results for AGE: linearity is always rejected at the 1% level

� results for INCOME: linearity is rejected at the 2% level forh = 0.75 and at 1% level for h = 1.0

SPM

Generalized Additive Models 9-15

Example: Unemployed after apprenticeship?

subsample of n = 462 from the first 9 waves of the GSOEP, including allindividuals who have completed an apprenticeship in the years between1985 and 1992

Y unemployed after apprenticeship (1 if yes)

U1 female (1 if yes)

U2 years of school education

U3 firm size (1 if large firm)

T1 age of the person

T2 earnings as an apprentice (in DM)

T3 city size (in 100, 000 habitants)

T4 percentage of people apprenticed in a certain occupation, divided by thepercentage of people employed in this occupation in the entire economy,and

T5 unemployment rate in the particular country the apprentice is living in

SPM

Generalized Additive Models 9-16

GLM (logit) GAPLMCoefficients S.E. Coefficients

FEMALE -0.3651 0.3894 -0.3962AGE 0.0311 0.1144 –SCHOOL 0.0063 0.1744 0.0452EARNINGS -0.0009 0.0010 –CITY SIZE -5.e-07 4.e-07 –FIRM SIZE -0.0120 0.4686 -0.1683DEGREE -0.0017 0.0021 –URATE 0.2383 0.0656 –constant -3.9849 2.2517 -2.8949

Table 19: Logit and GAPLM coefficients for unemployment data

SPM

Generalized Additive Models 9-17

Density of Age

20 25 30 35T1

00.

050.

1015

0.20

0.25

Den

sity

Density of City Size

0 5 10T3

05

10

1e-0

7 *

Den

sity

Density of Unemployment Rate

6 9 12 15T5

00.

050.

100.

15

Den

sity

Density of Earnings

500 1000 1500 2000T2

00.

0005

0.00

10.

0015

0.00

2

Den

sity

Density of Degree

0 100 200 300 400T4

0.00

10.

002

0.00

30.

004

Den

sity

logit(Unemployment) <-- Age

20 25 30 35T1

-8-6

-4-2

0

g1

logit(Unemployment) <-- City Size

0 5 10T3

-2-1

01

g3

logit(Unemployment) <-- Unemployment Rate

6 9 12 15T5

-30

36

g5

logit(Unemployment) <-- Earnings

500 1000 1500 2000T2

-12

-9-6

g2

logit(Unemployment) <-- Degree

0 100 200 300 400T4

-6-4

-20

2

g4

Figure 84: Density plots (left) and additive component estimates (right)

for some of the explanatory variables

SPM

Page 96: Spm Slides

Generalized Additive Models 9-18

� test results show that the linearity hypothesis cannot berejected for all variables and significant levels from 1% to 20%

� parametric logit coefficients for all variables (except theconstant and URATE) are already insignificant

� seems that the data can be explained by neither theparametric logit model nor the semiparametric GAPLM

SPM

Generalized Additive Models 9-19

Summary: Generalized AdditiveModels

� The nonparametric additive components in all theseextensions of simple additive models can be estimated withthe rate that is typical for one dimensional smoothing.

� An additive partial linear model (APLM) is of the form

E (Y |X,T) = X�β + c +d∑

α=1

gα(Tα) .

Here, β and c can be estimated with the parametric rate√

n.While for the marginal integration estimator in the suggestedprocedure it is necessary to undersmooth, it is still not clearfor the backfitting what to do when d > 1.

SPM

Generalized Additive Models 9-20

� A generalized additive model (GAM) has the form

E (Y |T) = G{c +d∑

α=1

gα(Tα)}

with a (known) link function G . To estimate this using thebackfitting we combine the local scoring and the Gauss-Seidelalgorithm. Theory is lacking here. Using the marginalintegration we get a closed formula for the estimator for whichasymptotic theory can be also derived.

� The generalized additive partial linear model (GAPLM) is ofthe form

E (Y |X,T) = G

{X�β + c +

d∑α=1

gα(Tα)

}with a (known) link function G . In the parametric part β and

SPM

Generalized Additive Models 9-21

c can be estimated again with the√

n-rate. For thebackfitting we combine estimation in APLM with the localscoring algorithm. But again the the case d > 1 is not clearand no theory has been provided. For the marginal integrationapproach we combine the quasi-likelihood procedure with themarginal integration afterwards.

� In all considered models, we have only developed theory forthe marginal integration so far. Interpretation, advantagesand drawbacks stay the same as mentioned in the context ofadditive models (AM).

� We can perform test procedures on the additive componentsseparately. Due to the complex structure of the estimationprocedures we have to apply (wild) bootstrap methods.

SPM