lecture 13: regression problems · i onesamplesigntest,wilcoxonsignedranktest,large-sample...

43
Lecture 13: Regression Problems Pratheepa Jeganathan 05/01/2019

Upload: others

Post on 29-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Lecture 13: Regression Problems

Pratheepa Jeganathan

05/01/2019

Recall

I One sample sign test, Wilcoxon signed rank test, large-sampleapproximation, median, Hodges-Lehman estimator,distribution-free confidence interval.

I Jackknife for bias and standard error of an estimator.I Bootstrap samples, bootstrap replicates.I Bootstrap standard error of an estimator.I Bootstrap percentile confidence interval.I Hypothesis testing with the bootstrap (one-sample problem.)I Assessing the error in bootstrap estimates.I Example: inference on ratio of heart attack rates in the

aspirin-intake group to the placebo group.I The exhaustive bootstrap distribution.

I Discrete data problems (one-sample, two-sample proportiontests, test of homogeneity, test of independence).

I Two-sample problems (location problem - equal variance,unequal variance, exact test or Monte Carlo, large-sampleapproximation, H-L estimator, dispersion problem, generaldistribution).

I Permutation tests (permutation test for continuous data,different test statistic, accuracy of permutation tests).

I Permutation tests (discrete data problems, exchangeability).

The independence problem

Introduction

I Correlation: measures the degree of which two variables arerelated.

I Regression: measures the stochastic relationship betweenresponse variable and one or more predictor variables.

I Regression relationship: simple linear regression, multiple linearregression, nonlinear regression.

Correlation

I Consider random pairs (X ,Y ). The strength of the relationshipor association between X and Y is of our main interest.

I If X and Y are discrete, we can use odds ratio to measure theassociation and χ2 goodness-of-fit test for testing theassociation.

I If X and Y are independentP (X = x ,Y = y) = P (X = x) P (Y = y) for all levels of Xand Y .

I If X and Y are continuous, from random sample(X1,Y1) , · · · , (Xn,Yn) we can use Pearson correlationcoefficient or nonparametric Kendall or Spearman statistics tomeasure the strength of the association.

Pearson’s correlation coefficient

I Let X and Y be continuous random variables with mean µX ,µY and standard deviation σX , σY .

I Pearson’s correlation coefficient is

ρ = E (X − µX ) (Y − µY )σXσY

= E (XY )− E (X )E (Y )σXσY

.

I If X and Y are independent, E (XY ) = E (X )E (Y ). Thus,ρ = 0, converse is not true.

I If X and Y are bivariate normal, converse is also true.

I If X and Y are dependent, ρ 6= 0.I Pearson correlation coefficient measures the linear association

between X and Y .

Estimate Pearson’s correlation coefficient

I Sample Pearson’s correlation coefficient:

ρ = r =∑n

i=1

(Xi − X

) (Yi − Y

)√∑n

i=1

(Xi − X

)2∑ni=1

(Yi − Y

)2.

I Slope in simple linear regression is related to sample Pearson’scorrelation coefficient.

β = r(σYσX

),

where σX and σY are sample standard deviations of X and Y ,respectively and ˆbeta is the least squares estimate of slope in asimple regression of Y on X .

Pearson’s correlation coefficient

I If X and Y have a bivariate normal distribution, testingPearson’s correlation coefficient using student’s t-distribution.

I Testing using permutation method (assume(Xi ,YΠ(i)

)is

exchangeable):I Under the null hypothesis of independence, define

(Xi ,YΠ(i)

),

where Π(i) is any permutation of 1, · · · , n.

I Construct confidence interval using bootstrap method.I Use nonparametric bootstrap: sample with replacement (Xi ,Yi ).

I Note:I If the range of the distribution is bounded, ρ is always defined.I ρ is not defined for Cauchy distribution (it has undefined

variance).I Caution should be given for heavy-tailed distributions.I ρ is high-sensitive to outliers and distribution assumption.

The independence problem (tests based on signs - Kendall)

I Let (Xi ,Yi ) , i = 1, · · · , n be IID bivariate observations from ajoint distribution FX ,Y (x , y).

I Testing independenceI H0 : FX ,Y (x , y) = FX (x) FY (x) for all pairs (x , y) versus HA :

X and Y are dependent.

I Kendall population correlation coefficient τ

τ = 2P(Y2 − Y1) (X2 − X1) > 0 − 1.

I τ measures the monotonicity between X and Y .I If X and Y are independent, τ = 0, converse is not true.I If τ 6= 0, X and Y are dependent.

The independence problem (tests based on signs - Kendall)

I P(Y2 − Y2) (X2 − X1 > 0) =P (X2 > X1,Y2 > Y2) P (X2 < X1,Y2 < Y2).

I Under H0, P (X2 > X1,Y2 > Y2) =(12

)(12

)=(14

).

I Thus, Under H0, τ = 2((1

4

)+(14

))− 1 = 0.

The independence problem (tests based on signs - Kendall)

I H0 : τ = 0 versus H0 : τ 6= 0 or H0 : τ > 0 or H0 : τ < 0.I Significance level α.I Test statistic: K = K/ (n (n − 1) /2) , where

K =n−1∑i=1

n∑j=i+1

Q ((Xi ,Yi ) (Xj ,Yj)) ,

where

Q ((Xi ,Yi ) (Xj ,Yj)) =

1 ; (Yj − Yi ) (Xj − Xi ) > 0−1 ; (Yj − Yi ) (Xj − Xi ) < 0.

I (Yj − Yi ) (Xj − Xi ) > 0 concordant.I (Yj − Yi ) (Xj − Xi ) < 0 discordant.

The independence problem (tests based on signs - Kendall)

cor.test(x, y,alternative = c("two.sided", "less", "greater"),method = c("pearson", "kendall", "spearman"),exact = NULL, conf.level = 0.95,

continuity = FALSE, ...)

The independence problem (tests based on signs - Kendall)

I The large-sample approximation:

I K∗ = Kn(n − 1)(2n + 5)/181/2 ∼ N (0, 1) .

I Ties: if there are tied X values and or Y values, assign zero toQ.

I Approximate test.

Example (tests based on signs - Kendall)

I Hunter L measure of lightness X , along with panel scores Y fornine lots of canned tuna n = 9.

I It is suspected that the Hunter L value is positively associatedwith the panel score.

Table8.1 = data.frame(x = c(44.4, 45.9, 41.9, 53.3,44.7, 44.1, 50.7, 45.2, 60.1),y = c( 2.6, 3.1, 2.5, 5.0, 3.6,

4.0, 5.2, 2.8, 3.8))

Example (tests based on signs - Kendall)

cor.test(x = Table8.1$x, y = Table8.1$y,method = "kendall", alternative = "greater")

#### Kendall's rank correlation tau#### data: Table8.1$x and Table8.1$y## T = 26, p-value = 0.05972## alternative hypothesis: true tau is greater than 0## sample estimates:## tau## 0.4444444

I T is sum of positive Q’s.I K = 2T − n(n − 1)/2. Now, we can use this for large-sample

approximation.

Kendall’ s sample rank correlation coefficient

I τ = 2Kn(n − 1) .

I For the example

T = 26n = length(Table8.1$x)K = 2*T-n*(n-1)/2; K

## [1] 16

tau.hat = 2*K/(n*(n-1)); tau.hat

## [1] 0.4444444

cor(Table8.1$x, Table8.1$y, method = "kendall")

## [1] 0.4444444

Bootstrap confidence interval (Kendall’s correlationcoefficient)

I Sample with replacement (Xi ,Yi ) to obtain bootstrap sample(X ∗i ,Y ∗i ).

I Compute bootstrap replicate value of τ∗.I Necessary to use Q = 0 for ties.

I From bootstrap replicates, τ∗1, τ∗2, · · · , τ∗B, construct(1− α) 100% confidence interval for τ .

Bootstrap confidence interval (Kendall’s correlationcoefficient)

library(NSM3)kendall.ci(Table8.1$x, Table8.1$y, alpha=.05,

type="t", bootstrap = T, B = 1000)

Histogram of tau.hat

tau.hat

Fre

quen

cy

−0.5 0.0 0.5 1.0

010

025

0

#### 1 - alpha = 0.95 two-sided CI for tau:## -0.152, 0.935

The indepndence problem (tests based on ranks -Spearman)

I Spearman rank correlation coefficient ρs .I ρs measures monotonic relationships (whether linear or not).I Spearman’s sample rank correlation coefficient rs .I Rank Xi ’s, denote by Ri ’s and rank Yi ’s, denote by Si ’s.I rs is Pearson product moment sample correlation of Ri and Si .

rs = 1− 6∑n

i=1 D2i

n(n2 − 1) ,

Di = Si − Ri , i = 1, · · · , n.

Example (tests based on ranks - Spearman)

cor.test(x = Table8.1$x, y = Table8.1$y,method = "spearman", alternative = "greater")

#### Spearman's rank correlation rho#### data: Table8.1$x and Table8.1$y## S = 48, p-value = 0.0484## alternative hypothesis: true rho is greater than 0## sample estimates:## rho## 0.6

Example (tests based on ranks - Spearman)

cor(x = Table8.1$x, y = Table8.1$y,method = "spearman")

## [1] 0.6

Rank-based regression analysis

Simple linear regression

I Linear regression in two-sample problem:I Combine Xi ; i = 1, · · · ,m and Yj ; j = 1, · · · , n.I Z = (X1, · · · ,Xm,Y1, · · · ,Yn)T

, N = n + m.I Let g = (1, · · · , 1, 0, · · · , 0)T , 1’s in first m position and rest is

0’s.I Two-sample problem as a linear model:

Zi = β0 + ∆gi + εi , i = 1, · · · ,N, where e1, · · · , eN ∼ F (·).I Estimate ∆ and test for ∆.

Rank-based linear regression

0

40

80

120

5 10 15 20 25

speed

dist

Method

LS

R

Test for slope (based on signs)

I Simple linear model: Yi = α + βXi + εi .

I α - interceptI β - slopeI ε1, · · · , εn ∼ F (·) with median 0.

I β measures every unit increase in the value of the independent(predictor) variable X , expected increase (or decrease,depending on the sign) of the dependent (response) variable Y .

Test for slope (based on signs - Theil (1950))

I H0 : β = β0.I Test statistic: C =

∑n−1i=1

∑nj=i+1 Sign (Dj − Di ) , where

Di = Yi − β0xi .I Motivation for the test statistic:

I Dj − Di = Yj − β0xj − Yi + β0xi = Yj − Yi + β0 (xi − xj).I Median of Yj − Yi = β (xj − xi ).I Thus, under H0, median of

Di − Dj = β (xj − xi ) + β0 (xi − xj) = (β − β0) (xj − xi ).I When β > β0, Di − Dj is positive and leads to larger C values.

I C is the Kendall’s correlation statistics, and can be interpretedas a test for correlation between X and Y .

I Slope estimator associated with Theil statisticβ = medianSij ; 1 ≤ i , j ≤ n, where

Sij = Yj − Yixj − xi

; 1 ≤ i , j ≤ n.

Rank-based intercept estimator

I Define Ai = Yi − βxi , i = 1, · · · , n.I A point estimator for α is

α = medianA1, · · · ,An.

Example (Testing slope)

I Effect of Cloud Seeding on Rainfall.I Smith (1967) described experiment in Australia on cloud

seeding.I Investigate the effects of a particular method of cloud seeding

on the amount of rainfall.I Data

I Two area of mountains served as target and control.I Effect of seeding was measured by the double ratio: [T/Q

(seeded)]/[T/Q (unseeded)].

I The slope parameter β represents the rate of change in Y perunit change in x .

I Test H0 : β = 0 versus HA : β < 0.

Example (Testing slope)

Table9.1 = data.frame(x.years.seeded = c(1,2,3,4,5),Y.double.ratio = c(1.26,1.27,1.12,1.16,1.03))

theil.fit = theil (Table9.1$x.years.seeded,Table9.1$Y.double.ratio,beta.0 = 0 ,slopes=TRUE,type = "l",doplot = FALSE)

Example (Testing slope)

theil.fit

## Alternative: beta less than 0## C = -6, C.bar = -0.6, P = 0.117## beta.hat = -0.056## alpha.hat = 1.316#### All slopes:## i j S.ij## 1 2 0.01000000## 1 3 -0.07000000## 1 4 -0.03333333## 1 5 -0.05750000## 2 3 -0.15000000## 2 4 -0.05500000## 2 5 -0.08000000## 3 4 0.04000000## 3 5 -0.04500000## 4 5 -0.13000000###### 1 - alpha = 0.95 lower bound for beta:## -0.13, Inf

Example (Testing slope)

1 2 3 4 5

1.05

1.20

Example (confidence interval for slope)

theil.output = theil(Table9.1$x.years.seeded,Table9.1$Y.double.ratio,beta.0 = 0 ,slopes=TRUE,type = "t", doplot = FALSE, alpha = .05)

c(theil.output$L, theil.output$U)

## [1] -0.15 0.04

General multiple linear regression

I Interest in the regression relationship between several (p)independent (predictor) variables and one response variable.

I Yi = ζ + β1x1i + β2x2i + · · ·+ βpxpi + ei , i = 1, · · · , n.I Let βq = [β1, · · · , βq]T and βp−q = [βq+1, · · · , βp]T .I H0 : βq = 0 versus HA : βq 6= 0.I Read HWC Chapter 9.5I Use rfit() command in R.

The geometry of rank-based linear models

Overview

I Reference (Hettmansperger and McKean 2010, Chapter3)(hettmansperger2010) and HWC Chapter 9, page 484,comments 24, 25, and 26.

I Analysis (estimation, testing, diagnostic, outlier detection,detection of influential cases) can be based on either signs orranks.

I Error distribution could be either asymmetric (use sign) orsymmetric (use rank).

The geometry of rank-based linear models

I The model: Yi = α + xTi β + εi , i = 1, · · · , n.

I The location parameter of the distribution of εi is zero.I β - p × 1 vector of unknown parameters of interest.I α - intercept.

I The model in matrix form: Y = 1α + Xβ + ε.I X has full column rank p.I Let ΩF denote the column space spanned by the columns of X.I Y = 1α + η + ε, where η ∈ ΩF .

I Coordinate-free model.

I Estimating η.I Testing a general linear hypotheses H0 : Mβ = 0 versus

HA : Mβ 6= 0, where M is a q × p matrix of full row rank.

The geometry of rank-based linear models estimation

I Estimate η by minimizing the distance between Y and thesubspace ΩF .

I Define distance in terms of norms or pseudo-norms:‖v‖ =

∑ni=1 a (R (vi )) vi , a(1) ≤ a(2) ≤ · · · a(n) a set of scores

generated as a(i) = φ

(i

n + 1

)and φ (u) ∈ (0, 1).

I Rank estimate of η is a vector Y φ such that

Dφ (Y ,ΩF ) =∥∥∥Y − Y φ

∥∥∥φ

= minη∈ΩF

‖Y − η‖φ .

I βφ =(XT X

)−1XT Y φ and α = medianYi − xT

i βφ.

The geometry of rank-based linear models estimation

Figure 1: Geometry of Estimation

The geometry of rank-based linear models testing

Figure 2: Geometry of Testing

I ΩF column space of full model design matrix X.I ΩR reduced model subspace ΩR ⊂ ΩF .

I ΩR = η ∈ ΩF : η = Xβ, for some β such that Mβ = 0.I Y φ,ΩR estimate of η when the reduced model is fit.I Dφ (Y ,ΩR) =

∥∥∥Y − Y φ,ΩR

∥∥∥φdenote the distance between Y

and ΩR .

The geometry of rank-based linear models testing

I RDφ = Dφ (Y ,ΩR)− Dφ (Y ,ΩF ) reduction in residualdispersion when we pass from reduced model to the full model.

I Large value of RDφ indicates HA.

References for this lecture

HWC Chapter 8.

HWC Chapter 9.1-9.4, 9.6.

Hettmansperger, Thomas P, and Joseph W McKean. 2010. RobustNonparametric Statistical Methods. CRC Press.