statistical data analysis 2010/2011 m. de gunst lecture 10

20
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Upload: virginia-bell

Post on 06-Jan-2018

214 views

Category:

Documents


0 download

DESCRIPTION

Statistical Data Analysis 3 Multiple linear regression (Reader: Chapter 8) Relationship between one response variable and one or more explanatory variable Last time: Statistical model Parameter estimation Selection explanatory variables (determination coef, F-, t-tests) Model quality: global methods/diagnostics (plots) This week: further investigation of model quality deviating observation points outlier, leverage point/potential, influence point plots, numerical measures and tests test for outliers, hat matrix, Cook’s distance explanatory variables that are themselves linearly related – collinearity: plots, numerical measures variance inflation factors, condition indices, variance decomposition

TRANSCRIPT

Page 1: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis

2010/2011

M. de Gunst

Lecture 10

Page 2: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis2

Statistical Data Analysis: Introduction

TopicsSummarizing dataInvestigating distributions Bootstrap Robust methodsNonparametric tests Analysis of categorical dataMultiple linear regression (continued)

Page 3: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis3

Multiple linear regression (Reader: Chapter 8)

Relationship between one response variableand one or more explanatory variableLast time:Statistical modelParameter estimationSelection explanatory variables (determination coef, F-, t-tests)Model quality: global methods/diagnostics (plots)

This week: further investigation of model quality

deviating observation points outlier, leverage point/potential, influence point plots, numerical measures and tests test for outliers, hat matrix, Cook’s distance

explanatory variables that are themselves linearly related – collinearity: plots, numerical measures variance inflation factors, condition indices, variance decomposition

Page 4: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis4

Statistical model

Multiple linear regression model

independent and normally distributed

Issues: 1) estimate2) select explanatory variables3) assess model quality

Page 5: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis5

3) Assessment of model quality – deviating points

Consider observation point (yi, xi1,…,xip)

types of deviating observation points

deviating response: outlierdeviating explanatory variable: potential or leverage pointif point has influence: influence point

how to detect

outlier: test for outliersleverage point: hat matrixInfluence point: Cook’s distance

Page 6: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis6

Example outlier

Forbes’ data: boiling temperature for different pressure

Small deviating effect in responsemay have large effects

Generally easy to detect in plots

Page 7: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis7

3) Assessment of model quality – outliers

Outlier: deviating responseHow to detect? Make plots - which ones?

If possible outliers detected, do formal testIdea: if k-th point outlier, then it fits the regression model up to a shift δ

i.e. it fits mean shift outlier modelfor sufficiently large |δ | , or in matrix notation with s.t.

When is k-th point outlier in terms of δ ?

How to test?

Page 8: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis8

3) Assessment of model quality – outliers

Outlier: deviating response

If k-th point outlier, then it fits mean shift outlier modelfor sufficiently large |δ | , with s.t.

When is k-th point outlier in terms of δ ?

If |δ | significantly different from 0, then k-th point outlier

Test for outlierH0: δ = 0, β arbitraryH1: δ ≠ 0, β arbitrary (note: in Reader one-sided)

Test statistic ~

Page 9: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis9

Example leverage point

Huber’s data:

Small deviation in explanatory variablemay have large effect

Often difficult to detect in plots: on edge of range of values value residual often not large

Page 10: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis10

3) Assessment of model quality – leverage points

Potential or leverage point : deviating explanatory variable How to detect?

With hatmatrix

stems from

Properties of H: and if hii large then other hij small

We see and

Hence, if hii large, then i-th point has potential influence

Page 11: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis11

3) Assessment of model quality – influence points

Influence point: if point has influence

How to detect?

check if point outlier or leverage pointIf yes, then fit model with and without this point

If result very different: point is influence point

Measure based on difference between estimated beta’s:

Cook’s distance for i-th point:

if Di larger than 1 (roughly), then i-th point is influence point

Parameter estimate without i-th point

Page 12: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis12

3) Assessment of model quality – influence points

Measure of influence based on difference between estimated beta’s:

Cook’s distance for i-th point:

If Di larger than 1 (roughly), then i-th point is influence point

Explanation: the set

is confidence region with confidence 1 – α for parameter vector βThus defines measure of distance from

For choices of α around 0.5 the values of b outside this set lie “far away” from For choices of α around 0.5 the boundary of the set, ,has value around 1

Parameter estimate without i-th point

Page 13: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis13

Example influence points

Cook’s distances for different data sets:

Page 14: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis14

3) Assessment of model quality – collinearity

explanatory variables that are themselves linearly related – collinearity: numerical measures variance inflation factors, condition indices, variance decomposition

when a problemif variance of one or more estimator is largethen estimate(s) not reliable

how to detectknown methods?scatter plots, corr. coeff (between pairs of variables), determination coef of Xj on others = squared multiple linear corr coeff between Xj and others + several new numerical measures

Page 15: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis15

3) Assessment of model quality – collinearity

exactly collinear if for some constants not all equal to 0

If one or more collinearities in (general) matrix X, then rank(X) not maximaland does not exist

With approximate collinearities difficult to compute

In design matrix X one or more (approximate) collinearities can exist between its columns

In that case difficult to compute and/or one or more may be large

Page 16: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis16

3) Assessment of model quality – collinearity

How to detect collinearity

scatter plots, corr. coeff (between pairs of variables), determination coef of Xj on all others = squared multiple linear corr

coeff between Xj and all others 4 new numerical measuresi) variance inflation factors because VIFj is amount of increase in variance of due to relationship between

Xj and all others

If VIFj large, then estimate unreliable

Page 17: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis17

3) Assessment of model quality – collinearity

How to detect collinearity

ii) condition number (read in Reader)

iii) condition indicesmakes ues of singular value decompositionwith and D = diagonal( )

k-th condition index:

If small, thus large → collinearity

because then

if not too small, then Xj involved in collinearity

singular values of X≥ 0

Page 18: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis18

3) Assessment of model quality – collinearity

How to detect collinearity

iv) variance decomposition proportions

because (from s.v.d.)

If is large, then investigate which terms involved via the

Write the in matrix and look in row of large (= small ) which are close to 1 Corresponding Xj involved in collinearity

Easier to see then with method (iii)

Page 19: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis19

3) Assessment of model quality – collinearity

No general guideline exists

Sometimes: - leave out one or more explanatory variable - scale explanatory variables - center explanatory variables

Always: - try to find explanation, this may lead to right choice

Solutions for collinearity

variable may loose meaning

Page 20: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10

Statistical Data Analysis20

3) Assessment of model quality – example

Now: Example body fat data different document