regression: motivation one dimensional data (summary by mean) 10 20 30 40 50

36
Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Upload: arthur-tiller

Post on 02-Apr-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Regression: Motivation

One dimensional data

(Summary by Mean)

10 20 30 40 50

Page 2: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

X (X-a)2

10 (10-a)2

20 (20-a)2

30 (30-a)2

40 (40-a)2

50 (50-a)2

150 T min T when a = mean=30

Page 3: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

RegressionEstriol Birth Wt

7 25

9 25

9 25

12 27

14 27

14 30

15 32

15 34

15 34

15 35

16 27

16 24

16 30

16 31

16 32

Estriol Birth Wt

30 35.5

32 35.5

36 35.5

35 37.0

37 37.0

31 38.5

34 38.5

38 40.0

30 41.5

40 43.0

28 46.0

43 46.0

32 47.5

39 47.5

34 50.5

Page 4: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Regression

• Concerns– Data summarization

• (As in one dimensional data)

– Prediction of low birthweight baby• (for special prenatal care to those in high risk)

Page 5: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Scatter plot

7 12 17 22 27

24

29

34

39

43

Birt

h w

eigh

t

Estriol

Page 6: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Lines through scatter plot to represent the data

7 12 17 22 27

24

29

34

39

43

Line 3

Line 4

Line 5

Estriol (mg/24 hr)

Bir

thw

eigh

t (g/

100)

Line 2

Page 7: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Regression line: The best lineThe best representation of data

Regression Line through Scatter Plot

7 12 17 22 27

24

29

34

39

43

Fig Reg 1.6

Estriol (mg/24 hr)

Bir

thw

eigh

t (g/

100)

Page 8: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

What is this with a line and numbers anyway?

• They could be the same in two different form or language

• But, lines require less space to record remember, memorize and are easy to comprehend

• Lines could be pictorial or mathematical representation of numerical data

Page 9: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

• A lineY = 2+3X

Numbers generated by the line

Slope = 2

Intercept =3

(interpretation ??)

x y

0 2

1 5

2 8

… …

50 152

… …

… …

Page 10: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Representation of bivariate measure ments in different forms

• Equation Y =2+3x

• Data/Number

• x y

• 0 2• 1 5• 2 8• … …

50 152• … …• … …

Y

X0 3

2

11

Picture/Graph

Page 11: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Straight lines

Inte

rcep

t

-------

A Straight Line

X

Y

Two Straight lines with the Same Slope but Different Intercepts

X Y

Page 12: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Straight lines

Zero Slope

Zero Intercept

X X

Y

Y

Two Straight Lines with the same Intercept but Different Slopes

Straight Line with Zero Slope and Zero Intercept

Page 13: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Regression: what line will generate the data?

Estriol Birth Wt

7 25

9 25

9 25

12 27

14 27

14 30

15 32

15 34

15 34

15 35

16 27

16 24

16 30

16 31

16 32

Estriol Birth Wt

30 35.5

32 35.5

36 35.5

35 37.0

37 37.0

31 38.5

34 38.5

38 40.0

30 41.5

40 43.0

28 46.0

43 46.0

32 47.5

39 47.5

34 50.5

Page 14: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Regression: what line will generate the data?

7 12 17 22 27

24

29

34

39

43

Birt

h w

eigh

t

Estriol

Page 15: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Which is the best line?

7 12 17 22 27

24

29

34

39

43

Line 1

Line 3

Line 4

Line 5

Estriol (mg/24 hr)

Bir

thw

eigh

t (g/

100)

Line 2

Page 16: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

The best lineBirthweight = 21.52 + 0.608 Estriol

Regression Line through Scatter Plot

7 12 17 22 27

24

29

34

39

43

Estriol (mg/24 hr)

Bir

thw

eigh

t (g/

100)

Page 17: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Computer output

Coefficientsa

21.523 2.620 8.214 .000 16.164 26.883

.608 .147 .610 4.143 .000 .308 .908

(Constant)

ESTRIOL

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig. Lower Bound Upper Bound

95% Confidence Interval for B

Dependent Variable: BWEIGHTa.

Page 18: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Regression

The Saga continues

Page 19: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Out of curiosity

How did this accomplish what we wanted (i.e. data summarization and identifying women who might need special prenatal care)

Page 20: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

• 1. We end up with the line Birthweight =21.52+0.608 Estriol, hoping that

this line will generate the original data

2. In the case of univariate ‘mean’ is closest to the data in a sense. In similar way, regression line is the closet line to the data . In that sense it summarizes the data.

Page 21: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Recall

One dimensional data

(Summary by Mean)

10 20 30 40 50

Page 22: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Recall

X (X-a)2Bweight (bweight- L)2

10 (10-a)2 25 (25-L)2

20 (20-a)2 25 (25-L)2

30 (30-a)2 25 (25-L)2

40 (40-a)2 27 (27-L)2

50 (50-a)2 … …

Mean=30 minimizes sum L =21.52+0.608 Esriol minimizes the sum – This is regression line

Page 23: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Prediction

• Women that need special care

• If lowbirth weight is defined as < 2500g, then women with estriol level < 5.72 are in hirisk of having low birthweight babies.

Page 24: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

• So is everything fine and dandy

• Not necessarily -– How closely does the regression line

generates the data?– How much is estriol is responsible for

birthweight??– Was there something that would have better

predicted women at risk???

Page 25: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Birthweights Generated From

Observed Difference

Squared From

Obs. No.

(a)

Estriol

(b)

Observed Data (c)

Line 1.1

(d)

Line 1.2

(e)

Line 1.1 [(c)-(d)]2

Line 1.2 [(c)-(e)]2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

7 9 9

12 14 14 15 15 15 15 16 16 16 16 16 16 17 17 17 18 18 19 19 20 21 22 24 24 25 25 27

25 25 25 27 27 30 32 34 34 35 27 24 30 31 32 35 30 32 36 35 37 31 34 38 30 40 28 43 32 39 34

20.5 23.5 23.5 28.0 31.0 31.0 32.5 32.5 32.5 32.5 34.0 34.0 34.0 34.0 34.0 34.0 35.5 35.5 35.5 37.0 37.0 38.5 38.5 40.0 41.5 43.0 46.0 46.0 47.5 47.5 50.5

25.776 26.992 26.992 28.816 30.032 30.032 30.640 30.640 30.640 30.640 31.248 31.248 31.248 31.248 31.248 31.248 31.856 31.856 31.856 32.464 32.464 33.072 33.072 33.680 34.288 34.896 36.112 36.112 36.720 36.720 37.936

20.25 2.25 2.25 1.00

16.00 1.00 0.25 2.25 2.25 6.25

49.00 100.00

16.00 9.00 4.00 1.00

30.25 12.25 0.25 4.00 0.00

56.25 20.25 4.00

132.25 9.00

324.00 9.00

240.25 72.25

272.25

0.6022 3.9681 3.9681 3.2979 9.1930 0.0010 1.8496

11.2896 11.2896 19.0096 18.0455 52.5335

1.5575 0.0615 0.5655

14.0775 3.4447 0.0207

17.1727 6.4313

20.5753 4.2932 0.8612

18.6624 18.3869 26.0508 65.8045 47.4445 22.2784

5.1984 15.4921

Sum Mean Variance

534.00 17.23 22.58

992.00 32.00 22.47

1111.00 35.84 50.81

992.00 32.00 8.35

1419.00 - -

423.43 - -

Page 26: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

E BW Pred Diff 7.00 25.00 25.78076 -.78076 9.00 25.00 26.99714 -1.99714 9.00 25.00 26.99714 -1.99714 12.00 27.00 28.82171 -1.82171 14.00 27.00 30.03810 -3.03810 14.00 30.00 30.03810 -.03810 15.00 32.00 30.64629 1.35371 15.00 34.00 30.64629 3.35371 15.00 34.00 30.64629 3.35371 15.00 35.00 30.64629 4.35371 16.00 27.00 31.25448 -4.25448 16.00 24.00 31.25448 -7.25448 16.00 30.00 31.25448 -1.25448 16.00 31.00 31.25448 -.25448 16.00 32.00 31.25448 .74552 16.00 35.00 31.25448 3.74552 17.00 30.00 31.86267 -1.86267 17.00 32.00 31.86267 .13733 17.00 36.00 31.86267 4.13733 18.00 35.00 32.47086 2.52914 18.00 37.00 32.47086 4.52914 19.00 31.00 33.07905 -2.07905 19.00 34.00 33.07905 .92095 20.00 38.00 33.68724 4.31276 21.00 30.00 34.29543 -4.29543 22.00 40.00 34.90362 5.09638 24.00 28.00 36.12000 -8.12000 24.00 43.00 36.12000 6.88000 25.00 32.00 36.72819 -4.72819 25.00 39.00 36.72819 2.27181 27.00 34.00 37.94457 -3.94457

Page 27: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

How good is the regression

Regression Line through Scatter Plot

7 12 17 22 27

24

29

34

39

43

Fig Reg 1.6

Estriol (mg/24 hr)

Bir

thw

eigh

t (g/

100)

Page 28: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

How good is the regression

• R2 = 0.372– Estriol explains about 37.2% of variation in

the birthweights. Remaining 62.8 % is explained by other factors

– At estriol 16, we have several birthweight s(24,30,31,32 and 35). If estriol is the only factor for Birthweight we would not see this variation.

Page 29: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

How good is the regrssionRegression line and 95% confidence intervals around predicted values

Estriol

Bweight line upper lower

7 27

22.4777

43

Page 30: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Other factors

Multiple Regression

Page 31: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Regression Diagnostics

Residual Analysis

Page 32: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Diagnostics

• Residual for a patient (observation)– Difference between observed birthweight and

the birthweight regression line would generate (predict)

• Example: (for the first patient)– Observed birthweight = 25– Generated = 21.52+0.608 estriol

=21.52+0.608(7)=25.776

Residual = 25-25.776= -0.776

Page 33: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Diagnostics

• Residual plots

• Plot of residuals against predicted values

• For assumptions– Normality, linearity and homoscedasticity

Page 34: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Non normal

Heteroscedasticity

nonlinearity

Page 35: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Diagnostics

• Residuals for influence patients (observation)

- change in estimated parameters (slope and intercept) when the analysis is redone without the patient in question

Patients with high leverage and large residual will have greater influence.

Page 36: Regression: Motivation One dimensional data (Summary by Mean) 10 20 30 40 50

Diagnostics

• Standardized and the studentized (or jackknife) residual

– A patient with large values for these residuals indicate outliers