pearson’s father-son data - university of...

11

Click here to load reader

Upload: ngongoc

Post on 05-Jun-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PEARSON’S FATHER-SON DATA - University of Chicagogalton.uchicago.edu/~wichura/Stat200/Handouts/C10.pdf · PEARSON’S FATHER-SON DATA † The following scatter diagram shows the

PEARSON’S FATHER-SON DATA

• The following scatter diagram shows the heights of 1,078fathers and their full-grown sons, in England, circa 1900.There is one dot for each father-son pair.

58 60 62 64 66 68 70 72 74 76 78

Father’s height (inches)

58

60

62

64

66

68

70

72

74

76

78

Son

’shei

ght

(inch

es)

Heights of fathers and their full grown sons

. .. .. .. ... .. .. .. ..... .. . .. . ... ... .. .. . ....... ......... .. . . . ... . . .. ... . . ......... . ....... ... .. ......... ... ..... ... .. ... ... ... ... .. ... .. .... . ........ ... .... .... ......... .. ....... .... .... .... ......... ... ..... . .. . ...

. ....... ... ............. ........ ...... ... ... ............................. ....... ........... .......................... ... ...... .. ... . ............ .. ... .. .. ....... ... ....... ....... ... .... ................. ... ..... ............ .... ... ........ ... ....... ............ .. .. .... ............... ........ .... ..... .... ... ..... .... .. .... .. ...... ...... ............. .... ........... ...... ...... ... ........ .. .............. ......... ..... .. .. ........ .......... .. ........ ............... ........... .... .. .. .. ...... .................. ............ . ........ .......... . ........ ............. ... ................ ..... .... .... .. ............ . .... ..... ......... ..... .. ... ...... ........ ....... ..... ......... . .... ... ............... ... ....... .............. .. .... . .... ... .... ... . ............................ ... ... .... . ...

.. ... .. ......... ...... . .. ..... ........ .... . . .. .... .. .. ....... ...... ..

. .. .. . .. . .

. . ... .

• How would you describe the relationship between the heightsof the fathers and the heights of their sons?

• For a father of a given height, what height would you predictfor his son?

10–1

• How tall are the sons of 6 foot fathers?

58 60 62 64 66 68 70 72 74 76 78

Father’s height (inches)

58

60

62

64

66

68

70

72

74

76

78

Son

’shei

ght

(inch

es)

Father-son pairs where the father is 6 feet tall

. .. .. .. ... .. .. .. ..... .. . .. . ... ... .. .. . ....... ......... .. . . . ... . . .. ... . . ......... . ....... ... .. ......... ... ..... ... .. ... ... ... ... .. ... .. .... . ........ ... .... .... ......... .. ....... .... .... .... ......... ... ..... . .. . ...

. ....... ... ............. ........ ...... ... ... ............................. ....... ........... .......................... ... ...... .. ... . ............ .. ... .. .. ....... ... ....... ....... ... .... ................. ... ..... ............ .... ... ........ ... ....... ............ .. .. .... ............... ........ .... ..... .... ... ..... .... .. .... .. ...... ...... ............. .... ........... ...... ...... ... ........ .. .............. ......... ..... .. .. ........ .......... .. ........ ............... ........... .... .. .. .. ...... .................. ............ . ........ .......... . ........ ............. ... ................ ..... .... .... .. ............ . .... ..... ......... ..... .. ... ...... ........ ....... ..... ......... . .... ... ............... ... ....... .............. .. .... . .... ... .... ... . ............................ ... ... .... . ...

.. ... .. ......... ...... . .. ..... ........ .... . . .. .... .. .. ....... ...... ..

. .. .. . .. . .

. . ... .

×

• The points in the vertical chimney are the father-son pairswhere the father is 6 feet tall, to the nearest inch.

• The cross marks the average height of the sons of thesefathers.

• These sons are inches tall, on average.

• This is the natural guess for the height of a son of a 6foot father.

10–2

Page 2: PEARSON’S FATHER-SON DATA - University of Chicagogalton.uchicago.edu/~wichura/Stat200/Handouts/C10.pdf · PEARSON’S FATHER-SON DATA † The following scatter diagram shows the

• How does son’s height vary with father’s height?

58 60 62 64 66 68 70 72 74 76 78

Father’s height (inches)

58

60

62

64

66

68

70

72

74

76

78

Son

’shei

ght

(inch

es)

The graph of averages

. .. .. .. ... .. .. .. ..... .. . .. . ... ... .. .. . ....... ......... .. . . . ... . . .. ... . . ......... . ....... ... .. ......... ... ..... ... .. ... ... ... ... .. ... .. .... . ........ ... .... .... ......... .. ....... .... .... .... ......... ... ..... . .. . ...

. ....... ... ............. ........ ...... ... ... ............................. ....... ........... .......................... ... ...... .. ... . ............ .. ... .. .. ....... ... ....... ....... ... .... ................. ... ..... ............ .... ... ........ ... ....... ............ .. .. .... ............... ........ .... ..... .... ... ..... .... .. .... .. ...... ...... ............. .... ........... ...... ...... ... ........ .. .............. ......... ..... .. .. ........ .......... .. ........ ............... ........... .... .. .. .. ...... .................. ............ . ........ .......... . ........ ............. ... ................ ..... .... .... .. ............ . .... ..... ......... ..... .. ... ...... ........ ....... ..... ......... . .... ... ............... ... ....... .............. .. .... . .... ... .... ... . ............................ ... ... .... . ...

.. ... .. ......... ...... . .. ..... ........ .... . . .. .... .. .. ....... ...... ..

. .. .. . .. . .

. . ... .

×××××××

××××××

• Reading from left to right, the crosses mark the average ofson’s height, for fathers who are respectively 61, 62, . . . , 73inches tall, to the nearest inch.

• These averages are the conditional means of son’s height,for fathers of the given heights.

• As fathers increase in height, on average so do their sons.The relationship is nearly linear.

• Are the bumps “real”?

10–3

• What’s a simple way to describe how son’s height varieswith father’s height?

58 60 62 64 66 68 70 72 74 76 78

Father’s height (inches)

58

60

62

64

66

68

70

72

74

76

78

Son

’shei

ght

(inch

es)

The graph of averages and the regression line

. .. .. .. ... .. .. .. ..... .. . .. . ... ... .. .. . ....... ......... .. . . . ... . . .. ... . . ......... . ....... ... .. ......... ... ..... ... .. ... ... ... ... .. ... .. .... . ........ ... .... .... ......... .. ....... .... .... .... ......... ... ..... . .. . ...

. ....... ... ............. ........ ...... ... ... ............................. ....... ........... .......................... ... ...... .. ... . ............ .. ... .. .. ....... ... ....... ....... ... .... ................. ... ..... ............ .... ... ........ ... ....... ............ .. .. .... ............... ........ .... ..... .... ... ..... .... .. .... .. ...... ...... ............. .... ........... ...... ...... ... ........ .. .............. ......... ..... .. .. ........ .......... .. ........ ............... ........... .... .. .. .. ...... .................. ............ . ........ .......... . ........ ............. ... ................ ..... .... .... .. ............ . .... ..... ......... ..... .. ... ...... ........ ....... ..... ......... . .... ... ............... ... ....... .............. .. .... . .... ... .... ... . ............................ ... ... .... . ...

.. ... .. ......... ...... . .. ..... ........ .... . . .. .... .. .. ....... ...... ..

. .. .. . .. . .

. . ... .

×××××××

××××××

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

......................

• The so-called regression line (of son’s height on father’sheight) is a smoothed version of the graph of averages.

• The average height of sons whose fathers are x inchestall is approximately equal to the y coordinate of the pointon the regression line directly above x.

10–4

Page 3: PEARSON’S FATHER-SON DATA - University of Chicagogalton.uchicago.edu/~wichura/Stat200/Handouts/C10.pdf · PEARSON’S FATHER-SON DATA † The following scatter diagram shows the

• How does the regression line compare to the SD line?

58 60 62 64 66 68 70 72 74 76 78

Father’s height (inches)Average = 67.7, SD = 2.7

58

60

62

64

66

68

70

72

74

76

78

Son

’shei

ght

(inch

es)

Ave

rage

68.7

,SD

=2.

7The regression line and the SD line

. .. .. .. ... .. .. .. ..... .. . .. . ... ... .. .. . ....... ......... .. . . . ... . . .. ... . . ......... . ....... ... .. ......... ... ..... ... .. ... ... ... ... .. ... .. .... . ........ ... .... .... ......... .. ....... .... .... .... ......... ... ..... . .. . ...

. ....... ... ............. ........ ...... ... ... ............................. ....... ........... .......................... ... ...... .. ... . ............ .. ... .. .. ....... ... ....... ....... ... .... ................. ... ..... ............ .... ... ........ ... ....... ............ .. .. .... ............... ........ .... ..... .... ... ..... .... .. .... .. ..... . ...... ............. .... ........... ...... ...... ... ........ .. .............. ......... ..... .. .. ........ .......... .. ........ ............... ........... .... .. .. .. ...... .................. ............ . ........ .......... . ........ ............. ... ................ ..... .... .... .. ............ . .... ..... ......... ..... .. ... ...... ........ ....... ..... ......... . .... ... ............... ... ....... .............. .. .... . .... ... .... ... . ............................ ... ... .... . ...

.. ... .. ......... ...... . .. ..... ........ .... . . .. .... .. .. ....... ...... ..

. .. .. . .. . .

. . ... .

correlation coefficientr = 0.5

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

......................

........................................................................................................................................................................................................................................................................................................................

← SD line

• How is the regression line like the SD line?• It goes through the point of averages. Fathers of averageheight tend to have sons of average height.

• How is it different from the SD line?• It is less steeply sloped, by a factor of r.

• What does that mean?• For each increase of one SD in father’s height, there isan increase of only r SDs in son’s height, on average.

10–5

• Where does the term “regression” come from?

58 60 62 64 66 68 70 72 74 76 78

Father’s height (inches)Average = 67.7, SD = 2.7

58

60

62

64

66

68

70

72

74

76

78

Son

’shei

ght

(inch

es)

Ave

rage

68.7

,SD

=2.

7

The regression line and the SD line

. .. .. .. ... .. .. .. ..... .. . .. . ... ... .. .. . ....... ......... .. . . . ... . . .. ... . . ......... . ....... ... .. ......... ... ..... ... .. ... ... ... ... .. ... .. .... . ........ ... .... .... ......... .. ....... .... .... .... ......... ... ..... . .. . ...

. ....... ... ............. ........ ...... ... ... ............................. ....... ........... .......................... ... ...... .. ... . ............ .. ... .. .. ....... ... ....... ....... ... .... ................. ... ..... ............ .... ... ........ ... ....... ............ .. .. .... ............... ........ .... ..... .... ... ..... .... .. .... .. ...... ...... ............. .... ........... ...... ...... ... ........ .. .............. ......... ..... .. .. ........ .......... .. ........ ............... ........... .... .. .. .. ...... .................. ............ . ........ .......... . ........ ............. ... ................ ..... .... .... .. ............ . .... ..... ......... ..... .. ... ...... ........ ....... ..... ......... . .... ... ............... ... ....... .............. .. .... . .... ... .... ... . ............................ ... ... .... . ...

.. ... .. ......... ...... . .. ..... ........ .... . . .. .... .. .. ....... ...... ..

. .. .. . .. . .

. . ... .

correlation coefficientr = 0.5

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

......................

........................................................................................................................................................................................................................................................................................................................

← SD line

• Sons of fathers who are tallershorter than average are themselves

tallershorter than average, but by not so much as the fathers.

• There is a “falling back” towards the mean. In Galton’sterms: a “regression towards mediocrity.”

• The term “regression” is widely used in statistics in connec-tion with how one variable varies with respect to another.

10–6

Page 4: PEARSON’S FATHER-SON DATA - University of Chicagogalton.uchicago.edu/~wichura/Stat200/Handouts/C10.pdf · PEARSON’S FATHER-SON DATA † The following scatter diagram shows the

• Why do sons of fathers who are taller than average tendthemselves to be taller than average, but by not so much astheir fathers?

• First answer: A person’s height depends on the heights ofboth the mother and father. Very tall men generally marrywomen who are not exceptionally tall, not because they prefershort women, but because factors other than height enter intothe choice of a mate. A man who is 2 SDs taller than theaverage male may marry a women who is 2 SDs taller thanthe average female. But because attributes other than heightmatter, the wife is very likely to be a woman who is less than 2SDs taller than average, just because there are so many moresuch women. Thus, most very tall men marry women who arenot as exceptionally tall as they are, and so have sons who arenot as exceptionally tall either.

• Second answer: Diet, exercise, and other environmentalfactors influence height, so that observed height is not aperfect reflection of one’s genes. Someone who is very tallis much more likely to be the unusually tall result of moreaverage genes than the unusually short result of extremely tallgenes, simply because there are many more average genes inthe population. Thus the observed height of a very tall personis usually an overstatement of his or her genetic height, whichis what determines the expected height of the child.

10–7

THE REGRESSION METHOD

• For any football-shaped scatter diagram

x

y

........

.....................................

..........................................................................................................................................................................................................................................................................................................................................

................................................................................................................................................

the average value of y corresponding to a given value of x maybe estimated by the regression line (for y on x).

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..................................

..

#SDx

r #SDy

the point ofaverages

regressionestimate

↓ ← regressionline

Avex x

Avey

y

For however many SDs x is above or below average, theregression estimate for the conditional mean of y given x isonly r times that many SDs above or below the overall averagefor y.

10–8

Page 5: PEARSON’S FATHER-SON DATA - University of Chicagogalton.uchicago.edu/~wichura/Stat200/Handouts/C10.pdf · PEARSON’S FATHER-SON DATA † The following scatter diagram shows the

• Associated with each increase of one SD in x there is anincrease of only r SDs in y, on the average.

Why is r the right factor?

• First answer: it just turns out that way. We saw ithappen in the father-son example. It happens that way for allfootball-shaped scatted diagrams.

• Second answer. Three special cases are easy to understand:r = 0, 1, and −1.

• If r = 0, there is no association between x and y. For aone SD increase in x, there is no increase or decrease in y,on average.

• If r = 1, there is a perfect positive association betweenx and y A one SD increase in x corresponds exactly to aone SD increase in y.

• If r = 1, there is a perfect negative association. A oneSD increase in x corresponds exactly to a one SD decreasein y.

10–9

NONLINEAR ASSOCIATION

• Beware: linear regression doesn’t make sense when the datastructure isn’t linear.

0 1 2 3 4 5 6x

0

1

2

3

4

5

6

7

8

y.

. .

..

..

.

.

.

.

.

...

.

..

..

.

..

...

.

..

.

..

.

..

..

.

.

.

.. ..

.

.

.

.. .

.

.

...

.

. .

..

. .

.

.

.

.

..

.

..

...

.

. ...

.

.

.

...

.

.

.. .

.

..

.

.

.

..

.

.

..........................................................................................................................................................................................................................................................................................................................

...........................................................................................................................................................................................................................................................................................................................................................................smoothed version ofthe graph of averages →

regressionline↓

• You can always use the graph of averages to describe how ychanges with respect to x, on average.

• It usually makes sense to smooth the graph of averages toget a simpler description of the variation.

• But when the pattern in the data is nonlinear, it doesn’tmake sense to smooth the graph of averages to a straight line.

• How do you tell if pattern in the data is nonlinear?

10–10

Page 6: PEARSON’S FATHER-SON DATA - University of Chicagogalton.uchicago.edu/~wichura/Stat200/Handouts/C10.pdf · PEARSON’S FATHER-SON DATA † The following scatter diagram shows the

PREDICTIONS FOR INDIVIDUALS

• Consider again the 1,078 pairs of fathers and sons.

58 60 62 64 66 68 70 72 74 76 78

Father’s height (inches)

58

60

62

64

66

68

70

72

74

76

78Son

’shei

ght

(inch

es)

Ave

rage

68.7

Heights of fathers and sons in England, circa 1990

. .. .... ... .. .. .. ..... ... .. .... ... .... ........ ............. . ... .. ................ .............................. ..... ... ...... .............................. ............. ................. ........................ ....

. ....... ... ................................. ............................. ............................................... .......................... ... .... ....... .................... ............................................................................................................. ......... .... ........ ............ ............ ............. ............................................................................................ ....................... ........... ...... .. ........ .................. ............ ......................................... .................................. ..............

... .............. .......... ..................... ..................................... ... ..................... .. ..... ...................................................... ..... .............. ...... ................ ...

. .. ....................... ... .. .. . .. ... . ... .

..................................

..................................

..................................

..................................

..................................

..................................

..................................

...............

• I’m going to pick one of these pairs at random, and ask youto predict the son’s height.

• If I tell you nothing about the father, your predictionwould be . In the diagram,that corresponds to using the line.• If I tell you the father’s height, your prediction wouldbe . In the diagram, thatcorresponds to using the line.

• Would you use the regression line to predict the height of ason:

• For fathers in England circa 1900?• For fathers who are midgets?• For fathers in the U.S. in 1997?

• A regression line can be used to make predictions forindividuals. But if you have to extrapolate far from the data,or to a different group of subjects, watch out!

10–11

USING THE REGRESSION LINE

• For however many SDs x is above or below average, theregression estimate for the conditional mean of y given x isonly r times that many SDs above or below the overall averagefor y.

x..........................................

..........................................

..........................................

..........................................

..........................................

..........................................

..........................................

..........................................

..........................................

..........................................

.......

#SDx

r # SDy

the point ofaverages

(Avex, Avey)↓

regressionestimate

↓ ← regressionline

• Consider the father-son data.

average height of fathers ≈ 68˝, SD ≈ 3˝average height of sons ≈ 69˝, SD ≈ 3˝, r ≈ 0.5

The scatter diagram is football-shaped. On average, how tallare sons of 6 foot fathers?

• A 6 foot father is inches above average inheight.• That’s of an SD.• The average height of sons of 6 foot fathers isthe overall average by of an SD.• That’s inches.• So sons of 6 foot fathers are on average inchestall.

10–12

Page 7: PEARSON’S FATHER-SON DATA - University of Chicagogalton.uchicago.edu/~wichura/Stat200/Handouts/C10.pdf · PEARSON’S FATHER-SON DATA † The following scatter diagram shows the

USING THE REGRESSION LINE

• For however many SDs x is above or below average, theregression estimate for the conditional mean of y given x isonly r times that many SDs above or below the overall averagefor y.

• For the men ages 18–24 in the HANES sample, the rela-tionship between height and systolic blood pressure can besummarized as follows:

average height ≈ 68˝, SD ≈ 3˝average blood pressure ≈ 124 mm, SD ≈ 15 mm, r ≈ −0.2

The scatter diagram is football-shaped. Estimate the averageblood pressure of the men who were 64 inches tall.

• These men are inches average inheight.• That’s of an SD.• On average, these men are the overall averagein blood pressure by of an SD.• That’s mm.• So the average blood pressure of these men is estimatedas mm.

10–13

USING THE REGRESSION LINE

• In general, to estimate the average value of y for a givenvalue of x, you:

(1) Convert the given value of x to .

(2) Multiply by the . This givesthe corresponding average value of y in standard units.

(3) Convert back from standard units.

Ave.−2SD Ave. Ave.+2SDOriginal units for x

−2 −1 0 1 2Standard units for x

−2 −1 0 1 2Standard units for y

Ave.−2SD Ave. Ave.+2SDOriginal units for y

.................................................................................................

..................

........

..................

............................................................................................................................................................................

..........................

.................................................................................................

..................

........

..................

(Step 1)

(Step 2)

(Step 3)

In Step 1, you use the Average and SD of .

In Step 3, you use the Average and SD of .

10–14

Page 8: PEARSON’S FATHER-SON DATA - University of Chicagogalton.uchicago.edu/~wichura/Stat200/Handouts/C10.pdf · PEARSON’S FATHER-SON DATA † The following scatter diagram shows the

PREDICTING PERCENTILE RANKS

• For the first-year students and Podunk University, thecorrelation between SAT scores and first-year GPA is 0.60.Predict the percentile rank of the first-year GPA for a studentwhose percentile rank on the SAT was 30%.

• Line of attack:........................................................................................................................................................................

........................................................................................................................................................................

Ave.−2SD Ave. Ave.+2SDSAT points

z Area

0.30 23.580.35 27.370.40 31.080.45 34.730.50 38.290.55 41.77

−2 −1 0 1 2SAT in standard units

−2 −1 0 1 2GPA in standard units

Ave.−2SD Ave. Ave.+2SDGPA points

......................................................................................................................................

..................................

.........................................

.........................

........................

...........................

...........................................

........

• Percentile rank for SAT = .• Standard units for SAT = .• Standard units for GPA = .• Estimated percentile rank for GPA = .• You don’t need to know the Average and SD for SAT orGPA. But you do need to assume that .

10–15

THE TEST-RETEST SCENARIO

• An instructor standardizes her midterm and final so the classaverage is 50 and the SD is 10 on both tests. The correlationbetween the tests is always around 0.50. On one occasion,she took the students who scored in the lower quartile at themidterm and gave them special tutoring. On average, theyscored about 6 points higher on the final than they did onthe midterm. Does this show that the special tutoring waseffective? Answer yes or no, and explain briefly.

30 40 50 60 70Score on midterm

30

40

50

60

70

Sco

reon

final

••

•••

◦◦

◦◦

◦◦◦

◦◦

◦◦

◦◦

◦◦◦◦

◦◦◦

◦◦

◦◦

×

r = 0.52

Midterm and final scores

The scores of the students in the lower quartile on themidterm are marked by •’s; their point of averages ismarked by the ×. The scores of the remaining studentsare marked by ◦’s.

10–16

Page 9: PEARSON’S FATHER-SON DATA - University of Chicagogalton.uchicago.edu/~wichura/Stat200/Handouts/C10.pdf · PEARSON’S FATHER-SON DATA † The following scatter diagram shows the

• No. The improvement can be explained by chance variation,using the model

Observed test score = True score + chance error.

The data points in the preceding diagram were actually createdas follows. First a set of normally distributed “true scores” wererandomly assigned to the students in the class; these scoresrepresent the students’ innate abilities. Then the midtermand final scores were constructed by adding random normallydistributed “chance errors” to the true scores; these chanceerrors account for factors such as preparedness, concentration,and anxiety that vary from test day to test day. The followingdiagram shows the breakdown of the midterm scores into theirinnate-ability and test-day components.

30 40 50 60 70True score

−10

−5

0

5

10

Ran

dom

erro

ron

the

mid

term

Performance on the midterm

Students in the lower quartile on the midterm arerepresented by •’s, the other students by ◦’s.

••

◦◦◦

◦◦

◦◦

◦◦

◦◦

◦◦

◦◦

◦◦◦

◦◦◦

◦◦

10–17

The diagram shows that students in the lower quartile onthe midterm tended to have some bad luck — their chanceerrors for that test were mostly negative. Their midtermscores thus tend to understate their true abilities, which iswhat determines their average performance on the final. Inother words, you should expect this group to do better onaverage on the final, even without any special tutoring. Theexpected amount of improvement can be found by applyingthe regression method to estimate the group’s average scoreon the final from its average score on the midterm.

• What is the expected improvement?

• Suggest a design to study whether special tutoring iseffective. What problems might arise?

THE REGRESSION EFFECT ANDTHE REGRESSION FALLACY

• In a typical test-retest situation, the subjects get differentscores on the two tests. Take the bottom group on the firsttest. Some improve on the second test, others do worse; buton the average the bottom group shows an improvement. Nowthe top group. Some do better the second time, others fallback; but on average the top group does worse the secondtime. This is the regression effect , and it happens wheneverthe scatter diagram spreads out around the SD line into afootball-shaped cloud of points.

• The regression fallacy consists in thinking that the regressioneffect is due to something other than spread around the SDline.

10–18

Page 10: PEARSON’S FATHER-SON DATA - University of Chicagogalton.uchicago.edu/~wichura/Stat200/Handouts/C10.pdf · PEARSON’S FATHER-SON DATA † The following scatter diagram shows the

THERE ARE TWO REGRESSION LINES

• As well as regressing son’s height on father’s height, we canregress father’s height on son’s height:

58 60 62 64 66 68 70 72 74 76 78

Father’s height (FH)Average = 67.7, SD = 2.7

58

60

62

64

66

68

70

72

74

76

78

Son

’shei

ght

(SH

)A

vera

ge68

.7,SD

=2.

7

The regression of father’s height on son’s height

. .. .. .. ... .. .. .. ..... .. . .. . ... ... .. .. . ....... ......... .. . . . ... . . .. ... . . ......... . ....... ... .. ......... ... ..... ... .. ... ... ... ... .. ... .. .... . ........ ... .... .... ......... .. ....... .... .... .... ......... ... ..... . .. . ...

. ....... ... ............. ........ ...... ... ... ............................. ....... ........... .......................... ... ...... .. ... . ............ .. ... .. .. ....... ... ....... ....... ... .... ................. ... ..... ............ .... ... ........ ... ....... ............ .. .. .... ............... ........ .... ..... .... ... ..... .... .. .... .. ...... ...... ............. .... ........... ...... ...... ... ........ .. .............. ......... ..... .. .. ........ .......... .. ........ ............... ........... .... .. .. .. ...... .................. ............ . ........ .......... . ........ ............. ... ................ ..... .... .... .. ............ . .... ..... ......... ..... .. ... ...... ........ ....... ..... ......... . .... ... ............... ... ....... .............. .. .... . .... ... .... ... . ............................ ... ... .... . ...

.. ... .. ......... ...... . .. ..... ........ .... . . .. .... .. .. ....... ...... ..

. .. .. . .. . .

. . ... .

r = 0.5........................................................................................................................................................................................................................................................................................................................

← SD line

..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

×

Regression linefor predictingFH from SH

......................................................................................................................................... ..........................

• The points in the horizontal strip are the father-son pairswhere the son is one SD above average, to the nearest inch.• The cross at the point (69.3, 71.4) marks the average heightof the fathers of these sons. The cross lies very close to theregression line for FH on SH, which predicts these fathers tobe above average in height by only .

10–19

• Two regression lines can be drawn across a scatter diagram:

Avex

Ave

y

..................

.................

..................

.................

..................

.................

..................

.................

..................

.................

.......

........

...........................

..............................

......................................................................................................................................................................................................................................................................................................................................................................................................

........................................

...............................................................................

................................................................

................................................................

................................................................

................................................................

.........................

..........................................................................................................................................................................................................

Regression linefor y on x

−→

←− Regression linefor x on y

SD line −→r =

• The regression line for y on x is used to predict y from x.

• It involves thinking about:• dividing the scatter diagram into strips;• finding the average value of in each strip; and• smoothing the resulting graph of averages to astraight line.

• The line goes through the point of averages and is lesssteeply sloped than the SD line by a factor of |r|.

• The regression line for x on y is used to predict x from y.

• It involves thinking about:• dividing the scatter diagram into horizontal strips;• finding the average value of x in each strip; and• smoothing the resulting graph of averages to astraight line.

• The line goes through the point of averages and is moresteeply sloped than the SD line by a factor of 1/|r|.

• In general the two regression lines are different. You canscrew up a prediction badly by using the wrong line!

• When will the two lines be the same?

10–20

Page 11: PEARSON’S FATHER-SON DATA - University of Chicagogalton.uchicago.edu/~wichura/Stat200/Handouts/C10.pdf · PEARSON’S FATHER-SON DATA † The following scatter diagram shows the

SUMMARY

• Associated with each increase of one SD in x there is anincrease of only r SDs in y, on the average. Plotting theseregression estimates gives the regression line of y on x.

• The graph of averages is often close to a straight line, butmay be a little bumpy. The regression line smooths out thebumps. If the graph of averages is a straight line, then itcoincides with the regression line.

• The regression line can be used to make predictions of onevariable from another. But if you have to extrapolate far fromthe data, or to a different group of subjects, be careful.

• Beware of nonlinear structure and outliers. They can foolthe regression method.

• Remember the regression effect and the regression fallacy.

• You can draw two different regression lines on a scatterdiagram: one predicts y from x; the other predicts x from y.

10–21