stat 112 -- notes 4 chapter 3.5 chapter 3.7. teachers’ salaries and dating in u.s. culture, it is...

21
Stat 112 -- Notes 4 • Chapter 3.5 • Chapter 3.7

Post on 22-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Stat 112 -- Notes 4

• Chapter 3.5

• Chapter 3.7

Teachers’ Salaries and Dating• In U.S. culture, it is usually considered impolite to ask how

much money a person makes.• However, suppose that you are single and are interested in

dating a particular person.• Of course, salary isn’t the most important factor when

considering whom to date but it certainly is nice to know (especially if it is high!)

• In this case, the person you are interested in happens to be a high school teacher, so you know a high salary isn’t an issue.

• Still you would like to know how much she or he makes, so you take an informal survey of 11 high school teachers that you know.

Distributions Salary

35000 5000060000

Moments

Mean 50881.818 Std Dev 6491.1968 Std Err Mean 1957.1695 upper 95% Mean 55242.664 lower 95% Mean 46520.973 N 11

B a s e d o n t h i s d a t a , w h a t c a n y o u c o n c l u d e ? A b s e n t a n y o t h e r i n f o r m a t i o n , b e s t g u e s s f o r t e a c h e r ’ s s a l a r y i s t h e m e a n s a l a r y , $ 5 0 , 8 8 2 . B u t i t i s l i k e l y t h a t t h i s e s t i m a t e w i l l n o t b e c o r r e c t . T o g e t a n i d e a o f h o w f a r o f f , y o u m i g h t b e , y o u c a n c a l c u l a t e t h e s t a n d a r d d e v i a t i o n :

82.649110

421437378

1

)(11

1

2

n

yys i

i

T h e s t a n d a r d d e v i a t i o n i s t h e “ t y p i c a l ” a m o u n t b y w h i c h a n o b s e r v a t i o n d e v i a t e s f r o m m e a n . T h u s , y o u r b e s t e s t i m a t e f o r y o u r p o t e n t i a l d a t e ’ s s a l a r y i s $ 5 0 , 8 8 2 b u t a t y p i c a l e s t i m a t e w i l l b e o f f b y a b o u t $ 6 , 5 0 0 .

• You happen to know that the person you are interested in has been teaching for 8 years.

• How can you use this information to better predict your potential date’s salary?

• Regression Analysis to the Rescue! • You go back to each of the original 11 teachers

you surveyed and ask them for their years of experience.

• Simple Linear Regression Model: E(Y|X)= , the distribution of Y given X is normal with mean and standard deviation .

Bivariate Fit of Salary By Years of Experience

35000

40000

45000

50000

55000

60000

65000

Sa

lary

0 2.5 5 7.5 1012.5Years of Experience

X10

X10

B i v a r i a t e F i t o f S a l a r y B y Y e a r s o f E x p e r i e n c e

3 5 0 0 0

4 0 0 0 0

4 5 0 0 0

5 0 0 0 0

5 5 0 0 0

6 0 0 0 0

6 5 0 0 0

Salary

0 2 .5 5 7 .5 1 0 1 2 .5Y e a rs o f E x p e rie n c e

L in e a r F it L i n e a r F i t

S a l a r y = 4 0 6 1 2 . 1 3 5 + 1 6 8 6 . 0 6 7 4 Y e a r s o f E x p e r i e n c e S u m m a r y o f F i t

R S q u a r e 0 . 5 4 5 8 8 1 R S q u a r e A d j 0 . 4 9 5 4 2 3 R o o t M e a n S q u a r e E r r o r 4 6 1 0 . 9 3 M e a n o f R e s p o n s e 5 0 8 8 1 . 8 2 O b s e r v a t i o n s ( o r S u m W g t s ) 1 1

Linear Fit L in e a r F it

S a la ry = 4 0 6 1 2 .1 3 5 + 1 6 8 6 .0 6 7 4 Y e a rs o f E xp e rie n c e S u m m a ry o f F it

R S q u a re 0 .5 4 5 8 8 1 R S q u a re A d j 0 .4 9 5 4 2 3 R o o t M e a n S q u a re E rro r 4 6 1 0 .9 3 • Predicted salary of your potential date who has been a

teacher for 8 years = Estimated Mean salary for teachers of 8 years = 40612.135+1686.0674*8 = $54,100

• How far off will your estimate typically be? Root mean square error = Estimated standard deviation of Y|X = $4,610.93.

• Notice that the typical error of your estimate of teacher salary using experience, $4,610.93, is less than that of using only information on mean teacher salary, $6,491.20.

• Regression analysis enables you to better predict your potential date’s salary.

R Squared

• How much better predictions of your potential date’s salary does the simple linear regression model provide than just using the mean teacher’s salary?

• This is the question that R squared addresses. • R squared: Number between 0 and 1 that measures

how much of the variability in the response the regression model explains.

• R squared close to 0 means that using regression for predicting Y|X isn’t much better than mean of Y, R squared close to 1 means that regression is much better than the mean of Y for predicting Y|X.

Summary of Fit

RSquare 0.545881 RSquare Adj 0.495423 Root Mean Square Error 4610.93

R Squared Formula

• Total sum of squares = = the sum of squared prediction errors for using sample mean of Y to predict Y

• Residual sum of squares = , where is the prediction of Yi from the least squares line.

squares of sum Total

squares of sum Residual - squares of sum Total2 R

2

1)( YY

n

i i

n

i ii YY1

2)ˆ(

ii XY 10ˆˆˆ

What’s a good R squared?

• A good R2 depends on the context. In precise laboratory work, R2 values under 90% might be too low, but in social science contexts, when a single variable rarely explains great deal of variation in response, R2 values of 50% may be considered remarkably good.

• The best measure of whether the regression model is providing predictions of Y|X that are accurate enough to be useful is the root mean square error, which tells us the typical error in using the regression to predict Y from X.

More Information About Your Potential Date’s Salary:

Prediction Intervals• From the regression model, you predict that your

potential date’s salary is $54,100 and the typical error you expect to make in your prediction is $4,611.

• Suppose you want to know an interval that will most of the time (say 95% of the time) contain your date’s salary?

• We can find such a prediction interval by using the fact that under the simple linear regression model, the distribution of Y|X is normal, here the subpopulation of teachers with 8 years of experience has a normal distribution with estimated mean $54,100 and estimated standard deviation $4,611.

Prediction Interval• A 95% prediction interval has the property

that if we repeatedly take samples from a population with the simple regression model where are fixed at their current values and then sample with ,the prediction interval will contain 95% of the time.

nyy ,...,1

nxx ,...,1

pypxx

py

Best prediction of py : 0 1ˆ ˆ ( | )p p pY E Y X X b b X

2

2

( )11

( 1)p

pX

X Xs RMSE

n n s

,

2 2

1

1( )

1

n

X iis X X

n

.

95% Prediction Interval: .025, 2p̂ n pY t s

Comment: For large n, the 95% prediction interval is approximate ˆ 2*pY RMSE

Prediction Interval for Your Date’s Salary

• Suppose your date has 8 years of experience. p̂Y 40612.14+1686.07*8=54100.7

2

2

( )11

( 1)p

pX

X Xs RMSE

n n s

=

2

2

1 (8 6.09)4610.93 1

11 10*2.8445238.07

95% Prediction Interval:

.025, 2ˆ 54100.7 2.262*5238.07

(42252.19,65949.21)

p n pY t s

Your date’s salary will be in the range (42252.19,65949.21) most of the time.

We obtain X and 2XS from Analyze, Distribution on the X variable.

Distributions Years of Experience

0

2.5

5

7.5

10

12.5

Moments Mean 6.0909091 Std Dev 2.8444523 Std Err Mean 0.8576346 upper 95% Mean 8.0018382 lower 95% Mean 4.17998 N 11

Prediction Intervals in JMP

• After using Fit Line, click the red triangle next to Linear Fit and click Confid Curves Indiv.

• Use the crosshair tool (under Tools) to find the exact prediction interval for a particular x value.

35000

40000

45000

50000

55000

60000

65000

Sal

ary

0 2.5 5 7.5 10 12.5

Years of Experience

Association vs. Causality

• A high means that x has a strong linear relationship with y – there is a strong association between x and y. It does not imply that x causes y.

• Alternative explanations for high : – Reverse is true. Y causes X.– There may be a lurking (confounding) variable

related to both x and y which is the common cause of x and y

2R

2R

Bivariate Fi t of Sal ary of Presbyt erian Mi ni st ers in MA By Price of Rum

0

1 0 0 0 0

2 0 0 0 0

3 0 0 0 0

4 0 0 0 0

5 0 0 0 0S

ala

ry o

f Pre

sb

yte

rian

Min

iste

rs in

MA

1 9 9 8

1 9 8 2

1 9 5 41 9 2 61 8 8 6

0 2 .55 7 .51 01 2 .5P ric e o f R u m

Are the Presybterian ministers benefiting from the rum trade or supporting it?

Example

• A community in the Philadelphia area is interested in how crime rates affect property values. If low crime rates increase property values, the community may be able to cover the costs of increased police protection by gains in tax revenues from higher property values. Data on the average housing price and crime rate (per 1000 population) communities in Pennsylvania near Philadelphia for 1996 are shown in housecrime.JMP.

Bivariate Fit of HousePrice By CrimeRate

0

100000

200000

300000

400000

500000H

ou

se

Pri

ce

10 20 30 40 50 60 70

CrimeRate

Summary of Fit RSquare 0.184229 RSquare Adj 0.175731 Root Mean Square Error 78861.53 Mean of Response 158464.5 Observations (or Sum Wgts) 98 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 225233.55 16404.02 13.73 <.0001 CrimeRate -2288.689 491.5375 -4.66 <.0001

-100000

0

100000

200000

300000

Re

sid

ua

l

10 20 30 40 50 60 70

CrimeRate

Distributions Residuals HousePrice

-100000 0 100000200000300000

Questions

1. Can you deduce a cause-and-effect relationship from these data? What are other explanations for the association between housing prices and crime rate other than that high crime rates cause low housing prices?

2. Does the simple linear regression model appear to hold?

Extrapolation

• When constructing estimates of or predicting individual values of a dependent value based on , caution must be used if is outside the range of the observed x’s. The data does not provide information about whether the simple linear regression model continues to hold outside of the range of the observed x’s.

• Example: The crime rate in Center City Philadelphia is 366.1. Does the simple linear regression model fit from housecrimerate.JMP provide an accurate prediction of the average house price in Center City.

( | )newE Y X

newx