statr session14, jan 11
TRANSCRIPT
Correlation and Regression Analysis:Learning Objectives
• Explain the purpose of regression analysis and the meaning of independent versus dependent variables.
• Compute the equation of a simple regression line from a sample of data, and interpret the slope and intercept of the equation.
• Estimate values of Y to forecast outcomes using the regression model.
• Understand residual analysis in testing the assumptions and in examining the fit underlying the regression line.
• Compute a standard error of the estimate and interpretits meaning.
• Compute a coefficient of determination and interpret it.
Correlation
• Correlation is a measure of the degree of relatedness of variables.
• Coefficient of Correlation (r) - applicable only if both variables being analyzed have at least an interval level of data.
Three Degrees of Correlation
r < 0 r > 0
r = 0
Degree of Correlation
• The term (r) is a measure of the linear correlation of two variables– The number ranges from -1 to 0 to +1
Positive correlation: as one variable increases, the other variable increases
Negative correlation: as one variable increases, the other one decreases
No correlation: the value of r is close to 0– Closer to +1 or -1, the higher the correlation
between two variables
Pearson Product-MomentCorrelation Coefficient
=
Regression Analysis
• Regression analysis is the process of constructing a mathematical model or function that can be used to predict or determine one variable by another variable or variables.
Simple Regression Analysis
• Bivariate (two variables) linear regression -- the most elementary regression model– dependent variable, the variable to be predicted, usually
called Y– independent variable, the predictor or explanatory
variable, usually called X– Usually the first step in this analysis is to construct a
scatter plot of the data• Nonlinear relationships and regression models with
more than one independent variable can be explored by using multiple regression models
Regression Models
• Deterministic Regression Model - - produces an exact output:
• Probabilistic Regression Model
• 0 and 1 are population parameters
• 0 and 1 are estimated by sample statistics b0
and b1
0 1y x
0 1y x
Equation of the Simple Regression Line
A typical regression line
X
Y
𝑏0
ϴ Slope = =
y-intercept =
Least Squares Analysis• Least squares analysis is a process whereby a regression model
is developed by producing the minimum sum of the squared error values
• The vertical distance from each point to the line is the error of the prediction.
• The least squares regression line is the regression line that results in the smallest sum of errors squared.
Least Squares Analysis
1 2 2 2
22b
X X X X X XX X Y Y XY nXY
n
XYX Y
n
n
0 1 1b b bY XY
nX
n
Least Squares Analysis
SS X X Y Y XYX Y
n
SSn
SSSS
XY
XX
XY
XX
X X X X
b
2 2
2
1
0 1 1b b bY XY
nX
n
Solving for and of the Regression Line: Airline Cost DataAirlines Cost Data include the costs and associated number of
passengers for twelve 500-mile commercial airline flights using Boeing 737s during the same season of the year.
Number of CostPassengers ($1,000) 61 4,280 63 4,080 67 4,420 69 4,170
70 4,480 74 4,300 76 4,820 81 4,700 86 5,110 91 5,130 95 5,640 97 5,560
Solving for and of the Regression Line: Airline Cost Example (Part 1)
Number ofPassengers Cost ($1,000) x y x 2 xy
61 4.28 3,721 261.0863 4.08 3,969 257.0467 4.42 4,489 296.1469 4.17 4,761 287.7370 4.48 4,900 313.6074 4.30 5,476 318.2076 4.82 5,776 366.3281 4.70 6,561 380.7086 5.11 7,396 439.4691 5.13 8,281 466.8395 5.64 9,025 535.8097 5.56 9,409 539.32
x = 930 y = 56.69 2x = 73,764 xy = 4,462.22
Solving for and of the Regression Line: Airline Cost Example (Part 2)
745.6812
)69.56)(930(22.462,4 n
YXXYSS XY
168912
)930(764,73)( 22
2 n
XXSS XX
0407.1689
745.681
XX
XY
SSSSb
57.112
930)0407(.12
69.5610
nX
bn
Yb
XY 0407.57.1ˆ
Residual Analysis
• Residual is the difference between the actual values and the predicted values i.e.
• Reflects the error of the regression line at any given point.
Residual Analysis: Airline Cost Example
Number of PredictedPassengers Cost ($1,000) Value Residual X Y Y YY ˆ
61 4.28 4.053 .22763 4.08 4.134 -.05467 4.42 4.297 .12369 4.17 4.378 -.20870 4.48 4.419 .06174 4.30 4.582 -.28276 4.82 4.663 .15781 4.70 4.867 -.16786 5.11 5.070 .04091 5.13 5.274 -.14495 5.64 5.436 .20497 5.56 5.518 .042
001.)ˆ( YY
Residual Analysis: Airline Cost Example
Outliers: Data points that lie apart from the rest of the points. They can produce large residuals and affect the regression line.
Using Residuals to Test the Assumptions of the Regression Model
• The assumptions of the regression model– The model is linear– The error terms have constant variances– The error terms are independent– The error terms are normally distributed
Using Residuals to Test the Assumptions of the Regression Model
• The assumption that the regression model is linear does not hold for the residual plot shown above
• In figure (a) below the error variance is greater for smaller values of x and smaller for larger values of x and vice-versa in figure (b) below. This is a case of heteroscedasiticity.
Standard Error of the Estimate
• Residuals represent errors of estimation forindividual points.
• A more useful measurement of error is thestandard error of the estimate.
• The standard error of the estimate, denoted by se,is a standard deviation of the error of theregression model.
Standard Error of the Estimate
SSE
Y XY
SSEn
Y Y
Y b b
Se
2
20 1
2
Sum of Squares Error
Standard Errorof the
Estimate
Determining SSE for the Airline Cost Data Example
Number ofPassengers Cost ($1,000) Residual X Y YY ˆ 2)ˆ( YY
61 4.28 .227 .0515363 4.08 -.054 .0029267 4.42 .123 .0151369 4.17 -.208 .0432670 4.48 .061 .0037274 4.30 -.282 .0795276 4.82 .157 .0246581 4 .70 -.167 .0278986 5.11 .040 .0016091 5.13 -.144 .0207495 5.64 .204 .0416297 5.56 .042 .00176
001.)ˆ( YY 2)ˆ( YY =.31434
Sum of squares of error = SSE = .31434
Coefficient of Determination ()
• The coefficient of determination is the proportion of variability of the dependent variable (y) accounted for or explained by the independent variable (x)
• The coefficient of determination ranges from 0 to 1.• An r 2 of zero means that the predictor accounts for
none of the variability of the dependent variable and that there is no regression prediction of y by x.
• An r 2 of 1 means perfect prediction of y by x and that 100% of the variability of y is accounted for by x.
Coefficient of Determination ()
n
SSESSSSE
SSSSR
SSSSE
SSSSR
SSESSRSSiationlaineduniationlainedSS
nSS
YY
r
YYYY
YY
YY
YYYY
YY
YY
YY
2
2
2
2
22
1
1
1
var expvar exp
Coefficient of Determination () forthe Airline Cost Example
899.11209.3
31434.1
1
11209.312
56.699251.270
31434.0
2
22
2
YY
YY
SSSSE
r
nY
YSS
SSE
89.9% of the variabilityof the cost of flying a
Boeing 737 is accounted for by the number of passengers.
Relation between and
• The coefficient of determination is the square of the coefficient of correlation
• is always positive• may be positive or negative• The researcher must examine the sign of the slope
of the regression line to determine whether a positive or negative relationship exists between the variables.
Exercise in R: Linear Regression
Open URL: www.openintro.orgGo to Labs in R and select 7 - Linear Regression