1 data analysis linear regression data analysis linear regression ernesto a. diaz department of...
TRANSCRIPT
1
Data Analysis
Linear Regression
Data Analysis
Linear Regression
Ernesto A. DiazDepartment of Mathematics
Redwood High School
2
Let us pause for a few moments…
What are we working on in this chapter?
3
Problem Statement
If we have a scatter plot that seems “linear”, can we find an equation that generates similar data? How accurate will it be?
4
Regression
One important branch of inferential statistics, called regression analysis, is used to compare quantities or variables, to discover relationships that exist between them, and to formulate those relationships in useful ways.
5
Linear Regression
Once a scatter diagram has been produced, we can draw a curve that best fits the pattern exhibited by the sample points.
The best-fitting curve for the sample points is called an estimated regression curve. If the points in the scatter diagram seem to lie approximately along a straight line, the relationship is assumed to be linear, and the line that best fits the data points is called the estimated linear regression.
6
Linear RegressionLinear regression is the process of
determining the linear relationship between two variables.
If we assume that the best-fitting curve is a line, then the equation of that line will take the form
y = ax + b,where a is the slope of the line and b is
the y-coordinate of the y-intercept. To identify the estimated regression line, we must find the values of the “regression coefficients” a and b.
7
Regression, 1st approach
8
2nd Approach: Med-Med Line
99
How do we evaluate accuracy? Root Mean Square Error (RMS)
Sum of Squares of Residuals (SSres)
2ˆ
2
y yRMS
n
2
1
ˆn
RES i ii
SS y y
10
3rd Approach: Least SquaresFor each x-value in the data set, the corresponding y-value usually differs from the value it would have if the data point were exactly on the line. These differences are shown in the figure by vertical line segments. The most common procedure is to choose the line where the sum of the squares of all these differences is minimized. This is called the method of least squares, and the resulting line is called the least squares line.
11
Linear Regression
Linear regression is the process of determining the linear relationship between two variables.
The line of best fit (regression line or the least squares line) is the line such that the sum of the squares of the vertical distances from the line to the data points (on a scatter diagram) is a minimum.
12
Linear Regression Formulas
The least squares line (regression line)
that provides the best fit to the data points (x1, y1), (x2, y2),… (xn, yn) has the equation
22
, where
( ), and
y mx b
n xy x y y m xm b
nn x x
13
Med-Med vs. Least Squares
The Median-Median Line is sometimes called the resistant line because it is not very influenced by one or two “bad” data points.
The Least Squares Line uses every point in its calculation, so it is affected by outliers.
14
Example 1: Regression
Suppose that we wish to get an idea of how the number of hours preparing for a final exam relates to the score on the exam. Data is collected and shown below.
Hours 1 2 3 4 5 6 7 8 9 10
Score 50 62 62 74 70 86 78 90 96 94
15
Linear Regression
The first step in analyzing these data is to graph the results as shown in the scatter diagram on the next slide.
16
Scatter Diagram
0
20
40
60
80
100
120
0 5 10 15
Hours Studying
Ex
am
Sc
ore
17
Linear Regression
If we let x denote hours studying and y denote exam score in the data of the previous slide and assume that the best-fitting curve is a line, then the equation of that line will take the form
y = mx + b,
where m is the slope of the line and b is the y-coordinate of the y-intercept. To identify the estimated regression line, we must find the values of the “regression coefficients” m and b.
18
Solution
2 22
10(4592) (55)(762)=
10(385) (55)
4.86
762 (4.86)(55)49.47
10
n xy x ym
n x x
m
y m xb
n
4.86 49.47y x The equation is
Example 1: Computing a Least Squares Line
19
Estimated Regression Line
0
20
40
60
80
100
120
0 5 10 15
Hours Studying
Ex
am
Sc
ore
20
Example: Med-Med vs. Best Fit
Using Dobbie, Find the estimated regression line using both methods
Hours 1 2 3 4 5 6 7 8 9 10
Score 50 62 62 74 70 86 78 90 96 94
21
Example 2: Predicting from a Regression Line
Use the result from the previous example to predict the exam score for a student that studied 6.5 hours.
II) Best Fit: Use the equation and replace x with 6.5.ˆ 4.86(6.5) 49.47 81.06y
ˆ 4.86 49.47y x
Based on the given data, the student should make about an 81%.
I) Med-Med: Use the equation and replace x with 6.5.ˆ 4.57(6.5) 52.19 81.90y
ˆ 4.57 52.19y x
Based on the given data, the student should make about an 82%.
Copyright © 2005 Pearson Education, Inc.22
13.8
Linear Correlation and Regression
Slide 13-23Copyright © 2005 Pearson Education, Inc.
Linear Correlation
Linear correlation is used to determine whether there is a relationship between two quantities and, if so, how strong the relationship is.
The linear correlation coefficient, r, is a unitless measure that describes the strength of the linear relationship between two variables.
If the value is positive, as one variable increases, the other increases.
If the value is negative, as one variable increases, the other decreases.
The variable, r, will always be a value between –1 and 1 inclusive.
Slide 13-24Copyright © 2005 Pearson Education, Inc.
Scatter Diagrams
A visual aid used with correlation is the scatter diagram, a plot of points (bivariate data). The independent variable, x, generally is a quantity that
can be controlled. The dependant variable, y, is the other variable.
The value of r is a measure of how far a set of points varies from a straight line. The greater the spread, the weaker the correlation and the
closer the r value is to 0.
Slide 13-25Copyright © 2005 Pearson Education, Inc.
Correlation
Slide 13-26Copyright © 2005 Pearson Education, Inc.
Correlation
Slide 13-27Copyright © 2005 Pearson Education, Inc.
Linear Correlation Coefficient
The formula to calculate the correlation coefficient (r) is as follows:
2 22 2
n xy x yr
n x x n y y
Slide 13-28Copyright © 2005 Pearson Education, Inc.
There are five applicants applying for a job as a medical transcriptionist. The following shows the results of the applicants when asked to type a chart. Determine the correlation coefficient between the words per minute typed and the number of mistakes.
Example: Words Per Minute versus Mistakes
934Nancy
1041Kendra
1253Phillip
1167George
824Ellen
MistakesWords per MinuteApplicant
Slide 13-29Copyright © 2005 Pearson Education, Inc.
We will call the words typed per minute, x, and the mistakes, y. List the values of x and y and calculate the necessary sums.
Solution
306811156934
xy = 2,281y2 = 510 x2 =10,711y = 50x = 219
10
12
11
8
y
Mistakes
xyy2 x2x
41
53
67
24
WPM
4101001681
6361442809
7371214489
19264576
Slide 13-30Copyright © 2005 Pearson Education, Inc.
Solution continued
The n in the formula represents the number of pieces of data. Here n = 5.
2 22 2
2 2
5 2281 219 50
5 10,711 219 5 510 50
11,405 10,950
5 10,711 47,961 5 510 2500
455
53,555 47,961 2550 2500
4550.86
5594 50
n xy x yr
n x x n y y
r
Slide 13-31Copyright © 2005 Pearson Education, Inc.
Solution continued
Since 0.86 is fairly close to 1, there is a fairly strong positive correlation.
This result implies that the more words typed per minute, the more mistakes made.
Slide 13-32Copyright © 2005 Pearson Education, Inc.
Linear Regression
Linear regression is the process of determining the linear relationship between two variables.
The line of best fit (line of regression or the least square line) is the line such that the sum of the vertical distances from the line to the data points is a minimum.
Slide 13-33Copyright © 2005 Pearson Education, Inc.
The Line of Best Fit
Equation:
22
, where
, and
y mx b
n xy x y y m xm b
nn x x
Slide 13-34Copyright © 2005 Pearson Education, Inc.
Example
Use the data in the previous example to find the equation of the line that relates the number of words per minute and the number of mistakes made while typing a chart.
Graph the equation of the line of best fit on a scatter diagram that illustrates the set of bivariate points.
Slide 13-35Copyright © 2005 Pearson Education, Inc.
Solution
From the previous results, we know that
Now we find the y-intercept, b.
22
2
5(2,281) (219)(50)
5(10,711) 219
455
55940.081
n xy x ym
n x x
m
m
m
50 0.081 219
532.261
6.4525
y m xb
n
b
b
Therefore the line of best fit is y = 0.081x + 6.452
Slide 13-36Copyright © 2005 Pearson Education, Inc.
Solution continued
To graph y = 0.081x + 6.452, plot at least two points and draw the graph.
8.88230
8.07220
7.26210
yx
Slide 13-37Copyright © 2005 Pearson Education, Inc.
Solution continued