chapter 2: looking at data - relationships /true-fact-the-lack-of-pirates-is-causing-global-warming
Post on 15-Jan-2016
215 views
TRANSCRIPT
1
Chapter 2: Looking at Data - Relationships
http://www.forbes.com/sites/erikaandersen/2012/03/23/true-fact-the-lack-of-pirates-is-causing-global-warming/
2
General Procedure
1. Plot the data.2. Look for the overall pattern.3. Calculate a numeric summary.4. Answer the question (which will be defined
shortly)
3
2.1: Relationships - Goals• Be able to define what is meant by an association
between variables.• Be able to categorize whether a variable is a response
variable or a explanatory variable.• Be able to identify the key characteristics of a data set.
4
Questions
• What objects do the data describe?• What variables are present and how are they
measured?• Are all of the variables quantitative?• Are the variables associated with each other?
5
Association (cont.)
Two variables are associated if knowing the values of one of the variables tells you something about the values of the other variable.1. Do you want to explore the association?2. Do you want to show causality?
6
Variable Types
• Response variable (Y): outcome of the study• Explanatory variable (X): explains or causes
changes in the response variable
7
Key Characteristics of Data• Cases: Identify what they are and how many• Label: Identify what the label variable is (if
present)• Categorical or quantitative: Classify each
variable as categorical or quantitative. • Values. Identify the possible values for each
variable.• Explanatory or Response: Classify each
variable as explanatory or response.
8
2.2: Scatterplots - Goals• Be able to create a scatterplot (lab)• Be able to interpret a scatterplot–Pattern–Outliers– Form, direction and strength of a relationship
• Be able to interpret scatterplots which have categorical variables.
9
Scatterplot - Procedure
1. Decide which variable is the explanatory variable and put on X axis. The response variable goes on the Y axis.
2. Label and scale your axes.3. Plot the (x,y) pairs.
10
Example: Scatterplot
The following data is to determine the relationship between age and change in systolic blood pressure (BP, mm Hg) after 24 hours in response to a particular treatment.
a) Draw a scatterplot of this data.
Obs 1 2 3 4 5 6 7 8 9 10 11Age 70 51 65 70 48 70 45 48 35 48 30BP -28 -10 -8 -15 -8 -10 -12 3 1 -5 8
11
Example: Scatterplot (cont)
25 30 35 40 45 50 55 60 65 70 75-30
-20
-10
0
10
BP
25 30 35 40 45 50 55 60 65 70 75-30
-20
-10
0
10
BP
Age
Age
12
Pattern
• Form• Direction• Strength• Outliers
13
PatternLinear
Nonlinear
No relationship
14
Outliers
15
Example: Scatterplot (cont)
25 30 35 40 45 50 55 60 65 70 75-30
-20
-10
0
10
BP
Age
16
Scatterplot with Categorical Variables
http://statland.org/Software_Help/Minitab/MTBpul2.htm
17
I am a Turkey, not Tukey!Thank you for not eating me!
18
2.3: Correlation - Goals• Be able to use (and calculate) the correlation to
describe the direction and strength of a linear relationship.
• Be able to recognize the properties of the correlation.
• Be able to determine when (and when not) you can use correlation to measure the association.
19
Sample correlation, r(Pearson’s Sample Correlation Coefficient)
𝑟=∑ [ (𝑥𝑖−𝑥 ) (𝑦 𝑖− 𝑦 ) ]
√∑ (𝑥𝑖−𝑥 )2√∑ ( 𝑦 𝑖− 𝑦 )2=
𝑆𝑆𝑥𝑦
√𝑆𝑆𝑥𝑥√𝑆𝑆𝑦𝑦
=∑ [ (𝑥𝑖−𝑥 ) ( 𝑦 𝑖− 𝑦 ) ]
(𝑛−1)𝑠𝑥 𝑠 𝑦
= 1𝑛−1∑ [(𝑥 𝑖−𝑥
𝑠𝑥 )( 𝑦 𝑖− 𝑦𝑠 𝑦
)]
20
Sum of Squares
21
Properties of Correlation
• r > 0 ==> positive associationr < 0 ==> negative association
• r is always a number between -1 and 1.• The strength of the linear relationship
increases as |r| moves to 1.– |r| = 1 only occurs if there is a perfect linear
relationship– r = 0 ==> x and y are uncorrelated.
22
Positive/Negative Correlation
23
Example: Positive/Negative Correlation
1) Would the correlation between the age of a used car and its price be positive or negative? Why?
2) Would the correlation between the weight of a vehicle and miles per gallon be positive or negative? Why?
24
Properties of Correlation
• r > 0 ==> positive associationr < 0 ==> negative association
• r is always a number between -1 and 1.• The strength of the linear relationship
increases as |r| moves to 1.– |r| = 1 only occurs if there is a perfect linear
relationship– r = 0 ==> x and y are uncorrelated.
25
Variety of Correlation Values
26
Value of r
27
Properties of Correlation
• r > 0 ==> positive associationr < 0 ==> negative association
• r is always a number between -1 and 1.• The strength of the linear relationship
increases as |r| moves to 1.– |r| = 1 only occurs if there is a perfect linear
relationship– r = 0 ==> x and y are uncorrelated.
28
Comments about Correlation• Correlation makes no distinction between
explanatory and response variables.
• r has no units and does not change when the units of x and y change.
29
Cautions about Correlation• Correlation requires that both variables be
quantitative.• Correlation measures the strength of LINEAR
relationships only.
• The correlation is not resistant to outliers.• Correlation is not a complete summary of
bivariate data.
30
Datasets with r = 0.816
31
Questions about Correlation
• Does a small r indicate that x and y are NOT associated?
• Does a large r indicate that x and y are linearly associated?
32
2.4: Least-Squares Regression - Goals• Be able to generally describe the method of
‘Least Squares Regression’• Be able to calculate and interpret the
regression line.• Using the least square regression line, be able
to predict the value of y for any appropriate value of x.
• Be able to calculate r2.• Be able to explain the meaning of r2.–Be able to discern what r2 does NOT explain.
33
Regression Line
A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes.
We can use a regression line to predict the value of y for a given value of x.
34
Idea of Linear Regression
35
Linear Regression
i i xy y
1 2xx xi
x x y y SS sb r
SS sx x
b0 = y - b1x�
y = b0 + b1x
36
Example: Regression Line
25 30 35 40 45 50 55 60 65 70 75-30-25-20-15-10
-505
10
BP
Age
y = 20.11 - 0.526x
37
Example: Regression LineThe following data is to determine the relationship
between age and change in systolic blood pressure (BP, mm Hg) after 24 hours in response to a particular treatment.
= 52.727, y = -7.636, sx� x = 14.164, sy = 9.688, r = -0.76951
b) What is the regression line for this data?c) What would the predicted value be for someone who is
51 years old?
Obs 1 2 3 4 5 6 7 8 9 10 11Age 70 51 65 70 48 70 45 48 35 48 30BP -28 -10 -8 -15 -8 -10 -12 3 1 -5 8
38
Facts about Least Square Regression
1. Slope: A change of one standard deviation in x corresponds to a change of r standard deviations in y.
2. Intercept: the value of y when x = 0.3. The line passes through the point ( ,y).x�4. There is an inherent difference between x
and y.
39
r2
• Coefficient of determination.• Fraction of the variation of the values of y that
is explained by the least-squares regression of y on x.
40
Example: Regression LineThe following data is to determine the
relationship between age and change in systolic blood pressure (BP, mm Hg) after 24 hours in response to a particular treatment.
d) What percent of variation of Y is due to the regression line?
Obs 1 2 3 4 5 6 7 8 9 10 11Age 70 51 65 70 48 70 45 48 35 48 30BP -28 -10 -8 -15 -8 -10 -12 3 1 -5 8
41
Beware of interpretation of r2
• Linearity• Outliers• Good prediction
42
2.5: Cautions about Correlation and Regression - Goals
• Be able to calculate the residuals.• Be able to use a residual plot to assess the fit of a
regression line.• Be able to identify outliers and influential observations
by looking at scatterplots and residual plots.• Be able to determine when you can predict a new
value.• Be able to identify lurking variables that can influence
the relationship between two variables.• Be able to explain the different between association
and causation.
43
Residuals
44
Example: Regression LineThe following data is to determine the
relationship between age and change in systolic blood pressure (BP, mm Hg) after 24 hours in response to a particular treatment.
e) What is the residual for someone who is 51 years old?
Obs 1 2 3 4 5 6 7 8 9 10 11Age 70 51 65 70 48 70 45 48 35 48 30BP -28 -10 -8 -15 -8 -10 -12 3 1 -5 8
45
Residual PlotsGood
Linearity Violation
46
Residual PlotsGood
Constant variance violation
47
Residual Plots – Bp
25 30 35 40 45 50 55 60 65 70 75-12
-6
0
6
12
Age
Resi
dual
25 30 35 40 45 50 55 60 65 70 75-24
-18
-12
-6
0
6
12
AgeRe
sidu
al
Original Y outlier
48
Residual Plots – Bp
25 30 35 40 45 50 55 60 65 70 75-12
-6
0
6
12
Age
Resi
dual
Original X outlier
25 50 75 100-12
-6
0
6
12
AgeRe
sidu
al
49
Influential Point
• An outlier is an observation that lies outside the overall pattern of the other observations.
• An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation.
50
Cautions about Correlation and Regression: Extrapolation
25 30 35 40 45 50 55 60 65 70 75-30-20-10
010
BP
51
Cautions about Correlation and Regression:
• Both describe linear relationship.• Both are affected by outliers.• Always PLOT the data.• Beware of extrapolation.• Beware of lurking variables– Lurking variables are important in the study,
but are not included.–Confounding variables confuse the issue.
• Correlation (association) does NOT imply causation!
52
Lurking VariablesIn each of these cases, identify the lurking variable.
1. For children, there is an extremely strong correlation between shoe size and math scores.
2. There is a very strong correlation between ice cream sales and number of deaths by drowning.
3. There is very strong correlation between number of churches in a town and number of bars in a town.
53
What is the lurking variable?
http://www.forbes.com/sites/erikaandersen/2012/03/23/true-fact-the-lack-of-pirates-is-causing-global-warming/
54
2.6: Data Analysis for Two-Way Tables - Goals
Statements• The distribution of a two random variables
(bivariate) is called a joint distribution.• Two random variables are similar to two events in
that they can have conditional probabilities and be independent of each other.
Goal• Interpret examples of Simpson’s paradox
55
Simpson’s Paradox
An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox.
56
Simpson’s Paradox
Consider the acceptance rates for the following groups of men and women who applied to college.
Counts Accepted Notaccepted Total
Men 198 162 360
Women 88 112 200
Total 286 274 560
Percents Accepted Notaccepted
Men 55% 45%
Women 44% 56%
57
Simpson’s Paradox
• Business School
• Art School
Counts Accepted Notaccepted Total
Men 18 102 120
Women 24 96 120
Total 42 198 240
Percents Accepted Notaccepted
Men 15% 85%
Women 20% 80%
Counts Accepted Notaccepted Total
Men 180 60 240Women 64 16 80
Total 244 76 320
Percents Accepted Notaccepted
Men 75% 25%
Women 80% 20%
58
2.7: The Question of Causation - Goals• Be able to explain an association–Causation–Common response–Confounding variables
• Apply the criteria for establishing causation.
59
Causation
Association does not mean causation!
60
Establishing CausationPerform an experiment!What do we need for causation?1. The association is strong.2. The association is consistent.
The connection happens in repeated trialsThe connection happens under varying conditions
3. Higher doses are associated with strong responses.
4. Alleged cause precedes the effect.5. The alleged cause is plausible.