stat 211 – 019 dan piett west virginia university lecture 2
TRANSCRIPT
![Page 1: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/1.jpg)
STAT 211 – 019 Dan Piett
West Virginia University
Lecture 2
![Page 2: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/2.jpg)
Last LecturePopulation/SampleVariable Types
Discrete/Continuous Numeric & Ranked/Unranked Categorical
Displaying Small Sets of NumbersDot Plots, Stem and Leaf, Pie Charts
HistogramsFrequency/Density and Symmetric vs
Right/Left SkewedMeasures of Center
Mean/Median
![Page 3: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/3.jpg)
Overview2.3 Measures of Dispersion2.5 Boxplots3.1 Scatterplots3.2 Correlation3.3 Regression
![Page 4: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/4.jpg)
Section 2.3
Measures of Dispersion
![Page 5: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/5.jpg)
Descriptive StatisticsDescribing the DataHow do we describe data?Graphs (Last Class)Measures
Center (Last Class)Mean/Median
Dispersion/Spread (This Class)Variance, Standard Deviation, IQR
![Page 6: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/6.jpg)
Spread of DataExample: SpreadData 1: 8, 8, 9, 9, 10, 11, 11, 12, 12Data 2: -30, -20, -10, 0, 10, 20, 30, 40 ,50Data 1 – Mean = Median = 10Data 2 – Mean = Median = 10
Both have the same measure of center but how do they differ?
Data 2 is much more spread out.
![Page 7: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/7.jpg)
Sample Standard DeviationSample Standard Deviation (S) is a
measure of how spread out the data is S can be any number >= 0Larger S indicates a larger spreadUnit Associated with S is the same unit as
the variableExample: Mean of 110 lb, Standard Deviation
10 lbThe square of the sample standard
deviation is called the sample variance
![Page 8: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/8.jpg)
Standard Deviation ExampleData 1 (8, 8, 9, 9, 10, 11, 11, 12, 12)
S = 1.58Data 2 (-30, -20, -10, 0, 10, 20, 30, 40 ,50)
S = 27.39
As you can see, the standard deviation of Data 2 is much larger than Data 1.
![Page 9: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/9.jpg)
Population Variance/Standard DeviationMuch like the sample mean (xbar)
estimates the population mean (mu), the sample variance/standard deviation (s) can be used to estimate the true population standard deviation (sigma)
![Page 10: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/10.jpg)
Linear Transformations and Changes of ScaleBy adding or subtracting a constant to every
value in a data setThe mean is increased/decreased by the same
amountThe median is increased/decreased by the same
amountThe standard deviation is unchanged
By multiplying each value by a constantThe mean is multiplied by the same amountThe median is multiplied by the same amountThe standard deviation is multiplied by the same
amount
![Page 11: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/11.jpg)
Section 2.5
Boxplots
![Page 12: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/12.jpg)
QuartilesQuartiles are numbers which partition the data
into 4 subgroups (ie 4 quarters in a dollar)Q1
The data separating lowest 25% of the data valuesQ2 aka. Median
The data separating the lowest 50% of the data values
Q3 The data separating the lowest 75% of the data
valuesQ4 aka. Maximum
The largest data value
![Page 13: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/13.jpg)
Quartiles ExampleYou can think of Q1 as the median of the
bottom half of the data and Q3 as the median of the top half of the data
![Page 14: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/14.jpg)
Interquartile Range (IQR)The IQR is another measure of spread,
much like S.Larger IQR results in more spread dataIQR is calculated as Q3 - Q1ExampleData 1 (8, 8, 9, 9, 10, 11, 11, 12, 12)
IQR = 11.5-8.5=3Data 2 (-30, -20, -10, 0, 10, 20, 30, 40 ,50)
IQR = 35-(-15) = 50
![Page 15: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/15.jpg)
BoxplotsBoxplots are a graphical representation of
the quartiles.
![Page 16: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/16.jpg)
Using IQR to Find Potential OutliersOne method to find potential outliers is as
follows:1. Find the IQR2. Add 1.5*IQR to Q3
Anything larger than this value can be flagged as a potential outlier
3. Likewise, subtract 1.5*IQR from Q1Anything smaller than this value can be flagged as a
potential outlier
Example Data 1 (8, 8, 9, 9, 10, 11, 11, 12, 12) Data 2 (-30, -20, -10, 0, 10, 20, 30, 40 ,50)
![Page 17: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/17.jpg)
Section 3.1
Scatterplots
![Page 18: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/18.jpg)
Bivariate DataBivariate data is data consisting of two
variables from the same individualExamples
Height and WeightClasses skipped and GPA
Graphed using a scatterplot
![Page 19: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/19.jpg)
Scatterplot Example
![Page 20: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/20.jpg)
Section 3.2
Correlation
![Page 21: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/21.jpg)
Pearson Correlation CoefficientWe have discussed ways to describe data of
one variable. This section will discuss how to describe two variables on the same individual together.
The correlation coefficient, r, is a measure of the strength of a linear (straight line) relationship between bivariate data. (You will not need to know the formula for r)
To say two variables are correlated is two say that an increase/decrease in one corresponds to an increase/decrease in the other.
![Page 22: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/22.jpg)
More on rr can take on values between -1 and 1The strength of the correlation depends on
how close you are to the extreme values of -1 or 1r = -.78 is a stronger correlation than r = .50
There are three types of correlationPositiveNegativeNo Correlation
![Page 23: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/23.jpg)
Positive CorrelationPositive Correlation exists when r is
between 0 and 1.The closer r is to 1, the stronger the
relationshipThis implies that if you increase one of the
variables, the other one will also increase.Examples:
Height and Weight, Temperature and Ice Cream Sales
![Page 24: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/24.jpg)
Negative CorrelationPositive Correlation exists when r is
between -1 and 0.The closer r is to -1, the stronger the
relationshipThis implies that if you increase one of the
variables, the other one will decrease.Example:
Temperature and Hot Chocolate Sales
![Page 25: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/25.jpg)
No CorrelationNo Correlation exists when r is
approximately 0This implies that if you increase one of the
variables the other one does not changeExample:
Temperature and Cookie Sales
![Page 26: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/26.jpg)
Interpretation of rAlthough we may find that two variables are
correlated, this does not mean that there is necessarily a causal relationship.
Example:High School Teachers who are paid less tend to have
students who do better on the SATs than Teachers who are paid more. It has been found that there is a negative correlation between teacher salary and students SAT scores. Therefore we should pay our teachers less so students score higher.
Clearly this is not a causal relationship. There is likely a third variable, that is explaining this. One possibility may be the age of the teacher.
![Page 27: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/27.jpg)
Section 3.3
Regression
![Page 28: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/28.jpg)
Regression IntroSo we have decided that two variables are
correlated, we are now going to use the value of one of the variables, “x”, to predict the value of the other variable, “y ”.
Example:Use height (x) to predict weight (y)Use temperature (x) to predict ice cream
sales (y)
![Page 29: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/29.jpg)
Regression Equation
![Page 30: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/30.jpg)
Calculating a Regression Equation Given the slope and intercept
![Page 31: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/31.jpg)
Plotting a Regression Line
![Page 32: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/32.jpg)
Notes on Regression Lines
![Page 33: STAT 211 – 019 Dan Piett West Virginia University Lecture 2](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e105503460f94afb0c6/html5/thumbnails/33.jpg)
ResidualsA residual is the distance between a point
(observed y-value) and the regression line (predicted y-value)
Formula: Observed Value – Predicted ValueUsing the Cholesterol Example:
For TV Hours = 3, our predicted value was 212.2The actual value on the graph is 220.The residual for this particular point is = 220-
212.2=7.8A residual may be positive or negative
The interpretation is that the observed y-value is 7.8 units larger than the predicted y value for TV Hours = 3