Download - STA 291 Summer 2010
STA 291Summer 2010
Lecture 4Dustin Lueker
The population distribution for a continuous variable is usually represented by a smooth curve◦ Like a histogram that gets finer and finer
Similar to the idea of using smaller and smaller rectangles to calculate the area under a curve when learning how to integrate
Symmetric distributions◦ Bell-shaped◦ U-shaped◦ Uniform
Not symmetric distributions:◦ Left-skewed◦ Right-skewed◦ Skewed
Population Distribution
2STA 291 Summer 2010 Lecture 4
Center of the data◦ Mean◦ Median◦ Mode
Dispersion of the data Sometimes referred to as spread
◦ Variance, Standard deviation◦ Interquartile range◦ Range
Summarizing Data Numerically
3STA 291 Summer 2010 Lecture 4
Mean◦ Arithmetic average
Median◦ Midpoint of the observations when they are
arranged in order Smallest to largest
Mode◦ Most frequently occurring value
Measures of Central Tendency
4STA 291 Summer 2010 Lecture 4
Sample size n Observations x1, x2, …, xn Sample Mean “x-bar”
Sample Mean
5
SUM
STA 291 Summer 2010 Lecture 4
n
ii
n
xn
nxxxx
1
21
1/)...(
Population size N Observations x1 , x2 ,…, xN Population Mean “mu”
Note: This is for a finite population of size N
Population Mean
6
SUM
STA 291 Summer 2010 Lecture 4
N
ii
N
xN
Nxxx
1
21
1/)...(
Requires numerical values◦ Only appropriate for quantitative data◦ Does not make sense to compute the mean for
nominal variables◦ Can be calculated for ordinal variables, but this does not
always make sense Should be careful when using the mean on ordinal variables Example “Weather” (on an ordinal scale)
Sun=1, Partly Cloudy=2, Cloudy=3,Rain=4, Thunderstorm=5Mean (average) weather=2.8
Another example is “GPA = 3.8” is also a mean of observations measured on an ordinal scale
Mean
7STA 291 Summer 2010 Lecture 4
Center of gravity for the data set Sum of the differences from values above
the mean is equal to the sum of the differences from values below the mean◦ 3+2+2 = 3 + 4
Mean
STA 291 Summer 2010 Lecture 4 8
Mean◦ Sum of observations divided by the number of
observations
Example◦ {7, 12, 11, 18}◦ Mean =
Mean (Average)
9STA 291 Summer 2010 Lecture 4
Highly influenced by outliers◦ Data points that are far from the rest of the data
◦ Example Monthly income for five people
1,000 2,000 3,000 4,000 100,000 Average monthly income =
What is the problem with using the average to describe this data set?
Mean
10STA 291 Summer 2010 Lecture 4
Measurement that falls in the middle of the ordered sample
When the sample size n is odd, there is a middle value◦ It has the ordered index (n+1)/2
Ordered index is where that value falls when the sample is listed from smallest to largest An index of 2 means the second smallest value
◦ Example 1.7, 4.6, 5.7, 6.1, 8.3
n=5, (n+1)/2=6/2=3, index = 3Median = 3rd smallest observation = 5.7
Median
11STA 291 Summer 2010 Lecture 4
When the sample size n is even, average the two middle values◦ Example
3, 5, 6, 9, n=4(n+1)/2=5/2=2.5, Index = 2.5Median = midpoint between 2nd and 3rd smallest observations = (5+6)/2 = 5.5
Median
12STA 291 Summer 2010 Lecture 4
For skewed distributions, the median is often a more appropriate measure of central tendency than the mean
The median usually better describes a “typical value” when the sample distribution is highly skewed
Example◦ Monthly income for five people
1,000 2,000 3,000 4,000 100,000◦ Median monthly income:
Why is the median better to use with this data than the mean?
Mean and Median
13STA 291 Summer 2010 Lecture 4
Measures of Central Tendency
14
Mode - Most frequent value.
Notation: Subscripted variables n = # of units in the sample N = # of units in the population x = Variable to be measured xi = Measurement of the ith unit
Mean - Arithmetic Average
Mean of a Sample - xMean of a Population -
μ
Median - Midpoint of the observations when they are arranged in increasing order
STA 291 Summer 2010 Lecture 4
Example: Highest Degree Completed
Median for Grouped or Ordinal Data
15
Highest Degree Frequency Percentage
Not a high school graduate
38,012 21.4
High school only 65,291 36.8Some college, no
degree33,191 18.7
Associate, Bachelor, Master, Doctorate,
Professional
41,124 23.2
Total 177,618 100
STA 291 Summer 2010 Lecture 4
n = 177,618 (n+1)/2 = 88,809.5 Median = midpoint between the 88809th
smallest and 88810th smallest observations◦ Both are in the category “High school only”
Mean wouldn’t make sense here since the variable is ordinal
Median◦ Can be used for interval data and for ordinal data◦ Can not be used for nominal data because the
observations can not be ordered on a scale
Calculate the Median
16STA 291 Summer 2010 Lecture 4
Mean◦ Interval data with an approximately symmetric
distribution Median
◦ Interval data◦ Ordinal data
Mean is sensitive to outliers, median is not
Mean vs. Median
17STA 291 Summer 2010 Lecture 4
Symmetric distribution◦ Mean = Median
Skewed distribution◦ Mean lies more toward the direction which the
distribution is skewed
Mean vs. Median
18STA 291 Summer 2010 Lecture 4
While the median is better than the mean for skewed distributions there is one large disadvantage to using the median◦ Insensitive to changes within the lower or upper
half of the data◦ Example
1, 2, 3, 4, 5 1, 2, 3, 100, 100
◦ Sometimes, the mean is more informative even when the distribution is skewed
Median
19STA 291 Summer 2010 Lecture 4
Keeneland Sales
Example
STA 291 Summer 2010 Lecture 4 20
Deviations The deviation of the ith observation xi from
the sample mean is the difference between them, ◦ Sum of all deviations is zero◦ Therefore, we use either the sum of the absolute
deviations or the sum of the squared deviations as a measure of variation
21
x)( xxi
STA 291 Summer 2010 Lecture 4
Variance of n observations is the sum of the squared deviations, divided by n-1
Sample Variance
22
22 ( )
1ix x
sn
STA 291 Summer 2010 Lecture 4
Example
23
Observation Mean Deviation SquaredDeviation
134710
Sum of the Squared Deviationsn-1
Sum of the Squared Deviations / (n-1)
STA 291 Summer 2010 Lecture 4
Interpreting Variance About the average of the squared
deviations◦ “average squared distance from the mean”
Unit◦ Square of the unit for the original data
Difficult to interpret◦ Solution
Take the square root of the variance, and the unit is the same as for the original data Standard Deviation
24STA 291 Summer 2010 Lecture 4
Properties of Standard Deviation s ≥ 0
◦ s = 0 only when all observations are the same If data is collected for the whole population
instead of a sample, then n-1 is replaced by N
s is sensitive to outliers
25STA 291 Summer 2010 Lecture 4
Variance and Standard Deviation Sample
◦ Variance
◦ Standard Deviation
Population◦ Variance
◦ Standard Deviation
26
22 ( )
1ix x
sn
2( )1
ix xs
n
22 ( )ix
N
2( )ix
N
STA 291 Summer 2010 Lecture 4
Population Parameters and Sample Statistics Population mean and population standard
deviation are denoted by the Greek letters μ (mu) and σ (sigma)◦ They are unknown constants that we would like to
estimate Sample mean and sample standard deviation are
denoted by and s◦ They are random variables, because their values vary
according to the random sample that has been selected
27
x
STA 291 Summer 2010 Lecture 4
Empirical Rule If the data is approximately symmetric and
bell-shaped then◦ About 68% of the observations are within one
standard deviation from the mean◦ About 95% of the observations are within two
standard deviations from the mean◦ About 99.7% of the observations are within
three standard deviations from the mean
28STA 291 Summer 2010 Lecture 4
Example Scores on a standardized test are scaled so
they have a bell-shaped distribution with a mean of 1000 and standard deviation of 150◦ About 68% of the scores are between
◦ About 95% of the scores are between
◦ If you have a score above 1300, you are in the top %
29STA 291 Summer 2010 Lecture 4