chapter 5 describing distributions numerically. describing the distribution center median (.5...
TRANSCRIPT
Chapter 5
Describing Distributions Numerically
Describing the Distribution Center
Median (.5 quantile, 2nd quartile, 50th percentile)
Mean Spread
Range Interquartile Range Standard Deviation
Median
Literally = middle number (data value)
Has the same units as the data n (number of observations) is odd
Order the data from smallest to largest Median is the middle number on the list (n+1)/2 number from the smallest value
• Ex: If n=11, median is the (11+1)/2 = 6th number from the smallest value
• Ex: If n=37, median is the (37+1)/2 = 19th number from the smallest value
Example – Frank Thomas
Career Home Runs 4 7 15 18 24 28 29 32 35 38 40 40 41 42 43
Remember to order the values, if they aren’t already in order!
• 15 observations– (15+1)/2 = 8th
observation from bottom
• Median = 32 HRs
Median
n is even Order the data from smallest to largest
Median is the average of the two middle numbers
(n+1)/2 will be halfway between these two numbers•Ex: If n=10, (10+1)/2 = 5.5, median is average of 5th and 6th numbers from smallest value
Example – Ryne Sandberg
Career Home Runs0 5 7 8 9 12 14 16 19 19 25 26 26 26 30 40 Remember to order the values if they aren’t already in order!
• 16 observations– (16 + 1)/2 = 8.5,
average of 8th and 9th observations from bottom
• Median = average of 16 and 19
• Median = 17.5 HRs
Mean
Ordinary average Add up all observations Divide by the number of observations
Has the same units as the data Formula
n observations y1, y2, y3, …, yn are the values
Mean
y y1 y2 y3 L yn
n
yn
1
ny
Examples
Thomas
Sandberg
(4 7 15 18 ... 43)
1526.4HRs
(0 5 7 8 ... 40)
1617.625 HRs
Mean vs. Median
Median = middle number Mean = value where histogram balances
Mean and Median similar when Data are symmetric
Mean and median different when Data are skewed There are outliers
Mean vs. Median
Mean influenced by unusually high or unusually low values Example: Income in a small town of 6 people
$25,000 $27,000 $29,000 $35,000 $37,000 $38,000
**The mean income is $31,830**The median income is $32,000
Mean vs. Median
Bill Gates moves to town$25,000 $27,000 $29,000 $35,000 $37,000 $38,000 $40,000,000
**The mean income is $5,741,571**The median income is $35,000
Mean is pulled by the outlier Median is not Mean is not a good center of these data
Mean vs. Median
Skewness pulls the mean in the direction of the tail Skewed to the right = mean > median Skewed to the left = mean < median
Outliers pull the mean in their direction Large outlier = mean > median Small outlier = mean < median
Spread
Range = maximum – minimum Thomas
Min = 4, Max = 43, Range = 43 - 4 = 39 HRs
Sandberg Min = 0, Max = 40, Range = 40 - 0 = 40 HRs
Spread
Range is a very basic measure of spread It is highly affected by outliers Makes spread appear larger than reality
Ex. The annual numbers of deaths from tornadoes in the U.S. from 1990 to 2000:
53 39 39 33 69 30 25 67 130 94 40• Range with outlier: 130 – 25 = 105 tornadoes• Range without outlier: 94 – 25 = 69 tornadoes
Spread
Interquartile Range (IQR) First Quartile (Q1)
•Larger than about 25% of the data Third Quartile (Q3)
•Larger than about 75% of the data
IQR = Q3 – Q1 Center (Middle) 50% of the values
Finding Quartiles
Order the data Split into two halves at the median When n is odd, include the median in both halves
When n is even, do not include the median in either half
Q1 = median of the lower half Q3 = median of the upper half
Example – Frank Thomas
Order the values (15 values)
4 7 15 18 24 28 29 32 35 38 40 40 41 42 43Lower Half = 4 7 15 18 24 28 29 32
Q1 = Median of lower half = 21 HRs Upper Half = 32 35 38 40 40 41 42 43 Q3 = Median of upper half = 40 HRs
IQR = 40 – 21 = 19 HRs
Example – Ryne Sandberg Order the values (16 values) 0 5 7 8 9 12 14 16 19 19 25 26 26 26 30 40
Lower Half = 0 5 7 8 9 12 14 16 Q1 = Median of lower half = 8.5 HRs
Upper Half =19 19 25 26 26 26 30 40 Q3 = Median of upper half = 26 HRs
IQR = Q3 – Q1 = 26 – 8.5 = 17.5 HRs
Five Number Summary
Minimum Q1 Median Q3 Maximum
Examples Thomas
Min = 4 HRs Q1 = 21 HRs Median = 32 HRs Q3 = 40 HRs Max = 43 HRs
Sandberg Min = 0 HRs Q1 = 8.5 HRs Median = 17.5 HRs Q3 = 26 HRs Max = 40 HRs
Graph of Five Number Summary Boxplot
Box between Q1 and Q3 Line in the box marks the median Lines extend out to minimum and maximum
Best used for comparisons Use this simpler method
Example – Thomas & Sandberg Boxplot of Thomas Home Runs
Box from 21 to 40 Line in box 32 Lines extend out from box from 4 and 43
Boxplot of Sandberg Home Runs Box from 8.5 to 26 Line in box at 17.5 Lines extend out from box to 0 and 40
Side by Side Boxplots of Thomas & Sandberg Home Runs
Spread
Standard deviation “Average” spread from mean Most common measure of spread
•(Although it is influenced by skewness and outliers)
Denoted by letter s Make a table when calculating by hand
Standard Deviation
s (y1 y )2 (y2 y )2 K (yn y )2
n 1
y y 2n 1
1
n 1y y 2
Example – Deaths from Tornadoes
53 53-56.27 =-3.27 10.69
39 39-56.27 = -17.27 298.25
39 39-56.27 = -17.27 298.25
33 33-56.27 = -23.27 541.49
69 69-56.27 = 12.73 162.05
30 30-56.27 = -26.27 690.11
25 25-56.27 = -31.27 977.81
67 67-56.27 = 10.73 115.13
130 130-56.27 = 73.73 5436.11
94 94-56.27 = 37.73 1423.55
40 40-56.27 = -16.27 264.71
y )( yy 2)( yy
s 10.69 298.25 L 264.71
11 131.97 tornadoes
Example – Frank Thomas Find the standard deviation of the number of home runs given the following statistic:
74.2329)( 2 yy
s (y y )2n 1
2329.74
15 112.9HRs
Properties of s
s = 0 only when all observations are equal; otherwise, s > 0
s has the same units as the data s is not resistant
Skewness and outliers affect s, just like mean
Tornado Example: • s with outlier: 31.97 tornadoes• s without outlier: 21.70 tornadoes
Which summaries should you use? What numbers are affected by outliers? Mean Standard deviation Range
What numbers are not affected by outliers? Median IQR
Which summaries should you use? Five Number Summary
Skewed Data Data with outliers
Mean and Standard Deviation Symmetric Data
ALWAYS PLOT YOUR DATA!!