9/14/15
1
Statistics: Unlocking the Power of Data Lock5
STAT 250 Dr. Kari Lock Morgan
Describing Data: Two Variables
SECTIONS 2.4, 2.5 • One quantitative variable (2.4) • One quantitative and one categorical (2.4) • Two quantitative (2.5)
Statistics: Unlocking the Power of Data Lock5
z-‐score
Which is better, an ACT score of 28 or a combined SAT score of 2100? � ACT: μ = 21, σ = 5 � SAT: μ = 1500, σ = 325 � Assume ACT and SAT scores have approximately bell-‐shaped distributions
a) ACT score of 28 b) SAT score of 2100 c) I don’t know
Statistics: Unlocking the Power of Data Lock5
Honeybee Waggle Dances
https://www.youtube.com/watch?v=-‐7ijI-‐g4jHg Statistics: Unlocking the Power of Data Lock5
Honeybee Waggle Dance � Honeybee scouts investigate new home or food source options; the scouts communicate the information to the hive with a “waggle dance”
� Scientists took bees to an island with only two possible options for nesting: one of very high quality and one of low quality. They recorded ¡ Quality of nesting site ¡ Distance to nesting site ¡ Number of waggle dance circuits performed ¡ Duration of waggle dance
Seeley, T., Honeybee Democracy, Princeton University Press, Princeton, NJ, 2010, p. 128
Statistics: Unlocking the Power of Data Lock5
Questions of the Day How many circuits of the waggle dance
do honey bees do?
How is this related to quality of a nesting site?
How is duration of the dance related to distance to a nesting site?
Statistics: Unlocking the Power of Data Lock5
Other Measures of Location
Maximum = largest data value
Minimum = smallest data value
Quartiles: Q1 = median of the values below m. Q3 = median of the values above m.
9/14/15
2
Statistics: Unlocking the Power of Data Lock5
Five Number Summary � Five Number Summary:
Min Max Q1 Q3 m
25% 25% 25% 25%
Minitab: Stat -> Basic Statistics -> Display Descriptive Statistics
Statistics: Unlocking the Power of Data Lock5
Five Number Summary
The distribution of number of circuits is
a) Symmetric b) Right-‐skewed c) Left-‐skewed d) Impossible to tell
Statistics: Unlocking the Power of Data Lock5
Percentile
The Pth percentile is the value which is greater than P% of the data
� We already used z-‐scores to determine whether an SAT score of 2100 or an ACT score of 28 is better
� We could also have used percentiles: ¡ ACT score of 28: 91st percentile ¡ SAT score of 2100: 97th percentile
Statistics: Unlocking the Power of Data Lock5
Five Number Summary � Five Number Summary:
Min Max Q1 Q3 m
0th percentile
100th percentile
50th percentile
75th percentile
25th percentile
25% 25% 25% 25%
Statistics: Unlocking the Power of Data Lock5
Measures of Spread
� Range = Max – Min
� Interquartile Range (IQR) = Q3 – Q1 � Is the range resistant to outliers? a) Yes b) No
� Is the IQR resistant to outliers? a) Yes b) No
Statistics: Unlocking the Power of Data Lock5
Comparing Statistics � Measures of Center:
¡ Mean (not resistant) ¡ Median (resistant)
� Measures of Spread: ¡ Standard deviation (not resistant) ¡ IQR (resistant) ¡ Range (not resistant)
� Most often, we use the mean and the standard deviation, because they are calculated based on all the data values, so use all the available information
9/14/15
3
Statistics: Unlocking the Power of Data Lock5
Boxplot
Median Q1
Q3
Lines (“whiskers”) extend from each quartile to the most extreme value that is not an outlier
Minitab: Graph -> Boxplot -> One Y -> Simple
Middle 50% of data
Outlier
Outlier
*For boxplots, outliers are dejined as any point more than 1.5 IQRs beyond the quartiles (although you don’t have to know that)
Statistics: Unlocking the Power of Data Lock5
Boxplot
This boxplot shows a distribution that is
a) Symmetric b) Left-‐skewed c) Right-‐skewed
Statistics: Unlocking the Power of Data Lock5
One Quantitative and One Categorical
� How is number of waggle circuits related to the quality of the nesting site?
� Two variables ¡ One quantitative (number of circuits) ¡ One categorical (quality – low or high)
� Can do anything for one quantitative variable, broken down by categorical groups
Statistics: Unlocking the Power of Data Lock5
Side-‐by-‐Side Boxplots
Minitab: Graph -> Boxplot -> One Y -> With Groups
Statistics: Unlocking the Power of Data Lock5
Stacked Dotplots
Minitab: Graph -> Dotplot -> One Y -> With Groups
Statistics: Unlocking the Power of Data Lock5
Overlaid Histograms
Minitab: Graph -> Histogram -> With Groups
9/14/15
4
Statistics: Unlocking the Power of Data Lock5
Quantitative Statistics by a Categorical Variable
� Any of the statistics we use for a quantitative variable can be looked at separately for each level of a categorical variable
Minitab: Stat -> Basic Statistics -> Display Descriptive Statistics -> By variables
Statistics: Unlocking the Power of Data Lock5
Difference in Means � Often, when comparing a quantitative variable across two categories, we compute the difference in means
!!xH − xL = 90.5−30=60.5Honeybees perform 60.5 circuits more, on average, for the high quality site as opposed to the low quality site.
Statistics: Unlocking the Power of Data Lock5
Association?
Does there appear to be an association between number of waggle circuits and quality of potential nesting site?
a) Yes b) No
Statistics: Unlocking the Power of Data Lock5
Summary: One Quantitative and One Categorical
� Summary Statistics ¡ Any summary statistics for quantitative variables, broken down by groups
¡ Difference in means
� Visualization ¡ Side-‐by-‐side graphs
Statistics: Unlocking the Power of Data Lock5
Two Quantitative Variables
� How is duration of the dance related to distance to a nesting site?
� Two quantitative variables
� Summary Statistics: correlation � Visualization: scatterplot
Statistics: Unlocking the Power of Data Lock5
Scatterplot
A scatterplot is the graph of the relationship between two quantitative variables.
Minitab: Graph -> Scatterplot -> Simple
9/14/15
5
Statistics: Unlocking the Power of Data Lock5
Direction of Association � A positive association means that values of one variable tend to be higher when values of the other variable are higher � A negative association means that values of one variable tend to be lower when values of the other variable are higher � Two variables are not associated if knowing the value of one variable does not give you any information about the value of the other variable
Statistics: Unlocking the Power of Data Lock5
Correlation
The correlation is a measure of the strength and direction of linear association
between two quantitative variables
• Sample correlation: r • Population correlation: ρ (“rho”)
Minitab: Stat -> Basic Statistics -> Correlation
r = 0.994 for duration of dance and distance to site
Statistics: Unlocking the Power of Data Lock5
Correlation 1. -‐1 ≤ r ≤ 1 2. The sign indicates the direction of association
1. positive association: r > 0 2. negative association: r < 0 3. no linear association: r ≠ 0
3. The closer r is to ±1, the stronger the linear association 4. r has no units and does not depend on the units of measurement 5. The correlation between X and Y is the same as the correlation between Y and X
Statistics: Unlocking the Power of Data Lock5
Correlation Guessing Game http://www.istics.net/Correlations/
Enter PennState for the group ID. Highest scorer in the class by the Birst exam gets one extra credit point on Exam 1!
Statistics: Unlocking the Power of Data Lock5
3.0 3.5 4.0 4.5 5.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
Malevolence Rating of Uniform
z-sc
ore
for P
enal
ty Y
ards
Correlation
r = 0.43
NFL Teams
Statistics: Unlocking the Power of Data Lock5
Correlation Cautions 1. Correlation can be heavily affected by outliers. Always plot your data!
9/14/15
6
Statistics: Unlocking the Power of Data Lock5
Testosterone Levels and Time What is the correlation between testosterone levels and hour of the day? a) Positive b) Negative c) About 0
Are testosterone level and hour of the day associated? a) Yes b) No
Statistics: Unlocking the Power of Data Lock5
Correlation Cautions 1. Correlation can be heavily affected by outliers. Always plot your data!
2. r = 0 means no linear association. The variables could still be otherwise associated. Always plot your data!
Statistics: Unlocking the Power of Data Lock5
TVs and Life Expectancy
0 200 400 600 800 1000
4050
6070
80
TV and Life Expectancy
TVs per 1000 People
Life
Exp
ecta
ncy
Angola
Australia
Cambodia
Canada
ChinaEgypt
France
Haiti
Iraq
Japan
Madagascar
Mexico
Morocco
Pakistan
Russia
South Africa
Sri Lanka
Uganda
United KingdomUnited States
Vietnam
Yemen
r = 0.74
Statistics: Unlocking the Power of Data Lock5
Correlation Cautions 1. Correlation can be heavily affected by outliers. Always plot your data!
2. r = 0 means no linear association. The variables could still be otherwise associated. Always plot your data!
3. Correlation does not imply causation!
Statistics: Unlocking the Power of Data Lock5
Summary: Two Quantitative Variables
� Summary Statistics: correlation � Visualization: scatterplot
Statistics: Unlocking the Power of Data Lock5
To Do � Read Sections 2.4 and 2.5
� Do HW 2.2, 2.3, 2.4, 2.5 (due Friday, 9/18)