describing data: one variable
DESCRIPTION
STAT 101 Dr. Kari Lock Morgan 9/6/12. Describing Data: One Variable. SECTIONS 2.1, 2.2, 2.3, 2.4 One categorical variable (2.1) One quantitative variable (2.2, 2.3, 2.4). The Big Picture. Sample. Population. Sampling. Statistical Inference. Descriptive Statistics. - PowerPoint PPT PresentationTRANSCRIPT
Statistics: Unlocking the Power of Data Lock5
STAT 101Dr. Kari Lock Morgan
9/6/12
Describing Data: One Variable
SECTIONS 2.1, 2.2, 2.3, 2.4• One categorical variable (2.1)• One quantitative variable (2.2, 2.3, 2.4)
Statistics: Unlocking the Power of Data Lock5
The Big Picture
Population
Sample
Sampling
Statistical Inference Descriptive
Statistics
Statistics: Unlocking the Power of Data Lock5
Descriptive Statistics
In order to make sense of data, we need ways to summarize and visualize it
Summarizing and visualizing variables and relationships between two variables is often known as descriptive statistics (also known as exploratory data analysis)
Type of summary statistics and visualization methods depend on the type of variable(s) being analyzed (categorical or quantitative)
Statistics: Unlocking the Power of Data Lock5
One Categorical VariableA random sample of US adults in 2012 were surveyed regarding the type of cell phone owned
Android? iPhone? Blackberry? Non-smartphone? No cell phone?
Statistics: Unlocking the Power of Data Lock5
Frequency Table
R: table(x)
•A frequency table shows the number of cases that fall in each category:
Android 458iPhone 437Blackberry 141Non Smartphone 924No cell phone 293Total 2253
Statistics: Unlocking the Power of Data Lock5
Proportion
The proportion in a category is found by
Proportion for a sample: (“p-hat”)
Proportion for a population: p
Statistics: Unlocking the Power of Data Lock5
ProportionWhat proportion of adults sampled do not
own a cell phone?
Android 458iPhone 437Blackberry 141Non Smartphone 924No cell phone 293Total 2253
�̂�=2932253 =0.13
or 13%
Proportions and percentages can be used interchangeably
Statistics: Unlocking the Power of Data Lock5
Relative Frequency TableA relative frequency table shows the proportion of cases that fall in each category
R: table(x)/length(x)
Android 0.203iPhone 0.194Blackberry 0.063Non Smartphone 0.410No cell phone 0.130
All the numbers in a relative frequency table sum to 1
Statistics: Unlocking the Power of Data Lock5
Bar Chart/Plot/GraphIn a barplot, the height of the bar corresponds to the number of cases falling in each category
R: barchart(x)
Statistics: Unlocking the Power of Data Lock5
Pie ChartIn a pie chart, the relative area of each slice of the pie corresponds to the proportion in each category
R: pie(table(x))
Statistics: Unlocking the Power of Data Lock5
StatKeywww.lock5stat.com/statkey
Statistics: Unlocking the Power of Data Lock5
Summary: One Categorical Variable
Summary Statistics Proportion Frequency table Relative frequency table
Visualization Bar chart Pie chart
Statistics: Unlocking the Power of Data Lock5
One Quantitative Variable
World gross for all 2011 Hollywood movies
HollywoodMovies2011
More graphics on profits for Hollywood movies
Statistics: Unlocking the Power of Data Lock5
HollywoodMovies2011
Statistics: Unlocking the Power of Data Lock5
DotplotIn a dotplot, each case is represented by a dot and dots are stacked. Easy way to see each case
Statistics: Unlocking the Power of Data Lock5
HistogramThe height of the each bar corresponds to the number of cases within that range of the variable
R: hist(x)
Statistics: Unlocking the Power of Data Lock5
Histogram vs Bar Chart
This is a
a) Histogramb) Bar chartc) Otherd) I have no idea
Statistics: Unlocking the Power of Data Lock5
Histogram vs Bar Chart
This is a
a) Histogramb) Bar chartc) Otherd) I have no idea
Statistics: Unlocking the Power of Data Lock5
Histogram vs Bar ChartA bar chart is for categorical data, and the x-axis has no numeric scale
A histogram is for quantitative data, and the x-axis is numeric
For a categorical variable, the number of bars equals the number of categories, and the number in each category is fixed
For a quantitative variable, the number of bars in a histogram is up to you (or your software), and the appearance can differ with different number of bars
Statistics: Unlocking the Power of Data Lock5
Shape
Symmetric Left-SkewedRight-Skewed
Long right tail
Statistics: Unlocking the Power of Data Lock5
NotationThe sample size, the number of cases in the sample, is denoted by n
We often let x or y stand for any variable, and x1 , x2 , …, xn represent the n values of the variable x
x1 = 97.009, x2 = 201.897, x3 = 216.196, …
Statistics: Unlocking the Power of Data Lock5
Mean
The mean or average of the data values is
Sample mean: Population mean: (“mu”)
𝑚𝑒𝑎𝑛=𝑥1+𝑥2+…+𝑥𝑛
𝑛 =∑ 𝑥𝑛
R: mean(x)
Statistics: Unlocking the Power of Data Lock5
Median
The median, m, is the middle value when the data are ordered.
If there are an even number of values, the median is the average of the two middle values.
The median splits the data in half.
R: median(x)
Statistics: Unlocking the Power of Data Lock5
m = 76.66
=150.74Mean is “pulled” in the direction of skewness
Measures of Center
World Gross (in millions)
Statistics: Unlocking the Power of Data Lock5
Skewness and Center
A distribution is left-skewed. Which measure of center would you expect to be higher?
a) Meanb) Median The mean will be
pulled down towards the skewness (towards the long tail).
Statistics: Unlocking the Power of Data Lock5
Outlier
An outlier is an observed value that is notably distinct from the other
values in a dataset.
Statistics: Unlocking the Power of Data Lock5
Outliers
World Gross (in millions)
Harry Potter
TransformersPirates of the Caribbean
Statistics: Unlocking the Power of Data Lock5
Resistance
A statistic is resistant if it is relatively unaffected by extreme
values.
The median is resistant while the mean is not.
Mean MedianWith Harry Potter $150,742,300 $76,658,500
Without Harry Potter $141,889,900 $75,009,000
Statistics: Unlocking the Power of Data Lock5
Outliers
When using statistics that are not resistant to outliers, stop and think about whether the outlier is a mistake
If not, you have to decide whether the outlier is part of your population of interest or not
Usually, for outliers that are not a mistake, it’s best to run the analysis twice, once with the outlier(s) and once without, to see how much the outlier(s) are affecting the results
Statistics: Unlocking the Power of Data Lock5
Standard Deviation
The standard deviation for a quantitative variable measures the
spread of the data
Sample standard deviation: sPopulation standard deviation: (“sigma”)
𝑠=√∑ (𝑥−𝑥 ) 2𝑛−1
R: sd(x)
Statistics: Unlocking the Power of Data Lock5
Standard DeviationThe standard deviation gives a rough estimate
of the typical distance of a data values from the mean
The larger the standard deviation, the more variability there is in the data and the more spread out the data are
Statistics: Unlocking the Power of Data Lock5
Frequency
-15 -10 -5 0 5 10 15
050
150
Frequency
-15 -10 -5 0 5 10 15
050
150
Standard Deviation
1s
4s
Both of these distributions are bell-shaped
Statistics: Unlocking the Power of Data Lock5
95% Rule
If a distribution of data is approximately symmetric and bell-shaped, about 95%
of the data should fall within two standard deviations of the mean.
For a population, 95% of the data will be between µ – 2 and µ + 2
StatKey
Statistics: Unlocking the Power of Data Lock5
The 95% RuleFrequency
-3 -2 -1 0 1 2 3
050
150
Frequency
-15 -10 -5 0 5 10 15
050
150
1s
4s
Statistics: Unlocking the Power of Data Lock5
The 95% RuleThe standard deviation for hours of sleep per night is closest to
a) ½b) 1c) 2d) 4e) I have no idea
2.03s
Statistics: Unlocking the Power of Data Lock5
z-score
The z-score for a data value, x, is
For a population, is replaced with µ and s is replaced with
Values farther from 0 are more extreme
Statistics: Unlocking the Power of Data Lock5
z-scoreA z-score puts values on a common scale
A z-score is the number of standard deviations a value falls from the mean
95% of all z-scores fall between what two values?
z-scores beyond -2 or 2 can be considered extreme
-2 and 2
Statistics: Unlocking the Power of Data Lock5
z-score
Which is better, an ACT score of 28 or a combined SAT score of 2100?
ACT: = 21, = 5SAT: = 1500, = 325
Assume ACT and SAT scores have approximately bell-shaped distributions
a) ACT score of 28b) SAT score of 2100c) I don’t know
28 21 7 1.45 5
z
2100 1500 600 1.85325 325
z
Statistics: Unlocking the Power of Data Lock5
Other Measures of Location
Maximum = largest data value
Minimum = smallest data value
Quartiles:Q1 = median of the values below m.Q3 = median of the values above m.
Statistics: Unlocking the Power of Data Lock5
Five Number SummaryFive Number Summary:
Min MaxQ1 Q3m
25% 25% 25% 25%
R: summary(x)
Statistics: Unlocking the Power of Data Lock5
Five Number Summary
The distribution of number of hours spent studying each week is
a) Symmetricb) Right-skewedc) Left-skewedd) Impossible to tell
> summary(study_hours) Min. 1st Qu. Median 3rd Qu. Max. 2.00 10.00 15.00 20.00 69.00
Statistics: Unlocking the Power of Data Lock5
Percentile
The Pth percentile is the value which is greater than P% of the data
We already used z-scores to determine whether an SAT score of 2100 or an ACT score of 28 is better
We could also have used percentiles: ACT score of 28: 91st percentile SAT score of 2100: 97th percentile
Statistics: Unlocking the Power of Data Lock5
Five Number SummaryFive Number Summary:
Min MaxQ1 Q3m
25% 25% 25% 25%
0th percentile
100th percentile
50th percentile
75th percentile
25th percentile
Statistics: Unlocking the Power of Data Lock5
Measures of Spread
Range = Max – Min
Interquartile Range (IQR) = Q3 – Q1
Is the range resistant to outliers?a) Yesb) No
Is the IQR resistant to outliers?a) Yesb) No
The range depends entirely on the most extreme values.
The IQR is based off the middle 50% of the data, which will not contain outliers.
Statistics: Unlocking the Power of Data Lock5
Comparing StatisticsMeasures of Center:
Mean (not resistant) Median (resistant)
Measures of Spread: Standard deviation (not resistant) IQR (resistant) Range (not resistant)
Most often, we use the mean and the standard deviation, because they are calculated based on all the data values, so use all the available information
Statistics: Unlocking the Power of Data Lock5
OutliersOutliers can be informally identified by
looking at a plot, but one rule of thumb for identifying outliers is data values more than 1.5 IQRs beyond the quartiles
A data value is an outlier if it is
Smaller than Q1 – 1.5(IQR)
or
Larger than Q3 + 1.5(IQR)
Statistics: Unlocking the Power of Data Lock5
Boxplot
MedianQ1
Q3
middle 50% of data
• Lines (“whiskers”) extend from each quartile to the most extreme value that is not an outlier
Outliers
R: boxplot(x)
Statistics: Unlocking the Power of Data Lock5
Boxplot
Which boxplot goes with the histogram of waiting times for the bus?
Histogram of Bus
Bus
Frequency
0 5 10 15 20
010
20
(a) (b) (c)
The data do not show any low outliers.
Statistics: Unlocking the Power of Data Lock5
StatKeywww.lock5stat.com/statkey
Statistics: Unlocking the Power of Data Lock5
Summary: One Quantitative VariableSummary Statistics
Center: mean, median Spread: standard deviation, range, IQR Percentiles 5 number summary
Visualization Dotplot Histogram Boxplot
Other concepts Shape: symmetric, skewed, bell-shaped Outliers, resistance z-scores
Statistics: Unlocking the Power of Data Lock5
To DoRead Sections 2.1, 2.2, 2.3, 2.4
Do Homework 1 (due Tuesday, 9/11)
If you haven’t already…
Get the textbook (at bookstore now)Get a clicker and register it (due Tuesday, 9/11)