wednesday, august 11 (131 minutes) - stevewillott.comstevewillott.com/17-18 ap stats notes in word/1...
TRANSCRIPT
1
Name _____________________________
Chapter 1 Learning Objectives Section
Related
Example
on Page(s)
Relevant
Chapter
Review
Exercise(s)
Can I do
this?
Identify the individuals and variables in a set of
data. Intro 3 R1.1
Classify variables as categorical or
quantitative. Intro 3 R1.1
Display categorical data with a bar graph.
Decide whether it would be appropriate to
make a pie chart.
1.1 9 R1.2, R1.3
Identify what makes some graphs of
categorical data deceptive. 1.1 10 R1.3
Calculate and display the marginal distribution
of a categorical variable from a two-way table. 1.1 13 R1.4
Calculate and display the conditional
distribution of a categorical variable for a
particular value of the other categorical
variable in a two-way table.
1.1 15 R1.4
Describe the association between two
categorical variables by comparing appropriate
conditional distributions.
1.1 17 R1.5
Make and interpret dotplots and stemplots of
quantitative data. 1.2
Dotplots: 25
Stemplots: 31 R1.6
Describe the overall pattern (shape, center, and
spread) of a distribution and identify any major
departures from the pattern (outliers).
1.2 Dotplots: 26 R1.6, R1.9
Identify the shape of a distribution from a
graph as roughly symmetric or skewed. 1.2 28
R1.6, R1.7,
R1.8, R1.9
Make and interpret histograms of quantitative
data. 1.2 33 R1.7, R1.8
Compare distributions of quantitative data
using dotplots, stemplots, or histograms. 1.2 30 R1.8, R1.10
Calculate measures of center (mean, median). 1.3
Mean: 49
Median: 52 R1.6
Calculate and interpret measures of spread
(range, IQR, standard deviation). 1.3
IQR: 55
Std. dev: 60 R1.9
Choose the most appropriate measure of center
and spread in a given setting. 1.3 65 R1.7
Identify outliers using the 1.5 × IQR rule. 1.3 56
R1.6, R1.7,
R1.9
Make and interpret boxplots of quantitative
data. 1.3 57 R1.7
Use appropriate graphs and numerical
summaries to compare distributions of
quantitative variables.
1.3 65 R1.8, R1.10
2
1.1 Analyzing Categorical Data
Read 2–4
Fr/Soph/Jr/Sr g.p.a
Email address
Name
Bus route
Phone number
Days absent
Address
Credits earned
Allergies
Current on immunizations
Exterior color mileage
Total car length
Number of cylinders
Cost
Model
VIN
Type of sound system
Size of fuel tank
What do we call these two kinds of variables? What’s the difference?
Why do people sometimes confuse the two kinds of variables?
What is a distribution? It’s all the values that a variable can take on and how often.
3
Alternate Example: Willott’s music
Here is information about 12 randomly selected songs in Willott’s music library.
Song Title Artist Album
year
Track
Length Genre
Tracks on
the album
Track
Number
Double Dare Bauhaus 1980 4:54 Gothic 9 1
Carpe Noctum Tiesto 2007 7:03 Dance/Electronic 12 4
She Wolf Shakira 2009 3:10 Latin 12 1
Come as You Are Nirvana 1991 3:39 Alternative 12 3
The Heinrich
Maneuver Interpol 2007 3:35 Alternative 11 4
Shake It Out Florence +
The Machine 2011 4:38 Alternative 12 2
My Songs Know What
You Did in the Dark
(Light Em Up)
Fall Out Boy 2013 3:07 Alternative 11 2
Locked Out of Heaven Bruno Mars 2012 3:53 Pop 10 2
Womanizer Britney
Spears 2008 3:44 Pop 13 1
Iceolate Front Line
Assembly 1990 5:13 Industrial 10 7
I Bet You Look Good
On The Dancefloor
Arctic
Monkeys 2006 2:54 Indie 13 2
Meat is Murder The Smiths 1985 6:06 Alternative 9 9
(a) Who are the individuals in this data set?
(b) What variables are measured? Identify each as categorical or quantitative. In what units were the
quantitative variables measured?
(c) Describe the individual in the first row.
Read 7–11
What's the difference between a data table, a frequency table, and a relative frequency table?
Data table Frequency table Relative frequency table
tells values of variables for
individuals
tells distribution of 1 variable in
table form
tells distribution of 1 variable as a
%, decimal, or fraction
Which one was the previous example?
When making pie charts and bar graphs, what do people often mess up?
4
Bar Graphs Pie Charts
Pros Quick & easy Show part-whole relationships well
Cons part-whole relationships are hard to see They’re hard to make by hand.
Don't use when percents don't add up to 100%.
Let's search "misleading graph" and see some examples.
Identify some particular problems many of these graphs share.
HW #11: page 7 (1, 3, 5, 7, 8), page 22 (11, 13, 15, 17, 18)
Read 12–18
Examples of:
…two-way table (2 variables are shown with counts or frequencies)
Senior Non-senior
Boy 8 3
Girl 15 4
…marginal distribution (totals for rows & columns; the distribution for each variable)
Senior Non-senior Totals
Boy 8 3 11
Girl 15 4 19
Totals 23 7 30
…conditional distribution (distribution of one variable as a % of the other variable)
Senior Non-senior
Boy 35% 43%
Girl 65% 57%
Totals 100% 100%
How do we know which variable to condition on? Divide by the explanatory variable totals.
Senior Non-senior Totals
Boy 73% 27% 100%
Girl 79% 21% 100%
Died Survived
Hospital A
Hospital B
5
What is a segmented (or stacked) bar graph?
Use a segmented bar graph to compare conditional distributions, to look for differences, and to look for
patterns.
When knowing the value of one variable helps predict the value of the other, we say that the variables are
associated. Association appears in a segmented bar graph when we see big differences in the proportions. The
proportions may be “flipped” or reversed.
Careful! An association does NOT
automatically mean that there is a
cause-and-effect relationship.
The boy/girl senior/non-senior graphs
did not show much association.
Alternate Example: Horseshoe Crabs
Two members of the University of Florida at Gainesville Department of Zoology collected data on Horseshoe
Crabs on a Delaware beach during 4 days in the late spring of 1992. Based on the color of the shells, they
classified each crab as Young, Intermediate, or Old and whether the crabs could right themselves when flipped
on their backs or whether they were stranded for at least a certain period of time. Here are the results.
Young Intermediate Old Total
Stranded 214 384 295 893
Not Stranded 1668 1204 216 3088
Total 1882 1588 511 3981
(a) Explain what it would mean if there was no association between age and strandedness.
(b) Does there appear to be an association between age and strandedness in this sample? Justify.
6
HW #12: page 22 (19, 21, 23, 25, 27–34)
And now, we change from categorical data to quantitative data…
1.2 Displaying Quantitative Data with Graphs Elmer and Ethel have retired and want to move someplace warm. The couple is considering nine different cities.
The dotplots below show the distribution of average daily high temperatures in December, January, and
February for each of these cities. Help them pick a city by answering the questions below, based on the data
shown in the graph.
1. What is the typical high temperature for these months in Phoenix, Orlando, and San Juan? Which of those 3 cities is
most similar in this respect to Palm Springs? (Look for the center: the average, median, or typical value.)
2. Are daily high temperatures for these months more predictable in Palm Springs or in Orlando? (Look at the spread:
the variation, including the range.)
3. What might be unique to Atlanta, San Diego, and Honolulu? (Look for outliers: unusual values.)
4. What makes San Juan and San Diego somewhat similar to one another? Likewise, Palm Springs, Phoenix, and
Orlando are similar to one another in this way, but different from the first group. (Look at the shape: symmetry vs.
asymmetry.)
palmspring...
atlantaH
phoenixH
sandiegoH
orlandoH
miamiH
keywestH
honoluluH
sanjuanH
60 65 70 75 80 85 90
Average High Temperatures Dot Plot
7
Read 25–27 Notice that we are now looking at quantitative data!
How should we describe the distribution of a quantitative variable? Use “SOCS”
Center- Typical value, such as the mean or the median
Spread- Range for now (we'll also use standard deviation and interquartile range "IQR")
Outliers- Unusual values for now (we'll eventually use the "1.5IQR Rule")
Shape- Address the graph's # of peaks and its symmetry
(unimodal, bimodal, multimodal, uniform, symmetric, asymmetric, skewed left, skewed right)
Read 27–29 Examples and descriptions of various shapes of distributions:
Unimodal Symmetric
Curve Dotplot Histogram
Heights on adult women Expected sums on 36 rolls
of two 6-sided dice Length of growing
seasons in St. Louis
Bimodal
Curve Dotplot Histogram
Heights of men and women Maximum angle of a
Observed sums on 35 rolls sample of roller coasters
of a 4-sided die and an 8-sided die
Unimodal Skewed Left
Curve Dotplot Histogram
Heights of kids at a
middle school dance Time to finish a difficult test Heights in my extended family
Unimodal Skewed Right
Curve Dotplot Histogram
Salaries of MLB players Selling prices of homes
in a new subdivision Scores on a multiple choice pre-test
over completely new material
Uniform
Curve Dotplot Histogram
Expected outcomes of spins of a
spinner with equally-sized spaces Outcomes of 36 rolls Ages of students
numbered 1-10 of a 6-sided die in a school district
8
Here are the number of calories per item for 16 convenience store sandwiches, along with a dotplot of the data.
360 430 440 440 440 450 450 460
470 480 480 490 490 490 500 510
Describe the shape, center, and spread of the distribution. Are there any outliers?
Read 29–30
When asked to compare two distributions, be sure that you compare and don’t just describe!
Be sure that you use “less”, “more”, and “-er” words.
How does the annual energy consumption (kWh/year) compare for top-loading washing machines and front-
loading washers? The data below is from the Home Depot website. There are 26 front-loaders and 32 top-
loaders included.
Read 31–32
Caution! Remember to include a key when making a stemplot (stem-and-leaf-plot).
If you write "19 | 7", is that 197, 19.7, 1970, ...?
9
How do gas prices in St. Charles County compare to those in Madison County, where Alton, Illinois is located?
A sample of gas prices was taken on several days in July 2015. Make a back-to-back stemplot and compare the
distributions. St. Charles Co.: 2.56, 2.56, 2.57, 2.57, 2.58, 2.58, 2.58, 2.58, 2.59, 2.59, 2.59, 2.59, 2.60, 2.60, 2.61
Madison Co.: 2.67, 2.68, 2.69, 2.69, 2.70, 2.70, 2.70, 2.71, 2.71, 2.71, 2.71, 2.72, 2.72, 2.73, 2.74
HW #13: page 41 (37, 39, 43, 45, 47)
1.2 Histograms The following table presents the total number of triples (3B) for the 30 MLB teams in the 2014 regular season.
Make a dotplot to display the distribution of triples for the season. Then, use your dotplot to make a histogram
of the distribution. Team 3B Team 3B Team 3B
Arizona 47 Pittsburgh 30 Toronto 24
San Francisco 42 San Diego 30 Tampa Bay 24
Colorado 41 Kansas City 29 Cleveland 23
LA Dodgers 38 Milwaukee 28 Atlanta 22
Miami 36 Texas 28 St. Louis 21
Oakland 33 Minnesota 27 Boston 20
Chicago Sox 32 Washington 27 Cincinnati 20
Seattle 32 Philadelphia 27 Houston 19
LA Angels 31 Detroit 26 NY Mets 19
Chicago Cubs 31 NY Yankees 26 Baltimore 16
Read 33–36
When you make a histogram...
...you can turn a dotplot into a histogram.
... be consistent with "fence sitters".
... be consistent with spacing and bin width.
10
Read 38–41
When might we want a relative frequency histogram rather than a frequency histogram?
…to see part-whole relationships or to compare 2 groups
HW #14: page 45 (51, 53, 55, 59–62)
1.3 Describing Quantitative Data with Numbers
Read 48–50
x is is a statistic; "x bar" is the sample mean. is a parameter; "mu" is the population mean.
When adding a very large or very small data value to a data set (or changing a data value to something very
large or very small) does not change the value of a statistic very much, or at all, we say that the statistic is
resistant.
The mean is not a resistant measure of center. Adding an extreme value, or altering a value to make it extreme,
will change the value of the mean quite a bit. Think about what happens to the average age of people in the
classroom when Mr. Willott walks in.
The mean is the balancing point.
Approximately where will the mean be located, when looking at a histogram or dotplot?
Read 51–53
The median is a resistant measure of center. Adding an extreme value, or altering a value to make it extreme,
will not change the value of the median much, if at all. Think about what happens to the median age of people
in the classroom when Mr. Willott walks in.
If we know the shape of a distribution, as shown below, then where are the mean and the median located in
relation to one another?
roughly symmetric exactly symmetric skewed
StL_winter_Avg_High_Temps
36 38 40 42 44 46 48 50
Average High Temperatures Dot Plot
11
Read 53–55
The range = highest data value minus lowest data value. The range is a single number and it is not a resistant
measure of spread. An extreme value will affect the value of the range. Think about what happens to the range
of ages of people in the classroom when Mr. Willott walks in.
The median divides an ordered list of data into two equal groups.
The quartiles divide an ordered list of data into four equal groups.
The interquartile range (IQR) is the spread of the middle 50% of the data. The IQR is a resistant measure of
spread. Think about what happens to the range of the middle 50% of ages of people in the classroom when Mr.
Willott walks in.
Here are data on the amount of fat (in grams) in 9 different Taco Bell menu items. Calculate the median,
quartiles, and IQR.
Read 57–58
What is the 1.5 IQR Rule for identifying outliers?
Illustration by
Kelly Boles
Item Fat (g)
Crunchy Taco 10
Nachos Supreme 24
Cheese Quesadilla 26
Chicken Quesadilla 27
Mexican Pizza 31
Taco Salad (steak) 37
Nachos BellGrande 39
XXL Grilled Stuft Burrito – Beef 41
Taco Salad (original) 42
12
How many fat grams would qualify as an outlier for the Taco Bell items?
Are there outliers among the 9 taco bell items?
Here are data for the calories for 12 McDonald’s menu items. Are there any outliers?
Read 56–58
The five-number summary: Minimum, Q1, Median, Q3, Maximum
A boxplot is a graph that is related to the five-number summary.
Draw a boxplot for the Taco Bell data. Check yours against the one that the graphing calculator makes.
Here are parallel boxplots for the heights of baseball players for 5 of the 2005 MLB teams. Compare these
distributions.
Sandwich Calorie
32 oz. Chocolate Shake 1160
Big Breakfast®
740
Big Mac® 540
Sausage Biscuit with Egg 510
McRib®
500
10 pc. McNuggets®
460
Double Cheeseburger 440
Quarter Pounder® 410
Filet-O-Fish®
380
McChicken®
360
Large Caramel Latte 330
Large Vanilla Iced Coffee 270
Item Fat (g)
Crunchy Taco 10
Nachos Supreme 24
Cheese Quesadilla 26
Chicken Quesadilla 27
Mexican Pizza 31
Taco Salad (steak) 37
Nachos BellGrande 39
XXL Grilled Stuft Burrito – Beef 41
Taco Salad (original) 42
13
HW #15: page 47 (65, 69–74), page 69 (79, 81, 83, 85, 86, 87, 89, 91, 93)
1.3 Standard Deviation Arnold ran each afternoon for 5 days. His distances (in miles) were 10, 10, 10, 10, and 10.
Find the mean (or average) number of miles that Arnold ran each day. ____________________
Complete the table:
Table for Arnold's distances
Distances Difference from the mean Square of difference from the
mean
10
10
10
10
10
Sum of squared differences:
Sum of squared differences divided by 4 (since there were 5 distances):
Square root of the sum of squared differences divided by 4:
That last value is the standard deviation for the distances Arnold ran. What are the units? ____________
The number above it is the variance for the distances. What are the units? ____________
Becky ran each afternoon for 5 days. Her distances (in miles) were 8, 9, 10, 11, and 12.
Find the mean (or average) number of miles that Becky ran each day. ____________________
Complete the table:
Table for Becky's distances
Distances Difference from the mean Square of difference from the
mean
8
9
10
11
12
Sum of squared differences:
Sum of squared differences divided by 4 (since there were 5 distances):
Square root of the sum of squared differences divided by 4:
That last value is the standard deviation for the distances Becky ran. What are the units? ____________
14
The number above it is the variance for the distances. What are the units? ______________
Caleb ran each afternoon for 5 days. His distances (in miles) were 7, 9, 10, 11, and 13.
Find the mean (or average) number of miles that Caleb ran each day. ____________________
Complete the table:
Table for Caleb's distances
Distances Difference from the mean Square of difference from the
mean
7
9
10
11
13
Sum of squared differences:
Sum of squared differences divided by 4 (since there were 5 distances):
Square root of the sum of squared differences divided by 4:
That last value is the standard deviation for the distances Caleb ran. What are the units? _____________
The number above it is the variance for the distances. What are the units? _________________
Donna ran each afternoon for 5 days. Her distances (in miles) were 3, 3, 4, 5, and 35.
Find the mean (or average) number of miles that Donna ran each day. ____________________
Complete the table:
Table for Donna's distances
Distances Difference from the mean Square of difference from the
mean
3
3
4
5
35
Sum of squared differences:
Sum of squared differences divided by 4 (since there were 5 distances):
Square root of the sum of squared differences divided by 4:
That last value is the standard deviation for the distances Donna ran. What are the units? ___________
The number above it is the variance for the distances. What are the units? ____________
15
The standard deviation measures the typical distance the data are from the mean.
The range, IQR, and standard deviation all measure variation or spread, but only the IQR is resistant.
Read 60–62
If s =4, then 𝑠2=16. If 𝑠2 =9, then s=3. If 𝜎2 =25, then 𝜎 =5. If 𝜎 =6, then 𝜎2 =36.
Four important properties of the standard deviation:
Standard deviation ≥ 0. (0 means no variation, a large number means lots of variation.)
Standard deviation units are the same as the units for the data.
Standard deviation is not resistant.
Standard deviation measures spread around the mean.
s=5 s=6.22 s=9.52 s=10.7
A random sample of 5 students was asked how many minutes they spent listening to music outside school hours
the previous day. They responded: 20, 30, 60, 90, 120. Calculate and interpret the standard deviation.
Read 63–66
Of mean, median, IQR, and standard deviation, which summary statistics will we typically use for each
situation?
Symmetric
Skewed
Center
Spread
Standard deviation Variance
Square root of variance Square of standard deviation
s= sample standard deviation 𝑠2= sample variance
𝜎= population standard deviation 𝜎2= population variance