chapter 2 part 3
TRANSCRIPT
Chapter 2: Descriptive StatisticsPART 3
FIGURE 2.13For some sets of data, some of the five-values (Min, Q1, Median, Q3, Max) may be the same.
If the Median is also Q1 or Q3, the diagram may not have a dotted line inside the box displaying the median.
The right side of the box would display both the third quartile and the median.
For example, if the smallest value and the first quartile were both one, the median and the third quartile were both five, and the largest value was seven, the box plot would look like this: In this case, at least 25% of the values are
equal to one (why?)
Also, at least 25% of the values are equal to five. (why?)
The top 25% of values fall between five and seven, inclusive. (Why?)
Measures of the Center of the Data
The “center” of a data set is a way of describing locationUsually Mean, Median, and Mode are used here
Mean is most common Median is better when there are extreme values or outliers since the median is not affected
them
We will talk about two different kinds of means:Sample Mean
Indicated by (pronounced x-bar)Population Mean
Indicated by à Pronounced “Mew”
x
Measures of the Center of the Data
To find the mean, add up all values in your data, and divide by n, the number of values in your data set
Alternatively, you can multiply each distinct value by its frequency, sum these values, and then divide by n
To quickly find the location of the median, use the expression
If n is odd, you will get a whole number answer, and this is the location of the middle value of the ordered data If n is even, you will get a decimal value
Suppose n = 50, then the above equation would give you 25.5, meaning that the median value lies halfway between the 25th and 26th values.
To find the median in this case, take the mean of these two middle values.
x
x
We should still have the list of shoes sizes in our L1 on our calculators; find the mean and median using the directions above.
Don’t clear L1 just yet; we’re going to use the data we had in there…
Measures of the Center of the Data
The Mode is the most frequent value in your set of dataYou can have more than one mode as long as those values have the same
frequency, and that frequency is the highestA set with two modes is called bimodal
The Law of Large Numbers and the Mean
The law of Large Numbers says that if you take samples of larger and larger size from any population, then the mean, , of the sample is very likely to get closer and closer to Ã
We will discuss this in more detail later on in the semester
A Sampling Distribution is a relative frequency distribution of a particular statistic (e.g. mean, proportion, median) with many, many samples.
Suppose thirty randomly selected students were asked the number of movies they watched in the previous week.
If you let the number of samples get very large (say, 300 million or more), the relative frequency table becomes a relative frequency distribution.
The Law of Large Numbers and the Mean
A statistic is a number calculated from the sample
Sample statistic examples include the mean, the median, and the mode, as well as others
The sample mean is an example of a statistic which estimates the population mean Ã
x
Calculating the Mean of Grouped Frequency Tables
When only grouped data is available, you do not know the individual data values (we only know intervals and interval frequencies)Therefore, you cannot compute an exact mean for the data setIn this case we must estimate the actual mean by calculating the mean of a frequency table
Find the midpoint of each interval
Multiply midpoint by frequency
Sum and divide by n
Skewness and the Mean, Median, and Mode
This histogram is symmetric You can draw a vertical line at
some point in the graph and the left and right are mirror images of each other
Notice that in this data set, the median, mean, and mode are the same.
In a perfectly symmetrical distribution, the mean and median are the same.
It is possible to have a bimodal symmetric distribution The two modes would be different from the mean and the median.
Skewness and the Mean, Median, and Mode
This histogram is not symmetric The right hand side seems to be
‘chopped off’ compared to the left hand side
We call this skewed to the left because it is pulled out to the left
The mean is 6.4, the median 6.5, and the mode is 7
Notice that the mean is less than the median, and they are both less than the mode
The mean and the median both reflect the skewing, but the mean reflects it more so
Skewness and the Mean, Median, and Mode
This histogram is not symmetric The left hand side seems to be
‘chopped off’ compared to the left hand side
We call this skewed to the right because it is pulled out to the right
The mean is 7.7, the median 7.5, and the mode is 7
Notice that the mean is the largest, and the mode is the smallest
The mean and the median both reflect the skewing, but the mean reflects it more so
Skewness and the Mean, Median, and Mode
Generally, if the distribution of data is skewed to the left
The mean is less than the median Which is often less than the mode
If the distribution is skewed to the right The mode is often less than the median
Which is less than the mean
Measures of the Spread of the Data
An important characteristic of any set of data is the variation in the data In some data sets, the data values are concentrated closely near the
mean In other data sets, the data values are more widely spread out from
the mean
The most common measure of variation, or spread, is the standard deviation
Standard Deviation – a number that measures how far data values are from their mean
The Standard Deviation Provides a numerical measure of the overall amount of variation in a
data set, and Can be used to determine whether a particular data value is close to
or far from the mean
Standard Deviation
The standard deviation:
provides a numerical measure of the overall variation in a data set
is always positive or zero
can be used to determine whether a particular data value is close to or far from the mean
is small when the data are concentrated close to the mean, exhibiting little variation or spread
is larger when the data values are more spread out from the mean, exhibiting more variation
Wait times at In N Out
Let x be the number of minutes a person waits for an order at In N Out at lunchtime Monday through Friday
x2318213219
We can say that the range of time they waited is 32-18 = 14 minutes, but this doesn’t really describe the variation
it’s simply the largest value minus the smallest value.
One really big order in front of you could really change the range drastically!
We’re going to do most of this with calculators, but let’s build it up by hand so we can see what it’s all about.
We’re going to need the mean of the data set
it’s 22.6 (
And then we’re going to add another column to the table, which is the difference from each data value and the mean
Wait times at In N Out
Let x be the number of minutes a person waits for an order at In N Out at lunchtime Monday through Friday
x2318213219
The are called deviations from the mean, because they express how far the datum is from the mean
If the x is less than the mean, it’s deviation from the mean is negative
Notice that the column adds up to zero?
This is how it should be, since the mean is like a balancing point
Next, we’ll add the final column, the ()2 ‘s.
x
23 0.418 -4.621 -1.632 9.419 -3.6
Sum: 0
x ()2
23 0.4 0.1618 -4.6 21.1621 -1.6 2.5632 9.4 88.3619 -3.6 12.96
This column is the squared deviations from the mean (obviously!)
Remember, the numbers in this column will be positive
Wait times at In N Out
Let x be the number of minutes a person waits for an order at In N Out at lunchtime Monday through Friday
x2318213219
x
23 0.418 -4.621 -1.632 9.419 -3.6
x ()2
23 0.4 0.1618 -4.6 21.1621 -1.6 2.5632 9.4 88.3619 -3.6 12.96
Now we do various things with this third column, ()2.
First, we add it up:
Then, we divide this sum by one less than the sample size, n - 1
Why minus one? It turns out that this number is a better fit for the population of which this data set is a sample (try not to worry about this)
2.125)( 2 xx
3.3142.125
1)( 2
nxx
S2 =
We label this quotient s2, and call it the sample variance
Wait times at In N Out
Let x be the number of minutes a person waits for an order at In N Out at lunchtime Monday through Friday
x2318213219
x
23 0.418 -4.621 -1.632 9.419 -3.6
x ()2
23 0.4 0.1618 -4.6 21.1621 -1.6 2.5632 9.4 88.3619 -3.6 12.96
3.3142.125
1)( 2
nxx
S2 =
We label this quotient s2, and call it the sample variance
So, 31.3 is the sample variance.
Next, we find the sample standard deviation, which we label s, because…
𝑠=√𝑠2
That is, the sample standard deviation is the square root of the sample variance.
5.5946
We’ll round this to the nearest tenth…
In summary, the formula for the sample standard deviation is:
𝑠=√31.3 ≈
1)( 2
n
xxs
𝑠=5.6
Memorize this
Wait times at In N Out
Let x be the number of minutes a person waits for an order at In N Out at lunchtime Monday through Friday
x2318213219
x
23 0.418 -4.621 -1.632 9.419 -3.6
x ()2
23 0.4 0.1618 -4.6 21.1621 -1.6 2.5632 9.4 88.3619 -3.6 12.96
So, sample variance, s2 = 31.3
And the sample standard variation, s = 5.6
But what does this mean?
In general, a data value that is two standard deviations from the average is on the borderline for what many statisticians would consider to be far from average
In this sample, = 22.6
So, 22.6 ± 5.6 = 17 and 28.2 minutes
22.6 ± 2(5.6) = 11.4 and 33.8 minutes
If you were to wait less than 11.4 minutes, or more than 33.8 minutes, that would be far from average
Wait times at In N Out
Let x be the number of minutes a person waits for an order at In N Out at lunchtime Monday through Friday
x2318213219
x
23 0.418 -4.621 -1.632 9.419 -3.6
x ()2
23 0.4 0.1618 -4.6 21.1621 -1.6 2.5632 9.4 88.3619 -3.6 12.96
In this sample, = 22.6
So, 22.6 ± 5.6 = 17 and 28.2 minutes
22.6 ± 2(5.6) = 11.4 and 33.8 minutes
If you were to wait less than 11.4 minutes, or more than 33.8 minutes, that would be far from average
In general
#STDEV does not need to be an integer
Sample:
Population
𝑥=𝑥+(¿𝑜𝑓𝑆𝑇𝐷𝐸𝑉 )(𝑠)
x = Ã+ (# of STDEV)(Ç)Population Mean Population Standard Deviation
‘sigma’
Formulas for the Standard Deviations
Sample Standard Deviation
Population Standard Deviation
1)( 2
n
xxs
Nx
2)(
If the sample has the same characteristics as the population…
then s should be a good estimate of Ç
Ç2 represents the population variance just as s2 represents the sample variance
Also note that if we have a census (so, the whole population), we divide by N, the number of items in the population
Sampling Variability of a Statistic How much the statistic varies from one sample to another is known as the sampling variability of a statistic
◦ You typically measure the sampling variability of a statistic by its standard error
The standard error of the mean is an example of a standard error◦ It is a special standard deviation and is known as the standard deviation of the sampling distribution of
the mean◦ We will cover the standard error of the mean in the chapter on The Central Limit Theorem◦ The notation for the standard error of the mean is n
Comparing Values from Different Data Sets
The standard deviation is useful when comparing data values that come from different data sets If the data sets have different means and standard deviations, then comparing the data values directly can be
misleading For each data value, calculate the number of standard deviations between it and the mean Use the formula: value = mean + (# of STDEVs)(standard deviation)
Solve for #ofSTDEVs
Compare the results of this calculation
#ofSTDEVs is often called a “z-score”
We can use the symbol z
Sample
Population
A few facts about what the Standard Deviation tells usFor ANY data set, no matter what the distribution of the data isAt least 75% of the data is within
two standard deviations of the mean
At least 89% of the data is within three standard deviations of the mean
At least 95% of the data is within 4.5 standard deviations of the mean This is known as Chebyshev’s Rule
A few facts about what the Standard Deviation tells us
For data having a distribution that is Bell-Shaped and Symmetric:Approximately 68% of the data is
within one standard deviation of the mean
Approximately 95% of the data is within two standard deviations of the mean
More than 99% of the data is within three standard deviations of the mean This is known as the Empirical Rule It is important to note that this rule only applies
when the shape of the distribution of the data is bell shaped and symmetric
Homework! Beginning on page 136:
74, 78, 79, 80, 83, 90, 104, 115, 116, 118