chapter 2 part 3

Chapter 2: Descriptive StatisticsPART 3

FIGURE 2.13For some sets of data, some of the five-values (Min, Q1, Median, Q3, Max) may be the same.

If the Median is also Q1 or Q3, the diagram may not have a dotted line inside the box displaying the median.

The right side of the box would display both the third quartile and the median.

For example, if the smallest value and the first quartile were both one, the median and the third quartile were both five, and the largest value was seven, the box plot would look like this: In this case, at least 25% of the values are

equal to one (why?)

Also, at least 25% of the values are equal to five. (why?)

The top 25% of values fall between five and seven, inclusive. (Why?)

Measures of the Center of the Data

The “center” of a data set is a way of describing locationUsually Mean, Median, and Mode are used here

Mean is most common Median is better when there are extreme values or outliers since the median is not affected

them

We will talk about two different kinds of means:Sample Mean

Indicated by (pronounced x-bar)Population Mean

Indicated by Ã Pronounced “Mew”

x


To find the mean, add up all values in your data, and divide by n, the number of values in your data set

Alternatively, you can multiply each distinct value by its frequency, sum these values, and then divide by n

To quickly find the location of the median, use the expression

If n is odd, you will get a whole number answer, and this is the location of the middle value of the ordered data If n is even, you will get a decimal value

Suppose n = 50, then the above equation would give you 25.5, meaning that the median value lies halfway between the 25th and 26th values.

To find the median in this case, take the mean of these two middle values.

x

x

We should still have the list of shoes sizes in our L1 on our calculators; find the mean and median using the directions above.

Don’t clear L1 just yet; we’re going to use the data we had in there…


The Mode is the most frequent value in your set of dataYou can have more than one mode as long as those values have the same

frequency, and that frequency is the highestA set with two modes is called bimodal

The Law of Large Numbers and the Mean

The law of Large Numbers says that if you take samples of larger and larger size from any population, then the mean, , of the sample is very likely to get closer and closer to Ã

We will discuss this in more detail later on in the semester

A Sampling Distribution is a relative frequency distribution of a particular statistic (e.g. mean, proportion, median) with many, many samples.

Suppose thirty randomly selected students were asked the number of movies they watched in the previous week.

If you let the number of samples get very large (say, 300 million or more), the relative frequency table becomes a relative frequency distribution.

The Law of Large Numbers and the Mean

A statistic is a number calculated from the sample

Sample statistic examples include the mean, the median, and the mode, as well as others

The sample mean is an example of a statistic which estimates the population mean Ã

x

Calculating the Mean of Grouped Frequency Tables

When only grouped data is available, you do not know the individual data values (we only know intervals and interval frequencies)Therefore, you cannot compute an exact mean for the data setIn this case we must estimate the actual mean by calculating the mean of a frequency table

Find the midpoint of each interval

Multiply midpoint by frequency

Sum and divide by n

Skewness and the Mean, Median, and Mode

This histogram is symmetric You can draw a vertical line at

some point in the graph and the left and right are mirror images of each other

Notice that in this data set, the median, mean, and mode are the same.

In a perfectly symmetrical distribution, the mean and median are the same.

It is possible to have a bimodal symmetric distribution The two modes would be different from the mean and the median.


This histogram is not symmetric The right hand side seems to be

‘chopped off’ compared to the left hand side

We call this skewed to the left because it is pulled out to the left

The mean is 6.4, the median 6.5, and the mode is 7

Notice that the mean is less than the median, and they are both less than the mode

The mean and the median both reflect the skewing, but the mean reflects it more so


This histogram is not symmetric The left hand side seems to be

‘chopped off’ compared to the left hand side

We call this skewed to the right because it is pulled out to the right

The mean is 7.7, the median 7.5, and the mode is 7

Notice that the mean is the largest, and the mode is the smallest

The mean and the median both reflect the skewing, but the mean reflects it more so


Generally, if the distribution of data is skewed to the left

The mean is less than the median Which is often less than the mode

If the distribution is skewed to the right The mode is often less than the median

Which is less than the mean

Measures of the Spread of the Data

An important characteristic of any set of data is the variation in the data In some data sets, the data values are concentrated closely near the

mean In other data sets, the data values are more widely spread out from

the mean

The most common measure of variation, or spread, is the standard deviation

Standard Deviation – a number that measures how far data values are from their mean

The Standard Deviation Provides a numerical measure of the overall amount of variation in a

data set, and Can be used to determine whether a particular data value is close to

or far from the mean

Standard Deviation

The standard deviation:

provides a numerical measure of the overall variation in a data set

is always positive or zero

can be used to determine whether a particular data value is close to or far from the mean

is small when the data are concentrated close to the mean, exhibiting little variation or spread

is larger when the data values are more spread out from the mean, exhibiting more variation

Wait times at In N Out

Let x be the number of minutes a person waits for an order at In N Out at lunchtime Monday through Friday

x2318213219

We can say that the range of time they waited is 32-18 = 14 minutes, but this doesn’t really describe the variation

it’s simply the largest value minus the smallest value.

One really big order in front of you could really change the range drastically!

We’re going to do most of this with calculators, but let’s build it up by hand so we can see what it’s all about.

We’re going to need the mean of the data set

it’s 22.6 (

And then we’re going to add another column to the table, which is the difference from each data value and the mean



x2318213219

The are called deviations from the mean, because they express how far the datum is from the mean

If the x is less than the mean, it’s deviation from the mean is negative

Notice that the column adds up to zero?

This is how it should be, since the mean is like a balancing point

Next, we’ll add the final column, the ()2 ‘s.

x

23 0.418 -4.621 -1.632 9.419 -3.6

Sum: 0

x ()2

23 0.4 0.1618 -4.6 21.1621 -1.6 2.5632 9.4 88.3619 -3.6 12.96

This column is the squared deviations from the mean (obviously!)

Remember, the numbers in this column will be positive



x2318213219

x

23 0.418 -4.621 -1.632 9.419 -3.6

x ()2

23 0.4 0.1618 -4.6 21.1621 -1.6 2.5632 9.4 88.3619 -3.6 12.96

Now we do various things with this third column, ()2.

First, we add it up:

Then, we divide this sum by one less than the sample size, n - 1

Why minus one? It turns out that this number is a better fit for the population of which this data set is a sample (try not to worry about this)

2.125)( 2 xx

3.3142.125

1)( 2

nxx

S2 =

We label this quotient s2, and call it the sample variance



x2318213219

x

23 0.418 -4.621 -1.632 9.419 -3.6

x ()2

23 0.4 0.1618 -4.6 21.1621 -1.6 2.5632 9.4 88.3619 -3.6 12.96

3.3142.125

1)( 2

nxx

S2 =

We label this quotient s2, and call it the sample variance

So, 31.3 is the sample variance.

Next, we find the sample standard deviation, which we label s, because…

𝑠=√𝑠2

That is, the sample standard deviation is the square root of the sample variance.

5.5946

We’ll round this to the nearest tenth…

In summary, the formula for the sample standard deviation is:

𝑠=√31.3 ≈

1)( 2

n

xxs

𝑠=5.6

Memorize this



x2318213219

x

23 0.418 -4.621 -1.632 9.419 -3.6

x ()2

23 0.4 0.1618 -4.6 21.1621 -1.6 2.5632 9.4 88.3619 -3.6 12.96

So, sample variance, s2 = 31.3

And the sample standard variation, s = 5.6

But what does this mean?

In general, a data value that is two standard deviations from the average is on the borderline for what many statisticians would consider to be far from average

In this sample, = 22.6

So, 22.6 ± 5.6 = 17 and 28.2 minutes

22.6 ± 2(5.6) = 11.4 and 33.8 minutes

If you were to wait less than 11.4 minutes, or more than 33.8 minutes, that would be far from average



x2318213219

x

23 0.418 -4.621 -1.632 9.419 -3.6

x ()2

23 0.4 0.1618 -4.6 21.1621 -1.6 2.5632 9.4 88.3619 -3.6 12.96

In this sample, = 22.6

So, 22.6 ± 5.6 = 17 and 28.2 minutes

22.6 ± 2(5.6) = 11.4 and 33.8 minutes

If you were to wait less than 11.4 minutes, or more than 33.8 minutes, that would be far from average

In general

#STDEV does not need to be an integer

Sample:

Population

𝑥=𝑥+(¿𝑜𝑓𝑆𝑇𝐷𝐸𝑉 )(𝑠)

x = Ã+ (# of STDEV)(Ç)Population Mean Population Standard Deviation

‘sigma’

Formulas for the Standard Deviations

Sample Standard Deviation

Population Standard Deviation

1)( 2

n

xxs

Nx

2)(

If the sample has the same characteristics as the population…

then s should be a good estimate of Ç

Ç2 represents the population variance just as s2 represents the sample variance

Also note that if we have a census (so, the whole population), we divide by N, the number of items in the population

Sampling Variability of a Statistic How much the statistic varies from one sample to another is known as the sampling variability of a statistic

◦ You typically measure the sampling variability of a statistic by its standard error

The standard error of the mean is an example of a standard error◦ It is a special standard deviation and is known as the standard deviation of the sampling distribution of

the mean◦ We will cover the standard error of the mean in the chapter on The Central Limit Theorem◦ The notation for the standard error of the mean is n

Comparing Values from Different Data Sets

The standard deviation is useful when comparing data values that come from different data sets If the data sets have different means and standard deviations, then comparing the data values directly can be

misleading For each data value, calculate the number of standard deviations between it and the mean Use the formula: value = mean + (# of STDEVs)(standard deviation)

Solve for #ofSTDEVs

Compare the results of this calculation

#ofSTDEVs is often called a “z-score”

We can use the symbol z

Sample

Population

A few facts about what the Standard Deviation tells usFor ANY data set, no matter what the distribution of the data isAt least 75% of the data is within

two standard deviations of the mean

At least 89% of the data is within three standard deviations of the mean

At least 95% of the data is within 4.5 standard deviations of the mean This is known as Chebyshev’s Rule

A few facts about what the Standard Deviation tells us

For data having a distribution that is Bell-Shaped and Symmetric:Approximately 68% of the data is

within one standard deviation of the mean

Approximately 95% of the data is within two standard deviations of the mean

More than 99% of the data is within three standard deviations of the mean This is known as the Empirical Rule It is important to note that this rule only applies

when the shape of the distribution of the data is bell shaped and symmetric

Homework! Beginning on page 136:

74, 78, 79, 80, 83, 90, 104, 115, 116, 118

chapter 2 part 3

Education