m03 sull8028 03 se c03 - higher education | pearson your purchase, you decide to collect as much...

64
Numerically Summarizing Data 128 Outline 3.1 Measures of Central Tendency 3.2 Measures of Dispersion 3.3 Measures of Central Tendency and Dispersion from Grouped Data 3.4 Measures of Position and Outliers 3.5 The Five-Number Summary and Boxplots Informed Decisions Using Data Suppose that you are in the market for a used car. To make an informed decision regarding your purchase, you decide to collect as much information as possible. What information is im- portant in helping you make this decision? See the Deci- sions project on page 189. PUTTING IT TOGETHER When we look at a distribution of data, we should consider three characteristics of the distribution: shape, center, and spread. In the last chapter, we discussed methods for organizing raw data into tables and graphs. These graphs (such as the histogram) allow us to identify the shape of the distribution. Recall that we de- scribe the shape of a distribution as symmetric (in particular, bell shaped or uniform), skewed right, or skewed left. The center and spread are numerical summaries of the data. The center of a data set is commonly called the average. There are many ways to describe the average value of a distribution. In addition, there are many ways to measure the spread of a distribution. The most appropriate measure of center and spread depends on the shape of the distribution. Once these three characteristics of the distribution are known, we can analyze the data for interesting features, including unusual data values, called outliers. 3

Upload: dangkiet

Post on 27-Mar-2018

239 views

Category:

Documents


10 download

TRANSCRIPT

Page 1: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

NumericallySummarizing Data

128

Outline3.1 Measures of Central

Tendency

3.2 Measures of Dispersion

3.3 Measures of CentralTendency andDispersion fromGrouped Data

3.4 Measures of Position and Outliers

3.5 The Five-Number Summaryand Boxplots

Informed DecisionsUsing

DataSuppose that you are in the market for aused car. To make an informed decision

regarding your purchase, youdecide to collect as much

information as possible.What information is im-

portant in helping you makethis decision? See the Deci-sions project on page 189.

PUTTING IT TOGETHERWhen we look at a distribution of data, we should consider three characteristics of the distribution: shape,center, and spread. In the last chapter, we discussed methods for organizing raw data into tables and graphs.These graphs (such as the histogram) allow us to identify the shape of the distribution. Recall that we de-scribe the shape of a distribution as symmetric (in particular, bell shaped or uniform), skewed right, orskewed left.

The center and spread are numerical summaries of the data. The center of a data set is commonly calledthe average.There are many ways to describe the average value of a distribution. In addition, there are manyways to measure the spread of a distribution. The most appropriate measure of center and spread dependson the shape of the distribution.

Once these three characteristics of the distribution are known, we can analyze the data for interestingfeatures, including unusual data values, called outliers.

3M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 128

Page 2: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.1 Measures of Central Tendency 129

3.1 MEASURES OF CENTRAL TENDENCYPreparing for This Section Before getting started, review the following:

• Population versus sample (Section 1.1, p. 5) • Qualitative data (Section 1.1, p. 9)

• Parameter versus statistic (Section 1.1, p. 5) • Simple random sampling (Section 1.3, pp. 22–27)

• Quantitative data (Section 1.1, p. 9)

Objectives 1 Determine the arithmetic mean of a variable from raw data2 Determine the median of a variable from raw data3 Explain what it means for a statistic to be resistant4 Determine the mode of a variable from raw data

A measure of central tendency numerically describes the average or typical datavalue. We hear the word average in the news all the time:

• The average miles per gallon of gasoline of the 2008 Chevrolet Corvette in citydriving is 15 miles.

• According to the U.S. Census Bureau, the national average commute time towork in 2006 was 24.0 minutes.

• According to the U.S. Census Bureau, the average household income in 2006was $48,201.

• The average American woman is tall and weighs 142 pounds.

In this chapter, we discuss three measures of central tendency: the mean, themedian, and the mode. While other measures of central tendency exist, these threeare the most widely used. When the word average is used in the media (newspapers,reporters, and so on) it usually refers to the mean. But beware! Some reporters usethe term average to refer to the median or mode. As we shall see, these three meas-ures of central tendency can give very different results!

1 Determine the Arithmetic Mean of a Variable from Raw DataWhen used in everyday language, the word average often represents the arithmeticmean.To compute the arithmetic mean of a set of data, the data must be quantitative.

Definitions The arithmetic mean of a variable is computed by determining the sum of allthe values of the variable in the data set and dividing by the number of observa-tions. The population arithmetic mean, (pronounced “mew”), is computedusing all the individuals in a population. The population mean is a parameter.

The sample arithmetic mean, (pronounced “x-bar”), is computed usingsample data. The sample mean is a statistic.

While other types of means exist (see Problems 45 and 46), the arithmetic meanis generally referred to as the mean. We will follow this practice for the remainder ofthe text.

Typically, Greek letters are used to represent parameters, and Roman letters areused to represent statistics. Statisticians use mathematical expressions to describethe method for computing means.

If are the N observations of a variable from a population, then thepopulation mean, is

(1)m =

x1 + x2 +Á

+ xN

N=axi

N

m,x1, x2 , Á , xN

x

m

5¿4–

Whenever you hear the wordaverage, be aware that the word maynot always be referring to the mean.One average could be used to supportone position, while another averagecould be used to support a different position.

Note to InstructorDiscuss with students that we describedistributions using three characteris-tics: shape, center, and spread. Shapecan be determined from pictures of data,like the histogram. Center is discussed inthis section, and spread is discussed inSection 3.2.

Note to InstructorIf you like, you can print out and distributethe Preparing for this Section quiz locatedin the Instructor’s Resource Center.

In Other WordsTo find the mean of a set of data, addup all the observations and divide by thenumber of observations.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 129

Page 3: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

130 Chapter 3 Numerically Summarizing Data

Table 1

Student Score

1. Michelle 82

2. Ryanne 77

3. Bilal 90

4. Pam 71

5. Jennifer 62

6. Dave 68

7. Joel 74

8. Sam 84

9. Justine 94

10. Juan 88

If are n observations of a variable from a sample, then the samplemean, is

(2)

Note that N represents the size of the population, while n represents the size ofthe sample. The symbol (the Greek letter capital sigma) tells us the terms are tobe added. The subscript i is used to make the various values distinct and does notserve as a mathematical operation. For example, is the first data value, is thesecond, and so on.

Let’s look at an example to help distinguish the population mean and samplemean.

EXAMPLE 1 Computing a Population Mean and a Sample Mean

Problem: The data in Table 1 represent the first exam score of 10 students enrolledin a section of Introductory Statistics.(a) Compute the population mean.(b) Find a simple random sample of size students.(c) Compute the sample mean of the sample obtained in part (b).

Approach(a) To compute the population mean, we add up all the data values (test scores) andthen divide by the number of individuals in the population.(b) Recall from Section 1.3 that we can use either Table I in Appendix A, a calcula-tor with a random-number generator, or computer software to obtain simple randomsamples. We will use a TI-84 Plus graphing calculator.(c) The sample mean is found by adding the data values that correspond to theindividuals selected in the sample and then dividing by the sample size.

Solution(a) We compute the population mean by adding the scores of all 10 students:

Divide this result by 10, the number of students in the class.

Although it was not necessary in this problem, we will agree to round the mean toone more decimal place than that in the raw data.

(b) To find a simple random sample of size from a population whose size iswe will use the TI-84 Plus random-number generator with a seed of 54.

(Recall that this gives the starting point that the calculator uses to generate thelist of random numbers.) Figure 1 shows the students in the sample. Bilal (90),Ryanne (77), Pam (71), and Michelle (82) are in the sample.(c) We compute the sample mean by first adding the scores of the individuals in thesample.

= 320

= 90 + 77 + 71 + 82

axi = x1 + x2 + x3 + x4

N = 10,n = 4

m =axi

N=

79010

= 79

= 790

= 82 + 77 + 90 + 71 + 62 + 68 + 74 + 84 + 94 + 88

axi = x1 + x2 + x3 +Á

+ x10

n = 4,

n = 4

x2x1

©

x =

x1 + x2 +Á

+ xn

n=axi

n

x,x1 , x2 , Á , xn

Figure 1

Note to InstructorIf you decided to postpone thediscussion of Sections 1.2–1.6, letstudents know they should not dwellon how the sample is obtained.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 130

Page 4: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.1 Measures of Central Tendency 131

Divide this result by 4, the number of individuals in the sample.

x =axi

n=

3204

= 80

Note to InstructorThe purpose of this activity is to introducethe students to sampling variability. Itshould take about 15 to 20 minutes. It canbe used to illustrate the ideas in Example 1.Keep the data collected so you can use itto illustrate other concepts later in thecourse. If you decided to postpone thediscussion of Sections 1.2–1.6, do notattempt this activity.

Population Mean versus Sample MeanTreat the students in the class as a population. All the students in theclass should determine their pulse rates.(a) Compute the population mean pulse rate.(b) Obtain a simple random sample of students and compute thesample mean. Does the sample mean equal the population mean?(c) Obtain a second simple random sample of students andcompute the sample mean. Does the sample mean equal the populationmean?(d) Are the sample means the same? Why?

n = 4

n = 4

It is helpful to think of the mean of a data set as the center of gravity. In otherwords, the mean is the value such that a histogram of the data is perfectly balanced,with equal weight on each side of the mean. Figure 2 shows a histogram of the datain Table 1 with the mean labeled. The histogram balances at m = 79.

Freq

uenc

y

3

2

1

60 65 70 75 80 85 90 95

Scores on First Exam

m � 79Score

Figure 2

The Mean as the Center of GravityFind a yardstick, a fulcrum, and three objects of equal weight (maybe1-kilogram weights from the physics department). Place the fulcrum at18 inches so that the yardstick balances like a teeter-totter. Now placeone weight on the yardstick at 12 inches, another at 15 inches, and thethird at 27 inches. See Figure 3.

Does the yardstick balance? Now compute the mean of the location ofthe three weights. Compare this result with the location of the fulcrum.Conclude that the mean is the center of gravity of the data set.

18 271512

Figure 3

2 Determine the Median of a Variable from Raw DataA second measure of central tendency is the median. To compute the median ofa set of data, the data must be quantitative.

Definition The median of a variable is the value that lies in the middle of the data whenarranged in ascending order. We use M to represent the median.

Now Work Problem 23

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 131

Page 5: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

132 Chapter 3 Numerically Summarizing Data

In Other WordsTo help remember the idea behind themedian, think of the median of a highway;it divides the highway in half. So themedian divides the data in half, with atmost half the data below the medianand at most half above it.

To determine the median of a set of data, we use the following steps:

Note to InstructorSuppose that a data set has 10 observa-tions. In class, have students explain intheir own words how to find the median.What if the data set has 11 observations?

EXAMPLE 2 Determining the Median of a Data Set with an Odd Number of Observations

Problem: The data in Table 2 represent the length (in seconds) of a random sam-ple of songs released in the 1970s. Find the median length of the songs.

Approach: We will follow the steps listed above.

SolutionStep 1: Arrange the data in ascending order:

179, 201, 206, 208, 217, 222, 240, 257, 284

Step 2: There are observations.

Step 3: Since there are an odd number of observations, the median will be theobservation exactly in the middle of the data set. The median, M, is 217 seconds

(the data value). We list the data in ascending order, with

the median in blue.

179, 201, 206, 208, 217, 222, 240, 257, 284

Notice there are four observations to the left and four observations to the right ofthe median.

EXAMPLE 3 Determining the Median of a Data Set with an Even Number of Observations

Problem: Find the median score of the data in Table 1 on page 130.

Approach: We will follow the steps listed above.

SolutionStep 1: Arrange the data in ascending order:

62, 68, 71, 74, 77, 82, 84, 88, 90, 94

Step 2: There are observations.Step 3: Because there are observations, the median will be the mean of the

two middle observations. The median is the mean of the fifth andan

2=

102

= 5bn = 10

n = 10

n + 1 2

=

9 + 1 2

= 5th

n = 9

Steps in Finding the Median of a Data Set

Step 1: Arrange the data in ascending order.Step 2: Determine the number of observations, n.Step 3: Determine the observation in the middle of the data set.• If the number of observations is odd, then the median is the data value that

is exactly in the middle of the data set. That is, the median is the observation

that lies in the position.

• If the number of observations is even, then the median is the mean of thetwo middle observations in the data set. That is, the median is the mean of

the observations that lie in the position and the position.n

2+ 1

n

2

n + 1 2

Table 2

Song Name Length

“Sister Golden Hair” 201

“Black Water” 257

“Free Bird” 284

“The Hustle” 208

“Southern Nights” 179

“Stayin’ Alive” 222

“We Are Family” 217

“Heart of Glass” 206

“My Sharona” 240

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 132

Page 6: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.1 Measures of Central Tendency 133

sixth observations with the data written in ascending

order. So the median is the mean of 77 and 82:

Notice that there are five observations to the left and five observations to the rightof the median, as follows:

We conclude that 50% (or half) of the students scored less than 79.5 and 50%(or half) of the students scored above 79.5.

Now compute the median of the data in Problem 15 by hand

M = 79.5æ

62, 68, 71, 74, 77, 82, 84, 88, 90, 94

M =

77 + 82 2

= 79.5

an

2+ 1 =

102

+ 1 = 6b

EXAMPLE 4 Finding the Mean and Median Using Technology

Problem: Use statistical software or a calculator to determine the population meanand median of the student test score data in Table 1 on page 130.

Approach: We will use Excel to obtain the mean and median. The steps for calcu-lating measures of central tendency using the TI-83/84 Plus graphing calculator,MINITAB, or Excel are given in the Technology Step-by-Step on page 142.

Solution: Figure 4 shows the output obtained from Excel.

3 Explain What It Means for a Statistic to be ResistantThus far, we have discussed two measures of central tendency, the mean and themedian. Perhaps you are asking yourself which measure is better. It depends.

EXAMPLE 5 Comparing the Mean and the Median

Problem: Yolanda wants to know how much time she typically spends on her cellphone. She goes to her phone’s website and records the phone call length for a ran-dom sample of 12 calls and obtains the data in Table 3. Find the mean and medianlength of a cell phone call. Which measure of central tendency better describes thelength of a typical phone call?

Approach: We will find the mean and median using MINITAB. To help judgewhich is the better measure of central tendency, we will also draw a dot plot of thedata using MINITAB.

Solution: Figure 5 indicates that the mean talk time is minutes andthe median talk time is 3.5 minutes. Figure 6 shows a dot plot of the data usingMINITAB.

x = 7.3

Student Scores

MeanStandard ErrorMedianMode

793.272783389

79.5#N/A

Figure 4

Descriptive Statistics: TalkTimeVariableTalkTime

N12

N*0Mean7.25

SE Mean3.74

StDev12.96

Minimum1.00

Q12.25

Q35.75

Median3.50

Maximum48.00

Figure 5

Table 3

1 7 4 1

2 4 3 48

3 5 3 6

Source: Yolanda Sullivan’s cell phonerecords

Note to InstructorIf you collected pulse data, change thelargest observation to something extreme,like 120. Ask students to recompute themean and median. Discuss the results.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 133

Page 7: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

134 Chapter 3 Numerically Summarizing Data

Which measure of central tendency do you think better describes the typical amountof time Yolanda spends on a phone call? Given the fact that only one phone call has atalk time greater than the mean, we conclude that the mean is not representative ofthe typical talk time. So the median is the better measure of central tendency.

Look back at the data in Table 3. Suppose Yolanda’s 48-minute phone call wasactually a 5-minute phone call.Then the mean talk time would be 3.7 minutes and themedian talk time would still be 3.5 minutes. So the one extreme observation (48 min-utes) can cause the mean to increase substantially, but have no effect on the median. Inother words, the mean is sensitive to extreme values while the median is not. In fact, ifYolanda’s 48-minute phone call had actually been a 148-minute phone call, the medianwould still be 3.5 minutes, but the mean would increase to 15.6 minutes.The median isunchanged because it is based on the value of the middle observation, so the value ofthe largest observation does not play a role in its computation.Because extreme valuesdo not affect the value of the median, we say that the median is resistant.

Definition A numerical summary of data is said to be resistant if extreme values (verylarge or small) relative to the data do not affect its value substantially.

So the median is resistant, while the mean is not resistant.When data are either skewed left or skewed right, there are extreme values in

the tail, which tend to pull the mean in the direction of the tail. For example, inskewed-right distributions, there are large observations in the right tail. These ob-servations tend to increase the value of the mean, while having little effect on themedian. Similarly, in distributions that are skewed left, the mean will tend to besmaller than the median. In distributions that are symmetric, the mean and the medi-an are close in value. We summarize these ideas in Table 4 and Figure 7.

*This idea is discussed in “Mean, Median, and Skew: Correcting a Textbook Rule” by Paul T. von Hippel.Journal of Statistics Education, Volume 13, Number 2 (2005).

Note to InstructorIn-class concept check: Ask students todiscuss how extreme observations affectthe mean and median. Which measureof center is best when extreme valuesexist? Which is a better measure of cen-tral tendency for the following variables:income, price of housing, height, or IQ?

Figure 6

Table 4

Relation Between the Mean, Median, and Distribution Shape

Distribution Shape Mean versus Median

Skewed left Mean substantially smaller than median

Symmetric Mean roughly equal to median

Skewed right Mean substantially larger than median

(a) Skewed LeftMean � Median

(b) SymmetricMean � Median

(c) Skewed RightMean � Median

Mean

Median

MeanMedian

Mean � MedianFigure 7

Mean or median versus skewness

A word of caution is in order.The relation between the mean, median, and skew-ness are guidelines.The guidelines tend to hold up well for continuous data,but,whenthe data are discrete, the rules can be easily violated. See Problem 47.*

A question you may be asking yourself is, “Why would I ever compute themean?” After all, the mean and median are close in value for symmetric data, and

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 134

Page 8: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.1 Measures of Central Tendency 135

the median is the better measure of central tendency for skewed data. The reasonwe compute the mean is that much of the statistical inference that we perform isbased on the mean. We will have more to say about this in Chapter 8.

EXAMPLE 6 Describing the Shape of a Distribution

Problem: The data in Table 5 represent the birth weights (in pounds) of 50 ran-domly sampled babies.(a) Find the mean and the median.(b) Describe the shape of the distribution.(c) Which measure of central tendency better describes the average birth weight?

Approach(a) This can be done either by hand or technology. We will use a TI-84 Plus to com-pute the mean and the median.(b) We will draw a histogram to identify the shape of the distribution.(c) If the data are roughly symmetric, the mean is a good measure of central ten-dency. If the data are skewed, the median is a good measure of central tendency.

Solution(a) Using a TI-84 Plus, we find and See Figure 8.M = 7.35.x = 7.49

(a) (b)

Figure 8

(b) See Figure 9 for the frequency histogram with the mean and median labeled.The distribution is bell shaped. We have further evidence of the shape because themean and median are close to each other.

5.7 6.2 6.7 7.2 7.7 8.2 8.7 9.2 9.7

Mean.The balancing pointof the histogram.

MedianBirth Weights of Babies

Freq

uenc

y

Weight (in pounds)

15

10

5

0

Figure 9Birth weights of 50 randomly

selected babies

(c) Because the mean and median are close in value, we use the mean as the measureof central tendency.

Table 5

5.8 7.4 9.2 7.0 8.5 7.6

7.9 7.8 7.9 7.7 9.0 7.1

8.7 7.2 6.1 7.2 7.1 7.2

7.9 5.9 7.0 7.8 7.2 7.5

7.3 6.4 7.4 8.2 9.1 7.3

9.4 6.8 7.0 8.1 8.0 7.5

7.3 6.9 6.9 6.4 7.8 8.7

7.1 7.0 7.0 7.4 8.2 7.2

7.6 6.7

Now Work Problem 27

4 Determine the Mode of a Variable from Raw DataA third measure of central tendency is the mode. The mode can be computed foreither quantitative or qualitative data.

Definition The mode of a variable is the most frequent observation of the variable thatoccurs in the data set.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 135

Page 9: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

136 Chapter 3 Numerically Summarizing Data

In Other WordsRemember, nominal data are qualitativedata that cannot be written in anymeaningful order.

Table 6

Back Back Hand Neck Knee Knee

Wrist Back Groin Shoulder Shoulder Back

Elbow Back Back Back Back Back

Back Shoulder Shoulder Knee Knee Back

Hip Knee Hip Hand Back Wrist

Source: Krystal Catton, student at Joliet Junior College

To compute the mode, tally the number of observations that occur for each datavalue. The data value that occurs most often is the mode. A set of data can have nomode, one mode, or more than one mode. If no observation occurs more than once,we say the data have no mode.

EXAMPLE 7 Finding the Mode of Quantitative Data

Problem: The following data represent the number of O-ring failures on the shuttleColumbia for its 17 flights prior to its fatal flight:

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 3

Find the mode number of O-ring failures.

Approach: We tally the number of times we observe each data value. The datavalue with the highest frequency is the mode.

Solution: The mode is 0 because it occurs most frequently (11 times).

EXAMPLE 8 Finding the Mode of Quantitative Data

Problem: Find the mode of the exam score data listed in Table 1 on page 130.

Approach: Tally the number of times we observe each data value. The data valuewith the highest frequency is the mode.

Solution: For convenience, the data are listed next:

82, 77, 90, 71, 62, 68, 74, 84, 94, 88

Since each data value occurs only once, there is no mode.

Now compute the mode of the data in Problem 15

A data set can have more than one mode. For example, suppose the instructorrecorded the scores of Pam and Sam incorrectly and they actually scored 77 and 88,respectively. The data set in Table 1 would now have two modes: 77 and 88. In thiscase,we say the data are bimodal. If a data set has three or more data values that occurwith the highest frequency, the data set is multimodal.The mode is usually not report-ed for multimodal data because it is not representative of a typical value. Figure 10(a)shows a distribution with one mode.Figure 10(b) shows a distribution that is bimodal.

We cannot determine the value of the mean or median of data that are nominal.The only measure of central tendency that can be determined for nominal data isthe mode.

EXAMPLE 9 Determining the Mode of Qualitative Data

Problem: The data in Table 6 represent the location of injuries that required reha-bilitation by a physical therapist. Determine the mode location of injury.

(a) One modeMode

(b) BimodalMode Mode

Figure 10

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 136

Page 10: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.1 Measures of Central Tendency 137

Approach: Determine the location of injury that occurs with the highest frequency.

Solution: The mode location of injury is the back, with 12 instances.

Measure of Central Tendency Computation Interpretation When to Use

Mean Population mean:

Sample mean: x =

©xi

n

m =

©xi

N Center of gravity When data are quantitative and the

frequency distribution is roughlysymmetric

Median Arrange data in ascendingorder and divide the dataset in half

Divides the bottom 50% of thedata from the top 50%

When the data are quantitative andthe frequency distribution is skewedleft or skewed right

Mode Tally data to determinemost frequent observation

Most frequent observation When the most frequent observationis the desired measure of centraltendency or the data are qualitative

3.1 ASSESS YOUR UNDERSTANDING

Concepts and Vocabulary1. What does it mean if a statistic is resistant? Why is the medi-

an resistant, but the mean is not?

2. In the 2000 census conducted by the U.S. Census Bureau,two average household incomes were reported: $41,349and $55,263. One of these averages is the mean and theother is the median. Which is the mean? Support youranswer. Mean

3. The U.S. Department of Housing and Urban Development(HUD) uses the median to report the average price of ahome in the United States. Why do you think HUD uses themedian? Data skewed right

4. A histogram of a set of data indicates that the distribution ofthe data is skewed right. Which measure of central tendencywill likely be larger, the mean or the median? Why? Mean

5. If a data set contains 10,000 values arranged in increasingorder, where is the median located?

6. True or False: A data set will always have exactly one mode.False

Skill BuildingIn Problems 7–10, find the population mean or sample mean asindicated.

7. Sample: 20, 13, 4, 8, 10

8. Sample: 83, 65, 91, 87, 84

9. Population: 3, 6, 10, 12, 14

10. Population: 1, 19, 25, 15, 12, 16, 28, 13, 6

11. For Super Bowl XL, CBS television sold 65 ad slots for atotal revenue of roughly $162.5 million. What was the meanprice per ad slot? $2.5 million

m � 15m � 9

x � 82x � 11

� $55,263

12. The median for the given set of six ordered data values is 26.5.What is the missing value? 7 12 21 32 41 50

13. Crash Test Results The Insurance Institute for HighwaySafety crashed the 2007 Audi A4 four times at 5 milesper hour. The costs of repair for each of the four crasheswere

$976, $2038, $918, $1899

Compute the mean, median, and mode cost of repair.

14. Cell Phone Use The following data represent the monthlycell phone bill for my wife’s phone for six randomly selectedmonths.

$35.34, $42.09, $39.43, $38.93, $43.39, $49.26

Compute the mean, median, and mode phone bill.

15. Concrete Mix A certain type of concrete mix is designed towithstand 3,000 pounds per square inch (psi) of pressure.The strength of concrete is measured by pouring the mixinto casting cylinders 6 inches in diameter and 12 inches tall.The concrete is allowed to set for 28 days. The concrete’sstrength is then measured. The following data represent thestrength of nine randomly selected casts (in psi).

3960, 4090, 3200, 3100, 2940, 3830, 4090, 4040, 3780

Compute the mean, median, and mode strength of the con-crete (in psi). 3670; 3830; 4090

16. Flight Time The following data represent the flight time(in minutes) of a random sample of seven flights from LasVegas, Nevada, to Newark, New Jersey, on ContinentalAirlines.

282, 270, 260, 266, 257, 260, 267

Compute the mean, median, and mode flight time.

5. The median is between the and ordered values5001st5000th

Now Work Problem 33

SummaryWe conclude this section with the following chart, which addresses the circumstancesunder which each measure of central tendency should be used.

13. $1,457.75; $1,437.50; none 14. $41.41; $40.76; none16. 266; 266; 260

NW

Note to Instructor If you postponed discussion of Sections 1.2–1.6, do not assign Problems 23 or 24.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 137

Page 11: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

138 Chapter 3 Numerically Summarizing Data

6 12 18 24 3030 9 15 21 27 33

(a)

Freq

uenc

y 15

10

5

4 8 12 16 20

(b)

Freq

uenc

y

9876543210

5 15 2510 20 30

(c)

Freq

uenc

y

15

10

5

0

15 25 35 45 555

(a)

Freq

uenc

y

15

10

5

010 30 50 70 90�10

(b)

Freq

uenc

y

9876543210

26 38 50 62 7414

(d)

Freq

uenc

y

15

10

5

07.5 22.5 37.5 52.5 67.5�7.5

(c)

Freq

uenc

y

10

5

0

Tap 7.64 7.45 7.47 7.50 7.68 7.69

7.45 7.10 7.56 7.47 7.52 7.47

Bottled 5.15 5.09 5.26 5.20 5.02 5.23

5.28 5.26 5.13 5.26 5.21 5.24

Source: Emily McCarney, student at Joliet Junior College

17. For each of the three histograms shown,determine whether themean is greater than, less than, or approximately equal to themedian. Justify your answer. (a) (b) (c) x<Mx � Mx>M

19. Mean versus Median Applet Load the mean versus medianapplet that is located on the CD that accompanies the text.Change the lower limit to 0 and the upper limit to 10 andclick Update.(a) Create a data set of at least eight observations such that

the mean and median are roughly 2.(b) Add a single observation near 9. How does this new

value affect the mean? How does this new value affectthe median?

(c) Change the upper limit to 25 and click Update. Removethe single value added from part (b).Add a single obser-vation near 24. How does this new value affect the mean?the median?

(d) Refresh the page. Change the lower limit to 0 and theupper limit to 50 and click Update. Create a data set ofat least eight observations such that the mean and medianare roughly 40.

(e) Add a single observation near 35. How does this newvalue affect the mean? the median? Now “grab” thispoint with your mouse cursor and drag it toward 0.Whathappens to the value of the mean? What happens to thevalue of the median? Why?

20. Mean versus Median Applet Load the mean versus medianapplet that is located on the CD that accompanies the text.Change the lower limit to 0 and the upper limit to 10 andclick Update.(a) Create a data set of at least 10 observations such that the

mean equals the median.(b) Create a data set of at least 10 observations such that the

mean is greater than the median.(c) Create a data set of at least 10 observations such that the

mean is less than the median.(d) Comment on the shape of each distribution from parts

(a)–(c).(e) Can you create a distribution that is skewed left, but has

a mean that is greater than the median?

Applying the Concepts21. pH in Water The acidity or alkalinity of a solution is meas-

ured using pH.A pH less than 7 is acidic; a pH greater than 7is alkaline. The following data represent the pH in samplesof bottled water and tap water.

18. Match the histograms shown to the summary statistics:

21. (a) Tap: ; M 7.485; mode 7.47Bottled: x � 5.194, M � 5.22, mode � 5.26

��x � 7.50

Mean Median

I 42 42

II 31 36

III 31 26

IV 31 32

(a) Compute the mean, median, and mode pH for each typeof water. Comment on the differences between the twowater types.

(b) Suppose the pH of 7.10 in tap water was incorrectlyrecorded as 1.70. How does this affect the mean? themedian? What property of the median does this illus-trate? , M 7.485; resistance

22. Reaction Time In an experiment conducted online at theUniversity of Mississippi, study participants are asked toreact to a stimulus. In one experiment, the participant must

�x � 7.05

(a) IV (b) III (c) II (d) I

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 138

Page 12: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

press a key upon seeing a blue screen. The time (in seconds)to press the key is measured. The same person is thenasked to press a key upon seeing a red screen, again with thetime to react measured. The table shows the results for sixstudy participants.(a) Compute the mean, median, and mode reaction time for

both blue and red.(b) Does there appear to be a difference in the reaction time?

What might account for any difference? How might thisinformation be used?

(c) Suppose the reaction time of 0.841 for blue was incor-rectly recorded as 8.41. How does this affect the mean?the median? What property of the median does thisillustrate? ; M 0.5315; resistance�x � 1.8125

Section 3.1 Measures of Central Tendency 139

(c) Which samples result in a sample mean that overesti-mates the population mean? Which samples result in asample mean that underestimates the population mean?Do any samples lead to a sample mean that equals thepopulation mean?

25. Carbon Dioxide Emissions The given data represent thecarbon dioxide (CO2) emissions (in thousands of metrictons) of the top 10 emitters in 2004.

(a) Determine the mean CO2 emissions of the top 10countries. 1,813,654.5 thousand metric tons

(b) Explain why the total emissions of a country is likely notthe best gauge of CO2 emissions; instead, the emissionsper capita (total emissions divided by population size) isthe better gauge.

(c) Determine the mean and median per capita CO2 emis-sions of the top 10 countries. Which measure would anenvironmentalist likely use to support the position thatper capita CO2 emissions are too high? Why?

26. Tour de Lance Lance Armstrong won the Tour de Franceseven consecutive years (1999–2005). The following tablegives the winning times, distances, speeds, and margin ofvictory.

1 0.582 0.408

2 0.481 0.407

3 0.841 0.542

4 0.267 0.402

5 0.685 0.456

6 0.450 0.533

Source: PsychExperiments at the University ofMississippi (www.olemiss.edu/psychexps)

Participant Number

Reaction Time to Blue

Reaction Time to Red

(a) Compute the population mean pulse. 72.2(b) Determine three simple random samples of size 3 and

compute the sample mean pulse of each sample.(c) Which samples result in a sample mean that overesti-

mates the population mean? Which samples result in asample mean that underestimates the population mean?Do any samples lead to a sample mean that equals thepopulation mean?

24. Travel Time The following data (table on top of next col-umn) represent the travel time (in minutes) to school fornine students enrolled in Sullivan’s College Algebra course.Treat the nine students as a population.(a) Compute the population mean for travel time. 26.4 min(b) Determine three simple random samples of size 4

and compute the sample mean for travel time of eachsample.

22. (a) Blue: 0.551 sec; 0.5315 sec; none; Red: 0.458 sec; 0.432 sec; none

23. Pulse Rates The following data represent the pulse rates(beats per minute) of nine students enrolled in a section ofSullivan’s Introductory Statistics course. Treat the nine stu-dents as a population.

Student Pulse

Perpectual Bempah 76

Megan Brooks 60

Jeff Honeycutt 60

Clarice Jefferson 81

Crystal Kurtenbach 72

Janette Lantka 80

Kevin McCarthy 80

Tammy Ohm 68

Kathy Wojdyla 73

Student Travel Time Student Travel Time

Amanda 39 Scot 45

Amber 21 Erica 11

Tim 9 Tiffany 12

Mike 32 Glenn 39

Nicole 30

2.814 thousand metric tons; 2.67 thousand metric tons; mean

Country Emissions Per Capita Emissions

United States 6,049,435 5.61

China 5,010,170 1.05

Russia 1,524,993 2.89

India 1,342,962 0.34

Japan 1,257,963 2.69

Germany 808,767 2.67

Canada 639,403 5.46

United Kingdom 587,261 2.67

South Korea 465,643 2.64

Italy 449,948 2.12

Source: Carbon Dioxide Information Analysis Center

NW

1999 91.538 3687 40.28 7.617

2000 92.552 3662 39.56 6.033

2001 86.291 3453 40.02 6.733

2002 82.087 3278 39.93 7.283

2003 83.687 3427 40.94 1.017

2004 83.601 3391 40.56 6.317

2005 86.251 3593 41.65 4.667

Source: cyclingnews.com

YearWinningTime (h)

Distance(km)

Winning Speed (km/h)

WinningMargin (min)

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 139

Page 13: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

140 Chapter 3 Numerically Summarizing Data

of central tendency better describes the “center” of thedistribution? Skewed right; median

29. M&Ms The following data represent the weights (in grams)of a simple random sample of 50 M&M plain candies.

Determine the shape of the distribution of hours worked bydrawing a frequency histogram. Compute the mean andmedian. Which measure of central tendency better describeshours worked? Skewed left; median

32. A Dealer’s Profit The following data represent the profit(in dollars) of a new-car dealer for a random sample of40 sales.

x � 22 hrs; M � 25 hrs;

(a) Compute the mean and median of his winning times forthe seven races.

(b) Compute the mean and median of the distances for theseven races.

(c) Compute the mean and median of his winning timemargins.

(d) Compute the mean winning speed by finding the meanof the data values in the table. Next, compute the meanwinning speed by finding the total of the seven distancesand dividing by the total of the seven winning times.Finally, compute the mean winning speed by dividing themean distance by the mean winning time. Do the threevalues agree or are there differences?

27. Connection Time A histogram of the connection time, inseconds, to an Internet service provider for 30 randomlyselected connections is shown. The mean connection time is39.007 seconds and the median connection time is 39.065seconds. Identify the shape of the distribution. Which meas-ure of central tendency better describes the “center” of thedistribution? Symmetric; mean

Mean � 5.667 min; median � 6.317 min

Mean � 3498.7 km; median � 3453 km

Mean � 86.572 h; median � 86.251 h

26. (d) Method 1: method 2: method 3: mean � 40.414 km/hmean � 40.414 km/h;mean � 40.420 km/h;

Source: Nicole Spreitzer, student at Joliet Junior College

Source: Carol Wesolowski, student at Joliet Junior College

0

1

2

3

4

5

6

7

Freq

uenc

y

36 38 40 42 44

Time (seconds)

Connection Time

28. Journal Costs A histogram of the annual subscription cost(in dollars) for 26 biology journals is shown. The mean sub-scription cost is $1846 and the median subscription cost is$1142. Identify the shape of the distribution.Which measure

0

2

4

6

8

10

12

14

Freq

uenc

y

0 2000 4000 6000 8000

Cost (dollars)

Cost of Biology Journals

0.87 0.88 0.82 0.90 0.90 0.84 0.84

0.91 0.94 0.86 0.86 0.86 0.88 0.87

0.89 0.91 0.86 0.87 0.93 0.88

0.83 0.95 0.87 0.93 0.91 0.85

0.91 0.91 0.86 0.89 0.87 0.84

0.88 0.88 0.89 0.79 0.82 0.83

0.90 0.88 0.84 0.93 0.81 0.90

0.88 0.92 0.85 0.84 0.84 0.86

Source: Michael Sullivan

Symmetric; meanDetermine the shape of the distribution of weights ofM&Ms by drawing a frequency histogram. Compute themean and median.Which measure of central tendency betterdescribes the weight of a plain M&M?

30. Old Faithful We have all heard of the Old Faithful geyserin Yellowstone National Park. However, there is another,less famous, Old Faithful geyser in Calistoga, California.The following data represent the length of eruption (in sec-onds) for a random sample of eruptions of the CaliforniaOld Faithful.

x � 0.875 g; M � 0.875 g;

108 108 99 105 103 103 94

102 99 106 90 104 110 110

103 109 109 111 101 101

110 102 105 110 106 104

104 100 103 102 120 90

113 116 95 105 103 101

100 101 107 110 92 108

Source: Ladonna Hansen, Park Curator

0 0 15 20 30

40 30 20 35 35

28 15 20 25 25

30 5 0 30 24

28 30 35 15 15

NW

Symmetric; meanDetermine the shape of the distribution of length of eruptionby drawing a frequency histogram. Compute the mean andmedian. Which measure of central tendency better describesthe length of eruption?

31. Hours Working A random sample of 25 college studentswas asked, “How many hours per week typically do youwork outside the home?” Their responses were as follows:

x � 104.1 s, M � 104 s;

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 140

Page 14: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Sample of Size 30

106 92 98 103 100 102

98 124 83 70 108 121

102 87 121 107 97 114

140 93 130 72 81 90

103 97 89 98 88 103

Section 3.1 Measures of Central Tendency 141

(a) Determine the mode political view. Moderate(b) Do you think it would be a good idea to rotate the

choices conservative, moderate, or liberal in the ques-tion? Why? Yes, to avoid response bias

34. Hospital Admissions The following data represent the diag-nosis of a random sample of 20 patients admitted to a hospital.

781 1,038 453 1,446 3,082

501 451 1,826 1,348 3,001

1,342 1,889 580 0 2,909

2,883 480 1,664 1,064 2,978

149 1,291 507 261 540

543 87 798 673 2,862

1,692 1,783 2,186 398 526

730 2,324 2,823 1,676 4,148

Source: Ashley Hudson, student at Joliet Junior College

Determine the shape of the distribution of new-car profit bydrawing a frequency histogram. Compute the mean and me-dian. Which measure of central tendency better describes theprofit? Skewed right; median

33. Political Views A sample of 30 registered voters was sur-veyed in which the respondents were asked, “Do you consi-der your political views to be conservative, moderate, orliberal?” The results of the survey are shown in the table.

x � $1,392.83; M � $1,177.50;

Liberal

Moderate

Liberal

Moderate

Moderate

Liberal

Conservative

Liberal

Liberal

Conservative

Conservative

Liberal

Moderate

Conservative

Moderate

Moderate

Moderate

Conservative

Conservative

Moderate

Moderate

Moderate

Conservative

Moderate

Liberal

Liberal

Moderate

Liberal

Liberal

Conservative

NW

Source: Based on data from the General Social Survey

Cancer Motor vehicle accident

Congestive heartfailure

Gunshot wound

Fall Gunshot wound

Gunshot wound

Motor vehicle accident

Gunshot wound

Assault Motor vehicle accident

Gunshot wound

Motor vehicle accident

Motor vehicle accident

Gunshot wound

Motor vehicle accident

Gunshot wound Motor vehicle accident

Fall Gunshot wound

Source: Tamela Ohm, student at Joliet Junior College

Determine the mode diagnosis. Gunshot wound

35. Resistance and Sample Size Each of the following threedata sets represents the IQ scores of a random sample ofadults. IQ scores are known to have a mean and median

Sample of Size 12

106 92 98 103 100 102

98 124 83 70 108 121

36. Mr. Zuro finds the mean height of all 14 students in hisstatistics class to be 68.0 inches. Just as Mr. Zuro finishes ex-plaining how to get the mean, Danielle walks in late. Danielleis 65 inches tall.What is the mean height of the 15 students inthe class? 67.8 inches

37. A researcher with the Department of Energy wants todetermine the mean natural gas bill of householdsthroughout the United States. He knows the mean naturalgas bill of households for each state, so he adds togetherthese 50 values and divides by 50 to arrive at his estimate.Is this a valid approach? Why or why not? No

38. Net Worth According to the Statistical Abstract of the UnitedStates, the mean net worth of all households in the UnitedStates in 2004 was $448,200, while the median net worth was$93,100.(a) Which measure do you believe better describes the

typical U.S. household’s net worth? Support youropinion. Median

(b) What shape would you expect the distribution of networth to have? Why? Skewed right

(c) What do you think causes the disparity in the two meas-ures of central tendency?

39. You are negotiating a contract for the Players Association ofthe NBA. Which measure of central tendency will you use tosupport your claim that the average player’s salary needs to beincreased?Why?As the chief negotiator for the owners, whichmeasure would you use to refute the claim made by thePlayers Association?

40. In January 2008, the mean amount of money lost per visitorto a local riverboat casino was $135. Do you think the medianwas more than, less than, or equal to this amount? Why?

41. Missing Exam Grade A professor has recorded examgrades for 20 students in his class, but one of the grades is nolonger readable. If the mean score on the exam was 82 andthe mean of the 19 readable scores is 84, what is the value ofthe unreadable score? 44

42. For each of the following situations, determine which meas-ure of central tendency is most appropriate and justify yourreasoning.

of 100. For each data set, compute the mean and median.For each data set recalculate the mean and median, assum-ing that the individual whose IQ is 106 is accidentallyrecorded as 160. For each sample size, state what happens tothe mean and the median? Comment on the role the num-ber of observations plays in resistance.

Sample of Size 5

106 92 98 103 100

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 141

Page 15: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

142 Chapter 3 Numerically Summarizing Data

(a) Average price of a home sold in Pittsburgh, Pennsylvaniain 2009 Median

(b) Most popular major for students enrolled in a statisticscourse Mode

(c) Average test score when the scores are distributedsymmetrically Mean

(d) Average test score when the scores are skewed right(e) Average income of a player in the National Football

League Median(f) Most requested song at a radio station Mode

43. Linear Transformations Benjamin owns a small Internetbusiness. Besides himself, he employs nine other people. Thesalaries earned by the employees are given next in thou-sands of dollars (Benjamin’s salary is the largest, of course):

30, 30, 45, 50, 50, 50, 55, 55, 60, 75(a) Determine the mean, median, and mode for salary.(b) Business has been good!As a result,Benjamin has a total of

$25,000 in bonus pay to distribute to his employees. Oneoption for distributing bonuses is to give each employee(includinghimself)$2,500. Addthebonusesunder thisplanto the original salaries to create a new data set.Recalculatethe mean, median, and mode. How do they compare to theoriginals?

(c) As a second option, Benjamin can give each employee abonus of 5% of his or her original salary.Add the bonus-es under this second plan to the original salaries to cre-ate a new data set. Recalculate the mean, median, andmode. How do they compare to the originals?

(d) As a third option, Benjamin decides not to give his em-ployees a bonus at all. Instead, he keeps the $25,000 forhimself. Use this plan to create a new data set. Recalcu-late the mean, median, and mode. How do they compareto the originals?

44. Linear Transformations Use the five test scores of 65, 70, 71,75, and 95 to answer the following questions:(a) Find the sample mean. 75.2(b) Find the median. 71(c) Which measure of central tendency best describes the

typical test score? Median

mode � 50median � 50;mean � 52.5;

mode � 52.5median � 52.5;mean � 52.5;

(d) Suppose the professor decides to curve the exam byadding 4 points to each test score. Compute the samplemean based on the adjusted scores. 79.2

(e) Compare the unadjusted test score mean with thecurved test score mean. What effect did adding 4 to eachscore have on the mean?

45. Trimmed Mean Another measure of central tendency is thetrimmed mean. It is computed by determining the mean of adata set after deleting the smallest and largest observed val-ues. Compute the trimmed mean for the data in Problem 29.Is the trimmed mean resistant? Explain. 0.875

46. Midrange The midrange is also a measure of central tendency.It is computed by adding the smallest and largest observedvalues of a data set and dividing the result by 2; that is,

Compute the midrange for the data in Problem 29. Is themidrange resistant? Explain. 0.87; no

47. Putting It Together: Shape, Mean and Median As part of asemester project in a statistics course, Carlos surveyed asample of 40 high school students and asked, “How manydays in the past week have you consumed an alcoholic bev-erage?” The results of the survey are shown next.

Midrange =

largest data value + smallest data value

2

42. (d) Median 43. (a) mode � 50median � 50;Mean � 50;

0 0 1 4 1 1 1 5 1 3

0 1 0 1 0 4 0 1 0 1

0 0 0 0 2 0 0 0 0 0

1 0 2 0 0 0 1 2 1 1

2 0 1 0 1 3 1 1 0 3

43. (c) mode � 52.5median � 52.5;mean � 52.5;

TI-83/84 Plus1. Enter the raw data in L1 by pressing STATand selecting 1:Edit.2. Press STAT, highlight the CALC menu, andselect 1:1-Var Stats.3. With 1-Var Stats appearing on theHOME screen, press 2nd then 1 to insert L1on the HOME screen. Press ENTER.

MINITAB1. Enter the data in C1.2. Select the Stat menu, highlight BasicStatistics, and then highlight DisplayDescriptive Statistics.3. In the Variables window, enter C1. Click OK.

TECHNOLOGY STEP-BY-STEP Determining the Mean and Median

Excel1. Enter the data in column A.2. Select the Tools menu and highlight DataAnalysis3. In the Data Analysis window, highlightDescriptive Statistics and click OK.4. With the cursor in the Input Range window,use the mouse to highlight the data incolumn A.5. Select the Summary statistics option andclick OK.

Á

(a) Is this data discrete or continuous? Discrete(b) Draw a histogram of the data and describe its shape.(c) Based on the shape of the histogram, do you expect

the mean to be more than, equal to, or less than themedian?

(d) Compute the mean and the median. What does this tellyou? Mean: 0.94; Median: 1

(e) Determine the mode. 0(f) Do you believe that Carlos’ survey suffers from sampling

bias? Why? Yes

x 7 M

47. (b) Skewed right

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 142

Page 16: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.2 Measures of Dispersion 143

3.2 MEASURES OF DISPERSION

Objectives 1 Compute the range of a variable from raw data2 Compute the variance of a variable from raw data3 Compute the standard deviation of a variable from raw data4 Use the Empirical Rule to describe data that are bell shaped5 Use Chebyshev’s inequality to describe any set of data

In Section 3.1, we discussed measures of central tendency. The purpose of thesemeasures is to describe the typical value of a variable. In addition to measuring thecentral tendency of a variable, we would also like to know the amount of dispersionin the variable. By dispersion, we mean the degree to which the data are spread out.An example should help to explain why measures of central tendency are not suffi-cient in describing a distribution.

EXAMPLE 1 Comparing Two Sets of Data

Problem: The data in Table 7 represent the IQ scores of a random sample of100 students from two different universities. For each university, compute the meanIQ score and draw a histogram, using a lower class limit of 55 for the first class anda class width of 15. Comment on the results.

Table 7

University A University B

73 103 91 93 136 108 92 104 90 78 86 91 107 94 105 107 89 96 102 96

108 93 91 78 81 130 82 86 111 93 92 109 103 106 98 95 97 95 109 109

102 111 125 107 80 90 122 101 82 115 93 91 92 91 117 108 89 95 103 109

103 110 84 115 85 83 131 90 103 106 110 88 97 119 90 99 96 104 98 95

71 69 97 130 91 62 85 94 110 85 87 105 111 87 103 92 103 107 106 97

102 109 105 97 104 94 92 83 94 114 107 108 89 96 107 107 96 95 117 97

107 94 112 113 115 106 97 106 85 99 98 89 104 99 99 87 91 105 109 108

102 109 76 94 103 112 107 101 91 107 116 107 90 98 98 92 119 96 118 98

107 110 106 103 93 110 125 101 91 119 97 106 114 87 107 96 93 99 89 94

118 85 127 141 129 60 115 80 111 79 104 88 99 97 106 107 112 97 94 107

Approach: We will use MINITAB to compute the mean and draw a histogram foreach university.

Solution: We enter the data into MINITAB and determine that the mean IQ scoreof both universities is 100.0. Figure 11 shows the histograms.

Figure 11

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 143

Page 17: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

144 Chapter 3 Numerically Summarizing Data

Both universities have the same mean IQ, but the histograms indicate the IQsfrom University A are more spread out, that is, more dispersed.While an IQ of 100.0is typical for both universities, it appears to be a more reliable description of the typ-ical student from University B than from University A. That is, a higher proportionof students from University B have IQ scores within, say, 15 points of the mean of100.0 than students from University A.

Our goal in this section is to discuss numerical measures of dispersion so that wecan quantify the spread of data. In this section, we discuss three numerical measuresfor describing the dispersion, or spread, of data: the range, variance, and standarddeviation. In Section 3.4, we will discuss another measure of dispersion, theinterquartile range (IQR).

1 Compute the Range of a Variable from Raw DataThe simplest measure of dispersion is the range. To compute the range, the datamust be quantitative.

Definition The range, R, of a variable is the difference between the largest data value andthe smallest data value. That is,

EXAMPLE 2 Computing the Range of a Set of Data

Problem: The data in Table 8 represent the scores on the first exam of 10 studentsenrolled in a section of Introductory Statistics. Compute the range.

Approach: The range is found by computing the difference between the largestand smallest data values.

Solution: The highest test score is 94 and the lowest test score is 62. The range is

All the students in the class scored between 62 and 94 on the exam. The differ-ence between the best score and the worst score is 32 points.

Now compute the range of the data in Problem 19

Notice that the range is affected by extreme values in the data set, so the rangeis not resistant. If Jennifer did not study and scored 28, the range becomes

In addition, the range is computed using only two values in thedata set (the largest and smallest). The variance and the standard deviation, on theother hand, use all the data values in the computations.

2 Compute the Variance of a Variable from Raw DataJust as there is a population mean and sample mean, we also have a populationvariance and a sample variance. Measures of dispersion are meant to describehow spread out data are. Another way to think about this is to describe how far,on average, each observation is from the mean. Variance is based on the deviation

R = 94 - 28 = 66.

R = 94 - 62 = 32

Range = R = largest data value - smallest data value

Note to InstructorYou may want to share with studentsthat the smallest data value is also re-ferred to as the minimum and the largestdata value is referred to as the maximum,as a preview of the five-number summary.

Note to InstructorConsider continuing with the data collectedfrom the population mean versus samplemean activity to illustrate Examples 2through 6.

Table 8

Student Score

1. Michelle 82

2. Ryanne 77

3. Bilal 90

4. Pam 71

5. Jennifer 62

6. Dave 68

7. Joel 74

8. Sam 84

9. Justine 94

10. Juan 88

Note to InstructorStress to the students that the varianceis difficult to interpret because it is mea-sured in units squared (such as pointssquared). However, the variance plays animportant role in statistical inference.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 144

Page 18: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.2 Measures of Dispersion 145

about the mean. For a population, the deviation about the mean for the ith obser-vation is For a sample, the deviation about the mean for the ith observa-tion is The further an observation is from the mean, the larger theabsolute value of the deviation.

The sum of all deviations about the mean must equal zero. That is,

In other words, observations greater than the mean are offset by observations lessthan the mean. Because the sum of deviations about the mean is zero, we cannot usethe average deviation about the mean as a measure of spread. However, squaringa nonzero number always results in a positive number, so we could find the averagesquared deviation.

Definition The population variance of a variable is the sum of the squared deviationsabout the population mean divided by the number of observations in the popu-lation, N. That is, it is the mean of the squared deviations about the populationmean. The population variance is symbolically represented by (lowercaseGreek sigma squared).

(1)

where are the N observations in the population and is the pop-ulation mean.

A formula that is equivalent to Formula (1), called the computational formulafor determining the population variance, is

(2)

where means to square each observation and then sum these squared values,and means to add up all the observations and then square the sum.

We illustrate how to use both formulas for computing the variance in the nextexample.

1©xi22©xi

2

s2=

axi2

-

Aaxi B2N

N

mx1, x2 , Á , xN

s2=

1x1 - m22 + 1x2 - m22 +Á

+ 1xN - m22N

= a1xi - m22

N

s2

a 1xi - m2 = 0 and a 1xi - x2 = 0

xi - x.xi - m.

Note to InstructorBe sure to let students know which, if any,formula you prefer.

EXAMPLE 3 Computing a Population Variance

Problem: Compute the population variance of the test scores presented inTable 8.

Approach Using Formula (1)Step 1: Create a table with four columns. Enter the populationdata in the first column. In the second column, enter the popula-tion mean.Step 2: Compute the deviation about the mean for each datavalue. That is, compute for each data value. Enter thesevalues in column 3.Step 3: Square the values in column 3, and enter the results incolumn 4.Step 4: Sum the squared deviations in column 4, and divide thisresult by the size of the population, N.

xi - m

Approach Using Formula (2)Step 1: Create a table with two columns. Enterthe population data in the first column. Squareeach value in the first column and enter the resultin the second column.

Step 2: Sum the entries in the first column. Thatis, find Sum the entries in the second column.That is, find

Step 3: Substitute the values found in Step 2 andthe value for N into the computational formulaand simplify.

©xi2.

©xi .

In using Formula (1), donot round until the last computation.Use as many decimal places asallowed by your calculator to avoidround-off errors.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 145

Page 19: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

146 Chapter 3 Numerically Summarizing Data

Step 2: Compute the deviations about the mean for each obser-vation, as shown in column 3. For example, the deviation aboutthe mean for Michelle is It is a good idea to add upthe entries in this column to make sure they sum to 0.Step 3: Column 4 shows the squared deviations about the mean.Notice the further an observation is from the mean, the larger thesquared deviation.Step 4: We sum the entries in column 4 to obtain the numera-tor of Formula (1). We compute the population variance by di-viding the sum of the entries in column 4 by the number ofstudents, 10:

s2=a 1xi - m22

N=

96410

= 96.4 points2

82 - 79 = 3.

Table 9

SolutionStep 1: See Table 9. Column 1 lists the observations in the dataset, and column 2 contains the population mean.

Table 10

Score, Score Squared,

82

77

90 8,100

71 5,041

62 3,844

68 4,624

74 5,476

84 7,056

94 8,836

88 7,744

axi2

= 63,374axi = 790

772= 5,929

822= 6,724

xi2xi

Score,xii

PopulationMean, M

Deviationabout theMean, xi � M

SquaredDeviations about the Mean, 1xi � M22

82 79 82 - 79 = 3 32= 9

77 79 77 - 79 = -2 (-2)2= 4

90 79 11 121

71 79 -8 64

62 79 -17 289

68 79 -11 121

74 79 -5 25

84 79 5 25

94 79 15 225

88 79 9 81

a 1xi - m22 = 964a 1xi - m2 = 0

SolutionStep 1: See Table 10. Column 1 lists the observa-tions in the data set, and column 2 contains thevalues in column 1 squared.

Step 2: The last row of columns 1 and 2 show thatand

Step 3: We substitute 790 for 63,374 forand 10 for N into the computational

Formula (2).

= 96.4 points2

=

96410

s2=

axi2

-

Aaxi B2N

N=

63,374 -

17902210

10

©xi2,

©xi ,©xi

2= 63,374.©xi = 790

The unit of measure of the variance in Example 3 is points squared. Thisunit of measure results from squaring the deviations about the mean. Becausepoints squared does not have any obvious meaning, the interpretation of thevariance is limited.

The sample variance is computed using sample data.

Definition The sample variance, is computed by determining the sum of the squared de-viations about the sample mean and dividing this result by The formulafor the sample variance from a sample of size n is

(3)

where are the n observations in the sample and is the samplemean.

xx1 , x2 , Á , xn

s2=a 1xi - x22

n - 1=

1x1 - x22 + 1x2 - x22 +Á

+ 1xn - x22n - 1

n - 1.s2,

When using Formula (3), besure to use with as many decimalplaces as possible to avoid round-offerror.

x

Note to InstructorIn-class concept check: Ask students toexplain why When calcu-lating why does it sometimesturn out to be a little more or less than 0?

©1xi - x2,©1xi - x2 = 0?

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 146

Page 20: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.2 Measures of Dispersion 147

A computational formula that is equivalent to Formula (3) for computing thesample variance is

(4)

Notice that the sample variance is obtained by dividing by If we dividedby n, as we might expect, the sample variance would consistently underestimatethe population variance. Whenever a statistic consistently overestimates or under-estimates a parameter, it is called biased. To obtain an unbiased estimate of thepopulation variance, we divide the sum of the squared deviations about the sam-ple mean by

To help understand the idea of a biased estimator, consider the followingsituation: Suppose you work for a carnival in which you must guess a person’sage. After 20 people come to your booth, you notice that you have a tendency tounderestimate people’s age. (You guess too low.) What would you do about this?In all likelihood, you would adjust your guesses higher so that you don’t underes-timate anymore. In other words, before the adjustment, your guesses were biased.To remove the bias, you increase your guess. This is what dividing by in thesample variance formula accomplishes. Dividing by n results in an underestimate,so we divide by a smaller number to increase our “guess.”

Although a proof that establishes why we divide by is beyond the scopeof the text, we can provide an explanation that has intuitive appeal. We alreadyknow that the sum of the deviations about the mean, , must equal zero.Therefore, if the sample mean is known and the first observations are known,then the nth observation must be the value that causes the sum of the deviations toequal zero. For example, suppose based on a sample of size 3. In addition, if

and then we can determine

We call the degrees of freedom because the first observationshave freedom to be whatever value they wish, but the nth value has no freedom. Itmust be whatever value forces the sum of the deviations about the mean to equalzero.

Again, you should notice that typically Greek letters are used for parameters,while Roman letters are used for statistics. Do not use rounded values of the samplemean in Formula (3).

EXAMPLE 4 Computing a Sample Variance

Problem: Compute the sample variance of the sample obtained in Example 1(b)on page 130 from Section 3.1.

Approach: We follow the same approach that we used to compute the populationvariance, but this time using the sample data. In looking back at Example 1(b) fromSection 3.1, we see that Bilal (90), Ryanne (77), Pam (71), and Michelle (82) are inthe sample.

n - 1n - 1

x3 = 7

5 + x3 = 12

x1 = 2, x2 = 3, x = 42 + 3 + x3

3= 4

x1 + x2 + x3

3= x

x3 .x2 = 3,x1 = 2x = 4

n - 1©1xi - x2

n - 1

n - 1

n - 1.

n - 1.

s2=

axi2

-

Aaxi B2n

n - 1

When computing the samplevariance, be sure to divide by not n.

n - 1,

Note to InstructorA visual to help students understand de-grees of freedom: Suppose there are sixdoughnuts in a box. The first five peopleto pick a doughnut have a choice (or free-dom) about which doughnut to pick. Thesixth person does not.

In Other WordsWe have degrees of freedom in thecomputation of because an unknownparameter, , is estimated with Foreach parameter estimated, we lose1 degree of freedom.

x.m

s2n - 1

Note to InstructorTo help students understand the conceptof bias, draw a dart board on the board.Have all the darts congregate in theupper-left quadrant of the board.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 147

Page 21: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

148 Chapter 3 Numerically Summarizing Data

Notice that the sample variance obtained for this sample is an underestimate ofthe population variance we found in Example 3. This discrepancy does not violateour definition of an unbiased estimator, however. A biased estimator is one thatconsistently under- or overestimates.

3 Compute the Standard Deviation of a Variable from Raw DataThe standard deviation and the mean are the most popular methods for numericallydescribing the distribution of a variable.This is because these two measures are usedfor most types of statistical inference.

Definitions The population standard deviation, is obtained by taking the square root ofthe population variance. That is,

The sample standard deviation, s, is obtained by taking the square root of thesample variance. That is,

EXAMPLE 5 Obtaining the Standard Deviation for a Population and a Sample

Problem: Use the results obtained in Examples 3 and 4 to compute the populationand sample standard deviation score on the statistics exam.

s = 3s2

s = 3s2

S,

Solution Using Formula (3)Step 1: Create a table with four columns. Enter the sample data inthe first column. In the second column, enter the sample mean. SeeTable 11.

Solution Using Formula (4)Step 1: See Table 12. Column 1 lists the observa-tions in the data set, and column 2 contains thevalues in column 1 squared.

Table 11

Score, xi

SampleMean, x

Deviation about the Mean, xi � x

Squared Deviations about the Mean, (xi � x)2

90 80

77 80 9

71 80 81

82 80 4

a 1xi - x 22 = 194a 1xi - x 2 = 0

2

-9

-3

102= 10090 - 80 = 10

Step 2: Compute the deviations about the mean for each observa-tion, as shown in column 3. For example, the deviation about themean for Bilal is It is a good idea to add up theentries in this column to make sure they sum to 0.Step 3: Column 4 shows the squared deviations about the mean.Step 4: We sum the entries in column 4 to obtain the numerator ofFormula (3). We compute the population variance by dividing thesum of the entries in column 4 by one fewer than the number ofstudents,

s2=a 1xi - x22

n - 1=

1944 - 1

= 64.7

4 - 1:

90 - 80 = 10.

Table 12

Score, Score squared,

90

77

71 5,041

82 6,724

ax i2 = 25,794axi = 320

77 2= 5,929

90 2= 8,100

x i2xi

Step 2: The last rows of columns 1 and 2 showthat and Step 3: We substitute 320 for 25,794 for and 4 for n into the computational Formula (4).

= 64.7

=

1943

s2=

axi2

-

Aaxi B2n

n - 1=

25,794 -

1320224

4 - 1

©xi2,©xi ,

©xi2

= 25,794.©xi = 320

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 148

Page 22: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.2 Measures of Dispersion 149

The Sample Standard DeviationUsing the pulse data from Section 3.1, page 131, do the following:(a) Obtain a simple random sample of students and computethe sample standard deviation.(b) Obtain a second simple random sample of students andcompute the sample standard deviation.(c) Are the sample standard deviations the same? Why?

n = 4

n = 4

Approach: The population standard deviation is the square root of the populationvariance. The sample standard deviation is the square root of the sample variance.

Solution: The population standard deviation is

The sample standard deviation for the sample obtained in Example 1 fromSection 3.1 is

To avoid round-off error, never use the rounded value of the variance to computethe standard deviation.

s = 3s2= C

©1xi - x22n - 1

= A194

4 - 1= 8.0 points

s = 3s2= Ca

1xi - m22N

= A96410

= 9.8 points

Never use the roundedvariance to compute the standarddeviation.

Now Work Problem 25

EXAMPLE 6 Determining the Variance and Standard Deviation Using Technology

Problem: Use statistical software or a calculator to determine the population stan-dard deviation of the data listed in Table 8. Also determine the sample standarddeviation of the sample data from Example 4.

Approach: We will use a TI-84 Plus graphing calculator to obtain the populationstandard deviation and sample standard deviation score on the statistics exam. Thesteps for determining the standard deviation using the TI-83 or TI-84 Plus graphingcalculator, MINITAB, or Excel are given in the Technology Step-by-Step on page 159.

Solution: Figure 12(a) shows the population standard deviation, and Figure 12(b)shows the sample standard deviation. Notice the TI graphing calculators provideboth a population and sample standard deviation as output. This is because thecalculator does not know whether the data entered are population data or sampledata. It is up to the user of the calculator to choose the correct standard devia-tion. The results agree with those obtained in Example 5. To get the variance, weneed to square the standard deviation. For example, the population variance is9.8183501672

= 96.4 points2.

(a) (b)

Samplestandarddeviation

Populationstandarddeviation

Figure 12

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 149

Page 23: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

150 Chapter 3 Numerically Summarizing Data

Is the standard deviation resistant? To help determine the answer, sup-pose Jennifer’s exam score (Table 8 on page 144) is 26, not 62. The populationstandard deviation increases from 9.8 to 18.3. Clearly, the standard deviation isnot resistant.

Interpretations of the Standard DeviationThe standard deviation is used in conjunction with the mean to numerically describedistributions that are bell shaped and symmetric. The mean measures the center ofthe distribution, while the standard deviation measures the spread of the distribu-tion. So how does the value of the standard deviation relate to the dispersion of thedistribution?

Loosely, we can interpret the standard deviation as the typical deviation fromthe mean. So, in the data from Table 8, a typical deviation from the mean,points, is 9.8 points.

If we are comparing two populations, then the larger the standard deviation,the more dispersion the distribution has provided that the variable of interest fromthe two populations has the same unit of measure. The units of measure must be thesame so that we are comparing apples with apples. For example, $100 is not the sameas 100 Japanese yen, because $1 is equivalent to about 109 yen. This means astandard deviation of $100 is substantially higher than a standard deviation of100 yen.

EXAMPLE 7 Comparing the Standard Deviation of Two Data Sets

Problem: Refer to the data in Example 1. Use the standard deviation to determinewhether University A or University B has more dispersion in the IQ scores of itsstudents.

Approach: We will use MINITAB to compute the standard deviation of IQ foreach university.The university with the higher standard deviation will be the univer-sity with more dispersion in IQ scores. Recall that, on the basis of the histograms, itwas apparent that University A had more dispersion. Therefore, we would expectUniversity A to have a higher sample standard deviation.

Solution: We enter the data into MINITAB and compute the descriptive statistics.See Figure 13.

m = 79

Note to InstructorTo help students understand the conceptof standard deviation, ask them whetherthey would prefer to invest in a stock witha standard deviation rate of return of 8%or one of 20%. Why? Or would they ratherstand in a line with a standard deviationwait time of 3 minutes or 6 minutes?Why? In both examples, assume themeans are equal.

Descriptive statisticsVariableUniv AUniv B

VariableUniv AUniv B

N100100

N*00

Mean100.00100.00

SE Mean1.610.83

StDev16.088.35

Minimum6086

Q19094

Median10298

Q3110107

Maximum141119

Figure 13

The sample standard deviation is larger for University A (16.1) than for Uni-versity B (8.4). Therefore, University A has IQ scores that are more dispersed.

4 Use the Empirical Rule to Describe DataThat Are Bell ShapedIf data have a distribution that is bell shaped, the Empirical Rule can be used to de-termine the percentage of data that will lie within k standard deviations of the mean.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 150

Page 24: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.2 Measures of Dispersion 151

The Empirical RuleIf a distribution is roughly bell shaped, then

• Approximately 68% of the data will lie within 1 standard deviation of themean. That is, approximately 68% of the data lie between and

• Approximately 95% of the data will lie within 2 standard deviations of themean. That is, approximately 95% of the data lie between and

• Approximately 99.7% of the data will lie within 3 standard deviations ofthe mean.That is, approximately 99.7% of the data lie between and

Note: We can also use the empirical rule based on sample data with used inplace of and s used in place of >s.m

x

m + 3s.m - 3s

m + 2s.m - 2s

m + 1s.m - 1s

Figure 14 illustrates the Empirical Rule.

68% within1 standarddeviation

95% within2 standard deviations

99.7% of data are within3 standard deviations of

the mean (m � 3s to m � 3s)

34%34%

13.5% 13.5%2.35%2.35% 0.15%0.15%

m � 3s m � 3sm � 2s m � 2sm � s m � sm

Figure 14

Let’s revisit the data from University A in Table 7.

EXAMPLE 8 Using the Empirical Rule

Problem: Use the data from University A in Table 7.(a) Determine the percentage of students who have IQ scores within 3 standarddeviations of the mean according to the Empirical Rule.(b) Determine the percentage of students who have IQ scores between 67.8 and132.2 according to the Empirical Rule.(c) Determine the actual percentage of students who have IQ scores between 67.8and 132.2.(d) According to the Empirical Rule, what percentage of students will haveIQ scores above 132.2?

Approach: To use the Empirical Rule, a histogram of the data must be roughlybell shaped. Figure 15 shows the histogram of the data from University A.

Solution: The histogram of the data drawn in Figure 15 is roughly bell shaped.From Example 7 we know that the mean IQ score of the students enrolled in Uni-versity A is 100 and the standard deviation is 16.1. To help organize our thoughtsand make the analysis easier, we draw a bell-shaped curve like the one in Figure 14,with the and See Figure 16.s = 16.1.x = 100

40

30

20

10

0

Freq

uenc

y

0 55 70 85 100 115 160130 145

IQ Scores

University A IQ Scores

Figure 15

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 151

Page 25: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

152 Chapter 3 Numerically Summarizing Data

For example, at least of all observations will lie within

standard deviations of the mean and at least of all

observations will lie within standard deviations of the mean.Notice the result does not state that exactly 75% of all observations lie within

2 standard deviations of the mean, but instead states that 75% or more of the obser-vations will lie within 2 standard deviations of the mean.

k = 3

a1 -

1

32 b100% = 88.9%k = 2

a1 -

1

22 b100% = 75%

0.15%2.35%

13.5% 13.5%

34% 34%

0.15%2.35%

100 – 3(16.1)= 51.7

100 – 2(16.1)= 67.8

100 – 16.1= 83.9

100 + 16.1= 116.1

100 + 2(16.1)= 132.2

100 + 3(16.1)= 148.3

100

Figure 16

(a) According to the Empirical Rule, approximately 99.7% of the IQ scores will bewithin 3 standard deviations of the mean. That is, approximately 99.7% of the datawill be greater than or equal to and less than or equal to

(b) Since 67.8 is exactly 2 standard deviations below the meanand 132.2 is exactly 2 standard deviations above the mean we use the Empirical Rule to determine that approximately 95% of all IQ scores liebetween 67.8 and 132.2.(c) Of the 100 IQ scores listed in Table 7, 96, or 96%, are between 67.8 and 132.2.This is very close to the approximation given by the Empirical Rule.(d) Based on Figure 16, approximately of students at Uni-versity A will have IQ scores above 132.2.

2.35% + 0.15% = 2.5%

[100 + 2116.12 = 132.2],67.8][100 - 2116.12 =

100 + 3116.12 = 148.3.100 - 3116.12 = 51.7

Now Work Problem 35

5 Use Chebyshev’s Inequality to Describe Any Set of DataThe Russian mathematician Pafnuty Chebyshev (1821–1894) developed an in-equality that is used to determine a lower bound on the percentage of observa-tions that lie within k standard deviations of the mean, where What’samazing about this result is that the bound is obtained regardless of the basicshape of the distribution (skewed left, skewed right, or symmetric).

k 7 1.

Chebyshev’s Inequality

For any data set, regardless of the shape of the distribution, at least

of the observations will lie within k standard deviations of the

mean, where k is any number greater than 1. That is, at least of

the data will lie between and for

Note: We can also use Chebyshev’s Inequality based on sample data. >

k 7 1.m + ksm - ks

a1 -

1

k2 b100%

a1 -

1

k2 b100%

Note to InstructorIf you are pressed for time, Chebyshev’sInequality can be skipped without loss ofcontinuity.

The Empirical Rule holdsonly if the distribution is bell shaped.Chebyshev’s Inequality holdsregardless of the shape of thedistribution.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 152

Page 26: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.2 Measures of Dispersion 153

Now Work Problem 39

EXAMPLE 9 Using Chebyshev’s Inequality

Problem: Using the data from University A in Table 7.(a) Determine the minimum percentage of students who have IQ scores within3 standard deviations of the mean according to Chebyshev’s Inequality.(b) Determine the minimum percentage of students who have IQ scores between67.8 and 132.2, according to Chebyshev’s Inequality.(c) Determine the actual percentage of students who have IQ scores between 67.8and 132.2.

Approach(a) We use Chebyshev’s Inequality with (b) We have to determine the number of standard deviations 67.8 and 132.2 arefrom the mean of 100.0. We then substitute this value of k into Chebyshev’sInequality.(c) We refer to Table 7 and count the number of observations between 67.8 and132.2. We divide this result by 100, the number of observations in the data set.

Solution(a) We use Chebyshev’s Inequality with and determine that at least

of all students have IQ scores within 3 standard devia-

tions of the mean. Since the mean of the data set is 100.0 and the standard devia-tion is 16.1, at least 88.9% of the students have IQ scores between

and (b) Since 67.8 is exactly 2 standard deviations below the meanand 132.2 is exactly 2 standard deviations above the meanwe use Chebyshev’s Inequality with to determine that at least

of all IQ scores lie between 67.8 and 132.2.

(c) Of the 100 IQ scores listed, 96 or 96% is between 67.8 and 132.2. Notice thatChebyshev’s inequality provides a rather conservative result.

a1 -

1

22 b100% = 75%

k = 2[100 + 2116.12 = 132.2],

[100 - 2116.12 = 67.8]x + ks = 100 + 3116.12 = 148.3.x - ks = 100.0 - 3116.12 = 51.7

a1 -

1

32 b100% = 88.9%

k = 3

k = 3.

Because the Empirical Rule requires that the distribution be bell shaped, whileChebyshev’s Inequality applies to all distributions, the Empirical Rule providesresults that are more precise.

3.2 ASSESS YOUR UNDERSTANDING

Concepts and Vocabulary1. Would it be appropriate to say that a distribution with a

standard deviation of 10 centimeters is more dispersedthan a distribution with a standard deviation of 5 inches?Support your position.

2. What is meant by the phrase degrees of freedom as itpertains to the computation of the sample variance?

3. Are any of the measures of dispersion mentioned in this sec-tion resistant? Explain.

4. The sum of the deviations about the mean always equals______.

5. What does it mean when a statistic is biased?6. What makes the range less desirable than the standard

deviation as a measure of dispersion?7. In one of Sullivan’s statistics sections, the standard deviation

of the heights of all students was 3.9 inches. The standard

deviation of the heights of males was 3.4 inches and thestandard deviation of females was 3.3 inches. Why isthe standard deviation of the entire class more than thestandard deviation of the males and females consideredseparately?

8. The standard deviation is used in conjunction with the______ to numerically describe distributions that are bellshaped. The ______ measures the center of the distribution,while the standard deviation measures the ______ of the dis-tribution.

9. True or False: When comparing two populations, the largerthe standard deviation, the more dispersion the distributionhas, provided that the variable of interest from the twopopulations has the same unit of measure. True

10. True or False: Chebyshev’s Inequality applies to all distribu-tions regardless of shape, but the Empirical Rule holds onlyfor distributions that are bell shaped. True

HistoricalNotes

Pafnuty Chebyshev wasborn on May 16, 1821, in Okatovo,Russia. In 1847, he began teachingmathematics at the University of St.Petersburg. Some of his more famouswork was done on prime numbers.In particular, he discovered a wayto determine the number of primenumbers less than or equal to a givennumber. Chebyshev also studiedmechanics, including rotary motion.Chebyshev was elected a Fellow of theRoyal Society in 1877. He died onNovember 26, 1894, in St. Petersburg.

zero

meanmean

spread

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 153

Page 27: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

154 Chapter 3 Numerically Summarizing Data

Freq

uenc

y

987654321

40 44 48 52 56 60

10

8

6

4

2Freq

uenc

y

30 35 40 45 50 6555 60 7570

(a) (b)

22. Match the histograms to the summary statistics given.

Mean MedianStandard Deviation

I 53 53 1.8

II 60 60 11

III 53 53 10

IV 53 53 22

Fre

quen

cy

20

10

048 50 52 54 56

20

10

0

Fre

quen

cy

32 42 52 62 72 82

Fre

quen

cy

10

5

023 43 63 83 103 123

Fre

quen

cy

42 52 62 72 82

15

10

5

0

(a)

(c)

(b)

(d)

Applying the Concepts 23. (a) Tap water (0.59 vs. 0.26)

23. pH in Water The acidity or alkalinity of a solution is meas-ured using pH. A pH less than 7 is acidic, while a pH greaterthan 7 is alkaline. The following data represent the pH insamples of bottled water and tap water.(a) Which type of water has more dispersion in pH using the

range as the measure of dispersion?(b) Which type of water has more dispersion in pH using the

standard deviation as the measure of dispersion?Tap water (0.154 vs. 0.081)

24. Reaction Time In an experiment conducted online at theUniversity of Mississippi, study participants are asked toreact to a stimulus. In one experiment, the participant mustpress a key upon seeing a blue screen. The time (in seconds)to press the key is measured. The same person is thenasked to press a key upon seeing a red screen, again with thetime to react measured. The results for six study participantsare listed in the table.(a) Which color has more dispersion using the range as the

measure of dispersion? Blue(b) Which color has more dispersion using the standard

deviation as the measure of dispersion? Blue

1 0.582 0.408

2 0.481 0.407

3 0.841 0.542

4 0.267 0.402

5 0.685 0.456

6 0.450 0.533

Source: PsychExperiments at the University of Mississippi (www.olemiss.edu/psychexps)

ParticipantNumber

Reaction Time to Blue

Reaction Time to Red

Skill Building 19. 1150; 211,275; 459.6

In Problems 11–16, find the population variance and standard de-viation or the sample variance and standard deviation as indicated.

11. Sample: 20, 13, 4, 8, 10

12. Sample: 83, 65, 91, 87, 84

13. Population: 3, 6, 10, 12, 14

14. Population: 1, 19, 25, 15, 12, 16, 28, 13, 6

15. Sample: 6, 52, 13, 49, 35, 25, 31, 29, 31, 29

16. Population: 4, 10, 12, 12, 13, 21

17. Crash Test Results The Insurance Institute for Highway Safetycrashed the 2007 Audi A4 four times at 5 miles per hour. Thecosts of repair for each of the four crashes are as follows:

$976, $2038, $918, $1899

Compute the range, sample variance, and sample standarddeviation cost of repair. $1120; 351,601.6; $593.0

18. Cell Phone Use The following data represent the monthlycell phone bill for my wife’s phone for six randomly selectedmonths:

$35.34, $42.09, $39.43, $38.93, $43.39, $49.26

Compute the range, sample variance, and sample standarddeviation phone bill. $13.92; 22.584; $4.752

19. Concrete Mix A certain type of concrete mix is designed towithstand 3,000 pounds per square inch (psi) of pressure.The strength of concrete is measured by pouring the mixinto casting cylinders 6 inches in diameter and 12 inchestall. The concrete is allowed to set up for 28 days. The con-crete’s strength is then measured (in psi). The followingdata represent the strength of nine randomly selected casts:

3960, 4090, 3200, 3100, 2940, 3830, 4090, 4040, 3780

Compute the range, sample variance, and sample standard de-viation for the strength of the concrete (in psi).

20. Flight Time The following data represent the flight time (inminutes) of a random sample of seven flights from Las Vegas,Nevada, to Newark, New Jersey, on Continental Airlines.

282, 270, 260, 266, 257, 260, 267

Compute the range, sample variance, and sample standarddeviation of flight time. 25; 71 min2; 8.4 min

21. Which histogram depicts a higher standard deviation?Justify your answer. (b)

s � 5s2 � 25;s � 14s2 � 196;s � 8s2 � 64;

s2 � 16; s � 4s � 10s2 � 100;

s � 6s2 � 36;

(a) III (b) I (c) IV (d) II

NW

Tap 7.64 7.45 7.47 7.50 7.68 7.69

7.45 7.10 7.56 7.47 7.52 7.47

Bottled 5.15 5.09 5.26 5.20 5.02 5.23

5.28 5.26 5.13 5.26 5.21 5.24

Source: Emily McCarney, student at Joliet Junior College

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 154

Page 28: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.2 Measures of Dispersion 155

(a) Compute the population standard deviation.(b) Determine three simple random samples of size 3, and

compute the sample standard deviation of each sample.(c) Which samples underestimate the population standard

deviation? Which overestimate the population standarddeviation?

26. Travel Time The following data represent the travel time(in minutes) to school for nine students enrolled inSullivan’s College Algebra course. Treat the nine studentsas a population.

25. Pulse Rates The following data represent the pulse rates(beats per minute) of nine students enrolled in a section ofSullivan’s course in Introductory Statistics. Treat the ninestudents as a population. 25. (a) 7.7 beats per minute

Student Travel Time Student Travel Time

Amanda 39 Scot 45

Amber 21 Erica 11

Tim 9 Tiffany 12

Mike 32 Glenn 39

Nicole 30

Student Pulse

Perpectual Bempah 76

Megan Brooks 60

Jeff Honeycutt 60

Clarice Jefferson 81

Crystal Kurtenbach 72

Janette Lantka 80

Kevin McCarthy 80

Tammy Ohm 68

Kathy Wojdyla 73

(a) Compute the population standard deviation. 12.8 minutes(b) Determine three simple random samples of size 4, and

compute the sample standard deviation of each sample.(c) Which samples underestimate the population standard

deviation? Which overestimate the population standarddeviation?

27. A Fish Story Ethan and Drew went on a 10-day fishing trip.The number of smallmouth bass caught and released by thetwo boys each day was as follows:

NW28. Soybean Yield The following data represent the number of

pods on a sample of soybean plants for two different plottypes.Which plot type do you think is superior? Why? Liberty

(a) Find the population mean and the range for the numberof smallmouth bass caught per day by each fisherman. Dothese values indicate any differences between the twofishermen’s catches per day? Explain.

(b) Find the population standard deviation for the number ofsmallmouth bass caught per day by each fisherman. Dothese values present a different story about the two fisher-men’s catches per day? Which fisherman has the moreconsistent record? Explain. Ethan: Drew:

(c) Discuss limitations of the range as a measure ofdispersion.

s � 7.9s � 4.9;

range � 19m � 10;

29. The Empirical Rule The following data represent the weights(in grams) of a random sample of 50 M&M plain candies.

(a) Determine the sample standard deviation weight. Expressyour answer rounded to three decimal places. 0.036 gram

(b) On the basis of the histogram drawn in Section 3.1,Problem 29, comment on the appropriateness of usingthe Empirical Rule to make any general statementsabout the weights of M&Ms. Okay to use

(c) Use the Empirical Rule to determine the percentage ofM&Ms with weights between 0.803 and 0.947 gram.Hint: 95%

(d) Determine the actual percentage of M&Ms that weighbetween 0.803 and 0.947 gram, inclusive. 96%

(e) Use the Empirical Rule to determine the percentage ofM&Ms with weights more than 0.911 gram. 16%

(f) Determine the actual percentage of M&Ms that weighmore than 0.911 gram. 12%

30. The Empirical Rule The following data represent the lengthof eruption for a random sample of eruptions at the OldFaithful geyser in Calistoga, California.

x = 0.875.

Plot Type Pods

Liberty 32 31 36 35 44 31 39 37 38

No Till 35 31 32 30 43 33 37 42 40

Source: Andrew Dieter and Brad Schmidgall, students at Joliet JuniorCollege

0.87 0.88 0.82 0.90 0.90 0.84 0.84

0.91 0.94 0.86 0.86 0.86 0.88 0.87

0.89 0.91 0.86 0.87 0.93 0.88

0.83 0.95 0.87 0.93 0.91 0.85

0.91 0.91 0.86 0.89 0.87 0.84

0.88 0.88 0.89 0.79 0.82 0.83

0.90 0.88 0.84 0.93 0.81 0.90

0.88 0.92 0.85 0.84 0.84 0.86

Source: Michael Sullivan

(a) Determine the sample standard deviation length oferuption. Express your answer rounded to the nearestwhole number. 6 seconds

(b) On the basis of the histogram drawn in Section 3.1,Problem 30, comment on the appropriateness of usingthe Empirical Rule to make any general statementsabout the length of eruptions. Okay to use

108 108 99 105 103 103 94

102 99 106 90 104 110 110

103 109 109 111 101 101

110 102 105 110 106 104

104 100 103 102 120 90

113 116 95 105 103 101

100 101 107 110 92 108

Source: Ladonna Hansen, Park Curator

Ethan 9 24 8 9 5 8 9 10 8 10

Drew 15 2 3 18 20 1 17 2 19 3

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 155

Page 29: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

156 Chapter 3 Numerically Summarizing Data

Describe each data set. That is, determine the shape, center,and spread. Which car would you buy and why?

32. Which Investment Is Better? You have received a year-endbonus of $5,000. You decide to invest the money in the stockmarket and have narrowed your investment options downto two mutual funds.The following data represent the histor-ical quarterly rates of return of each mutual fund for the past20 quarters (5 years).

(c) Use the Empirical Rule to determine the percentageof eruptions that last between 92 and 116 seconds.Hint: 95%

(d) Determine the actual percentage of eruptions that lastbetween 92 and 116 seconds, inclusive. 93%

(e) Use the Empirical Rule to determine the percentage oferuptions that last less than 98 seconds. 16%

(f) Determine the actual percentage of eruptions that lastless than 98 seconds. 11%

31. Which Car Would You Buy? Suppose that you are in themarket to purchase a car. With gas prices on the rise, youhave narrowed it down to two choices and will let gasmileage be the deciding factor. You decide to conduct a littleexperiment in which you put 10 gallons of gas in the car anddrive it on a closed track until it runs out gas. You conductthis experiment 15 times on each car and record the numberof miles driven.

x = 104.

Car 1

228 223 178 220 220

233 233 271 219 223

217 214 189 236 248

Car 2

277 164 326 215 259

217 321 263 160 257

239 230 183 217 230

Describe each data set. That is, determine the shape, center,and spread. Which mutual fund would you invest in andwhy?

33. Rates of Return of Stocks Stocks may be categorized by in-dustry. The following data represent the 5-year rates of re-turn (in percent) for a sample of financial stocks and energystocks ending December 3, 2007.(a) Compute the mean and the median rate of return for each

industry.Which sector has the higher mean rate of return?Which sector has the higher median rate of return?

risk. Typically, an investor “pays” for a higher return byaccepting more risk. Is the investor paying for higher re-turns in this instance? Do you think the higher returnsare worth the cost?

34. Temperatures It is well known that San Diego has milderweather than Chicago, but which city has more dispersion intemperatures over the course of a month? In particular, whichcity has more dispersion in high temperatures in the month ofNovember? Use the following data, which represent the dailyhigh temperatures for each day in November, 2007. In whichcity would you rather be a meteorologist? Why?

35. The Empirical Rule One measure of intelligence is theStanford–Binet Intelligence Quotient (IQ). IQ scores have abell-shaped distribution with a mean of 100 and a standarddeviation of 15.(a) What percentage of people has an IQ score between

70 and 130? 95%(b) What percentage of people has an IQ score less than

70 or greater than 130? 5%(c) What percentage of people has an IQ score greater than

130? 2.5%

Mutual Fund B

6.7 11.9 4.3 4.3

3.5 10.5 2.9 3.8 5.9

1.4 8.9 0.3

3.4 7.7 12.9-1.1-4.7

-2.4-6.7

-5.4

Mutual Fund A

1.3 0.6 6.8 5.0

5.2 4.8 2.4 3.0 1.8

7.3 8.6 3.4 3.8

6.4 1.9 3.1-2.3-0.5

-1.3

-0.3

Financial Stocks

16.01 29.66 8.58 61.90 16.27

18.30 47.16 31.40 26.57 12.15

0.86 25.95 16.54 15.21 25.89

14.44 77.82 50.75 15.52 1.05

7.82 8.30 10.01 7.24 13.53

Energy Stocks

44.39 7.22 18.82 106.05 41.29

42.79 23.10 44.87 34.18 40.67

29.32 39.08 20.35 52.94 42.56

39.89 13.07 71.03 28.68 33.76

24.75 17.39 41.29 26.61 22.26

Source: morningstar.com

Daily High Temperature, Chicago (°F)

54.0 43.0 52.0 39.9 51.8 39.2

57.2 42.8 66.0 44.6 41.0 39.0

55.4 52.0 60.8 42.8 30.9 48.2

53.6 48.0 55.4 61.0 37.9 48.0

53.6 51.1 46.4 61.0 37.9 35.6

Daily High Temperature, San Diego (°F)

66.0 63.0 64.4 64.9 64.4 71.1

64.9 63.0 70.0 63.0 64.0 71.1

66.2 63.0 80.6 63.0 62.6 73.9

64.4 63.0 84.0 62.6 69.8 73.9

64.9 66.0 84.0 62.1 72.0 71.1

Source: National Climatic Data Center

NW

33. (a) Energy; energy 33. (b) yesEnergy: s � 20.080%;

Financial: s � 18.936%;M � 34.18%;Energy: x � 36.254%,M � 16.01%; Financial: x � 22.357%,

(b) Compute the standard deviation for each industry. Infinance, the standard deviation rate of return is called

34. ; F San Diego: ; ; Chicago has more dispersions � 6.14°FR � 21.9°F°s � 8.54Chicago: R � 35.1°F

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 156

Page 30: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.2 Measures of Dispersion 157

36. The Empirical Rule SAT Math scores have a bell-shaped dis-tribution with a mean of 515 and a standard deviation of 114.Source: College Board, 2007 36. (a) 68%(a) What percentage of SAT scores is between 401 and 629?(b) What percentage of SAT scores is less than 401 or

greater than 629? 32%(c) What percentage of SAT scores is greater than 743? 2.5%

37. The Empirical Rule The weight, in grams, of the pair of kid-neys in adult males between the ages of 40 and 49 have abell-shaped distribution with a mean of 325 grams and astandard deviation of 30 grams. 37. (a) 265 and 385 grams(a) About 95% of kidney pairs will be between what weights?(b) Whatpercentageofkidneypairsweighsbetween235grams

and 415 grams? 99.7%(c) What percentage of kidney pairs weighs less than 235 grams

or more than 415 grams? 0.3%(d) Whatpercentageofkidneypairsweighsbetween295grams

and 385 grams? 81.5%38. The Empirical Rule The distribution of the length of bolts

has a bell shape with a mean of 4 inches and a standard devi-ation of 0.007 inch.(a) About 68% of bolts manufactured will be between what

lengths? 3.993 and 4.007 inches(b) What percentage of bolts will be between 3.986 inches

and 4.014 inches? 95%(c) If the company discards any bolts less than 3.986 inches

or greater than 4.014 inches, what percentage of boltsmanufactured will be discarded? 5%

(d) What percentage of bolts manufactured will be between4.007 inches and 4.021 inches? 15.85%

39. Chebyshev’s Inequality In December 2007, the average priceof regular unleaded gasoline excluding taxes in the UnitedStates was $3.06 per gallon, according to the Energy Informa-tion Administration.Assume that the standard deviation priceper gallon is $0.06 per gallon to answer the following.(a) What minimum percentage of gasoline stations had

prices within 3 standard deviations of the mean? 88.9%(b) What minimum percentage of gasoline stations had

prices within 2.5 standard deviations of the mean? Whatare the gasoline prices that are within 2.5 standard devi-ations of the mean? 84%; $2.91 to $3.21

(c) What is the minimum percentage of gasoline stationsthat had prices between $2.94 and $3.18? 75%

40. Chebyshev’s Inequality According to the U.S. Census Bu-reau, the mean of the commute time to work for a residentof Boston, Massachusetts, is 27.3 minutes. Assume that thestandard deviation of the commute time is 8.1 minutes to an-swer the following: 40. (a) 75%(a) What minimum percentage of commuters in Boston has a

commute time within 2 standard deviations of the mean?(b) What minimum percentage of commuters in Boston has a

commute time within 1.5 standard deviations of the mean?What are the commute times within 1.5 standard devia-tions of the mean? 55.6%; 15.15 minutes to 39.45 minutes

(c) What is the minimum percentage of commuters who havecommute times between 3 minutes and 51.6 minutes?

41. Comparing Standard Deviations The standard deviation ofbatting averages of all teams in the American League is0.008. The standard deviation of all players in the AmericanLeague is 0.02154. Why is there less variability in team bat-ting averages?

40. (c) 88.9% 42. (a) 42. (c) s � 13.3s2 � 176.4,R � 47.25,s � $12.6s2 � 160,R � $45,42. Linear Transformations Benjamin owns a small Internet

business. Besides himself, he employs nine other people. Thesalaries earned by the employees are given next in thou-sands of dollars (Benjamin’s salary is the largest, of course):

30, 30, 45, 50, 50, 50, 55, 55, 60, 75

(a) Determine the range, population variance, and popula-tion standard deviation for the data.

(b) Business has been good! As a result, Benjamin has atotal of $25,000 in bonus pay to distribute to his employ-ees. One option for distributing bonuses is to give eachemployee (including himself) $2,500. Add the bonusesunder this plan to the original salaries to create a newdata set. Recalculate the range, population variance, andpopulation standard deviation. How do they compare tothe originals?

(c) As a second option, Benjamin can give each employee abonus of 5% of his or her original salary.Add the bonusesunder this second plan to the original salaries to create anew data set. Recalculate the range, population vari-ance, and population standard deviation. How do theycompare to the originals?

(d) As a third option, Benjamin decides not to give his em-ployees a bonus at all. Instead, he keeps the $25,000 forhimself. Use this plan to create a new data set. Recalcu-late the range, population variance, and populationstandard deviation. How do they compare to theoriginals?

43. Resistance and Sample Size Each of the following threedata sets represents the IQ scores of a random sample ofadults. IQ scores are known to have a mean and median of100. For each data set, determine the sample standard devia-tion. Then recompute the sample standard deviation assum-ing that the individual whose IQ is 106 is accidentallyrecorded as 160. For each sample size, state what happens tothe standard deviation. Comment on the role that the num-ber of observations plays in resistance.

s � $18.5s2 � 341.3,R � $70,

s � $12.6s2 � 160,R � $45,

NW

Sample of Size 5

106 92 98 103 100

Sample of Size 12

106 92 98 103 100 102

98 124 83 70 108 121

Sample of Size 30

106 92 98 103 100 102

98 124 83 70 108 121

102 87 121 107 97 114

140 93 130 72 81 90

103 97 89 98 88 103

44. Identical Values Compute the sample standard deviation ofthe following test scores: 78, 78, 78, 78. What can be saidabout a data set in which all the values are identical?

45. Blocking and Variability Recall that blocking refers to theidea that we can reduce the variability in a variable bysegmenting the data by some other variable. The given data

s � 0

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 157

Page 31: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

158 Chapter 3 Numerically Summarizing Data

represent the recumbent length (in centimeters) of a sampleof 10 males and 10 females who are 40 months of age.

Males Females

104.0 94.4 102.5 100.8

93.7 97.6 100.4 96.3

98.3 100.6 102.7 105.0

86.2 103.0 98.1 106.5

90.7 100.9 95.4 114.5

Source: National Center for Health Statistics

(a) Determine the standard deviation of recumbent lengthfor all 20 observations. 6.11 cm

(b) Determine the standard deviation of recumbent lengthfor the males. 5.67 cm

(c) Determine the standard deviation of recumbent lengthfor the females. 5.59 cm

(d) What effect does blocking by gender have on the standarddeviation of recumbent length for each gender?

46. Coefficient of Variation The coefficient of variation is ameasure that allows for the comparison of two or more vari-ables measured on a different scale. It measures the relativevariability in terms of the mean and does not have a unit ofmeasure.The lower the coefficient of variation is, the less thedata vary. The coefficient of variation is defined as

For example, suppose the senior class of a school district hasa mean ACT score of 23 with a standard deviation of 4 and amean SAT score of 1,100 with a standard deviation of 150.Which test has more variability? The coefficient of variation

for the ACT exam is the coefficient of

variation for the SAT exam is The ACT has more variability.(a) Suppose the systolic blood pressure of a random sample

of 100 students before exercising has a sample mean of121, with a standard deviation of 14.1. The systolic bloodpressure after exercising is 135.9, with a standard devia-tion of 18.1. Is there more variability in systolic bloodpressure before exercise or after exercise? After

(b) An investigation of intracellular calcium and blood pres-sure was conducted by Erne, Bolli, Buergisser, andBuehler in 1984 and published in the New England Journalof Medicine, 310:1084–1088. The researchers measuredthe free calcium concentration in the blood platelets of38 people with normal blood pressure and 45 peoplewith high blood pressure. The mean and standard devia-tion of the normal blood pressure group were 107.9 and16.1, respectively. The mean and standard deviation ofthe high blood pressure group were 168.2 and 31.7, re-spectively. Is there more variability in free calcium con-centration in the normal blood pressure or the highblood pressure group? High blood pressure group

47. Mean Absolute Deviation Another measure of variation isthe mean absolute deviation. It is computed using the formula

Compute the mean absolute deviation of the data in Prob-lem 17 and compare the results with the sample standarddeviation. MAD � $510.75; s � $592.96

MAD =

a ƒ xi - x ƒ

n

1501100

# 100% = 13.6%.

423

# 100% = 17.4%;

CV =

standard deviationmean

# 100%

48. Coefficient of Skewness Karl Pearson developed a measurethat describes the skewness of a distribution, called thecoefficient of skewness. The formula is

The value of this measure generally lies between and The closer the value lies to the more the distribution isskewed left.The closer the value lies to the more the dis-tribution is skewed right. A value close to 0 indicates a sym-metric distribution. Find the coefficient of skewness of thefollowing distributions and comment on the skewness.(a) 3(b) standard 0(c) standard (d) Compute the coefficient of skewness for the data in

Problem 29.(e) Compute the coefficient of skewness for the data in

Problem 30. 0.05

49. Diversification A popular theory in investment states thatyou should invest a certain amount of money in foreign in-vestments to reduce your risk. The risk of a portfolio is de-fined as the standard deviation of the rate of return. Refer tothe following graph, which depicts the relation between risk(standard deviation of rate of return) and reward (mean rateof return).

0

�2.50deviation = 120median = 500,Mean = 400,deviation = 15median = 100,Mean = 100,

standard deviation = 10median = 40,Mean = 50,

+3,-3,

+3.-3

Skewness =

31mean - median2standard deviation

18.5

18

17.5

17

16.5

16

15.5

15

14.5

14

13.513.5 14 14.5 15 15.5 16 16.5

30% (30% foreign/70% U.S.)

(100% foreign)

100%90%

80%70%

60%

50%

40%

20%

10%

Less riskthan a100% U.S.portfolio.

0% (100% U.S.)

17 17.5 18 18.5 19 19.5 20

Level of Risk (Standard Deviation)

Source: T. Rowe Price

How Foreign StocksBenefit a Domestic Portfolio

Ave

rage

Ann

ual R

etur

ns (

1984

–199

4)

(a) Determine the average annual return and level of risk ina portfolio that is 10% foreign.

(b) Determine the percentage that should be invested inforeign stocks to best minimize risk. 30%

(c) Why do you think risk initially decreases as the percentof foreign investments increases?

(d) A portfolio that is 30% foreign and 70% American hasa mean rate of return of about 15.8%, with a standarddeviation of 14.3%. According to Chebyshev’s inequali-ty, at least 75% of returns will be between what values?According to Chebyshev’s inequality, at least 88.9% ofreturns will be between what two values? Should an in-vestor be surprised if she has a negative rate of return?Why? Between and 44.4%; between and58.7%; no

-27.1%- 12.8%

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 158

Page 32: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.2 Measures of Dispersion 159

Basement Waterproofing Coatings

A waterproofing coating can be an inexpensive and easy wayto deal with leaking basements. But how effective are they?In a study, Consumer Reports tested nine waterproofers torate their effectiveness in controlling water seepage thoughconcrete foundations.

To compare the products’ ability to control water seep-age, we applied two coats of each product to slabs cut fromconcrete block. For statistical validity, this process was re-peated at least six times. In each test run, four blocks (eachcoated with a different product) were simultaneously placedin a rectangular aluminum chamber. See the picture.

The chamber was sealed and filled with water and theblocks were subjected to progressively increasing hydrostaticpressures. Water that leaked out during each period waschanneled to the bottom of the chamber opening, collected,and weighed.

The table contains a subset of the data collected for twoof the products tested. Using these data:

(a) Calculate the mean, median, and mode weight of watercollected for product A.(b) Calculate the standard deviation of the weight of watercollected for product A.(c) Calculate the mean, median, and mode weight of watercollected for product B.(d) Calculate the standard deviation of the weight of watercollected for product B.

Does there appear to be a difference in these two prod-ucts’ abilities to mitigate water seepage? Why?

Note to Readers: In many cases, our test protocol and analyticalmethods are more complicated than described in these examples. Thedata and discussions have been modified to make the material moreappropriate for the audience.

Basement Waterproofer Test Chamber

Product ReplicateWeight of Collected Water (grams)

A 1 91.2A 2 91.2A 3 90.9A 4 91.3A 5 90.8A 6 90.8B 1 87.1B 2 87.2B 3 86.8B 4 87.0B 5 87.2B 6 87.0

© 2002 by Consumers Union of U.S., Inc., Yonkers, NY 10703-1057, a nonprofit organization. Reprinted with permission fromthe June 2002 issue of CONSUMER REPORTS® for educationalpurposes only. No commercial use or photocopying permitted.To learn more about Consumers Union, log onto www.ConsumerReports.org.

The same steps followed to obtain the measuresof central tendency from raw data can be used toobtain the measures of dispersion.

TECHNOLOGY STEP-BY-STEP Determining the Range, Variance,and Standard Deviation

(e) Construct back-to-back stem-and-leaf diagrams for thesedata.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 159

Page 33: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

160 Chapter 3 Numerically Summarizing Data

3.3 MEASURES OF CENTRAL TENDENCY AND DISPERSION FROM GROUPED DATA

Preparing for This Section Before getting started, review the following:

EXAMPLE 1 Approximating the Mean for Continuous Quantitative Data from a Frequency Distribution

Problem: The frequency distribution in Table 13 represents the 3-year rate ofreturn of a random sample of 40 small-capitalization growth mutual funds. Approx-imate the mean 3-year rate of return.

Objectives 1 Approximate the mean of a variable from grouped data2 Compute the weighted mean3 Approximate the variance and standard deviation of a variable from

grouped data

We have discussed how to compute descriptive statistics from raw data, but manytimes the data that we have access to have already been summarized in frequencydistributions (grouped data). While we cannot obtain exact values of the mean orstandard deviation without raw data, these measures can be approximated using thetechniques discussed in this section.

1 Approximate the Mean of a Variable from Grouped DataSince raw data cannot be retrieved from a frequency table, we assume that, withineach class, the mean of the data values is equal to the class midpoint. We then multi-ply the class midpoint by the frequency. This product is expected to be close to thesum of the data that lie within the class. We repeat the process for each class andsum the results. This sum approximates the sum of all the data.

Definition Approximate Mean of a Variable from a Frequency Distribution

Population Mean

(1a)

Sample Mean

(1b)

where is the midpoint or value of the ith class

is the frequency of the ith class

n is the number of classes

In Formula (1), approximates the sum of all the data values in the firstclass, approximates the sum of all the data values in the second class, and so on.Notice that the formulas for the population mean and sample mean are essentiallyidentical, just as they were for computing the mean from raw data.

x2f2

x1f1

fi

xi

x =axifi

afi=

x1f1 + x2f2 +Á

+ xnfn

f1 + f2 +Á

+ fn

m =

axifi

afi=

x1f1 + x2f2 +Á

+ xNfN f1 + f2 +

Á+ fN

Note to InstructorThis section can be skipped without lossof continuity. If covered, the formulasused in this section can be overwhelming.Therefore, it is recommended that “byhand” computations be de-emphasized.

• Organizing discrete data in tables (Section 2.2, pp. 82–83)

• Organizing continuous data in tables (Section 2.2, pp. 84–87)

• Class midpoint (Section 2.3, p. 102)

Note to InstructorIf you like, you can print out and distributethe Preparing for this Section quiz locatedin the Instructor’s Resource Center.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 160

Page 34: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.3 Measures of Central Tendency and Dispersion from Grouped Data 161

Table 13

Class (3-yearrrate of return) Frequency

10–11.99 6

12–13.99 14

14–15.99 10

16–17.99 4

18–19.99 3

20–21.99 0

22–23.99 3

Table 14

Class (3-year rate of return) Class Midpoint, Frequency,

10–11.99 6

12–13.99 13 14

14–15.99 15 10 150

16–17.99 17 4 68

18–19.99 19 3 57

20–21.99 21 0 0

22–23.99 23 3 69

ax ifi = 592afi = 40

(13)(14) = 182

(11)(6) = 66 10 + 12

2= 11

xi fi fi xi

Step 5: Substituting into Formula (1b), we obtain

The approximate mean 3-year rate of return is 14.8%.

The mean 3-year rate of return from the raw data listed in Example 3 onpage 84 from Section 2.2 is 14.8%. The approximate mean from grouped data isequal to the actual mean!

Now compute the mean of the frequency distribution in Problem 3

x =axifi

afi =

592 40

= 14.8

We computed the mean fromgrouped data in Example 1 eventhough the raw data is available. Thereason for doing this was to illustratehow close the two values can be. Inpractice, use raw data wheneverpossible.

Approach: We perform the following steps to approximate the mean.Step 1: Determine the class midpoint of each class. The class midpoint is found byadding consecutive lower class limits and dividing the result by 2.Step 2: Compute the sum of the frequencies,Step 3: Multiply the class midpoint by the frequency to obtain for each class.Step 4: Compute Step 5: Substitute into Formula (1b) to obtain the mean from grouped data.

Solution

Step 1: The lower class limit of the first class is 10. The lower class limit of the sec-

ond class is 12. Therefore, the class midpoint of the first class is so

The remaining class midpoints are listed in column 2 of Table 14.

Step 2: We add the frequencies in column 3 to obtain

Step 3: Compute the values of by multiplying each class midpoint by thecorresponding frequency and obtain the results shown in column 4 of Table 14.

Step 4: We add the values in column 4 of Table 14 to obtain ©xifi = 592.

xifi

6 + 14 +Á

+ 3 = 40.©fi =

x1 = 11.

10 + 12 2

= 11,

©xifi.xifi

©fi.

2 Compute the Weighted MeanSometimes, certain data values have a higher importance or weight associated withthem. In this case, we compute the weighted mean. For example, your grade-pointaverage is a weighted mean, with the weights equal to the number of credit hours ineach course.The value of the variable is equal to the grade converted to a point value.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 161

Page 35: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

162 Chapter 3 Numerically Summarizing Data

Definition Approximate Variance of a Variable from a Frequency Distribution

Population Variance Sample Variance

(3)

where is the midpoint or value of the ith class

is the frequency of the ith class

An algebraically equivalent formula for the population variance is

axi2fi -

Aaxifi B2 afi

afi

fi

xi

s2=a 1xi - m22fi

afi s2

=a 1xi - x22fi

Aafi B - 1

Note to InstructorIdea for a classroom discussion: Supposea company owns three cars that get20 miles per gallon, two cars that get22 miles per gallon, and one car that gets24 miles per gallon. Would the mean milesper gallon be 22? Discuss.

Definition The weighted mean, of a variable is found by multiplying each value of thevariable by its corresponding weight, summing these products, and dividing theresult by the sum of the weights. It can be expressed using the formula

(2)

where is the weight of the ith observation

is the value of the ith observationxi

wi

xw =awixi

awi =

w1x1 + w2x2 +Á

+ wnxn w1 + w2 +

Á+ wn

xw,

EXAMPLE 2 Computing the Weighted Mean

Problem: Marissa just completed her first semester in college. She earned an A inher 4-hour statistics course, a B in her 3-hour sociology course, an A in her 3-hourpsychology course, a C in her 5-hour computer programming course, and an A in her1-hour drama course. Determine Marissa’s grade-point average.

Approach: We must assign point values to each grade. Let an A equal 4 points, a Bequal 3 points, and a C equal 2 points. The number of credit hours for each coursedetermines its weight. So a 5-hour course gets a weight of 5, a 4-hour course gets aweight of 4, and so on.We multiply the weight of each course by the points earned inthe course, sum these products, and divide the sum by the number of credit hours.

Solution

Marissa’s grade-point average for her first semester is 3.19.

GPA = xw =awixi

awi=

4142 + 3132 + 3142 + 5122 + 11424 + 3 + 3 + 5 + 1

=

5116

= 3.19

Now Work Problem 11

3 Approximate the Variance and StandardDeviation of a Variable from Grouped DataThe procedure for approximating the variance and standard deviation fromgrouped data is similar to that of finding the mean from grouped data. Again,because we do not have access to the original data, the variance is approximate.

We approximate the standard deviation by taking the square root of thevariance.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 162

Page 36: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.3 Measures of Central Tendency and Dispersion from Grouped Data 163

EXAMPLE 3 Approximating the Variance and Standard Deviation froma Frequency Distribution

Problem: The data in Table 13 on page 161 represent the 3-year rate of return of arandom sample of 40 small-capitalization growth mutual funds. Approximate thevariance and standard deviation of the 3-year rate of return.

Approach: We will use the sample variance Formula (3).Step 1: Create a table with the class in the first column, the class midpoint in thesecond column, the frequency in the third column, and the unrounded mean in thefourth column.Step 2: Compute the deviation about the mean, for each class, where is theclass midpoint of the ith class and is the sample mean. Enter the results in column 5.Step 3: Square the deviation about the mean and multiply this result by the fre-quency to obtain Enter the results in column 6.Step 4: Add the entries in columns 3 and 6 to obtain and Step 5: Substitute the values obtained in Step 4 into Formula (3) to obtain anapproximate value for the sample variance.

SolutionStep 1: We create Table 15. Column 1 contains the classes. Column 2 contains theclass midpoint of each class. Column 3 contains the frequency of each class. Column4 contains the unrounded sample mean obtained in Example 1.

Step 2: Column 5 of Table 15 contains the deviation about the mean, foreach class.

Step 3: Column 6 contains the values of the squared deviation about the mean mul-tiplied by the frequency,

Step 4: We add the entries in columns 3 and 6 and obtain and©1xi - x 22 fi = 406.4.

©fi = 40

1xi - x 22 fi .

xi - x,

©1xi - x 22 fi .©fi

1xi - x 22 fi .

x xixi - x,

Table 15

Class (3-year Class Midpoint, Frequency,rate of return)

10–11.99 11 6 14.8 86.64

12–13.99 13 14 14.8 45.36

14–15.99 15 10 14.8 0.4

16–17.99 17 4 14.8 19.36

18–19.99 19 3 14.8 4.2 52.92

20–21.99 21 0 14.8 6.2 0

22–23.99 23 3 14.8 8.2 201.72

a 1xi - x 22 fi = 406.4afi = 40

2.2

0.2

-1.8

-3.8

1xi � x22 fixi � xxfixi

Step 5: Substitute these values into Formula (3) to obtain an approximate value forthe sample variance.

Take the square root of the unrounded estimate of the sample variance to obtain anapproximation of the sample standard deviation.

We approximate the sample standard deviation 3-year rate of return to be 3.23%.

s = 3s2= A

406.439

L 3.23

s2=a 1xi - x22 fi

Aafi B - 1=

406.4 39

L 10.42

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 163

Page 37: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

164 Chapter 3 Numerically Summarizing Data

Approximatesample standarddeviation

Approximatemean

Figure 17

3.3 ASSESS YOUR UNDERSTANDING

Concepts and Vocabulary1. Explain the role of the class midpoint in the formulas to

approximate the mean and the standard deviation.

2. In Section 3.1, the mean is given by Explain how

this is a special case of the weighted mean,

Applying the Concepts3. Birth Weight The following frequency distribution repre-

sents the birth weight of all babies born in the United Statesin 2004.Approximate the mean and standard deviation birthweight.M � 3289.5 g; S � 657.2 g

xw.

x =

©xi

n.

5. Household Winter Temperature Often, frequency distribu-tions are reported using unequal class widths because the fre-quencies of some groups would otherwise be small or very

4. Square Footage of Housing The following frequency distri-bution represents the square footage of a random sampleof 500 houses that are owner occupied year round. Approxi-mate the mean and standard deviation square footage.

sq ft; s � 907.0 sq ftx � 2419

Weight Number of Babies (grams) (thousands)

0–999 30

1,000–1,999 97

2,000–2,999 935

3,000–3,999 2,698

4,000–4,999 344

5,000–5,999 5

Source: National Vital Statistics Report,Volume 55, No. 1

NW

Square Footage Frequency

0–499 5

500–999 17

1,000–1,499 36

1,500–1,999 121

2,000–2,499 119

2,500–2,999 81

3,000–3,499 47

3,500–3,999 45

4,000–4,499 22

4,500–4,999 7

Source: Based on data from the U.S.Census Bureau

From the output, we can see that the approximate mean is 14.8% and the approxi-mate standard deviation is 3.23%. The results agree with our by-hand solutions.

From the raw data listed in Example 3 in Section 2.2, we find that the samplestandard deviation is 3.41%. The approximate sample standard deviation fromgrouped data is close to, but a little lower than, the sample standard deviation fromthe raw data!

Now compute the standard deviation from the frequency distribution in Problem 3

EXAMPLE 4 Approximating the Mean and Standard Deviation of Grouped Data Using Technology

Problem: Approximate the mean and standard deviation of the 3-year rate ofreturn data in Table 13 using a TI-83/84 Plus graphing calculator.

Approach: The steps for approximating the mean and standard deviation ofgrouped data using the TI-83�84 Plus graphing calculator are given in the TechnologyStep-by-Step on page 166.

Solution: Figure 17 shows the result from the TI-84 Plus.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 164

Page 38: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.3 Measures of Central Tendency and Dispersion from Grouped Data 165

large. Consider the following data, which represent the day-time household temperature the thermostat is set to whensomeone is home for a random sample of 750 households.Determine the class midpoint, if necessary, for each class andapproximate the mean and standard deviation temperature.x � 70 .6°F; s � 3 .5°F

Temperature (°F) Frequency

61–64 31

65–67 67

68–69 198

70 195

71–72 120

73–76 89

77–80 50

Source: Based on data from the U.S.Department of Energy

6. Living in Poverty (See Problem 5) The following frequencydistribution represents the age of people living in poverty in2006 (in thousands). In this frequency distribution, the classwidths are not the same for each class. Approximate themean and standard deviation age of a person living in pover-ty. For the open-ended class 65 and older use 70 as the classmidpoint. M � 30.5 years; S � 20.8 years

AgeNumber of Multiple Births

15–19 84

20–24 431

25–29 1,753

30–34 2,752

35–39 1,785

40–44 378

45–49 80

50–54 9

Source: NationalVital Statistics Reports,Vol.55,No.10,September 29,2006

Age Frequency

0–17 12,827

18–24 5,047

25–34 4,920

35–44 4,049

45–54 3,399

55–59 1,468

60–64 1,357

65 and older 3,394

Source: U.S. Census Bureau

7. Multiple Births The following data represent the number oflive multiple-delivery births (three or more babies) in 2004for women 15 to 54 years old.

(a) Approximate the mean and standard deviation for age.(b) Draw a frequency histogram of the data to verify that

the distribution is bell shaped.(c) According to the Empirical Rule, 95% of mothers of

multiple births will be between what two ages? and 43.3 years21.7 years

SAT Math Score Number

200–249 13,159

250–299 23,083

300–349 59,672

350–399 124,616

400–449 202,883

450–499 243,569

500–549 247,435

550–599 213,880

600–649 156,057

650–699 120,933

700–749 54,108

750–800 35,136

Source: The College Board

(a) Approximate the mean and standard deviation of thescore.

(b) Draw a frequency histogram of the data to verify thatthe distribution is bell shaped.

(c) According to the Empirical Rule, 95% of these studentswill have SAT Mathematics scores between what twovalues? 290.1 and 748.1

9. Serum HDL Use the frequency distribution whose classwidth is 10 obtained in Problem 35 in Section 2.2 to approxi-mate the mean and standard deviation for serum HDL.Compare these results to the actual mean and standarddeviation.

10. Dividend Yield Use the frequency distribution whose classwidth is 0.4 obtained in Problem 36 in Section 2.2 to approx-imate the mean and standard deviation of the dividend yield.Compare these results to the actual mean and standarddeviation. ;

11. Grade-Point Average Marissa has just completed her sec-ond semester in college. She earned a B in her 5-hour calcu-lus course, an A in her 3-hour social work course, an A in her4-hour biology course, and a C in her 3-hour American liter-ature course. Assuming that an A equals 4 points, a B equals3 points, and a C equals 2 points, determine Marissa’s grade-point average for the semester. 3.27

12. Computing Class Average In Marissa’s calculus course, at-tendance counts for 5% of the grade,quizzes count for 10% ofthe grade, exams count for 60% of the grade, and the finalexam counts for 25% of the grade. Marissa had a 100% aver-age for attendance, 93% for quizzes, 86% for exams, and 85%on the final. Determine Marissa’s course average. 87.15%

13. Mixed Chocolates Michael and Kevin want to buy chocolates.They can’t agree on whether they want chocolate-coveredalmonds, chocolate-covered peanuts, or chocolate-coveredraisins. They agree to create a mix. They bought 4 pounds ofchocolate-covered almonds at $3.50 per pound, 3 poundsof chocolate-covered peanuts for $2.75 per pound, and 2pounds of chocolate-covered raisins for $2.25 per pound.Determine the cost per pound of the mix. $2.97

14. Nut Mix Michael and Kevin return to the candy store,but thistime they want to purchase nuts. They can’t decide amongpeanuts, cashews, or almonds. They again agree to create amix. They bought 2.5 pounds of peanuts for $1.30 per pound,

s L 0.865%x L 1 .214 %

x � 51.8; s � 12.1

M � 519.1; S « 114.5

8. SAT Scores The following data represent SAT Mathematicsscores for 2006.

(a) s « 5.4 yearsM � 32.5 years;

NW

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 165

Page 39: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

166 Chapter 3 Numerically Summarizing Data

4 pounds of cashews for $4.50 per pound, and 2 pounds ofalmonds for $3.75 per pound. Determine the price per poundof the mix. $3.38

15. Population The following data represent the male and fe-male population, by age, of the United States in July 2006.

(a) Approximate the population mean and standard devia-tion of age for mothers in 1980.

(b) Approximate the population mean and standard devia-tion of age for mothers in 2004.

(c) Which year has the higher mean age? 2004(d) Which year has more dispersion in age? 2004

Problems 17–20 use the following steps to approximate the medianfrom grouped data.

17. Approximate the median of the frequency distribution inProblem 3. 3367.9 g

18. Approximate the median of the frequency distributionin Problem 4. 2298.3 ft2

19. Approximate the median of the frequency distribution inProblem 7. 32.5 years

20. Approximate the median of the frequency distribution inProblem 8. 516.2

Problems 21 and 22, use the following definition of the modal class.The modal class of a variable can be obtained from data in a fre-quency distribution by determining the class that has the largestfrequency.

21. Determine the modal class of the frequency distribution inProblem 3. 3000–3999 g

22. Determine the modal class of the frequency distribution inProblem 4. 1500–1999 ft2

Age

Male Female Resident Pop Resident Pop (in thousands) (in thousands)

0–9 20,518 19,607

10–19 22,496 20,453

20–29 21,494 20,326

30–39 20,629 20,261

40–49 22,461 22,815

50–59 18,872 19,831

60–69 11,216 12,519

70–79 6,950 8,970

80–89 3,327 5,678

90–99 524 1,356

100–109 14 59

Source: U.S. Census Bureau

(a) Approximate the population mean and standard devia-tion of age for males.

(b) Approximate the population mean and standard devia-tion of age for females.

(c) Which gender has the higher mean age? Female(d) Which gender has more dispersion in age? Female16. Age of Mother The following data represent the age of

the mother at childbirth for 1980 and 2004.

M « 38.5 years; S « 23.1 years

M « 35.9 years; s « 21.8 years

Age1980 Births 2004 Births (thousands) (thousands)

10–14 9.8 6.8

15–19 551.9 415.3

20–24 1226.4 1034.5

25–29 1108.2 1104.5

30–34 549.9 965.7

35–39 140.7 475.6

40–44 23.2 103.7

45–49 1.1 5.7

Source: National Vital Statistics Reports, Vol. 55, No. 10

TI-83/84 Plus1. Enter the class midpoint in L1 and thefrequency or relative frequency in L2 bypressing STAT and selecting 1:Edit.2. Press STAT, highlight the CALC menu, andselect 1:1-Var Stats3. With 1-Var Stats appearing on the HOMEscreen, press 2nd then 1 to insert L1 on the

TECHNOLOGY STEP-BY-STEP Determining the Mean and Standard Deviationfrom Grouped Data

HOME screen. Then press the comma and press2nd 2 to insert L2 on the HOME screen. So theHOME screen should have the following:

1-Var Stats L1, L2

Press ENTER to obtain the mean andstandard deviation.

Approximating the Median from Grouped Data

Step 1: Construct a cumulative frequency distribution.Step 2: Identify the class in which the median lies.Remember, the median can be obtained by determiningthe observation that lies in the middle.Step 3: Interpolate the median using the formula

where L is the lower class limit of the class containingthe mediann is the number of data values in the frequencydistributionCF is the cumulative frequency of the class imme-diately preceding the class containing the medianf is the frequency of the median classi is the class width of the class containing the median

Median = M = L +

n 2

- CF

f 1i2

16. (a) 16. (b) M « 27.9 years; S « 6.3 yearsM « 25.5 years; s « 5.4 years

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 166

Page 40: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.4 Measures of Position and Outliers 167

3.4 MEASURES OF POSITION AND OUTLIERS

Objectives 1 Determine and interpret z-scores2 Interpret percentiles3 Determine and interpret quartiles4 Determine and interpret the interquartile range5 Check a set of data for outliers

In Section 3.1, we determined measures of central tendency. Measures of centraltendency are meant to describe the “typical” data value. Section 3.2 discussed mea-sures of dispersion, which describe the amount of spread in a set of data. In thissection, we discuss measures of position; that is, we wish to describe the relativeposition of a certain data value within the entire set of data.

1 Determine and Interpret z-ScoresAt the end of the 2007 season, the New York Yankees led the American League with968 runs scored, while the Philadelphia Phillies led the National League with892 runs scored.A quick comparison might lead one to believe that the Yankees arethe better run-producing team. However, this comparison is unfair because the twoteams play in different leagues. The Yankees play in the American League, wherethe designated hitter bats for the pitcher, whereas the Phillies play in the NationalLeague, where the pitcher must bat (pitchers are typically poor hitters). To comparethe two teams’ scoring of runs we need to determine their relative standings in theirrespective leagues. This can be accomplished using a z-score.

Definition The z-score represents the distance that a data value is from the mean in termsof the number of standard deviations. It is obtained by subtracting the meanfrom the data value and dividing this result by the standard deviation. There isboth a population z-score and a sample z-score; their formulas follow:

(1)

The z-score is unitless. It has mean 0 and standard deviation 1.

If a data value is larger than the mean, the z-score will be positive. If a datavalue is smaller than the mean, the z-score will be negative. If the data value equalsthe mean, the z-score will be zero. z-Scores measure the number of standard devia-tions an observation is above or below the mean. For example, a z-score of 1.24 is in-terpreted as “the data value is 1.24 standard deviations above the mean.” A z-scoreof is interpreted as “the data value is 2.31 standard deviations below themean.”

We are now prepared to determine whether the Yankees or Phillies had a betteryear in run production.

EXAMPLE 1 Comparing z-Scores

Problem: Determine whether the New York Yankees or the Philadelphia Phillieshad a relatively better run producing season. The Yankees scored 968 runs and playin the American League, where the mean number of runs scored was andthe standard deviation was The Phillies scored 892 runs and play inthe National League, where the mean number of runs scored was and thestandard deviation was s = 58.9 runs.

m = 763.0s = 73.5 runs.

m = 793.9

-2.31

Population z-Score Sample z-Score

z =

x - m

s z =

x - x s

In Other Wordsz-Scores provide a way to compareapples to oranges by converting variableswith different centers and/or spreads tovariables with the same center (0) andspread (1).

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 167

Page 41: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Approach: To determine which team had the relatively better run-producing sea-son with respect to each league, we compute each team’s z-score. The team with thehigher z-score had the better season. Because we know the values of the populationparameters, we will compute the population z-score.

Solution: First, we compute the z-score for the Yankees. z-Scores are typicallyrounded to two decimal places.

Next we compute the z-score for the Phillies.

So the Yankees had run production 2.37 standard deviations above the mean, whilethe Phillies had run production 2.19 standard deviations above the mean.Therefore,the Yankees had a relatively better year at scoring runs, with respect to the AmericanLeague, than did the Phillies, with respect to the National League.

z-score =

x - m

s =

892 - 763.0 58.9

= 2.19

z-score =

x - m

s =

968 - 793.9 73.5

= 2.37

168 Chapter 3 Numerically Summarizing Data

SmallestData Value

LargestData Value

Bottom1%

Bottom2%

. . .P1 P2 P98 P99

Top1%

Top2%

Figure 18

Percentiles are used to give the relative standing of an observation. Many stan-dardized exams, such as the SAT college entrance exam, use percentiles to providestudents with an understanding of how they scored on the exam in relation to allother students who took the exam.

EXAMPLE 2 Interpret a Percentile

Problem: Jennifer just received the results of her SAT exam. Her SAT Compositeof 1,710 is at the 73rd percentile. What does this mean?

Approach: In general, the kth percentile of an observation means that k percentof the observations are less than or equal to the observation.

Interpretation: A percentile rank of 73% means that 73% of SAT Compositescores are less than or equal to 1,710 and 27% of the scores are greater than 1,710.So about 27% of the students who took the exam scored better than Jennifer.

Now Work Problem 9

2 Interpret PercentilesRecall that the median divides the lower 50% of a set of data from the upper 50%.The median is a special case of a general concept called the percentile.

Definition The kth percentile, denoted , of a set of data is a value such that k percent ofthe observations are less than or equal to the value.

So percentiles divide a set of data that is written in ascending order into100 parts; thus 99 percentiles can be determined. For example, divides the bottom1% of the observations from the top 99%, divides the bottom 2% of the obser-vations from the top 98%, and so on. Figure 18 displays the 99 possible percentiles.

P2

P1

Pk

Note to InstructorIt really does not make sense to talkabout percentiles (other than quartiles)for small data sets. After all, how can onedivide a data set with 40 observationsinto 100 parts? When data sets arelarge, it makes more sense to determinepercentiles using a model, such as thenormal model. For that reason, we willpostpone determining percentiles until wecan use the normal model to find theirvalues.

Now Work Problem 17

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 168

Page 42: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.4 Measures of Position and Outliers 169

$180 $189 $370 $618 $735 $802 $1,185 $1,414 $1,657$1,953 $2,332 $2,336 $3,461 $4,668 $6,751 $9,908 $10,034 $21,147

3 Determine and Interpret QuartilesThe most common percentiles are quartiles. Quartiles divide data sets into fourths,or four equal parts. The first quartile, denoted divides the bottom 25% of thedata from the top 75%. Therefore, the first quartile is equivalent to the 25th per-centile. The second quartile divides the bottom 50% of the data from the top 50%,so the second quartile is equivalent to the 50th percentile, which is equivalent to themedian. Finally, the third quartile divides the bottom 75% of the data from the top25%, so that the third quartile is equivalent to the 75th percentile. Figure 19 illus-trates the concept of quartiles.

Q1 ,

SmallestData Value

Median LargestData Value

25% ofthe data

25% ofthe data

25% ofthe data

25% ofthe data

Q1 Q2 Q3

Figure 19

EXAMPLE 3 Finding and Interpreting Quartiles

Problem: The Highway Loss Data Institute routinely collects data on collisioncoverage claims. Collision coverage insures against physical damage to an insuredindividual’s vehicle. The data in Table 16 represent a random sample of 18 collisioncoverage claims based on data obtained from the Highway Loss Data Institute for2004 models. Find and interpret the first, second, and third quartiles for collisioncoverage claims.

Finding QuartilesStep 1: Arrange the data in ascending order.Step 2: Determine the median, M, or second quartile, .Step 3: Determine the first and third quartiles, and , by dividing the dataset into two halves; the bottom half will be the observations below (to the leftof) the location of the median and the top half will be the observations above(to the right of) the location of the median. The first quartile is the median ofthe bottom half and the third quartile is the median of the top half.

Q3Q1

Q2

In Other WordsTo find , determine the median ofthe data set. To find , determine themedian of the “lower half” of the dataset. To find , determine the medianof the “upper half” of the data set.

Q3

Q1

Q2

Using TechnologyThe technique presented here agreeswith the method used by TIgraphing calculators. It will agreewith MINITAB output when thenumber of observations is odd.When the number of observations iseven, MINITAB gives differentresults.

Note to InstructorAsk your students whether quartiles areresistant.

In Other WordsThe first quartile, Q1, is equivalent to the25th percentile, P25. The 2nd quartile, Q2,is equivalent to the 50th percentile, P50,which is equivalent to the median,M. Finally, the third quartile, Q3, isequivalent to the 75th percentile, P75.

Table 16

$6,751 $9,908 $3,461 $2,336 $21,147 $2,332

$189 $1,185 $370 $1,414 $4,668 $1,953

$10,034 $735 $802 $618 $180 $1,657

Approach: We follow the steps given above.

SolutionStep 1: The data written in ascending order are given as follows:

Step 2: There are observations, so the median, or second quartile, , is the

mean of the 9th and 10th observations. Therefore,

.$1,805

M = Q2 =

$1,657 + $1,9532

=

Q2n = 18

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 169

Page 43: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

170 Chapter 3 Numerically Summarizing Data

Step 3: The median of the bottom half of the data is the first quartile, .The bottomhalf of the data is the observations less than the median, or

$180 $189 $370 $618 $735 $802 $1185 $1414 $1657qQ1

The median of these data is the 5th observation. Therefore, .The median of the top half of the data is the third quartile, . The top half of

the data is the observations greater than the median, or

$1,953 $2,332 $2,336 $3,461 $4,668 $6,751 $9,908 $10,034 $21,147qQ3

The median of these data is the 5th observation. Therefore,

Interpretation: We interpret the quartiles as percentiles. For example, 25% of thecollision claims are less than or equal to the first quartile, $735, and 75% of the col-lision claims are greater than $735. Also, 50% of the collision claims are less than orequal to $1,805, the second quartile, and 50% of the collision claims are greater than$1,805. Finally, 75% of the collision claims are less than or equal to $4,668, the thirdquartile, and 25% of the collision claims are greater than $4,668.

EXAMPLE 4 Finding Quartiles Using Technology

Problem: Find the quartiles of the collision coverage claims data in Table 16.

Approach: We will use both a TI-84 Plus graphing calculator and MINITAB toobtain the quartiles. The steps for obtaining quartiles using a TI-83/84 Plus graphingcalculator,MINITAB,or Excel are given in theTechnology Step-by-Step on page 175.

Solution: Figure 20(a) shows the results obtained from a TI-84 Plus graphing cal-culator. Figure 20(b) shows the results obtained from MINITAB. The results pre-sented by the TI-84 agree with our “by hand” solution. Notice that the first quartile,706, and the third quartile, 5,189, reported by MINITAB disagree with our “byhand” and TI-84 result. This difference is due to the fact that the TI-84 Plus andMINITAB use different algorithms for obtaining quartiles.

Q3 = $4,668.

Q3

Q1 = $735

Q1

(a)

Descriptive statistics: ClaimVariableClaim

N18

N*0

Mean3874

SE Mean1250

StDev5302

Minimum180

Maximum2114

Median1805

Q1706

Q35189

(b)

Figure 20

Note to InstructorAsk students to conjecture the methodthat MINITAB uses in finding the quartiles.

Using TechnologyStatistical packages may usedifferent formulas for obtainingthe quartiles, so results may differslightly.

NowWorkProblem21(b)

4 Determine and Interpret the Interquartile RangeIn Section 3.2 we introduced two measures of dispersion, the range and the standarddeviation, neither of which are resistant to extreme values. Quartiles, on the otherhand, are resistant to extreme values. For this reason, it would be preferable to finda measure of dispersion that is based on quartiles.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 170

Page 44: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.4 Measures of Position and Outliers 171

Definition The interquartile range, denoted IQR, is the range of the middle 50% of theobservations in a data set. That is, the IQR is the difference between the firstand third quartiles and is found using the formula

The interpretation of the interquartile range is similar to that of the range andstandard deviation. That is, the more spread a set of data has, the higher the in-terquartile range will be.

IQR = Q3 - Q1

EXAMPLE 5 Determining and Interpreting the Interquartile Range

Problem: Determine and interpret the interquartile range of the collision claimdata from Example 3.

Approach: We will use the quartiles found by hand in Example 3. The interquar-tile range, IQR, is found by computing the difference between the first and thirdquartiles. It represents the range of the middle 50% of the observations.

Solution: The interquartile range is

Interpretation: The range of the middle 50% of the observations in the collisionclaim data is $3,933.

= $3,933

= $4,668 - $735

IQR = Q3 - Q1

For the remainder of this text, the direction describe the distribution will meanto describe its shape (skewed left, skewed right, symmetric), its center (mean ormedian), and its spread (standard deviation or interquartile range).

Summary: Which Measures to Report

Shape of Distribution

Measure of Central Tendency

Measure of Dispersion

Symmetric Mean Standard deviation

Skewed left or skewed right Median Interquartile range

5 Check a Set of Data for OutliersWhenever performing any type of data analysis, we should always check for extremeobservations in the data set. Extreme observations are referred to as outliers. When-ever outliers are encountered, their origin must be investigated. They can occur by

Now Work Problem 21(c)

Let’s compare the measures of central tendency and dispersion discussed thusfar for the collision claim data.The mean collision claim is $3874.4 and the median is$1,805. The median is more representative of the “center” because the data areskewed to the right (only 5 of the 18 observations are greater than the mean). Therange is $21,147 � $180 � $20,967.The standard deviation is $5,301.6 and the inter-quartile range is $3,933.The values of the range and standard deviation are affectedby the extreme claim of $21,147. In fact, if this claim had been $120,000 (let’s say theclaim was for a totaled Mercedes S-class AMG), then the range and standard devia-tion would increase to $119,820 and $27,782.5, respectively. The interquartile rangeis not affected.Therefore, when the distribution of data is highly skewed or containsextreme observations, it is best to use the interquartile range as the measure of dis-persion because it is resistant.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 171

Page 45: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

172 Chapter 3 Numerically Summarizing Data

Outliers distort both the meanand the standard deviation, becauseneither is resistant. Because thesemeasures often form the basis formost statistical inference, anyconclusions drawn from a set of datathat contains outliers can be flawed.

Checking for Outliers by Using Quartiles

Step 1: Determine the first and third quartiles of the data.Step 2: Compute the interquartile range.Step 3: Determine the fences. Fences serve as cutoff points for determiningoutliers.

Step 4: If a data value is less than the lower fence or greater than the upperfence, it is considered an outlier.

Upper fence = Q3 + 1.51IQR2 Lower fence = Q1 - 1.51IQR2

Note to InstructorA second method for identifying outliersuses z-scores. If the z-score is less than

or greater than 2, the data value canbe considered to be an outlier. However,this method should be used only if thedistribution is symmetric, because it isbased on the Empirical Rule.

-2

chance, because of error in the measurement of a variable, during data entry, or fromerrors in sampling. For example, in the 2000 presidential election, a precinct in NewMexico accidentally recorded 610 absentee ballots for Al Gore as 110.Workers in theGore camp discovered the data-entry error through an analysis of vote totals.

Sometimes extreme observations are common within a population. For exam-ple, suppose we wanted to estimate the mean price of a European car. We mighttake a random sample of size 5 from the population of all European automobiles. Ifour sample included a Ferrari F430 Spider (approximately $175,000), it probablywould be an outlier, because this car costs much more than the typical Europeanautomobile. The value of this car would be considered unusual because it is not atypical value from the data set.

We can use the following steps to check for outliers using quartiles.

EXAMPLE 6 Checking for Outliers

Problem: Check the data that represent the collision coverage claims for outliers.

Approach: We follow the preceding steps. Any data value that is less than thelower fence or greater than the upper fence will be considered an outlier.

SolutionStep 1: We will use the quartiles found by hand in Example 3. So and

Step 2: The interquartile range, IQR, is

Step 3: The lower fence, LF, is

The upper fence, UF, is

Step 4: There are no outliers below the lower fence. However, there is an outlierabove the upper fence, corresponding to the claim of $21,147.

= $10,567.5

= $4,668 + 1.5($3,933)

UF = Q3 + 1.51IQR2

= -$5,164.5

= $735 - 1.5($3,933)

LF = Q1 - 1.51IQR2

= $3,933

= $4,668 - $735

IQR = Q3 - Q1

Q3 = $4,668.Q1 = $735

NowWorkProblem21(d)

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 172

Page 46: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.4 Measures of Position and Outliers 173

13. ERA Champions In 2007, Jake Peavy of the San DiegoPadres had the lowest earned-run average (ERA is themean number of runs yielded per nine innings pitched) ofany starting pitcher in the National League, with an ERAof 2.54. Also in 2007, John Lackey of the Los Angeles An-gels had the lowest ERA of any starting pitcher in theAmerican League with an ERA of 3.01. In the NationalLeague, the mean ERA in 2007 was 4.119 and the standarddeviation was 0.697. In the American League, the meanERA in 2007 was 4.017 and the standard deviation was0.678. Which player had the better year relative to his peers,Peavy or Lackey? Why? Peavy

14. Batting Champions The highest batting average ever recordedin Major League Baseball was by Ted Williams in 1941 whenhe hit 0.406. That year, the mean and standard deviation forbatting average were 0.2806 and 0.0328. In 2007, MagglioOrdonez was the American League batting champion, with abatting average of 0.363. In 2007, the mean and standard devi-ation for batting average were 0.2848 and 0.0273.Who had thebetter year relative to his peers,Williams or Ordonez? Why?

15. School Admissions A highly selective boarding school willonly admit students who place at least 1.5 standard devia-tions above the mean on a standardized test that has amean of 200 and a standard deviation of 26. What is theminimum score that an applicant must make on the test tobe accepted? 239

16. Quality Control A manufacturer of bolts has a quality-control policy that requires it to destroy any bolts that aremore than 2 standard deviations from the mean. The quality-control engineer knows that the bolts coming off the assemblyline have a mean length of 8 cm with a standard deviation of0.05 cm. For what lengths will a bolt be destroyed?

17. You Explain It! Percentiles Explain the meaning of thefollowing percentiles.

3.4 ASSESS YOUR UNDERSTANDING

Concepts and Vocabulary1. Write a paragraph that explains the meaning of percentiles.

2. Suppose you received the highest score on an exam. Yourfriend scored the second-highest score, yet you both were inthe 99th percentile. How can this be?

3. Morningstar is a mutual fund rating agency. It ranks a fund’sperformance by using one to five stars. A one-star mutualfund is in the bottom 20% of its investment class; a five-starmutual fund is in the top 20% of its investment class. Inter-pret the meaning of a four-star mutual fund.

4. When outliers are discovered, should they always be re-moved from the data set before further analysis?

5. Mensa is an organization designed for people of high intelli-gence. One qualifies for Mensa if one’s intelligence is meas-ured at or above the 98th percentile. Explain what this means.

6. Explain the advantage of using z-scores to compare observa-tions from two different data sets.

7. Explain the circumstances for which the interquartile rangeis the preferred measure of dispersion. What is an advantagethat the standard deviation has over the interquartile range?

8. Explain what each quartile represents.

Applying the Concepts9. Birth Weights In 2005, babies born after a gestation period of

32 to 35 weeks had a mean weight of 2,600 grams and a stan-dard deviation of 660 grams. In the same year, babies bornafter a gestation period of 40 weeks had a mean weight of3,500 grams and a standard deviation of 470 grams. Suppose a34-week gestation period baby weighs 2,400 grams and a40-week gestation period baby weighs 3,300 grams. What isthe z-score for the 34-week gestation period baby? What isthe z-score for the 40-week gestation period baby? Whichbaby weighs less relative to the gestation period?

10. Birth Weights In 2005, babies born after a gestation period of32 to 35 weeks had a mean weight of 2,600 grams and a stan-dard deviation of 660 grams. In the same year, babies bornafter a gestation period of 40 weeks had a mean weight of3,500 grams and a standard deviation of 470 grams. Suppose a34-week gestation period baby weighs 3,000 grams and a40-week gestation period baby weighs 3,900 grams. What isthe z-score for the 34-week gestation period baby? What isthe z-score for the 40-week gestation period baby? Whichbaby weighs less relative to the gestation period?

11. Men versus Women The average 20- to 29-year old man is69.6 inches tall, with a standard deviation of 3.0 inches, whilethe average 20- to 29-year old woman is 64.1 inches tall, witha standard deviation of 3.8 inches. Who is relatively taller, a75-inch man or a 70-inch woman? Man

Source: CDC Vital and Health Statistics, Advance Data,Number 361, July 5, 2005

12. Men versus Women The average 20- to 29-year-old man is69.6 inches tall, with a standard deviation of 3.0 inches, whilethe average 20- to 29-year-old woman is 64.1 inches tall, witha standard deviation of 3.8 inches. Who is relatively taller, a67-inch man or a 62-inch woman? WomanSource: CDC Vital and Health Statistics, Advance Data,Number 361, July 5, 2005

Age Percentile

10th 25th 50th 75th 90th

20–29 166.8 171.5 176.7 181.4 186.8

30–39 166.9 171.3 176.0 181.9 186.2

40–49 167.9 172.1 176.9 182.1 186.0

50–59 166.0 170.8 176.0 181.2 185.4

60–69 165.3 170.1 175.1 179.5 183.7

70–79 163.2 167.5 172.9 178.1 181.7

80 or older 161.7 166.1 170.5 175.3 179.4

NW

NW

9. �0.30; �0.43; 40-week gestation 10. 0.61; 0.85; 34-week gestation 14. Williams 16.<7.9 cm or>8.1 cm

Source: Advance Data from Vital and Health Statistics,Number 361, July 7, 2005.(a) The 15th percentile of the head circumference of males 3

to 5 months of age is 41.0 cm.(b) The 90th percentile of the waist circumference of

females 2 years of age is 52.7 cm.(c) Anthropometry involves the measurement of the human

body. One goal of these measurements is to assess howbody measurements may be changing over time. The fol-lowing table represents the standing height of malesaged 20 years or older for various age groups. Based onthe percentile measurements of the different age groups,what might you conclude?

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 173

Page 47: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

174 Chapter 3 Numerically Summarizing Data

18. You Explain It! Percentiles Explain the meaning of thefollowing percentiles.Source: National Center for Health Statistics.(a) The 5th percentile of the weight of males 36 months of

age is 12.0 kg.(b) The 95th percentile of the length of newborn females is

53.8 cm.

19. You Explain It! Quartiles A violent crime includes rape,robbery, assault, and homicide. The following is a summaryof the violent-crime rate (violent crimes per 100,000 popula-tion) for all 50 states in the United States plus Washington,D.C., in 2005.

(a) Provide an interpretation of these results.(b) Determine and interpret the interquartile range. 256.9(c) The violent-crime rate in Washington, D.C., in 2005 was

1,459. Would this be an outlier? Yes(d) Do you believe that the distribution of violent-crime

rates is skewed or symmetric? Why? Skewed right20. You Explain It! Quartiles One variable that is measured by

online homework systems is the amount of time a studentspends on homework for each section of the text. The fol-lowing is a summary of the number of minutes a studentspends for each section of the text for the fall 2007 semesterin a College Algebra class at Joliet Junior College.

(a) Provide an interpretation of these results.(b) Determine and interpret the interquartile range. 30.5(c) Suppose a student spent 2 hours doing homework for a

section. Would this be an outlier? Yes(d) Do you believe that the distribution of time spent doing

homework is skewed or symmetric? Why? Skewed right

21. April Showers The following data represent the number ofinches of rain in Chicago, Illinois, during the month of Aprilfor 20 randomly selected years.

Q1 = 42 Q2 = 51.5 Q3 = 72.5

Q1 = 272.8 Q2 = 387.4 Q3 = 529.7

0.97 2.47 3.94 4.11 5.79

1.14 2.78 3.97 4.77 6.14

1.85 3.41 4.00 5.22 6.28

2.34 3.48 4.02 5.50 7.69

Source: NOAA, Climate Diagnostics Center

(a) Compute the z-score corresponding to the 1971 rainfallof 0.97 inch. Interpret this result.

(b) Determine the quartiles. 2.625 in.; 3.985 in.; 5.36 in.(c) Compute and interpret the interquartile range, IQR.(d) Determine the lower and upper fences.Are there any out-

liers, according to this criterion? 9.463 in.; No

22. Hemoglobin in Cats The following data represent the he-moglobin (in g/dL) for 20 randomly selected cats.

- 1.478 in.;

2.735

- 1.70

5.7 8.9 9.6 10.6 11.7

7.7 9.4 9.9 10.7 12.9

7.8 9.5 10.0 11.0 13.0

8.7 9.6 10.3 11.2 13.4

Source: Joliet Junior College Veterinarian TechnologyProgram

(a) Compute the z-score corresponding to the hemoglobinof Blackie, 7.8 g/dL. Interpret this result.

(b) Determine the quartiles. 9.15 g/dL; 9.95 g/dL; 11.1 g/dL(c) Compute and interpret the interquartile range, IQR. 1.95(d) Determine the lower and upper fences.Are there any out-

liers, according to this criterion?

23. Rate of Return of Google The following data represent themonthly rate of return of Google common stock from itsinception in August 2004 through November 2007.(a) Determine and interpret the quartiles.(b) Check the data set for outliers. 0.47

- 1.21

0.26 0.26 0.04 0.06 0.06

0.47 0.06 �0.16 0.19 0.05

�0.05 �0.02 0.08 0.02 �0.02

0.06 �0.01 0.07 �0.05 0.01

0.01 0.11 �0.11 0.09 0.10

NW

24. Emissions The following data represent the carbondioxide emissions per capita (total carbon dioxide emissions,in tons, divided by total population) for the countries ofWestern Europe in 2004.(a) Determine and interpret the quartiles.(b) Is the observation corresponding to Luxembourg, 6.81,

an outlier? Yes

CO2

25. Fraud Detection As part of its “Customers First” program, acellular phone company monitors monthly phone usage. Thegoal of the program is to identify unusual use and alert thecustomer that their phone may have been used by an un-scrupulous individual. The following data represent themonthly phone use in minutes of a customer enrolled in thisprogram for the past 20 months.

346 345 489 358 471

442 466 505 466 372

442 461 515 549 437

480 490 429 470 516

The phone company decides to use the upper fence as thecutoff point for the number of minutes at which the customershould be contacted. What is the cutoff point? 574 minutes

26. Stolen Credit Card A credit card company decides to enacta fraud-detection service.The goal of the credit card companyis to determine if there is any unusual activity on the creditcard. The company maintains a database of daily charges ona customer’s credit card. Days when the card was inactiveare excluded from the database. If a day’s worth of chargesappears unusual, the customer is contacted to make sure thatthe credit card has not been compromised. Use the followingdaily charges (rounded to the nearest dollar) to determinethe amount the daily charges must exceed before the cus-tomer is contacted. $219

22. (d) 6.23 g/dL; 14.03 g/dL; Yes, 5.7 g/dL 23. (a) �0.02; 0.035; 0.095 24. (a) 1.61; 2.165; 2.68

�0.04

0.18

0.13

�0.10

0.25

�0.04

0.09

�0.08

0.02

�0.02

0.22

0.02

�0.02

0.03

0.03Source: Yahoo!Finance

2.34 1.34 3.86 1.40 2.08 2.39 2.68

2.64 3.44 2.07 1.67 1.61 6.81 2.21

2.09 1.64 2.87 2.38 1.47 1.53

1.44 3.65 2.12 5.22 2.67 1.01

Source: Carbon Dioxide Information Analysis Center

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 174

Page 48: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.5 The Five-Number Summary and Boxplots 175

143 166 113 188 133

90 89 98 95 112

111 79 46 20 112

70 174 68 101 212

27. Student Survey of Income A survey of 50 randomly selectedfull-time Joliet Junior College students was conducted dur-ing the Fall 2007 semester. In the survey, the students wereasked to disclose their weekly income from employment. Ifthe student did not work, $0 was entered.

students were asked to disclose their weekly spending onentertainment. The results of the survey are as follows:

0 262 0 635 0

244 521 476 100 650

12,777 567 310 527 0

83 159 0 547 188

719 0 367 316 0

479 0 82 579 289

375 347 331 281 628

0 203 149 0 403

0 454 67 389 0

671 95 736 300 181

(a) Check the data set for outliers. $12,777(b) Draw a histogram of the data and label the outliers on

the histogram.(c) Provide an explanation for the outliers.

28. Student Survey of Entertainment Spending A survey of 40randomly selected full-time Joliet Junior College studentswas conducted in the Fall 2007 semester. In the survey, the

21 54 64 33 65

22 39 67 54 22

115 7 80 59 20

36 10 12 101 1,000

28 28 75 50 27

(a) Check the data set for outliers. $115 and $1,000(b) Draw a histogram of the data and label the outliers on

the histogram.(c) Provide an explanation for the outliers.

29. Pulse Rate Use the results of Problem 23 in Section 3.1 andProblem 25 in Section 3.2 to compute the z-scores for all thestudents. Compute the mean and standard deviation of thesez-scores.

30. Travel Time Use the results of Problem 24 in Section 3.1and Problem 26 in Section 3.2 to compute the z-scores for allthe students. Compute the mean and standard deviation ofthese z-scores.

31. Fraud Detection Revisited Use the fraud-detection datafrom Problem 25 to do the following.(a) Determine the standard deviation and interquartile

range of the data. 58.0 min; 56.5 min(b) Suppose the month in which the customer used 346 min-

utes was not actually that customer’s phone. That partic-ular month, the customer did not use her phone at all, so0 minutes were used. How does changing the observa-tion from 346 to 0 affect the standard deviation and in-terquartile range? What property does this illustrate?

3.5 THE FIVE-NUMBER SUMMARY AND BOXPLOTS

Objectives 1 Compute the five-number summary2 Draw and interpret boxplots

Let’s pause for a minute and consider where we have come up to this point. InChapter 2, we discussed techniques for graphically representing data. These sum-maries included bar graphs, pie charts, histograms, stem-and-leaf plots, and timeseriesgraphs. In Sections 3.1–3.4, we presented techniques for measuring the center of adistribution, spread in a distribution, and relative position of observations in a distri-bution of data. Why do we want these summaries? What purpose do they serve?

32

51

33

26

35

21

26

13

38

9

16

14

36

8

48

TI-83/84 PlusFollow the same steps given to compute the meanand median from raw data. (Section 3.1)

MINITABFollow the same steps given to compute the meanand median from raw data. (Section 3.1)

TECHNOLOGY STEP-BY-STEP Determining Quartiles

Excel1. Enter the raw data into column A.2. With the data analysis Tool Pak enabled,select the Tools menu and highlight Data Analysis3. Select Rank and Percentile from the Data Analysis window.4. With the cursor in the Input Range cell,highlight the data. Press OK.

. Á

S = 115.0 min; IQR = 56.5 min

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 175

Page 49: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Table 17

19.95 23.25 23.32 25.55 25.83 26.28 42.47

28.58 28.72 30.18 30.35 30.95 32.13 49.17

33.23 33.53 36.68 37.05 37.43 41.42 54.63

Source: Laura Gillogly, student at Joliet Junior College

EXAMPLE 1 Obtaining the Five-Number Summary

Problem: The data shown in Table 17 show the finishing times (in minutes) of themen in the 60- to 64-year-old age group in a 5-kilometer race. Determine the five-number summary of the data.

Approach: The five-number summary requires that we determine the minimumdata value, , M (the median), , and the maximum data value. We need toarrange the data in ascending order and then use the procedures introduced inSection 3.4 to obtain M, and Q3 .Q1 ,

Q3Q1

176 Chapter 3 Numerically Summarizing Data

Historical Note

John Tukey was born onJuly 16, 1915, in New Bedford,Massachusetts. His parents graduatednumbers 1 and 2 from Bates Collegeand were elected “the couple mostlikely to give birth to a genius.” In1936, Tukey graduated from BrownUniversity with an undergraduatedegree in chemistry. He went on toearn a master’s degree in chemistryfrom Brown. In 1939, Tukey earnedhis doctorate in mathematics fromPrinceton University. He remainedat Princeton and, in 1965, becamethe founding chair of theDepartment of Statistics.Among his manyaccomplishments, Tukey is creditedwith coining the terms software and bit.In the early 1970s, he discussed thenegative effects of aerosol cans onthe ozone layer. In December 1976 hepublished Exploratory Data Analysis,from which the following quoteappears.“Exploratory data analysis cannever be the whole story, but nothingelse can serve as the foundation stone—as the first step.” (p. 3). Tukey alsorecommended that the 1990 Census beadjusted by means of statisticalformulas. John Tukey died in NewBrunswick, New Jersey, on July 26,2000.

Well, the reason we want these summaries is to gain a sense as to what the dataare trying to tell us. We are exploring the data to see if they contain any interestinginformation that may be useful in our research.The summaries make this explorationmuch easier. In fact, because these summaries represent an exploration, a famousstatistician, named John Tukey, called this material exploratory data analysis.

Tukey defined exploratory data analysis as “detective work—numerical detec-tive work—or graphical detective work.” He believed it is best to treat the explo-ration of data as a detective would search for evidence when investigating a crime.Our goal is only to collect and present evidence. Drawing conclusions (or inference)is like the deliberations of the jury. So, what we have done up to this point fallsunder the category of exploratory data analysis. We have not come to any conclu-sions; we have only collected information and presented summaries.

We have already looked at one of Tukey’s graphical summaries, the stem-and-leaf plot. In this section, we look at two more summaries: the five-number summaryand the boxplot.

1 Compute the Five-Number SummaryRemember that the median is a measure of central tendency that divides the lower50% of the data from the upper 50% of the data. It is resistant to extreme valuesand is the preferred measure of central tendency when data are skewed right or left.

The three measures of dispersion presented in Section 3.2 (range, variance, andstandard deviation) are not resistant to extreme values. However, the interquartilerange, – , is resistant. It measures the spread of the data by determining the dif-ference between the 25th and 75th percentiles. It is interpreted as the range of themiddle 50% of the data. However, the median, and do not provide informa-tion about the extremes of the data. To get this information, we need to know thesmallest and largest values in the data set.

The five-number summary of a set of data consists of the smallest data value,the median, and the largest data value. Symbolically, the five-number sum-

mary is presented as follows:Q3 ,Q1 ,

Q3Q1

Q1Q3

Five-Number SummaryMINIMUM Q1 M Q3 MAXIMUM

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 176

Page 50: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.5 The Five-Number Summary and Boxplots 177

Descriptive statistics: TimesVariableTimes

N21

N*0

Mean33.37

SE Mean2.20

StDev10.10

Minimum19.95

Q126.06

Median 30.95

Q337.24

Maximum 54.63

Figure 21

EXAMPLE 2 Obtaining the Five-Number Summary Using Technology

Problem: Using statistical software or a graphing calculator, determine the five-number summary of the data presented in Table 17.

Approach: We will use MINITAB to obtain the five-number summary.

Solution: Figure 21 shows the output supplied by MINITAB. The five-numbersummary is highlighted.

2 Draw and Interpret BoxplotsThe five-number summary can be used to create another graph, called the boxplot.

Drawing a Boxplot

Step 1: Determine the lower and upper fences:

Remember,Step 2: Draw vertical lines at M, and Enclose these vertical lines in abox.Step 3: Label the lower and upper fences.Step 4: Draw a line from to the smallest data value that is larger than thelower fence. Draw a line from to the largest data value that is smaller thanthe upper fence. These lines are called whiskers.Step 5: Any data values less than the lower fence or greater than the upperfence are outliers and are marked with an asterisk 1*2.

Q3

Q1

Q3 .Q1 ,IQR = Q3 - Q1 .

Upper fence = Q3 + 1.51IQR2 Lower fence = Q1 - 1.51IQR2

Solution: The data in ascending order are as follows:

19.95, 23.25, 23.32, 25.55, 25.83, 26.28, 28.58, 28.72, 30.18, 30.35, 30.95,32.13, 33.23, 33.53, 36.68, 37.05, 37.43, 41.42, 42.47, 49.17, 54.63

The smallest number (the fastest time) in the data set is 19.95.The largest number inthe data set is 54.63.The first quartile, is 26.06.The median, M, is 30.95.The thirdquartile, is 37.24. The five-number summary is

19.95 26.06 30.95 37.24 54.63

Q3 ,Q1 ,

Note to InstructorSome call the boxplot described here themodified boxplot.

EXAMPLE 3 Constructing a Boxplot

Problem: Use the results from Example 1 to a construct a boxplot of the finishingtimes of the men in the 60- to 64-year-old age group.

Approach: Follow the steps presented above.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 177

Page 51: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

0 10 20 30 40 50 60

[ ]Figure 22(c)

Step 4: The smallest data value that is larger than 9.29 (the lower fence) is 19.95.The largest data value that is smaller than 54.01 (the upper fence) is 49.17. We drawhorizontal lines from to 19.95 and from to 49.17. See Figure 22(c).Q3Q1

Step 5: Plot any values less than 9.29 (the lower fence) or greater than 54.01 (theupper fence) using an asterisk These values are outliers. So 54.63 is an outlier.Remove the brackets from the graph. See Figure 22(d).

1*2.

0 10 20 30 40 50 60

Figure 22(d)

178 Chapter 3 Numerically Summarizing Data

Step 3: Temporarily mark the location of the lower and upper fence with brackets([ and ]). See Figure 22(b).

0 10 20 30 40 50 60

Q1 M Q3Figure 22(a)

0 10 20 30 40 50 60

[ ]Lower Fence Upper FenceFigure 22(b)

Solution: From the results of Example 1, we know that andTherefore, the interquartile

The difference between the 75th percentile and 25th percentile is a time of11.18 minutes.

Step 1: We compute the lower and upper fences:

Step 2: Draw a horizontal number line with a scale that will accommodate ourgraph. Draw vertical lines at and Enclose theselines in a box. See Figure 22(a).

Q3 = 37.24.M = 30.95,Q1 = 26.06,

Upper fence = Q3 + 1.51IQR2 = 37.24 + 1.5111.182 = 54.01

Lower fence = Q1 - 1.51IQR2 = 26.06 - 1.5111.182 = 9.29

11.18.range = IQR = Q3 - Q1 = 37.24 - 26.06 =Q3 = 37.24.

M = 30.95,Q1 = 26.06,

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 178

Page 52: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.5 The Five-Number Summary and Boxplots 179

0

10

5

15

20

25

Freq

uenc

y

50 52 54 56 58 60 62 64 66 68

(a) Skewed right

MaxMMinQ1 Q3

48 52 56 60 64 68

Figure 23

6 7 8 9 10 11 12 13 14 15

(c) Skewed left

MaxMMin Q1 Q3

140

120

100

80

60

40

20

0

Freq

uenc

y

6.5 8.5 10.5 12.5 14.5

20 30 40 50 60 70 80

(b) Symmetric

MaxMin Q1 Q3

0

10

5

15

20

Freq

uenc

y

30 40 50 60 70 80

The boxplot in Figure 22(d) suggests that the distribution is skewed right, sincethe right whisker is longer than the left whisker and the median is left of center inthe box. We can also assess the shape using the quartiles. The distance from the me-dian to the first quartile is , while the distance from the medi-an to the third quartile is . Plus, the distance from the medianto the minimum value is , while the distance from the median tothe maximum value is .23.68 (= 54.63 - 30.95)

11 (= 30.95 - 19.95)6.29 (= 37.24 - 30.95)4.89 (= 30.95 - 26.06)

Using a Boxplot and Quartiles to Describe the Shape of a DistributionFigure 23 shows three histograms and their corresponding boxplots with the five-number summary labeled. We should notice the following from the figure.

• In Figure 23(a), the histogram shows the distribution is skewed right. Noticethat the median is left of center in the box, and the right whisker is longerthan the left whisker. Further notice the distance from the median to the firstquartile is less than the distance from the median to the third quartile. Also, thedistance from the median to the minimum value in the data set is less than thedistance from the median to the maximum value in the data set.

• In Figure 23(b), the histogram shows the distribution is symmetric. Noticethat the median is in the center of the box, and the left and right whiskers areroughly the same length. Further notice the distance from the median to thefirst quartile is the same as the distance from the median to the third quartile.Also, the distance from the median to the minimum value in the data setis the same as the distance from the median to the maximum value in thedata set.

• In Figure 23(c), the histogram shows the distribution is skewed left. Notice thatthe median is right of center in the box, and the left whisker is longer than theright whisker. Further notice the distance from the median to the first quartile ismore than the distance from the median to the third quartile. Also, the distancefrom the median to the minimum value in the data set is more than the distancefrom the median to the maximum value in the data set.

Now Work Problem 11

Identifying the shape of adistribution from a boxplot (or froma histogram, for that matter) issubjective. When identifying the shapeof a distribution from a graph, be sureto support your opinion.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 179

Page 53: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Now Work Problem 15

Flight versus Control

Figure 24

180 Chapter 3 Numerically Summarizing Data

EXAMPLE 4 Comparing Two Distributions by Using Boxplots

Problem: In the Spacelab Life Sciences 2, 14 male rats were sent to space. Upontheir return, the red blood cell mass (in milliliters) of the rats was determined. Acontrol group of 14 male rats was held under the same conditions (except for space-flight) as the space rats, and their red blood cell mass was also determined when thespace rats returned. The project was led by Paul X. Callahan. The data in Table 18were obtained. Construct side-by-side boxplots for red blood cell mass for the flightgroup and control group. Does it appear that the flight to space affected the redblood cell mass of the rats?

Table 18

Flight Control

7.43 7.21 8.59 8.64 8.65 6.99 8.40 9.66

9.79 6.85 6.87 7.89 7.62 7.44 8.55 8.70

9.30 8.03 7.00 8.80 7.33 8.58 9.88 9.94

6.39 7.54 7.14 9.14

Source: NASA Life Sciences Data Archive

Approach: When comparing two data sets, we draw side-by-side boxplots on thesame horizontal number line to make the comparison easy. Graphing calculatorswith advanced statistical features, as well as statistical spreadsheets such asMINITAB and Excel, have the ability to draw boxplots. We will use MINITAB todraw the boxplots. The steps for drawing boxplots using a TI-83/84 Plus graphingcalculator, MINITAB, or Excel are given in the Technology Step-by-Step onpage 183.

Solution: Figure 24 shows the side-by-side boxplots drawn in MINITAB. From theboxplots, it appears that the space flight has reduced the red blood cell mass of therats since the median for the flight group is less than the median forthe control group . The spread, as measured by the interquartile range,appears to be similar for both groups.

(M L 8.6)(M L 7.7)

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 180

Page 54: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.5 The Five-Number Summary and Boxplots 181

0 5 10 15 20

Concepts and Vocabulary1. Explain how to determine the shape of a distribution using

the boxplot and quartiles.

2. In a boxplot, if the median is to the left of the center of thebox and the right whisker is substantially longer than the leftwhisker, the distribution is skewed _________.

(a) To the nearest integer,what is the median of variable x?40(b) To the nearest integer, what is the third quartile of

variable y? 52(c) Which variable has more dispersion? Why? y(d) Describe the shape of the variable x. Support your

position. Symmetric(e) Describe the shape of the variable y. Support your

position. Skewed right

6. Use the side-by-side boxplots shown to answer the questionsthat follow.

3.5 ASSESS YOUR UNDERSTANDING

�5 0 5 10 15

5. Use the side-by-side boxplots shown to answer the questionsthat follow.

(a) To the nearest integer, what is the median of variable x? 16(b) To the nearest integer, what is the first quartile of vari-

able y? 22(c) Which variable has more dispersion? Why? y(d) Does the variable x have any outliers? If so, what is the

value of the outlier? Yes; 30(e) Describe the shape of the variable y. Support your

position. Skewed left

7. Exam Scores After giving a statistics exam, Professor Dangdetermined the following five-number summary for her classresults: 60 68 77 89 98. Use this information to draw a box-plot of the exam scores.

8. Speed Reading Jessica enrolled in a course that promised toincrease her reading speed. To help judge the effectivenessof the course, Jessica measured the number of words perminute she could read prior to enrolling in the course. Sheobtained the following five-number summary: 110 140 157173 205. Use this information to draw a boxplot of the read-ing speed.

Applying the Concepts9. Age at Inauguration The following data represent the age

of U.S. presidents on their respective inauguration days.

y

x

100 30 50 10020 40 60 80 90 110 12070

y

x

0 5 10 15 20 25 30 35 40

right

Skill BuildingIn Problems 3 and 4, (a) identify the shape of the distribution and(b) determine the five-number summary. Assume that each num-ber in the five-number summary is an integer.

3. (a) skewed right (b) 0, 1, 3, 6, 16

4. (a) symmetric(b) �1, 2, 5, 8, 11

42 48 51 52 54 56 57 61 65

43 49 51 54 55 56 57 61 68

46 49 51 54 55 56 58 62 69

46 50 51 54 55 57 60 64

47 50 52 54 55 57 61 64

Source: factmonster.com

7.2 8.5 9.0 9.4 10.0 10.3 11.2 11.5 13.8

7.8 8.6 9.1 9.6 10.0 10.3 11.2 11.5 14.4

7.8 8.6 9.2 9.7 10.0 10.3 11.2 11.7 16.4

7.9 8.6 9.2 9.7 10.1 10.7 11.3 12.4

8.1 8.7 9.2 9.9 10.2 10.7 11.3 12.5

8.3 8.8 9.4 9.9 10.3 10.9 11.3 13.6

Source: 2004 American Community Survey by the U.S. Census Bureau

(a) Find the five-number summary. 7.2, 9.0, 10.0, 11.2, 16.4(b) Construct a boxplot.(c) Comment on the shape of the distribution. Skewed right

(a) Find the five-number summary. 42, 51, 55, 58, 69(b) Construct a boxplot.(c) Comment on the shape of the distribution. symmetric

10. Carpoolers The following data represent the percentageof workers who carpool to work for the fifty states plusWashington D.C. Note: The minimum observation of 7.2%corresponds to Maine and the maximum observation of16.4% corresponds to Hawaii.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 181

Page 55: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

(a) Draw side-by-side boxplots for each vitamin type.(b) Which vitamin type has more dispersion? Generic(c) Which vitamin type appears to dissolve faster? Centrum

16. Chips per Cookie Do store-brand chocolate chip cookieshave fewer chips per cookie than Keebler’s Chips DeluxeChocolate Chip Cookies? To find out, a student randomlyselected 21 cookies of each brand and counted the numberof chips in the cookies. The results are shown next.

(a) Draw side-by-side boxplots for each brand of cookie.(b) Does there appear to be a difference in the number of

chips per cookie? Not really(c) Does one brand have a more consistent number of chips

per cookie? Keebler

Keebler Store Brand

32 23 28 21 23 24

28 28 29 24 25 27

25 20 25 26 26 21

22 21 24 18 16 24

21 24 21 21 30 17

26 28 24 23 28 31

33 20 31 27 33 29

Source: Trina McNamara, student at Joliet Junior College

182 Chapter 3 Numerically Summarizing Data

13. M&Ms In Problem 29 from Section 3.1, we drew a his-togram of the weights of M&Ms and found that the distribu-tion is symmetric. Draw a boxplot of these data. Use theboxplot and quartiles to confirm the distribution is symmet-ric. For convenience, the data are displayed again.

0.608 0.601 0.606 0.602 0.611

0.608 0.610 0.610 0.607 0.600

0.608 0.608 0.605 0.609 0.605

0.610 0.607 0.611 0.608 0.610

0.612 0.598 0.600 0.605 0.603

Source: Kelly Roe, student at Joliet Junior College

(a) Construct a boxplot.(b) Use the boxplot and quartiles to describe the shape of

the distribution. Skewed left

12. Emissions Revisited The following data represent thecarbon dioxide emissions per capita (total carbon dioxideemissions, in tons, divided by total population) for the coun-tries of Western Europe in 2004. In Problem 24 from Sec-tion 3.4, you found the quartiles.(a) Construct a boxplot.(b) Use the boxplot and quartiles to describe the shape of

the distribution. Skewed right

CO2

2.34 1.34 3.86 1.40 2.08

2.64 3.44 2.07 1.67 1.61

2.09 1.64 2.87 2.38 1.47

1.44 3.65 2.12 5.22 2.67

2.68 2.39 6.81 1.53 1.01

2.21

Source: Carbon Dioxide Information Analysis Center

0.87 0.88 0.82 0.90 0.90 0.84 0.84

0.91 0.94 0.86 0.86 0.86 0.88 0.87

0.89 0.91 0.86 0.87 0.93 0.88

0.83 0.95 0.87 0.93 0.91 0.85

0.91 0.91 0.86 0.89 0.87 0.84

0.88 0.88 0.89 0.79 0.82 0.83

0.90 0.88 0.84 0.93 0.81 0.90

0.88 0.92 0.85 0.84 0.84 0.86

Source: Michael Sullivan

108 108 99 105 103 103 94

102 99 106 90 104 110 110

103 109 109 111 101 101

110 102 105 110 106 104

104 100 103 102 120 90

113 116 95 105 103 101

100 101 107 110 92 108

Source: Ladonna Hansen, Park Curator

Centrum Generic Brand

2.73 3.07 3.30 3.35 3.12 6.57 6.23 6.17 7.17 5.77

2.95 2.15 2.67 2.80 2.25 6.73 5.78 5.38 5.25 5.55

2.60 2.57 4.02 3.02 2.15 5.50 6.50 7.42 6.47 6.30

3.03 3.53 2.63 2.30 2.73 6.33 7.42 5.57 6.35 5.92

3.92 2.38 3.25 4.00 3.63 5.35 7.25 7.58 6.50 4.97

3.02 4.17 4.33 3.85 2.23 7.13 5.98 6.60 5.03 7.18

Source: Amanda A. Sindewald, student at Joliet Junior College

15. Dissolving Rates of Vitamins A student wanted to knowwhether Centrum vitamins dissolve faster than the corre-sponding generic brand. The student used vinegar as a proxyfor stomach acid and measured the time (in minutes) it tookfor a vitamin to completely dissolve. The results are shownnext.

14. Old Faithful In Problem 30 from Section 3.1, we drew a his-togram of the length of eruption of California’s Old Faithfulgeyser and found that the distribution is symmetric. Draw aboxplot of these data. Use the boxplot and quartiles to con-firm the distribution is symmetric. For convenience, the dataare displayed again.

NW

11. Got a Headache? The following data represent the weight(in grams) of a random sample of 25 Tylenol tablets.NW

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 182

Page 56: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Section 3.5 The Five-Number Summary and Boxplots 183

17. Putting It Together: Paternal Smoking It is well-documentedthat active maternal smoking during pregnancy is associatedwith lower birth weight babies. Researchers Fernando D.Martinez and associates wanted to determine if there is arelationship between paternal smoking habits and birthweight. The researchers administered a questionnaire toeach parent of newborn infants. One question askedwhether the individual smoked regularly. Because the surveywas administered within 15 days of birth, it was assumed thatany regular smokers were also regular smokers during preg-nancy. Birthweights for the babies (in grams) of nonsmokingmothers were obtained and divided into two groups, non-smoking fathers and smoking fathers.The given data are rep-resentative of the data collected by the researchers. Theresearchers concluded that the birth weight of babies whosefather smoked was less than the birth weight of babies whosefather did not smoke.

Source: “The Effect of Paternal Smoking on the Birthweightof Newborns Whose Mothers Did Not Smoke” Fernando D.

Martinez, Anne L. Wright, Lynn M. Taussig, American Jour-nal of Public Health Vol. 84, No. 9(a) Is this an observational study or a designed experiment?

Why? Observational study(b) What is the explanatory variable? What is the response

variable? Whether the father smoked or not; birth weight(c) Can you think of any lurking variables that may affect

the results of the study?(d) In the article, the researchers stated that, “birthweights

were adjusted for possible confounders . . .”. What doesthis mean?

(e) Determine summary statistics (mean, median, standarddeviation, quartiles) for each group.

(f) Interpret the first quartile for both the nonsmoker andsmoker group.

(g) Draw a side-by-side boxplot of the data. Does the side-by-side boxplot confirm the conclusions of the study?

Nonsmokers Smokers

4194 3128 3522 3290 3454 3423 3998 3686 3455 2851 3066 3145

3062 3471 3771 4354 3783 3544 3150 3807 2986 3548 2918 4104

3544 3994 3746 2976 4019 4067 4216 3963 3502 3892 3457 2768

4054 3732 3518 3823 3884 3302 3493 3769 3255 3509 3234 3629

4248 3436 3719 3976 3668 3263 2860 4131 3282 3129 2746 4263

TI-83/84 Plus1. Enter the raw data into L1.2. Press 2nd Y= and select 1:Plot 1.3. Turn the plots ON. Use the cursor tohighlight the modified boxplot icon. Yourscreen should look as follows:

TECHNOLOGY STEP-BY-STEP Drawing Boxplots Using Technology

Excel1. Load the DDXL Add-in.2. Enter the raw data into column A. If you aredrawing side-by-side boxplots, enter all the datain column A and use index values to identifywhich group the data belongs to in column B.Consider Example 4, we might use 1 for theflight group and 2 for the control group.Highlight the data.3. Select the DDXL menu and highlight Chartsand Plots. If you are drawing a single boxplot,select “Boxplot” from the pull-down menu; ifyou drawing side-by-side boxplots, select“Boxplot by Groups” from the pull-downmenu.4. Put the cursor in the “Quantitative Variable”window. From the Names and Columnswindow, select the column of the data and clickthe � arrow. If you are drawing side-by-sideboxplots, place the cursor in the “GroupVariable” window. From the Names andColumns window, select the column of theindices and click the � arrow. If the first rowcontains the variable name, check the “Firstrow is variable names” box. Click OK.

4. Press ZOOM and select 9:ZoomStat.

MINITAB1. Enter the raw data into column C1.2. Select the Graph menu and highlight Boxplot3. For a single boxplot, select One Y, simple.For two or more boxplots, select MultipleY’s, simple.4. Select the data to be graphed. If you wantthe boxplot to be horizontal rather thanvertical, select the Scale button, then transposevalue and category scales. Click OK.

Á

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 183

Page 57: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Arithmetic mean (p. 129)Population arithmetic mean (p. 129)Sample arithmetic mean (p. 129)Mean (p. 129)Median (p. 131)Resistant (p. 134)Mode (p. 135)No mode (p. 136)Bimodal (p. 136)Multimodal (p. 136)Dispersion (p. 143)Range (p. 144)

Deviation about the mean (pp. 144–145)Population variance (p. 145)Computational formula (p. 145)Sample variance (p. 146)Biased (p. 147)Degrees of freedom (p. 147)Population standard deviation (p. 148)Sample standard deviation (p. 148)The Empirical Rule (p. 150)Chebyshev’s Inequality (p. 152)Weighted mean (p. 162)z-score (p. 167)

kth percentile (p. 168)Quartiles (p. 169)Interquartile range (p. 171)Describe the distribution (p. 171)Outlier (p. 171)Fences (p. 172)Exploratory data analysis (p. 176)Five-number summary (p. 176)Boxplot (p. 177)Whiskers (p. 177)

Vocabulary

184 Chapter 3 Numerically Summarizing Data

We can determine the relative position of an observation ina data set using z-scores and percentiles. A z-score denotes howmany standard deviations an observation is from the mean. Per-centiles determine the percent of observations that lie above andbelow an observation. The interquartile range is a resistant meas-ure of dispersion. The upper and lower fences can be used toidentify potential outliers. Any potential outlier must be investi-gated to determine whether it was the result of a data-entry erroror some other error in the data-collection process, or of an un-usual value in the data set.

The five-number summary provides an idea about thecenter and spread of a data set through the median and theinterquartile range. The length of the tails in the distributioncan be determined from the smallest and largest data values.The five-number summary is used to construct boxplots. Box-plots can be used to describe the shape of the distributionand to visualize outliers.

CHAPTER 3 REVIEW

SummaryThis chapter concentrated on describing distributions numerically.Measures of central tendency are used to indicate the typicalvalue in a distribution. Three measures of central tendency werediscussed. The mean measures the center of gravity of the distri-bution. The median separates the bottom 50% of the data fromthe top 50%.The data must be quantitative to compute the mean.The data must be at least ordinal to compute the median. Themode measures the most frequent observation. The data can beeither quantitative or qualitative to compute the mode. Themedian is resistant to extreme values, while the mean is not.

Measures of dispersion describe the spread in the data. Therange is the difference between the highest and lowest datavalues. The variance measures the average squared deviationabout the mean. The standard deviation is the square root of thevariance. The range, variance, and standard deviation are notresistant. The mean and standard deviation are used in manytypes of statistical inference.

The mean, median, and mode can be approximated fromgrouped data. The variance and standard deviation can also beapproximated from grouped data.

FormulasPopulation Mean Sample Mean

x =

axi

nm =

axi

N

Population Variance

Sample Variance

Population Standard Deviation

s = 3s2

s2=

a 1xi - x22n - 1

=

axi2

-

Aaxi B2n

n - 1

s2=

a 1xi - m22N

=

axi2

-

Aaxi B2N

N

Sample Standard Deviation

Weighted Mean

Population Mean from Grouped Data

Sample Mean from Grouped Data

x =

axifi

afi

m =

axifi

afi

xw =

awixi

awi

Range = Largest Data Value - Smallest Data Value

s = 3s2

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 184

Page 58: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Chapter 3 Review 185

Review Exercises1. Muzzle Velocity The following data represent the muzzle

velocity (in meters per second) of rounds fired from a155-mm gun.

Interquartile Range

Lower and Upper Fences

Upper Fence = Q3 + 1.51IQR2 Lower Fence = Q1 - 1.51IQR2

IQR = Q3 - Q1

Population Variance from Grouped Data

Sample Variance from Grouped Data

Population z-Score

z =

x - m

s

s2=

a 1xi - x22fi

Aafi B - 1

s2=

a 1xi - m22fi

afi

ObjectivesSection You should be able to Example Review Exercises

3.1 1 Determine the arithmetic mean of a variable from raw data (p. 129) 1, 4, 6 1(a), 2(a), 3(a), 4(c), 10(a)2 Determine the median of a variable from raw data (p. 131) 2, 3, 4, 6 1(a), 2(a), 3(a), 4(c), 10(a)3 Explain what it means for a statistic to be resistant (p. 133) 5 2(c), 10(h), 10(i), 124 Determine the mode of a variable from raw data (p. 135) 7, 8, 9 3(a), 4(d)

3.2 1 Compute the range of a variable from raw data (p. 144) 2 1(b), 2(b), 3(b)2 Compute the variance of a variable from raw data (p. 144) 3, 4, 6 1(b)3 Compute the standard deviation of a variable from raw data (p. 148) 5, 6, 7 1(b), 2(b), 3(b), 10(d)4 Use the Empirical Rule to describe data that are bell shaped (p. 150) 8 5(a)–(d)5 Use Chebyshev’s inequality to describe any set of data (p. 152) 9 5(e)–(f)

3.3 1 Approximate the mean of a variable from grouped data (p. 160) 1, 4 6(a)2 Compute the weighted mean (p. 161) 2 73 Approximate the variance and standard deviation of a variable

from grouped data (p. 162) 3, 4 6(b)

3.4 1 Determine and interpret z-scores (p. 167) 1 82 Interpret percentiles (p. 168) 2 113 Determine and interpret quartiles (p. 169) 3, 4 10(b)4 Determine and interpret the interquartile range (p. 170) 5 2(b), 10(d)5 Check a set of data for outliers (p. 171) 6 10(e)

3.5 1 Compute the five-number summary (p. 176) 1, 2 10(c)2 Draw and interpret boxplots (p. 177) 3, 4 9, 10(f)–10(g)

Á

793.8 793.1 792.4 794.0 791.4

792.4 791.7 792.3 789.6 794.4

Source: Christenson, Ronald, and Blackwood, Larry, “Tests for Precisionand Accuracy of Multiple Measuring Devices,” Technometrics, 35(4):411–421, 1993.

14,050 13,999 12,999 10,995 9,980

8,998 7,889 7,200 5,500

Source: cars.com

1. (a)(a) Compute the sample mean and median muzzle velocity.(b) Compute the range, sample variance, and sample standard

deviation.

2. Price of Chevy Cobalts The following data represent thesales price in dollars for nine 2-year-old Chevrolet Cobalts inthe Los Angeles area.

Range � 4.8 m>s, s2 � 2.03, s � 1.42 m>s

x � 792.51 m>s, M � 792.40 m>s

2. (a)(a) Determine the sample mean and median price.(b) Determine the range, sample standard deviation,and interquar-

tile range. , ,(c) Redo (a) and (b) if the data value 14,050 was incorrectly

entered as 41,050. How does this change affect themean? the median? the range? the standard deviation?the interquartile range? Which of these values isresistant? Median and IQR are resistant.

3. Chief Justices The following data represent the ages ofchief justices of the U.S. Supreme Court when they wereappointed.

IQR � $5,954.5s � $3,074.9Range � $8,550

x � $10,178.9; M � $9,980

Sample z-Score

z =

x - xs

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 185

Page 59: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

186 Chapter 3 Numerically Summarizing Data

(c) Assuming the data are bell shaped, what percentage oflight bulbs will last between 547 and 706 hours? 81.5%

(d) If the company that manufactures the light bulb guaran-tees to replace any bulb that does not last at least441 hours, what percentage of light bulbs can the firmexpect to have to replace, according to the EmpiricalRule? 0.15%

(e) Use Chebyshev’s Inequality to determine the minimumpercentage of light bulbs with a life within 2.5 standarddeviations of the mean. 84%

(f) Use Chebyshev’s Inequality to determine the minimumpercentage of light bulbs with a life between 494 and706 hours. 75%

6. Travel Time to Work The frequency distribution listed in thetable represents the travel time to work (in minutes) for arandom sample of 895 U.S. adults.

Justice Age

John Jay 44

John Rutledge 56

Oliver Ellsworth 51

John Marshall 46

Roger B. Taney 59

Salmon P. Chase 56

Morrison R. Waite 58

Melville W. Fuller 55

Edward D. White 65

William H. Taft 64

Charles E. Hughes 68

Harlan F. Stone 69

Frederick M. Vinson 56

Earl Warren 62

Warren E. Burger 62

William H. Rehnquist 62

John G. Roberts 50

Source: Information Please Almanac

(a) Compute the population mean, median, and modeages. , , bimodal: 56 and 62

(b) Compute the range and population standard deviationages. Range � 25, � 7.0

(c) Obtain two simple random samples of size 4, and computethe sample mean and sample standard deviation ages.

4. Number of Tickets Issued As part of a statistics project, astudent surveys 30 randomly selected students and asks themhow many speeding tickets they have been issued in the pastmonth. The results of the survey are as follows:

S

M � 58.0m � 57.8

1 1 0 1 0 0

0 0 0 1 0 1

0 1 2 0 1 1

0 0 0 0 1 1

0 0 0 0 1 0

(a) Draw a frequency histogram of the data and describe itsshape. Skewed right

(b) Based on the shape of the histogram, do you expectthe mean to be more than, equal to, or less than themedian?

(c) Compute the mean and the median of the number oftickets issued. ;

(d) Determine the mode of the number of tickets issued. 0

5. Chebyshev’s Inequality and the Empirical Rule Supposethat a random sample of 200 light bulbs has a mean life of600 hours and a standard deviation of 53 hours.(a) A histogram of the data indicates the sample data follow

a bell-shaped distribution. According to the EmpiricalRule, 99.7% of light bulbs have lifetimes between ______and ______ hours.

(b) Assuming the data are bell shaped, determine the per-centage of light bulbs that will have a life between 494and 706 hours? 95%

M � 0x � 0.4

Mean>median

Travel Time (minutes) Frequency

0–9 28

10–19 151

20–29 177

30–39 206

40–49 94

50–59 93

60–69 88

70–79 45

80–89 13

Source: Based on data from the2006 American Community Survey

(a) Approximate the mean travel time to work for U.S.adults in 2006. 37.5 minutes

(b) Approximate the standard deviation travel time to workfor U.S. adults in 2006. 19.2 minutes

7. Weighted Mean Michael has just completed his first semes-ter in college. He earned an A in his 5-hour calculus course, aB in his 4-hour chemistry course, an A in his 3-hour speechcourse, and a C in his 3-hour psychology course.Assuming anA equals 4 points, a B equals 3 points, and a C equals 2 points,determine Michael’s grade-point average if grades areweighted by class hours. 3.33

8. Weights of Males versus Females According to the NationalCenter for Health Statistics, the mean weight of a 20- to29-year-old female is 156.5 pounds, with a standard deviationof 51.2 pounds. The mean weight of a 20- to 29-year-old maleis 183.4 pounds, with a standard deviation of 40.0 pounds.Who is relatively heavier: a 20- to 29-year-old female whoweights 160 pounds or a 20- to 29-year-old male who weighs185 pounds? Female

9. Population Density The side-by-side boxplot shows the den-sity (population per square kilometer) of the 50 UnitedStates (USA) and the fifteen countries in the EuropeanUnion (EU).(a) Is the USA or the EU more densely populated? Why? EU(b) Does the USA or the EU have more dispersion in

density? Why?

441759

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 186

Page 60: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Chapter Test 187

Population Density

EU

USA

0 100 200 300 400 500

� � � �

Density (people per km2)

(c) Are there any outliers in the density for the EU? If so,approximate their values. 490

(d) What is the shape of each distribution? Skewed right

(a) Determine the mean and median number of words in apresidential inaugural address.

(b) Determine and interpret the quartiles for the number ofwords in a presidential inaugural address.

(c) Determine the five-number summary for the number ofwords in a presidential inaugural address.

(d) Determine the standard deviation and interquartilerange for the number of words in a presidential inaugu-ral address.

(e) Are there any outliers in the data set? If so, what is (are)the value(s)? Yes, 5433 and 8445

(f) Draw a boxplot of the data.(g) Describe the shape of the distribution. Support your po-

sition using the boxplot and quartiles. Skewed right(h) Which measure of central tendency do you think better

describes the typical number of words in an inauguraladdress? Why? Median

(i) Which measure of dispersion do you think better de-scribes the spread of the typical number of words in aninaugural address? Why? IQR

11. You Explain It! Percentiles According to the National Cen-ter for Health Statistics, a 19-year-old female whose height is67.1 inches has a height that is at the 85th percentile. Explainwhat this means.

12. Skinfold Thickness Procedure One method of estimatingbody fat is through skinfold thickness measurement. Themeasurement can use from three to nine different standardanatomical sites around the body. The right side is usuallyonly measured (for consistency). The tester pinches the skinat the appropriate site to raise a double layer of skin andthe underlying adipose tissue, but not the muscle. Calipersare then applied 1 centimeter below and at right angles to thepinch and a reading is taken 2 seconds later.The mean of twomeasurements should be taken. If the two measurements dif-fer greatly, a third should be done and then the median valuetaken. Explain why a median is used as the measure of cen-tral tendency when three measures are taken, rather than themean. Resistance

10. (b) 1,355; 2,144; 2,978 words10. (c) 135, 1355, 2144, 2978, 8445

S � 1,418 words; IQR � 1,623 words

10. Presidential Inaugural Addresses Ever wonder how manywords are in a typical inaugural address? The following datarepresent the lengths of all the inaugural addresses (measuredin word count) for all presidents up to George W. Bush’s sec-ond address. (a) � 2,363 words; M � 2,144 words m

1,425 1,125 1,128 5,433 2,242 2,283

135 1,172 1,337 1,802 2,446 1,507

2,308 3,838 2,480 1,526 2,449 2,170

1,729 8,445 2,978 3,318 1,355 1,571

2,158 4,776 1,681 4,059 1,437

1,175 996 4,388 3,801 2,130

1,209 3,319 2,015 1,883 1,668

3,217 2,821 3,967 1,807 1,087

4,467 3,634 2,217 1,340 2,463

2,906 698 985 559 2,546

Source: infoplease.com

1. The following data represent the amount of snowfall (ininches) received at Mammoth Mountain in the Sierra Moun-tains from the 1996–1997 ski season to the 2005–2006 ski sea-son. Treat these data as a sample of all ski seasons.

2. The Federal Bureau of Investigation classifies various larce-nies. The following data represent the type of larcenies basedon a random sample of 15 larcenies.What is the mode type oflarceny? From motor vehicles

CHAPTER TEST.

399 542 347 381 400

289 358 439 602 661

Source: Mammoth Mountain

(a) Determine the mean amount of snowfall. 441.8 inches(b) Determine the median amount of snowfall. 399.5 inches(c) Suppose the observation 661 inches was incorrectly

recorded at 1661 inches. Recompute the mean and themedian.What do you notice? What property of the medi-an does this illustrate? ; inches; resistance

M � 399.5x � 541.8 inches

Pocket picking and purse snatching

From motor vehicles From motorvehicles

Bicycles From motor vehicles

From buildings

From buildings Shoplifiting Motor vehicleaccessories

From motor vehicles Shoplifiting From motor vehicles

From motor vehicles Pocket pickingand pursesnatching

From motor vehicles

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 187

Page 61: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

9. The following data represent the weights (in grams) of 50randomly selected quarters. Determine and interpret thequartiles. Does the data set contain any outliers?5.58 g; 5.60 g; 5.67 g; yes, 5.84 g

10. Armando is filling out a college application. The applicationrequires that Armando supply either his SAT math score, orhis ACT math score. Armando scored 610 on the SAT mathand 27 on the ACT math. Which score should Armandoreport, given that the mean SAT math score is 515 with astandard deviation of 114, and the mean ACT math scores is21.0 with a standard deviation of 5.1? Why? ACT

11. According to the National Center for Health Statistics, a10-year-old male whose height is 53.5 inches has a height thatis at the 15th percentile. Explain what this means.

12. The distribution of income tends to be skewed to the right.Suppose you are running for a congressional seat and wish toportray that the average income in your district is low. Whichmeasure of central tendency, the mean or the median, wouldyou report? Why? Median

5.49 5.58 5.60 5.62 5.68

5.52 5.58 5.60 5.63 5.70

5.53 5.58 5.60 5.63 5.71

5.53 5.58 5.60 5.63 5.71

5.53 5.58 5.60 5.65 5.72

5.56 5.58 5.60 5.66 5.73

5.57 5.59 5.60 5.66 5.73

5.57 5.59 5.61 5.66 5.73

5.57 5.59 5.62 5.67 5.74

5.57 5.59 5.62 5.67 5.84

188 Chapter 3 Numerically Summarizing Data

3. Determine the range of the snowfall data from Problem 1.

4. (a) Determine the standard deviation of the snowfall datafrom Problem 1. 120.4 inches

(b) By hand, determine and interpret the interquartile rangeof the snowfall data from Problem 1. 184 inches

(c) Which of these two measures of dispersion is resistant?Why? IQR

5. In a random sample of 250 toner cartridges, the mean num-ber of pages a toner cartridge can print is 4,302 and the stan-dard deviation is 340.(a) Suppose a histogram of the data indicates that the sam-

ple data follow a bell-shaped distribution. According tothe Empirical Rule, 99.7% of toner cartridges will printbetween ______ and ______ pages.

(b) Assuming that the distribution of the data are bellshaped, determine the percentage of toner cartridgeswhose print total is between 3,622 and 4,982 pages. 95%

(c) If the company that manufactures the toner cartridgesguarantees to replace any cartridge that does not print atleast 3,622 pages, what percent of cartridges can the firmexpect to be responsible for replacing, according to theEmpirical Rule? 2.5%

(d) Use Chebyshev’s inequality to determine the minimumpercentage of toner cartridges with a page count within1.5 standard deviations of the mean. 55.6%

(e) Use Chebyshev’s inequality to determine the minimumpercentage of toner cartridges that print between 3,282and 5,322 pages. 88.9%

6. The following data represent the length of time (in minutes)between eruptions of Old Faithful in Yellowstone NationalPark.

Time (minutes) Frequency

40–49 8

50–59 44

60–69 23

70–79 6

80–89 107

90–99 11

100–109 1

(a) Approximate the mean length of time betweeneruptions. 74.9 minutes

(b) Approximate the standard deviation length of time be-tween eruptions. 14.7 minutes

7. Yolanda wishes to develop a new type of meatloaf to sell ather restaurant. She decides to combine 2 pounds of groundsirloin (cost $2.70 per pound), 1 pound of ground turkey

(cost $1.30 per pound), and pound of ground pork (cost

$1.80 per pound). What is the cost per pound of the meatloaf?$2.17

8. An engineer is studying bearing failures for two differentmaterials in aircraft gas turbine engines. The following dataare failure times (in millions of cycles) for samples of the twomaterial types.

12

(a) Compute the sample mean failure time for each material .(b) By hand, compute the median failure time for each

material.(c) Compute the sample standard deviation of the failure

times for each material. Which material has its failuretimes more dispersed?

(d) By hand, compute the five-number summary for eachmaterial.

(e) On the same graph, draw boxplots for the two materials.Annotate the graph with some general remarks compar-ing the failure times.

(f) Describe the shape of the distribution of each materialusing the boxplot and quartiles. Skewed right

Material A

3.17 5.88

4.31 6.91

4.52 8.01

4.66 8.97

5.69 11.92

Material B

5.78 9.65

6.71 13.44

6.84 14.71

7.23 16.39

8.20 24.37

8. (c) ; Material B 8. (d) A (in million cycles); 3.17, 4.52, 5.785, 8.01, 11.92 B (in million cycles); 5.78, 6.84, 8.925, 14.71, 24.37

sA � 2.626 million cycles; sB � 5.900 million cycles8. (b) MB � 8.925 million cyclesMA � 5.785 million cycles;

372 inches

3,282 5,322

8. (a) ; xB � 11.332 million cyclesxA � 6.404 million cycles

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 188

Page 62: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Chapter Test 189

Freq

uenc

y

0

20

15

10

5

300 500 700 800200 400 600I

Freq

uenc

y

0

18

16

14

12

10

8

6

4

2

400 480 560440 520 600II

13. Answer the following based on the histograms shown.(a) Which measure of central tendency would you recommend reporting for the data whose histogram is shown in Figure I?

Why? Mean(b) Which one has more dispersion? Explain. I

What Car Should IBuy?

Suppose you are in the market topurchase a used car. To make an in-

formed decision regarding your purchase,you would like to collect as much information aspossible.Among the information you might consid-er are the typical price of the car, the typical num-ber of miles the car should have, and its crash testresults, insurance costs, and expected repair costs.

1. Make a list of at least three cars that youwould consider purchasing. To be fair, the carsshould be in the same class (such as compact, mid-size, and so on).They should also be of the same age.2. Collect information regarding the three cars inyour list by finding at least eight cars of eachtype that are for sale. Obtain such information asthe asking price and the number of miles thecar has. Sources of data include your local news-paper, classified ads, and car websites (such as www.cars.com and www.vehix.com). Compute

summary statistics for asking price, number of miles, and other variables of interest. Using the samescale, draw boxplots of each variable considered.3. Go to the Insurance Institute for HighwaySafety website (www.hwysafety.org). Select theVehicle Ratings link. Choose the make and modelfor each car you are considering. Obtain informa-tion regarding crash testing for each car under con-sideration. Compare cars in the same class. Howdoes each car compare? Is one car you are consid-ering substantially safer than the others? Whatabout repair costs? Compute summary statistics forcrash tests and repair costs.4. Obtain information about insurance costs. Con-tact various insurance companies to determine thecost of insuring the cars you are considering. Com-pute summary statistics for insurance costs anddraw boxplots.5. Write a report supporting your conclusion re-garding which car you would purchase.

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 189

Page 63: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

Which colonial patriot penned the VINDEX essay in the January 8, 1770, issue ofThe Boston Gazette and Country Journal? Who wrote the 12 contested Federalistpapers? Such questions about authorship of unattributed documents are one of themany problems confronting historians. Statistical analyses, when combined withother historical facts, often help resolve these mysteries.

One such historical conundrum is the identity of the writer known simply as AMOURNER. A letter appeared in the February 26, 1770, issue of The BostonGazette and Country Journal. The text of the letter is as follows:

The general Sympathy and Concern for the Murder of the Lad by the baseand infamous Richardson on the 22d Instant, will be sufficient Reason foryour Notifying the Publick that he will be buried from his Father’s House inFrogg Lane, opposite Liberty-Tree, on Monday next, when all the Friends ofLiberty may have an Opportunity of paying their last Respects to the Re-mains of this little Hero and first Martyr to the noble Cause—Whose manlySpirit (after this Accident happened) appear’d in his discreet Answers to hisDoctor, his Thanks to the Clergymen who prayed with him, and Parents,and while he underwent the greatest Distress of bodily Pain; and with whichhe met the King of Terrors. These Things, together with the several heroicPieces found in his Pocket, particularly Wolfe’s Summit of human Glory,gives Reason to think he had a martial Genius, and would have made aclever Man.

A MOURNER.

The Lad the writer refers to is Christopher Sider, an 11-year-old son of a poorGerman immigrant. Young Sider was shot and killed on February 22, 1770, in acivil disturbance involving schoolboys, patriot supporters of the nonimportationagreement, and champions of the English crown. This event preceded the bloodyBoston Massacre by just a couple of weeks. Of Sider’s funeral, John Adams wrote,“My eyes never beheld such a funeral. The Procession extended further than canbe well imagined.” (Diary of John Adams entry for 1770. MONDAY FEB. 26 ORTHEREABOUTS.)

From a historical perspective, the identity of A MOURNER remains a mystery.However, it seems clear from the letter’s text that the author supported the patriotposition. This assumption somewhat narrows the field of possible writers.

Ordinarily, a statistical analysis of the frequencies of various words containedin the contested document, when compared with frequency analyses for knownauthors, would permit an identity inference to be drawn. Unfortunately, in this case,the letter is too short for this strategy to be useful. Another possibility is based onthe frequencies of word lengths. In this instance, a simple count of letters is generat-ed for each word in the document. Proper names, numbers, abbreviations, and titlesare removed from consideration, because these do not represent normal vocabularyuse. Care must be taken in choosing comparison texts, since colonial publishers ha-bitually incorporated their own spellings into the essays they printed.Therefore, it isdesirable to get comparison texts from the same printer the contested material camefrom.

The following table contains the summary analysis for six passages printed inThe Boston Gazette, and Country Journal in early 1770. The Tom Sturdy (probably apseudonym) text was included because of the use of the phrase “Friends of Liberty,”which appears in A MOURNER’s letter, as opposed to the more familiar “Sons ofLiberty.” John Hancock, James Otis, and Samuel Adams are included because theyare well-known patriots who frequently wrote articles that were carried by theBoston papers. In the case of Samuel Adams, the essay used in this analysis wassigned VINDEX, one of his many pseudonyms. The table presents three summaries

CA

SE S

TUD

Y

190

Who Was “A MOURNER”?

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 190

Page 64: M03 SULL8028 03 SE C03 - Higher Education | Pearson your purchase, you decide to collect as much information as possible. What information is im-portant in helping you make this decision?

JohnHancock

JamesOtis

SamuelAdams-1

SamuelAdams-2

Mean 4.08 4.69 4.58 4.60 4.52 4.56

Median 4 4 4 3 4 4

Mode 2 3 2 2 2 2

Standard deviation 2.17 2.60 2.75 2.89 2.70 2.80

Sample variance 4.70 6.76 7.54 8.34 7.30 7.84

Range 13 9 14 12 16 16

Minimum 1 1 1 1 1 1

Maximum 14 10 15 13 17 17

Sum 795 568 842 810 682 1492

Count 195 121 184 176 151 27

Summary Statistics of Word Length from Sample Passages by Various Potential Authors of the Letter Signed A MOURNER

Tom Sturdy

SamuelAdams-1 & 2

of work penned by Adams. The first two originate from two separate sections of theVINDEX essay. The last summary is a compilation of the first two.

1. Acting as a historical detective, generate a data set consisting of the length ofeach word used in the letter signed by A MOURNER. Be sure to disregard anytext that uses proper names, numbers, abbreviations, or titles.2. Calculate the summary statistics for the letter’s word lengths. Compare yourfindings with those of the known authors, and speculate about the identity of AMOURNER.3. Compare the two Adams summaries. Discuss the viability of word-lengthanalysis as a tool for resolving disputed documents.4. What other information would be useful to identify A MOURNER?

191

M03_SULL8028_03_SE_C03.QXD 9/9/08 11:02 AM Page 191