doane chapter 04b

7/30/2019 Doane Chapter 04B

1/63


2/63

Descriptive Statistics (Part 2)

Standardized Data

Percentiles and Quartiles

Box Plots

Grouped DataSkewness and Kurtosis (optional)

Chapter

4


3/63

For any population with mean m and standard

deviation s, the percentage of observations that lie

within kstandard deviations of the mean must be at

least 100[1 1/k2].

Developed by mathematicians Jules Bienaym

(1796-1878) and Pafnuty Chebyshev (1821-1894).

Standardized Data

Chebyshevs Theorem


4/63

Fork= 2 standard deviations,

100[1 1/22] = 75%

So, at least 75.0% will lie within m + 2s

Fork= 3 standard deviations,

100[1 1/32] = 88.9%

So, at least 88.9% will lie within m + 3s

Although applicable to any data set, these limits

tend to be too wide to be useful.

Standardized Data

Chebyshevs Theorem


5/63

The Empirical Rule states that for data from a

normal distribution, we expect that for

The normal or Gaussian distribution was named for

Karl Gauss (1771-1855).

The normal distribution is symmetric and is alsoknown as the bell-shaped curve.

k= 1 about 68.26% will lie within m + 1sk= 2 about 95.44% will lie within m + 2sk= 3 about 99.73% will lie within m + 3s

Standardized Data

The Empir ical Rule


6/63

Note: noupper bound

is given.

Data values

outsidem + 3s

are rare.

Distance from the mean is measured in terms of

the number of standard deviations.

Standardized Data

The Empir ical Rule


7/63

If 80 students take an exam, how many will score

within 2 standard deviations of the mean?

Assuming exam scores follow a normal distribution,the empirical rule states

about 95.44% will lie within m + 2sso 95.44% x 80 76 students will score+ 2s from m.

How many students will score more than 2

standard deviations from the mean?

Standardized Data

Example: Exam Scores


8/63

Unusualobservations are those that lie beyond

m + 2s. Outliers are observations that lie beyond

m + 3s.

Standardized Data

Unusual Observat ions


9/63

For example, the P/E ratio data contains several

large data values. Are they unusual or outliers?

7 8 8 10 10 10 10 12 13 13 13 13

13 13 13 14 14 14 15 15 15 15 15 16

16 16 17 18 18 18 18 19 19 19 19 19

20 20 20 21 21 21 22 22 23 23 23 2425 26 26 26 26 27 29 29 30 31 34 36

37 40 41 45 48 55 68 91

Standardized Data

Unusual Observat ions


10/63

If the sample came from a normal distribution, then

the Empirical rule states

1x s = 22.72 1(14.08)

2x s = 22.72 2(14.08)

3x s = 22.72 3(14.08)

Standardized Data

The Empir ical Rule

= (8.9, 38.8)

= (-5.4, 50.9)

= (-19.5, 65.0)


11/6322 72 38 88 9 50 9-5 4 65 0-19 5

Standardized Data

The Empir ical Rule

Outliers Outliers

UnusualUnusual

Are there any unusual values or outliers?

7 8 . . . 48 55 68 91


12/63

A standardized variable (Z) redefines each

observation in terms the number of standard

deviations from the mean.

ii

xz

m

s

Standardization

formula for a

population:

Standardization

formula for a

sample:

ii

x xz

s

Standardized Data

Defin ing a Standardized Variab le


13/63

zi tells how far away the observation is from the

mean.

ii

x xz s

= 7 22.7214.08 = -1.12

Standardized Data


For example, for the P/E data, the first valuex1 = 7.

The associated zvalue is


14/63

i

i

x x

z s

=91 22.72

14.08 =4.85

A negative zvalue means the observation is below

the mean.

Standardized Data


Positive zmeans the observation is above themean. Forx68 = 91,


15/63

Here are the standardized zvalues for the P/E

data:

Standardized Data


What do you conclude for these four values?


16/63

In Excel, use =STANDARDIZE(Array, Mean,STDev) to calculate a

standardized zvalue.

MegaStat calculates standardized values as well

as checks for outliers.

Standardized Data



17/63

What do we do with outliers in a data set?

If due to erroneous data, then discard.

An outrageous observation (one completely outsideof an expected range) is certainly invalid.

Recognize unusual data points and outliers and

their potential impact on your study.

Research books and articles on how to handle

outliers.

Standardized Data

Outl iers


18/63

For a normal distribution, the range of values is 6s

(from m 3s to m + 3s).

If you know the range R(high low), you can

estimate the standard deviation as s = R/6.

Useful for approximating the standard deviation

when only Ris known.

This estimate depends on the assumption of

normality.

Standardized Data

Est imat ing Sigma


19/63

Percentiles are data that have been divided into

100 groups.

For example, you score in the 83rd percentile on a

standardized test. That means that 83% of thetest-takers scored below you.

Deciles are data that have been divided into

10 groups.

Quintiles are data that have been divided into

5 groups.

Quartiles are data that have been divided into

4 groups


Percenti les


20/63

Percentiles are used to establish benchmarks for

comparison purposes (e.g., health care,

manufacturing and banking industries use 5, 25,50, 75 and 90 percentiles).

Quartiles (25, 50, and 75 percent) are commonly

used to assess financial performance and stock

portfolios.

Percentiles are used in employee merit evaluation

and salary benchmarking.


Percenti les


21/63

Quartiles are scale points that divide the sorted

data into four groups of approximately equal size.

The three values that separate the four groups are

called Q1 Q2 and Q3 respectively

Q1 Q2 Q3

Lower 25% | Second 25% | Third 25% | Upper 25%


Quart i les


22/63

The second quartile Q2 is the median, an important

indicator ofcentral tendency.

Q1 and Q3 measure dispersion since the

interquartile rangeQ3Q1 measures the degree ofspread in the middle 50 percent of data values.

Q2

Lower 50% | Upper 50%

Q1 Q3

Lower 25% | Middle 50% | Upper 25%


Quart i les


23/63

The first quartile Q1 is the median of the data

values below Q2, and the third quartile Q3 is the

median of the data values above Q2.

Q1 Q2 Q3

Lower 25% | Second 25% | Third 25% | Upper 25%

For first half of data,

50% above,

50% below Q1.

For second half of data,

50% above,

50% below Q3.


Quart i les


24/63

Depending on n, the quartiles Q1,Q2, and Q3 may

be members of the data set or may lie between

two of the sorted data values.


Quart i les


25/63

For small data sets, find quartiles using method of

medians:

Step 1. Sort the observations.

Step 2. Find the median Q2.

Step 3. Find the median of the data values that lie

below Q2.

Step 4. Find the median of the data values that lie

above Q2.


Method o f Medians


26/63

Use Excel function =QUARTILE(Array, k) to return

the kth quartile.

=QUARTILE(Array, 3)

=PERCENTILE(Array, 75)

Excel treats quartiles as a special case of

percentiles. For example, to calculate Q3

Excel calculates the quartile positions as:Position of Q1 0.25n + 0.75

Position of Q2 0.50n + 0.50

Position of Q3 0 75n + 0 25


Excel Quart i les


27/63

Consider the following P/E ratios for 68 stocks in a

portfolio.

Use quartiles to define benchmarks for stocks that

are low-priced (bottom quartile) or high-priced (top

quartile).

7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 1414 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19

19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26

26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91


Example: P/E Ratios and Quart i les


28/63

Using Excels method of interpolation, the quartile

positions are:

QuartilePosition

Formula InterpolateBetween

Q1 = 0.25(68) + 0.75 = 17.75 X17 +X18



Q2 = 0.50(68) + 0.50 = 34.50 X34 +X35Q3 = 0.75(68) + 0.25 = 51.25 X51 +X52


29/63

The quartiles are:

Quartile Formula

First (Q1) Q1 =X17 + 0.75 (X18-X17)= 14 + 0.75 (14-14) = 14



Second (Q2) Q2 =X34 + 0.50 (X35-X34)

= 19 + 0.50 (19-19) = 19

Third (Q3) Q3 =X51 + 0.25 (X52-X51)

= 26 + 0.25 (26-26) = 26


30/63

So, to summarize:

These quartiles express central tendency and

dispersion. What is the interquartile range?

Q1 Q2 Q3

Lower 25%ofP/ERatios

14 Second 25%ofP/ERatios

19 Third 25%ofP/ERatios

26 Upper 25%ofP/ERatios

Because of clustering of identical data values,

these quartiles do not provide clean cut points

between groups of observations.




31/63

Whether you use the method of

medians or Excel, your quartiles will be

about the same. Small differences incalculation techniques typically do not

lead to different conclusions in

business applications.


Tip


32/63

Quartiles generally resist outliers.

However, quartiles do not provide clean cut points

in the sorted data, especially in small samples with

repeating data values.

Data setA: 1, 2, 4, 4, 8, 8, 8, 8 Q1 = 3, Q2 = 6, Q3 = 8

Data set B: 0, 3, 3, 6, 6, 6, 10, 15 Q1 = 3, Q2 = 6, Q3 = 8

Although they have identical quartiles, these two

data sets are not similar. The quartiles do not

represent either data set well.


Caution


33/63

Some robust measures of central tendency and

dispersion using quartiles are:

Statistic Formula Excel Pro Con

Midhinge

=0.5*(QUARTILE

(Data,1)+QUARTILE

(Data,3))

Robust to

presence

of extreme

datavalues.

Less

familiar

to mostpeople.

1 3

2

Q Q


Dispersion Using Quart i les


34/63

Statistic Formula Excel Pro Con

Midspread Q3Q1=QUARTILE(Data,3)-

QUARTILE(Data,1)

Stable

when

extremedata values

exist.

Ignores

magnitude

of extremedata

values.


Dispersion Using Quart i les

Coefficient

of quartilevariation

(CQV)

None

Relative

variation in

percent sowe can

compare

data sets.

Less

familiar tonon-

statisticians

3 1

3 1100

Q Q

Q Q


35/63

The mean of the first and third quartiles.

For the 68 P/E ratios,

Midhinge =1 3

2

Q Q

Midhinge =1 3 14 26 20

2 2

Q Q

A robust measure of central tendency since

quartiles ignore extreme values.


Midhinge


36/63

A robust measure of dispersion


Midspread = Q3Q1

Midspread = Q3Q1 = 26 14 = 12


Midspread (In terquart i le Range)


37/63

Measures relative dispersion, expresses the

midspread as a percent of the midhinge.


3 1

3 1100

Q QCQV

Q Q

3 1

3 1

26 14

100 100 30.0%26 14

Q Q

CQV Q Q

Similar to the CV, CQVcan be used to compare

data sets measured in different units or with

diff t


Coeff icient o f Quart i le Variat ion (CQV)

l


38/63

A useful tool ofexploratory data analysis (EDA).

Also called a box-and-whisker plot.

Based on a five-number summary:

Xmin, Q1, Q2, Q3,Xmax

Consider the five-number summary for the

68 P/E ratios:

7 14 19 26 91

Xmin, Q1, Q2, Q3,Xmax

Box Plots

l


39/63

Minimum

Median (Q2)

Maximum

Q1 Q3

Box

Whiskers

Right-skewed

Center of Box is Midhinge

Box Plots

B Pl


40/63

Use quartiles to detect unusual data points.

These points are called fences and can be found

using the following formulas:Inner fences Outer fences:

Lower fence Q1 1.5 (Q3Q1) Q1 3.0 (Q3Q1)

Upper fence Q3 + 1.5 (Q3Q1) Q3 + 3.0 (Q3Q1)

Values outside the inner fences are unusualwhile

those outside the outer fences are outliers.

Box Plots

Fences and Unusual Data Values

B Pl


41/63

For example, consider the P/E ratio data:

Ignore the lower fence since it is negative and P/E

ratios are only positive.

Inner fences Outer fences:

Lower fence: 14 1.5 (2614) = 4 14 3.0 (2614) = 22Upper fence: 26 + 1.5 (2614) = +44 26 + 3.0 (2614) = +62

Box Plots


B Pl t


42/63

Truncate the whisker at the fences and display

unusual values

and outliers

as dots.

Inner

Fence

Outer

Fence

Unusual Outliers

Box Plots


Based on these fences, there are three unusual

P/E values and two outliers.

G d D t


43/63

Although some information is lost, grouped data

are easier to display than raw data.

When bin limits are given, the mean and standarddeviation can be estimated.

Accuracy of grouped estimates depend on

- the number of bins- distribution of data within bins

- bin frequencies

Grouped Data

Nature of Grouped Data

G d D t


44/63

Consider the frequency distribution for prices of

Lipitor for three cities:

Grouped Data

Mean and Standard Dev iat ion

Where

mj = class midpoint fj = class frequency

k= number of classes n = sample size

G d D t


45/63

Estimate the mean and standard deviation by

1

3427.572.92552

47

kj j

j

f mx

n

2

1

( ) 2091.489366.74293

1 47 1

kj j

j

f m xs

n

Note: dont round off too soon.

Grouped Data


G d D t


46/63

How accurate are grouped estimates compared to

ungrouped estimates?

Now estimate the coefficient of variation

CV= 100 (s / ) = 100 (6.74293 / 72.92552) = 9.2%x

For the previous example, we can compare the

grouped data statistics to the ungrouped data

statistics.

Grouped Data


Accu racy Issues

Grouped Data


47/63

For this example, very little information was lost

due to grouping.

However, accuracy could be lost due to the natureof the grouping (i.e., if the groups were not evenly

spaced within bins).

Grouped Data

Accuracy Issues

Grouped Data


48/63

The dot plot shows a relatively even distribution

within the bins.

Effects of uneven distributions within bins tend to

average out unless there is systematic skewness.

Grouped Data

Accuracy Issues

Grouped Data


49/63

Accuracy tends to improve as the number of bins

increases.

If the first or last class is open-ended, there will be

no class midpoint (no mean can be estimated).

Assume a lower limit of zero for the first class

when the data are nonnegative.

You may be able to assume an upper limit forsome variables (e.g., age).

Median and quartiles may be estimated even with

open-ended classes.

Grouped Data

Accuracy Issues

Skewness and Kurtosis


50/63

Generally, skewness may be indicated by looking

at the sample histogram or by comparing the mean

and median.

This visual indicator is imprecise and does not take

into consideration sample size n.


Skewness



51/63


Skewness Skewness is a unit-free statistic.

The coefficient compares two samples measured

in different units or one sample with a known

reference distribution (e.g., symmetric normal

distribution).

Calculate the samples skewness coefficientas:

Skewness =3

1( 1)( 2)

ni

i

x xn

n n s



52/63

In Excel, go to

Tools | Data Analysis |

Descriptive Statistics or

use the function=SKEW(array)


Skewness



53/63

Consider the following table showing the 90%

range for the sample skewness coefficient.


Skewness



54/63

Coefficients within the 90% range may be

attributed to random variation.


Skewness



55/63

Coefficients outside the range suggest the sample

came from a nonnormal population.


Skewness



56/63

As n increases, the range of chance variation

narrows.


Skewness



57/63

Kurtosis is the relative length of the tails and the

degree of concentration in the center.

Consider three kurtosis prototype shapes.


Kur tos is



58/63

A histogram is an unreliable guide to kurtosis since

scale and axis proportions may differ.

Excel and MINITAB calculate kurtosis as:

Kurtosis =

4 2

1

( 1) 3( 1)

( 1)( 2)( 3) ( 2)( 3)

ni

i

x xn n n

n n n s n n


Kur tos is



59/63

Consider the following table of expected 90%

range for sample kurtosis coefficient.


Kur tos is



60/63

A sample coefficient within the ranges may be

attributed to chance variation.


Kur tos is



61/63

Coefficients outside the range would suggest the

sample differs from a normal population.


Kur tos is



62/63

As sample size increases, the chance range

narrows.

Inferences about kurtosis are risky forn < 50.

Kur tos is


63/63

Applied Statistics inBusiness and Economics

End of Chapter 4

doane chapter 04b

Documents