doane chapter 04b
TRANSCRIPT
-
7/30/2019 Doane Chapter 04B
1/63
-
7/30/2019 Doane Chapter 04B
2/63
Descriptive Statistics (Part 2)
Standardized Data
Percentiles and Quartiles
Box Plots
Grouped DataSkewness and Kurtosis (optional)
Chapter
4
-
7/30/2019 Doane Chapter 04B
3/63
For any population with mean m and standard
deviation s, the percentage of observations that lie
within kstandard deviations of the mean must be at
least 100[1 1/k2].
Developed by mathematicians Jules Bienaym
(1796-1878) and Pafnuty Chebyshev (1821-1894).
Standardized Data
Chebyshevs Theorem
-
7/30/2019 Doane Chapter 04B
4/63
Fork= 2 standard deviations,
100[1 1/22] = 75%
So, at least 75.0% will lie within m + 2s
Fork= 3 standard deviations,
100[1 1/32] = 88.9%
So, at least 88.9% will lie within m + 3s
Although applicable to any data set, these limits
tend to be too wide to be useful.
Standardized Data
Chebyshevs Theorem
-
7/30/2019 Doane Chapter 04B
5/63
The Empirical Rule states that for data from a
normal distribution, we expect that for
The normal or Gaussian distribution was named for
Karl Gauss (1771-1855).
The normal distribution is symmetric and is alsoknown as the bell-shaped curve.
k= 1 about 68.26% will lie within m + 1sk= 2 about 95.44% will lie within m + 2sk= 3 about 99.73% will lie within m + 3s
Standardized Data
The Empir ical Rule
-
7/30/2019 Doane Chapter 04B
6/63
Note: noupper bound
is given.
Data values
outsidem + 3s
are rare.
Distance from the mean is measured in terms of
the number of standard deviations.
Standardized Data
The Empir ical Rule
-
7/30/2019 Doane Chapter 04B
7/63
If 80 students take an exam, how many will score
within 2 standard deviations of the mean?
Assuming exam scores follow a normal distribution,the empirical rule states
about 95.44% will lie within m + 2sso 95.44% x 80 76 students will score+ 2s from m.
How many students will score more than 2
standard deviations from the mean?
Standardized Data
Example: Exam Scores
-
7/30/2019 Doane Chapter 04B
8/63
Unusualobservations are those that lie beyond
m + 2s. Outliers are observations that lie beyond
m + 3s.
Standardized Data
Unusual Observat ions
-
7/30/2019 Doane Chapter 04B
9/63
For example, the P/E ratio data contains several
large data values. Are they unusual or outliers?
7 8 8 10 10 10 10 12 13 13 13 13
13 13 13 14 14 14 15 15 15 15 15 16
16 16 17 18 18 18 18 19 19 19 19 19
20 20 20 21 21 21 22 22 23 23 23 2425 26 26 26 26 27 29 29 30 31 34 36
37 40 41 45 48 55 68 91
Standardized Data
Unusual Observat ions
-
7/30/2019 Doane Chapter 04B
10/63
If the sample came from a normal distribution, then
the Empirical rule states
1x s = 22.72 1(14.08)
2x s = 22.72 2(14.08)
3x s = 22.72 3(14.08)
Standardized Data
The Empir ical Rule
= (8.9, 38.8)
= (-5.4, 50.9)
= (-19.5, 65.0)
-
7/30/2019 Doane Chapter 04B
11/6322 72 38 88 9 50 9-5 4 65 0-19 5
Standardized Data
The Empir ical Rule
Outliers Outliers
UnusualUnusual
Are there any unusual values or outliers?
7 8 . . . 48 55 68 91
-
7/30/2019 Doane Chapter 04B
12/63
A standardized variable (Z) redefines each
observation in terms the number of standard
deviations from the mean.
ii
xz
m
s
Standardization
formula for a
population:
Standardization
formula for a
sample:
ii
x xz
s
Standardized Data
Defin ing a Standardized Variab le
-
7/30/2019 Doane Chapter 04B
13/63
zi tells how far away the observation is from the
mean.
ii
x xz s
= 7 22.7214.08 = -1.12
Standardized Data
Defin ing a Standardized Variab le
For example, for the P/E data, the first valuex1 = 7.
The associated zvalue is
-
7/30/2019 Doane Chapter 04B
14/63
i
i
x x
z s
=91 22.72
14.08 =4.85
A negative zvalue means the observation is below
the mean.
Standardized Data
Defin ing a Standardized Variab le
Positive zmeans the observation is above themean. Forx68 = 91,
-
7/30/2019 Doane Chapter 04B
15/63
Here are the standardized zvalues for the P/E
data:
Standardized Data
Defin ing a Standardized Variab le
What do you conclude for these four values?
-
7/30/2019 Doane Chapter 04B
16/63
In Excel, use =STANDARDIZE(Array, Mean,STDev) to calculate a
standardized zvalue.
MegaStat calculates standardized values as well
as checks for outliers.
Standardized Data
Defin ing a Standardized Variab le
-
7/30/2019 Doane Chapter 04B
17/63
What do we do with outliers in a data set?
If due to erroneous data, then discard.
An outrageous observation (one completely outsideof an expected range) is certainly invalid.
Recognize unusual data points and outliers and
their potential impact on your study.
Research books and articles on how to handle
outliers.
Standardized Data
Outl iers
-
7/30/2019 Doane Chapter 04B
18/63
For a normal distribution, the range of values is 6s
(from m 3s to m + 3s).
If you know the range R(high low), you can
estimate the standard deviation as s = R/6.
Useful for approximating the standard deviation
when only Ris known.
This estimate depends on the assumption of
normality.
Standardized Data
Est imat ing Sigma
-
7/30/2019 Doane Chapter 04B
19/63
Percentiles are data that have been divided into
100 groups.
For example, you score in the 83rd percentile on a
standardized test. That means that 83% of thetest-takers scored below you.
Deciles are data that have been divided into
10 groups.
Quintiles are data that have been divided into
5 groups.
Quartiles are data that have been divided into
4 groups
Percentiles and Quartiles
Percenti les
-
7/30/2019 Doane Chapter 04B
20/63
Percentiles are used to establish benchmarks for
comparison purposes (e.g., health care,
manufacturing and banking industries use 5, 25,50, 75 and 90 percentiles).
Quartiles (25, 50, and 75 percent) are commonly
used to assess financial performance and stock
portfolios.
Percentiles are used in employee merit evaluation
and salary benchmarking.
Percentiles and Quartiles
Percenti les
-
7/30/2019 Doane Chapter 04B
21/63
Quartiles are scale points that divide the sorted
data into four groups of approximately equal size.
The three values that separate the four groups are
called Q1 Q2 and Q3 respectively
Q1 Q2 Q3
Lower 25% | Second 25% | Third 25% | Upper 25%
Percentiles and Quartiles
Quart i les
-
7/30/2019 Doane Chapter 04B
22/63
The second quartile Q2 is the median, an important
indicator ofcentral tendency.
Q1 and Q3 measure dispersion since the
interquartile rangeQ3Q1 measures the degree ofspread in the middle 50 percent of data values.
Q2
Lower 50% | Upper 50%
Q1 Q3
Lower 25% | Middle 50% | Upper 25%
Percentiles and Quartiles
Quart i les
-
7/30/2019 Doane Chapter 04B
23/63
The first quartile Q1 is the median of the data
values below Q2, and the third quartile Q3 is the
median of the data values above Q2.
Q1 Q2 Q3
Lower 25% | Second 25% | Third 25% | Upper 25%
For first half of data,
50% above,
50% below Q1.
For second half of data,
50% above,
50% below Q3.
Percentiles and Quartiles
Quart i les
-
7/30/2019 Doane Chapter 04B
24/63
Depending on n, the quartiles Q1,Q2, and Q3 may
be members of the data set or may lie between
two of the sorted data values.
Percentiles and Quartiles
Quart i les
-
7/30/2019 Doane Chapter 04B
25/63
For small data sets, find quartiles using method of
medians:
Step 1. Sort the observations.
Step 2. Find the median Q2.
Step 3. Find the median of the data values that lie
below Q2.
Step 4. Find the median of the data values that lie
above Q2.
Percentiles and Quartiles
Method o f Medians
-
7/30/2019 Doane Chapter 04B
26/63
Use Excel function =QUARTILE(Array, k) to return
the kth quartile.
=QUARTILE(Array, 3)
=PERCENTILE(Array, 75)
Excel treats quartiles as a special case of
percentiles. For example, to calculate Q3
Excel calculates the quartile positions as:Position of Q1 0.25n + 0.75
Position of Q2 0.50n + 0.50
Position of Q3 0 75n + 0 25
Percentiles and Quartiles
Excel Quart i les
-
7/30/2019 Doane Chapter 04B
27/63
Consider the following P/E ratios for 68 stocks in a
portfolio.
Use quartiles to define benchmarks for stocks that
are low-priced (bottom quartile) or high-priced (top
quartile).
7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 1414 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19
19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26
26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91
Percentiles and Quartiles
Example: P/E Ratios and Quart i les
-
7/30/2019 Doane Chapter 04B
28/63
Using Excels method of interpolation, the quartile
positions are:
QuartilePosition
Formula InterpolateBetween
Q1 = 0.25(68) + 0.75 = 17.75 X17 +X18
Percentiles and Quartiles
Example: P/E Ratios and Quart i les
Q2 = 0.50(68) + 0.50 = 34.50 X34 +X35Q3 = 0.75(68) + 0.25 = 51.25 X51 +X52
-
7/30/2019 Doane Chapter 04B
29/63
The quartiles are:
Quartile Formula
First (Q1) Q1 =X17 + 0.75 (X18-X17)= 14 + 0.75 (14-14) = 14
Percentiles and Quartiles
Example: P/E Ratios and Quart i les
Second (Q2) Q2 =X34 + 0.50 (X35-X34)
= 19 + 0.50 (19-19) = 19
Third (Q3) Q3 =X51 + 0.25 (X52-X51)
= 26 + 0.25 (26-26) = 26
-
7/30/2019 Doane Chapter 04B
30/63
So, to summarize:
These quartiles express central tendency and
dispersion. What is the interquartile range?
Q1 Q2 Q3
Lower 25%ofP/ERatios
14 Second 25%ofP/ERatios
19 Third 25%ofP/ERatios
26 Upper 25%ofP/ERatios
Because of clustering of identical data values,
these quartiles do not provide clean cut points
between groups of observations.
Percentiles and Quartiles
Example: P/E Ratios and Quart i les
-
7/30/2019 Doane Chapter 04B
31/63
Whether you use the method of
medians or Excel, your quartiles will be
about the same. Small differences incalculation techniques typically do not
lead to different conclusions in
business applications.
Percentiles and Quartiles
Tip
-
7/30/2019 Doane Chapter 04B
32/63
Quartiles generally resist outliers.
However, quartiles do not provide clean cut points
in the sorted data, especially in small samples with
repeating data values.
Data setA: 1, 2, 4, 4, 8, 8, 8, 8 Q1 = 3, Q2 = 6, Q3 = 8
Data set B: 0, 3, 3, 6, 6, 6, 10, 15 Q1 = 3, Q2 = 6, Q3 = 8
Although they have identical quartiles, these two
data sets are not similar. The quartiles do not
represent either data set well.
Percentiles and Quartiles
Caution
-
7/30/2019 Doane Chapter 04B
33/63
Some robust measures of central tendency and
dispersion using quartiles are:
Statistic Formula Excel Pro Con
Midhinge
=0.5*(QUARTILE
(Data,1)+QUARTILE
(Data,3))
Robust to
presence
of extreme
datavalues.
Less
familiar
to mostpeople.
1 3
2
Q Q
Percentiles and Quartiles
Dispersion Using Quart i les
-
7/30/2019 Doane Chapter 04B
34/63
Statistic Formula Excel Pro Con
Midspread Q3Q1=QUARTILE(Data,3)-
QUARTILE(Data,1)
Stable
when
extremedata values
exist.
Ignores
magnitude
of extremedata
values.
Percentiles and Quartiles
Dispersion Using Quart i les
Coefficient
of quartilevariation
(CQV)
None
Relative
variation in
percent sowe can
compare
data sets.
Less
familiar tonon-
statisticians
3 1
3 1100
Q Q
Q Q
-
7/30/2019 Doane Chapter 04B
35/63
The mean of the first and third quartiles.
For the 68 P/E ratios,
Midhinge =1 3
2
Q Q
Midhinge =1 3 14 26 20
2 2
Q Q
A robust measure of central tendency since
quartiles ignore extreme values.
Percentiles and Quartiles
Midhinge
-
7/30/2019 Doane Chapter 04B
36/63
A robust measure of dispersion
For the 68 P/E ratios,
Midspread = Q3Q1
Midspread = Q3Q1 = 26 14 = 12
Percentiles and Quartiles
Midspread (In terquart i le Range)
-
7/30/2019 Doane Chapter 04B
37/63
Measures relative dispersion, expresses the
midspread as a percent of the midhinge.
For the 68 P/E ratios,
3 1
3 1100
Q QCQV
Q Q
3 1
3 1
26 14
100 100 30.0%26 14
Q Q
CQV Q Q
Similar to the CV, CQVcan be used to compare
data sets measured in different units or with
diff t
Percentiles and Quartiles
Coeff icient o f Quart i le Variat ion (CQV)
l
-
7/30/2019 Doane Chapter 04B
38/63
A useful tool ofexploratory data analysis (EDA).
Also called a box-and-whisker plot.
Based on a five-number summary:
Xmin, Q1, Q2, Q3,Xmax
Consider the five-number summary for the
68 P/E ratios:
7 14 19 26 91
Xmin, Q1, Q2, Q3,Xmax
Box Plots
l
-
7/30/2019 Doane Chapter 04B
39/63
Minimum
Median (Q2)
Maximum
Q1 Q3
Box
Whiskers
Right-skewed
Center of Box is Midhinge
Box Plots
B Pl
-
7/30/2019 Doane Chapter 04B
40/63
Use quartiles to detect unusual data points.
These points are called fences and can be found
using the following formulas:Inner fences Outer fences:
Lower fence Q1 1.5 (Q3Q1) Q1 3.0 (Q3Q1)
Upper fence Q3 + 1.5 (Q3Q1) Q3 + 3.0 (Q3Q1)
Values outside the inner fences are unusualwhile
those outside the outer fences are outliers.
Box Plots
Fences and Unusual Data Values
B Pl
-
7/30/2019 Doane Chapter 04B
41/63
For example, consider the P/E ratio data:
Ignore the lower fence since it is negative and P/E
ratios are only positive.
Inner fences Outer fences:
Lower fence: 14 1.5 (2614) = 4 14 3.0 (2614) = 22Upper fence: 26 + 1.5 (2614) = +44 26 + 3.0 (2614) = +62
Box Plots
Fences and Unusual Data Values
B Pl t
-
7/30/2019 Doane Chapter 04B
42/63
Truncate the whisker at the fences and display
unusual values
and outliers
as dots.
Inner
Fence
Outer
Fence
Unusual Outliers
Box Plots
Fences and Unusual Data Values
Based on these fences, there are three unusual
P/E values and two outliers.
G d D t
-
7/30/2019 Doane Chapter 04B
43/63
Although some information is lost, grouped data
are easier to display than raw data.
When bin limits are given, the mean and standarddeviation can be estimated.
Accuracy of grouped estimates depend on
- the number of bins- distribution of data within bins
- bin frequencies
Grouped Data
Nature of Grouped Data
G d D t
-
7/30/2019 Doane Chapter 04B
44/63
Consider the frequency distribution for prices of
Lipitor for three cities:
Grouped Data
Mean and Standard Dev iat ion
Where
mj = class midpoint fj = class frequency
k= number of classes n = sample size
G d D t
-
7/30/2019 Doane Chapter 04B
45/63
Estimate the mean and standard deviation by
1
3427.572.92552
47
kj j
j
f mx
n
2
1
( ) 2091.489366.74293
1 47 1
kj j
j
f m xs
n
Note: dont round off too soon.
Grouped Data
Nature of Grouped Data
G d D t
-
7/30/2019 Doane Chapter 04B
46/63
How accurate are grouped estimates compared to
ungrouped estimates?
Now estimate the coefficient of variation
CV= 100 (s / ) = 100 (6.74293 / 72.92552) = 9.2%x
For the previous example, we can compare the
grouped data statistics to the ungrouped data
statistics.
Grouped Data
Nature of Grouped Data
Accu racy Issues
Grouped Data
-
7/30/2019 Doane Chapter 04B
47/63
For this example, very little information was lost
due to grouping.
However, accuracy could be lost due to the natureof the grouping (i.e., if the groups were not evenly
spaced within bins).
Grouped Data
Accuracy Issues
Grouped Data
-
7/30/2019 Doane Chapter 04B
48/63
The dot plot shows a relatively even distribution
within the bins.
Effects of uneven distributions within bins tend to
average out unless there is systematic skewness.
Grouped Data
Accuracy Issues
Grouped Data
-
7/30/2019 Doane Chapter 04B
49/63
Accuracy tends to improve as the number of bins
increases.
If the first or last class is open-ended, there will be
no class midpoint (no mean can be estimated).
Assume a lower limit of zero for the first class
when the data are nonnegative.
You may be able to assume an upper limit forsome variables (e.g., age).
Median and quartiles may be estimated even with
open-ended classes.
Grouped Data
Accuracy Issues
Skewness and Kurtosis
-
7/30/2019 Doane Chapter 04B
50/63
Generally, skewness may be indicated by looking
at the sample histogram or by comparing the mean
and median.
This visual indicator is imprecise and does not take
into consideration sample size n.
Skewness and Kurtosis
Skewness
Skewness and Kurtosis
-
7/30/2019 Doane Chapter 04B
51/63
Skewness and Kurtosis
Skewness Skewness is a unit-free statistic.
The coefficient compares two samples measured
in different units or one sample with a known
reference distribution (e.g., symmetric normal
distribution).
Calculate the samples skewness coefficientas:
Skewness =3
1( 1)( 2)
ni
i
x xn
n n s
Skewness and Kurtosis
-
7/30/2019 Doane Chapter 04B
52/63
In Excel, go to
Tools | Data Analysis |
Descriptive Statistics or
use the function=SKEW(array)
Skewness and Kurtosis
Skewness
Skewness and Kurtosis
-
7/30/2019 Doane Chapter 04B
53/63
Consider the following table showing the 90%
range for the sample skewness coefficient.
Skewness and Kurtosis
Skewness
Skewness and Kurtosis
-
7/30/2019 Doane Chapter 04B
54/63
Coefficients within the 90% range may be
attributed to random variation.
Skewness and Kurtosis
Skewness
Skewness and Kurtosis
-
7/30/2019 Doane Chapter 04B
55/63
Coefficients outside the range suggest the sample
came from a nonnormal population.
Skewness and Kurtosis
Skewness
Skewness and Kurtosis
-
7/30/2019 Doane Chapter 04B
56/63
As n increases, the range of chance variation
narrows.
Skewness and Kurtosis
Skewness
Skewness and Kurtosis
-
7/30/2019 Doane Chapter 04B
57/63
Kurtosis is the relative length of the tails and the
degree of concentration in the center.
Consider three kurtosis prototype shapes.
Skewness and Kurtosis
Kur tos is
Skewness and Kurtosis
-
7/30/2019 Doane Chapter 04B
58/63
A histogram is an unreliable guide to kurtosis since
scale and axis proportions may differ.
Excel and MINITAB calculate kurtosis as:
Kurtosis =
4 2
1
( 1) 3( 1)
( 1)( 2)( 3) ( 2)( 3)
ni
i
x xn n n
n n n s n n
Skewness and Kurtosis
Kur tos is
Skewness and Kurtosis
-
7/30/2019 Doane Chapter 04B
59/63
Consider the following table of expected 90%
range for sample kurtosis coefficient.
Skewness and Kurtosis
Kur tos is
Skewness and Kurtosis
-
7/30/2019 Doane Chapter 04B
60/63
A sample coefficient within the ranges may be
attributed to chance variation.
Skewness and Kurtosis
Kur tos is
Skewness and Kurtosis
-
7/30/2019 Doane Chapter 04B
61/63
Coefficients outside the range would suggest the
sample differs from a normal population.
Skewness and Kurtosis
Kur tos is
Skewness and Kurtosis
-
7/30/2019 Doane Chapter 04B
62/63
As sample size increases, the chance range
narrows.
Inferences about kurtosis are risky forn < 50.
Kur tos is
-
7/30/2019 Doane Chapter 04B
63/63
Applied Statistics inBusiness and Economics
End of Chapter 4