4b-1. descriptive statistics (part 2) standardized data standardized data percentiles and quartiles...

Descriptive Statistics (Part Descriptive Statistics (Part 2)2)

Descriptive Statistics (Part Descriptive Statistics (Part 2)2)

Standardized Data

Percentiles and Quartiles

Box Plots

Chapter4B4B4B4B

McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, Inc. All rights reserved.

4B-3

• For any population with mean For any population with mean and standard deviation and standard deviation , the percentage of , the percentage of observations that lie within observations that lie within kk standard deviations of the mean must be at standard deviations of the mean must be at least 100[1 – 1/least 100[1 – 1/kk22]. ].

• Developed by mathematicians Jules BienaymDeveloped by mathematicians Jules Bienayméé (1796-1878) and Pafnuty Chebyshev (1821-1894).(1796-1878) and Pafnuty Chebyshev (1821-1894).

Standardized DataStandardized DataStandardized DataStandardized Data

Chebyshev’s TheoremChebyshev’s Theorem

4B-4

• For For kk = 2 standard deviations, = 2 standard deviations, 100[1 – 1/2100[1 – 1/222] = 75%] = 75%

• So, at least 75.0% will lie within So, at least 75.0% will lie within ++ 2 2• For For kk = 3 standard deviations, = 3 standard deviations,

100[1 – 1/3100[1 – 1/322] = 88.9%] = 88.9%• So, at least 88.9% will lie within So, at least 88.9% will lie within ++ 3 3

• Although applicable to any data set, these limits Although applicable to any data set, these limits tend to be too wide to be useful.tend to be too wide to be useful.


Chebyshev’s TheoremChebyshev’s Theorem

4B-5

• The The Empirical RuleEmpirical Rule states that for data from a states that for data from a normal distribution, we expect that fornormal distribution, we expect that for

• The normal or Gaussian distribution was named for The normal or Gaussian distribution was named for Karl Gauss (1771-1855).Karl Gauss (1771-1855).

• The normal distribution is symmetric and is also The normal distribution is symmetric and is also known as the bell-shaped curve.known as the bell-shaped curve.

kk = 1 about 68.26% will lie within = 1 about 68.26% will lie within ++ 1 1kk = 2 about 95.44% will lie within = 2 about 95.44% will lie within ++ 2 2

kk = 3 about 99.73% will lie within = 3 about 99.73% will lie within ++ 3 3


The Empirical RuleThe Empirical Rule

4B-6

Note: no upper bound is given. Note: no upper bound is given. Data values outside Data values outside ++ 3 3 are rare.are rare.

• Distance from the mean is measured in terms of Distance from the mean is measured in terms of the number of standard deviations.the number of standard deviations.



4B-7

• If 80 students take an exam, how many will score If 80 students take an exam, how many will score within 2 standard deviations of the mean?within 2 standard deviations of the mean?

• Assuming exam scores follow a normal distribution, Assuming exam scores follow a normal distribution, the empirical rule statesthe empirical rule states

about 95.44% will lie within about 95.44% will lie within ++ 2 2so 95.44% x 80 so 95.44% x 80 76 students will score 76 students will score ++ 2 2 from from ..

• How many students will score more than 2 How many students will score more than 2 standard deviations from the mean?standard deviations from the mean?


Example: Exam ScoresExample: Exam Scores

4B-8

• UnusualUnusual observations are those that lie beyond observations are those that lie beyond ++ 2 2..

• OutliersOutliers are observations that lie beyond are observations that lie beyond ++ 3 3..


Unusual ObservationsUnusual Observations

4B-9

• For example, the P/E ratio data contains several For example, the P/E ratio data contains several large data values. Are they unusual or outliers?large data values. Are they unusual or outliers?

77 88 88 1010 1010 1010 1010 1212 1313 1313 1313 1313

1313 1313 1313 1414 1414 1414 1515 1515 1515 1515 1515 1616

1616 1616 1717 1818 1818 1818 1818 1919 1919 1919 1919 1919

2020 2020 2020 2121 2121 2121 2222 2222 2323 2323 2323 2424

2525 2626 2626 2626 2626 2727 2929 2929 3030 3131 3434 3636

3737 4040 4141 4545 4848 5555 6868 9191


Unusual ObservationsUnusual Observations

4B-10

• If the sample came from a normal distribution, then If the sample came from a normal distribution, then the Empirical rule statesthe Empirical rule states

1x s = 22.72 ± 1(14.08)

2x s = 22.72 ± 2(14.08)

3x s = 22.72 ± 3(14.08)



= (8.9, 38.8)

= (-5.4, 50.9)

= (-19.5, 65.0)

4B-11

22.7222.72 38.838.88.98.9 50.950.9-5.4-5.4 65.065.0-19.5-19.5



OutliersOutliers OutliersOutliers

UnusualUnusualUnusualUnusual

• Are there any unusual values or outliers?Are there any unusual values or outliers?7 8 7 8 . . .. . . 48 55 68 91 48 55 68 91

4B-12

• A A standardized variablestandardized variable ( (ZZ) redefines each observation in ) redefines each observation in terms the number of standard deviations from the mean.terms the number of standard deviations from the mean.

iix

z

Standardization Standardization formula for a formula for a population:population:

Standardization Standardization formula for a formula for a sample:sample:

iix x

zs


Defining a Standardized VariableDefining a Standardized Variable

4B-13

• zzii tells how far away the observation is from the mean. tells how far away the observation is from the mean.

iix x

zs

== 7 – 22.727 – 22.72

14.0814.08== -1.12-1.12



• For example, for the P/E data, the first value For example, for the P/E data, the first value xx11 = 7. = 7.

The associated The associated zz value is value is

4B-14

iix x

zs

== 91 – 22.7291 – 22.72

14.0814.08== 4.854.85

• A negative A negative zz value means the observation is below value means the observation is below the mean.the mean.



• Positive Positive zz means the observation is above the mean. means the observation is above the mean. For For xx6868 = 91, = 91,

4B-15

• Here are the standardized Here are the standardized zz values for the P/E values for the P/E data:data:



• What do you conclude for these four values?What do you conclude for these four values?

4B-16

• In Excel, use =STANDARDIZE(Array, Mean, STDev) to calculate a In Excel, use =STANDARDIZE(Array, Mean, STDev) to calculate a standardized standardized zz value. value.

• MegaStat calculates standardized values as well as MegaStat calculates standardized values as well as checks for outliers.checks for outliers.



4B-17

• What do we do with outliers in a data set?What do we do with outliers in a data set?

• If due to erroneous data, then discard.If due to erroneous data, then discard.

• An outrageous observation (one completely outside An outrageous observation (one completely outside of an expected range) is certainly invalid.of an expected range) is certainly invalid.

• Recognize unusual data points and outliers and Recognize unusual data points and outliers and their potential impact on your study.their potential impact on your study.

• Research books and articles on how to handle Research books and articles on how to handle outliers.outliers.


OutliersOutliers

4B-18

• For a normal distribution, the range of values is 6For a normal distribution, the range of values is 6 (from (from – 3 – 3 to to + 3 + 3).).

• If you know the range If you know the range RR (high – low), you can (high – low), you can estimate the standard deviation as estimate the standard deviation as = = RR/6./6.

• Useful for approximating the standard deviation Useful for approximating the standard deviation when only when only RR is known. is known.

• This estimate depends on the assumption of This estimate depends on the assumption of normality.normality.


Estimating SigmaEstimating Sigma

4B-19

• PercentilesPercentiles are data that have been divided into are data that have been divided into 100 groups.100 groups.

• For example, you score in the 83For example, you score in the 83rdrd percentile on a standardized test. percentile on a standardized test. That means that 83% of the test-takers scored below you. That means that 83% of the test-takers scored below you.

• DecilesDeciles are data that have been divided into are data that have been divided into 10 groups.10 groups.

• QuintilesQuintiles are data that have been divided into are data that have been divided into 5 groups.5 groups.

• QuartilesQuartiles are data that have been divided into are data that have been divided into 4 groups.4 groups.

Percentiles and QuartilesPercentiles and QuartilesPercentiles and QuartilesPercentiles and Quartiles

PercentilesPercentiles

4B-20

• Percentiles are used to establish Percentiles are used to establish benchmarksbenchmarks for comparison purposes for comparison purposes (e.g., health care, manufacturing and banking industries use 5, 25, 50, 75 (e.g., health care, manufacturing and banking industries use 5, 25, 50, 75 and 90 percentiles). and 90 percentiles).

• Quartiles (25, 50, and 75 percent) are commonly used Quartiles (25, 50, and 75 percent) are commonly used to assess financial performance and stock portfolios. to assess financial performance and stock portfolios.

• Percentiles are used in employee merit evaluation Percentiles are used in employee merit evaluation and salary benchmarking.and salary benchmarking.


PercentilesPercentiles

4B-21

• QuartilesQuartiles are scale points that divide the sorted are scale points that divide the sorted data into four groups of approximately equal size.data into four groups of approximately equal size.

• The three values that separate the four groups are The three values that separate the four groups are called called QQ11, , QQ22, and , and QQ33, respectively., respectively.

Q1 Q2 Q3

Lower 25% | Second 25% | Third 25% | Upper 25%


QuartilesQuartiles

4B-22

• The second quartile The second quartile QQ22 is the is the medianmedian, an important , an important

indicator of indicator of central tendencycentral tendency..

• QQ11 and and QQ33 measure measure dispersiondispersion since the since the interquartile rangeinterquartile range QQ33 – – QQ11

measures the degree of spread in the middle 50 percent of data values.measures the degree of spread in the middle 50 percent of data values.

QQ22

Lower 50% Lower 50% || Upper 50% Upper 50%

QQ11 QQ33

Lower 25%Lower 25% || Middle 50% Middle 50% || Upper 25%Upper 25%


QuartilesQuartiles

4B-23

• The first quartile The first quartile QQ11 is the median of the data values below is the median of the data values below QQ22, and , and

the third quartile the third quartile QQ33 is the median of the data values above is the median of the data values above QQ22..

QQ11 QQ22 QQ33

Lower 25%Lower 25% || Second 25%Second 25% || Third 25%Third 25% || Upper 25%Upper 25%

For first half of data, For first half of data, 50% above, 50% above,

50% below 50% below QQ11..

For second half of data, For second half of data, 50% above, 50% above,

50% below 50% below QQ33..


QuartilesQuartiles

4B-24

• Depending on Depending on nn, the quartiles , the quartiles QQ11,,QQ22, and , and QQ33 may be members of may be members of

the data set or may lie the data set or may lie betweenbetween two of the sorted data values. two of the sorted data values.


QuartilesQuartiles

4B-25

• For small data sets, find quartiles using For small data sets, find quartiles using method of method of mediansmedians::

Step 1.Step 1. Sort the observations. Sort the observations.

Step 2.Step 2. Find the median Find the median QQ22..

Step 3.Step 3. Find the median of the data values that lie Find the median of the data values that lie belowbelow QQ22..

Step 4.Step 4. Find the median of the data values that lie Find the median of the data values that lie aboveabove QQ22..


Method of MediansMethod of Medians

4B-26

• Use Excel function =QUARTILE(Array, k) to return Use Excel function =QUARTILE(Array, k) to return the the kkth quartile.th quartile.

=QUARTILE(Array, 3)=QUARTILE(Array, 3)

=PERCENTILE(Array, 75)=PERCENTILE(Array, 75)

• Excel treats quartiles as a special case of percentiles. Excel treats quartiles as a special case of percentiles. For example, to calculate For example, to calculate QQ33

• Excel calculates the quartile positions as:Excel calculates the quartile positions as:

Position of QPosition of Q11 0.250.25n n + 0.75+ 0.75




Excel QuartilesExcel Quartiles

4B-27

• Consider the following P/E ratios for 68 stocks in a Consider the following P/E ratios for 68 stocks in a portfolio. portfolio.

• Use quartiles to define benchmarks for stocks that are Use quartiles to define benchmarks for stocks that are low-priced (bottom quartile) or high-priced (top quartile).low-priced (bottom quartile) or high-priced (top quartile).

7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 14

14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19

19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26

26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91


Example: P/E Ratios and QuartilesExample: P/E Ratios and Quartiles

4B-28

• Using Excel’s method of interpolation, the quartile Using Excel’s method of interpolation, the quartile positionspositions are:are:

Quartile Quartile PositionPosition

FormulaFormula Interpolate Interpolate BetweenBetween

QQ11 = 0.25(68) + 0.75 = 17.75= 0.25(68) + 0.75 = 17.75 XX1717 + + XX1818



QQ22 = 0.50(68) + 0.50 = 34.50= 0.50(68) + 0.50 = 34.50 XX3434 + + XX3535

QQ33 = 0.75(68) + 0.25 = 51.25= 0.75(68) + 0.25 = 51.25 XX5151 + + XX5252

4B-29

• The quartiles are:The quartiles are:

QuartileQuartile FormulaFormula

First (First (QQ11)) QQ11 = = XX1717 + 0.75 ( + 0.75 (XX1818--XX1717) )

= 14 + 0.75 (14-14) = 14 = 14 + 0.75 (14-14) = 14



Second (Second (QQ22)) QQ22 = = XX3434 + 0.50 ( + 0.50 (XX3535--XX3434) )

= 19 + 0.50 (19-19) = 19 = 19 + 0.50 (19-19) = 19Third (Third (QQ33)) QQ33 = = XX5151 + 0.25 ( + 0.25 (XX5252--XX5151) )

= 26 + 0.25 (26-26) = 26 = 26 + 0.25 (26-26) = 26

4B-30

• So, to summarize:So, to summarize:

• These quartiles express central tendency and These quartiles express central tendency and dispersion. What is the interquartile range?dispersion. What is the interquartile range?

QQ11 QQ22 QQ33

Lower 25%Lower 25% of of P/E P/E RatiosRatios

1414 Second 25%Second 25% of of P/EP/E Ratios Ratios

1919 Third 25%Third 25% of of P/EP/E Ratios Ratios

2626 Upper 25%Upper 25% of of P/EP/E Ratios Ratios

• Because of clustering of identical data values, these quartiles do not Because of clustering of identical data values, these quartiles do not provide clean cut points between groups of observations.provide clean cut points between groups of observations.



4B-31

Whether you use the method of Whether you use the method of medians or Excel, your quartiles will be medians or Excel, your quartiles will be about the same. Small differences in about the same. Small differences in calculation techniques typically do not calculation techniques typically do not

lead to different conclusions in lead to different conclusions in business applications.business applications.


TipTip

4B-32

• Quartiles generally resist outliers.Quartiles generally resist outliers.• However, quartiles do not provide clean cut points in the sorted However, quartiles do not provide clean cut points in the sorted

data, especially in small samples with repeating data values.data, especially in small samples with repeating data values.

Data set Data set AA:: 1, 2, 4, 4, 8, 8, 8, 81, 2, 4, 4, 8, 8, 8, 8 QQ11 = 3, = 3, QQ22 = 6, = 6, QQ33 = 8 = 8

Data set Data set BB:: 0, 3, 3, 6, 6, 6, 10, 150, 3, 3, 6, 6, 6, 10, 15 QQ11 = 3, = 3, QQ22 = 6, = 6, QQ33 = 8 = 8

• Although they have identical quartiles, these two data sets are Although they have identical quartiles, these two data sets are not similar. The quartiles do not represent either data set well.not similar. The quartiles do not represent either data set well.


CautionCaution

4B-33

• Some robust measures of central tendency and Some robust measures of central tendency and dispersion using quartiles are:dispersion using quartiles are:

StatisticStatistic FormulaFormula ExcelExcel ProPro ConCon

MidhingeMidhinge=0.5*(QUARTILE=0.5*(QUARTILE

(Data,1)+QUARTILE(Data,1)+QUARTILE(Data,3))(Data,3))

Robust to Robust to presence presence of extreme of extreme data data values.values.

Less Less familiar familiar to most to most people.people.

1 3

2

Q Q


Dispersion Using QuartilesDispersion Using Quartiles

4B-34

StatisticStatistic FormulaFormula ExcelExcel ProPro ConCon

MidspreadMidspread QQ33 – – QQ11=QUARTILE(Data,3)-=QUARTILE(Data,3)-QUARTILE(Data,1)QUARTILE(Data,1)

Stable Stable when when extreme extreme data values data values exist.exist.

Ignores Ignores magnitude magnitude of extreme of extreme data data values.values.


Dispersion Using QuartilesDispersion Using Quartiles

Coefficient Coefficient of quartile of quartile variation variation ((CQVCQV))

NoneNone

Relative Relative variation in variation in percent so percent so we can we can compare compare data sets.data sets.

Less Less familiar to familiar to non-non-statisticiansstatisticians

3 1

3 1

100Q Q

Q Q

4B-35

• The mean of the first and third quartiles.The mean of the first and third quartiles.

• For the 68 P/E ratios,For the 68 P/E ratios,

Midhinge = Midhinge = 1 3

2

Q Q

Midhinge = Midhinge = 1 3 14 2620

2 2

Q Q

• A robust measure of central tendency since A robust measure of central tendency since quartiles ignore extreme values.quartiles ignore extreme values.


MidhingeMidhinge

4B-36

• A robust measure of dispersionA robust measure of dispersion


Midspread = Midspread = QQ33 – – QQ11

Midspread = Midspread = QQ33 – – QQ11 = 26 – 14 = 12 = 26 – 14 = 12


Midspread (Interquartile Range)Midspread (Interquartile Range)

4B-37

• Measures Measures relativerelative dispersion, expresses the dispersion, expresses the midspread as a percent of the midhinge.midspread as a percent of the midhinge.


3 1

3 1

100Q Q

CQVQ Q

3 1

3 1

26 14100 100 30.0%

26 14

Q QCQV

Q Q

• Similar to the Similar to the CVCV, , CQVCQV can be used to compare data can be used to compare data

sets measured in different units or with different means.sets measured in different units or with different means.


Coefficient of Quartile Variation (CQV)Coefficient of Quartile Variation (CQV)

4B-38

• A useful tool of A useful tool of exploratory data analysisexploratory data analysis (EDA). (EDA).

• Also called a Also called a box-and-whisker plotbox-and-whisker plot..

• Based on a Based on a five-number summaryfive-number summary::

XXminmin, , QQ11, , QQ22, , QQ33, , XXmaxmax

• Consider the five-number summary for the Consider the five-number summary for the 68 P/E ratios:68 P/E ratios:

7 14 19 26 917 14 19 26 91

XXminmin, , QQ11, , QQ22, , QQ33, , XXmaxmax

Box PlotsBox PlotsBox PlotsBox Plots

4B-39

MinimumMinimum

Median (Median (QQ22))

MaximumMaximum

QQ11 QQ33

BoxBox

WhiskersWhiskers

Right-skewedRight-skewed

Center of Box is MidhingeCenter of Box is Midhinge


4B-40

• Use quartiles to detect unusual data points.Use quartiles to detect unusual data points.

• These points are called These points are called fencesfences and can be found and can be found using the following formulas: using the following formulas:

Inner fencesInner fences Outer fences:Outer fences:

Lower fenceLower fence QQ11 – 1.5 ( – 1.5 (QQ33––QQ11)) QQ11 – 3.0 ( – 3.0 (QQ33––QQ11))

Upper fenceUpper fence QQ33 + 1.5 ( + 1.5 (QQ33––QQ11)) QQ33 + 3.0 ( + 3.0 (QQ33––QQ11))

• Values outside the inner fences are Values outside the inner fences are unusualunusual while while those outside the outer fences are those outside the outer fences are outliersoutliers. .


Fences and Unusual Data ValuesFences and Unusual Data Values

4B-41

• For example, consider the P/E ratio data:For example, consider the P/E ratio data:

• Ignore the lower fence since it is negative and P/E Ignore the lower fence since it is negative and P/E ratios are only positive. ratios are only positive.

Inner fencesInner fences Outer fences:Outer fences:

Lower fence:Lower fence: 14 – 1.5 (26–14) = 14 – 1.5 (26–14) = 44 14 – 3.0 (26–14) = 14 – 3.0 (26–14) = 2222

Upper fence:Upper fence: 26 + 1.5 (26–14) = +4426 + 1.5 (26–14) = +44 26 + 3.0 (26–14) = +6226 + 3.0 (26–14) = +62



4B-42

• Truncate the whisker at the fences and display Truncate the whisker at the fences and display unusual values unusual values and outliers and outliers as dots.as dots.

Inner Inner FenceFence

OuterOuterFenceFence

UnusualUnusual OutliersOutliers



• Based on these fences, there are three unusual Based on these fences, there are three unusual P/E values and two outliers.P/E values and two outliers.

4B-43

• Although some information is lost, grouped data Although some information is lost, grouped data are easier to display than raw data. are easier to display than raw data.

• When bin limits are given, the mean and standard When bin limits are given, the mean and standard deviation can be estimated.deviation can be estimated.

• Accuracy of grouped estimates depend on Accuracy of grouped estimates depend on - the number of bins- the number of bins- distribution of data within bins- distribution of data within bins- bin frequencies- bin frequencies

Grouped DataGrouped DataGrouped DataGrouped Data

Nature of Grouped DataNature of Grouped Data

4b-1. descriptive statistics (part 2) standardized data standardized data percentiles and quartiles...

Documents

number of standard deviations

standardized data example

data set

normal distribution

data values

empirical rule slide

exam scores slide

pe ratio data