lecture 4: measure of dispersion - university of new...

59
Lecture 4: Measure of Dispersion Donglei Du ([email protected]) Faculty of Business Administration, University of New Brunswick, NB Canada Fredericton E3B 9Y2 Donglei Du (UNB) ADM 2623: Business Statistics 1 / 59

Upload: doanthuy

Post on 17-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Lecture 4 Measure of Dispersion

Donglei Du(dduunbedu)

Faculty of Business Administration University of New Brunswick NB Canada FrederictonE3B 9Y2

Donglei Du (UNB) ADM 2623 Business Statistics 1 59

Table of contents1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 2 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 3 59

Introduction

Dispersion (aka variability scatter or spread)) characterizes howstretched or squeezed of the data

A measure of statistical dispersion is a nonnegative real number thatis zero if all the data are the same and increases as the data becomemore diverse

Dispersion is contrasted with location or central tendency andtogether they are the most used properties of distributions

There are many types of dispersion measures

RangeMean Absolute DeviationVarianceStandard Deviation

Donglei Du (UNB) ADM 2623 Business Statistics 4 59

Range

Range = maxminusmin

Donglei Du (UNB) ADM 2623 Business Statistics 5 59

Example

Example Find the range of the following data 10 7 6 2 5 4 8 1

Range = 10minus 1 = 9

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

gtrange(sec_01A)

[1] 130 189

Range = 189minus 130 = 59

Donglei Du (UNB) ADM 2623 Business Statistics 6 59

Properties of Range

Only two values are used in its calculation

It is influenced by an extreme value (non-robust)

It is easy to compute and understand

Donglei Du (UNB) ADM 2623 Business Statistics 7 59

Mean Absolute Deviation

The Mean Absolute Deviation of a set of n numbers

MAD =|x1 minus micro|+ + |xn minus micro|

n

|x ‐ ||x1 ‐ |

|x ‐ |

|x3 ‐ |

|x4 ‐ |

x1 x4x2 x3

|x2 |

Donglei Du (UNB) ADM 2623 Business Statistics 8 59

Example

Example A sample of four executives received the following bonuseslast year ($000) 140 150 170 160

Problem Determine the MAD

Solution

x =14 + 15 + 17 + 16

4=

62

4= 155

MAD =|14minus 155|+ |15minus 155|+ |17minus 155|+ |16minus 155|

4

=4

4= 1

Donglei Du (UNB) ADM 2623 Business Statistics 9 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

gtmad(sec_01A center=mean(sec_01A)constant = 1)

[1] 65

Donglei Du (UNB) ADM 2623 Business Statistics 10 59

Properties of MAD

All values are used in the calculation

It is not unduly influenced by large or small values (robust)

The absolute values are difficult to manipulate

Donglei Du (UNB) ADM 2623 Business Statistics 11 59

Variance

The variance of a set of n numbers as population

Var = σ2 =(x1 minus micro)2 + + (xn minus micro)2

n(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

n(computational formula)

The variance of a set of n numbers as sample

S2 =(x1 minus micro)2 + + (xn minus micro)2

nminus 1(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

nminus 1(computational formula)

Donglei Du (UNB) ADM 2623 Business Statistics 12 59

Standard Deviation

The Standard Deviation is the square root of variance

Other notations σ for population and S for sample

Donglei Du (UNB) ADM 2623 Business Statistics 13 59

Example Conceptual formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the mean variance and Standard Deviation

Solution Mean and variance

micro =10 + 11 + 13

3=

34

3asymp 1133333333333

σ2 asymp (10minus 1133)2 + (11minus 1133)2 + (13minus 1133)2

3

=1769 + 01089 + 27889

3=

46668

3= 15556

Standard Deviationσ asymp 1247237

Donglei Du (UNB) ADM 2623 Business Statistics 14 59

Example Computational formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the variance and Standard Deviation

Solution

Variance

σ2 =(102 + 112 + 132)minus (10+11+13)2

3

3

=390minus 1156

3

3=

390minus 38533

3=

467

3= 1555667

Standard Deviationσ asymp 1247665

If the above is sample then σ2 asymp 233335 and σ asymp 1527531

Donglei Du (UNB) ADM 2623 Business Statistics 15 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Sample Variance

gtvar(sec_01A)

[1] 1463228

if you want the population variance

gtvar(sec_01A)(length(sec_01A)-1)(length(sec_01A))

[1] 1439628

Sample standard deviation

gtsd(sec_01A)

[1] 120964

if you want the population SD

gtsd(sec_01A)sqrt((length(sec_01A)-1)(length(sec_01A)))

[1] 1199845

Donglei Du (UNB) ADM 2623 Business Statistics 16 59

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 2: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Table of contents1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 2 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 3 59

Introduction

Dispersion (aka variability scatter or spread)) characterizes howstretched or squeezed of the data

A measure of statistical dispersion is a nonnegative real number thatis zero if all the data are the same and increases as the data becomemore diverse

Dispersion is contrasted with location or central tendency andtogether they are the most used properties of distributions

There are many types of dispersion measures

RangeMean Absolute DeviationVarianceStandard Deviation

Donglei Du (UNB) ADM 2623 Business Statistics 4 59

Range

Range = maxminusmin

Donglei Du (UNB) ADM 2623 Business Statistics 5 59

Example

Example Find the range of the following data 10 7 6 2 5 4 8 1

Range = 10minus 1 = 9

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

gtrange(sec_01A)

[1] 130 189

Range = 189minus 130 = 59

Donglei Du (UNB) ADM 2623 Business Statistics 6 59

Properties of Range

Only two values are used in its calculation

It is influenced by an extreme value (non-robust)

It is easy to compute and understand

Donglei Du (UNB) ADM 2623 Business Statistics 7 59

Mean Absolute Deviation

The Mean Absolute Deviation of a set of n numbers

MAD =|x1 minus micro|+ + |xn minus micro|

n

|x ‐ ||x1 ‐ |

|x ‐ |

|x3 ‐ |

|x4 ‐ |

x1 x4x2 x3

|x2 |

Donglei Du (UNB) ADM 2623 Business Statistics 8 59

Example

Example A sample of four executives received the following bonuseslast year ($000) 140 150 170 160

Problem Determine the MAD

Solution

x =14 + 15 + 17 + 16

4=

62

4= 155

MAD =|14minus 155|+ |15minus 155|+ |17minus 155|+ |16minus 155|

4

=4

4= 1

Donglei Du (UNB) ADM 2623 Business Statistics 9 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

gtmad(sec_01A center=mean(sec_01A)constant = 1)

[1] 65

Donglei Du (UNB) ADM 2623 Business Statistics 10 59

Properties of MAD

All values are used in the calculation

It is not unduly influenced by large or small values (robust)

The absolute values are difficult to manipulate

Donglei Du (UNB) ADM 2623 Business Statistics 11 59

Variance

The variance of a set of n numbers as population

Var = σ2 =(x1 minus micro)2 + + (xn minus micro)2

n(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

n(computational formula)

The variance of a set of n numbers as sample

S2 =(x1 minus micro)2 + + (xn minus micro)2

nminus 1(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

nminus 1(computational formula)

Donglei Du (UNB) ADM 2623 Business Statistics 12 59

Standard Deviation

The Standard Deviation is the square root of variance

Other notations σ for population and S for sample

Donglei Du (UNB) ADM 2623 Business Statistics 13 59

Example Conceptual formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the mean variance and Standard Deviation

Solution Mean and variance

micro =10 + 11 + 13

3=

34

3asymp 1133333333333

σ2 asymp (10minus 1133)2 + (11minus 1133)2 + (13minus 1133)2

3

=1769 + 01089 + 27889

3=

46668

3= 15556

Standard Deviationσ asymp 1247237

Donglei Du (UNB) ADM 2623 Business Statistics 14 59

Example Computational formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the variance and Standard Deviation

Solution

Variance

σ2 =(102 + 112 + 132)minus (10+11+13)2

3

3

=390minus 1156

3

3=

390minus 38533

3=

467

3= 1555667

Standard Deviationσ asymp 1247665

If the above is sample then σ2 asymp 233335 and σ asymp 1527531

Donglei Du (UNB) ADM 2623 Business Statistics 15 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Sample Variance

gtvar(sec_01A)

[1] 1463228

if you want the population variance

gtvar(sec_01A)(length(sec_01A)-1)(length(sec_01A))

[1] 1439628

Sample standard deviation

gtsd(sec_01A)

[1] 120964

if you want the population SD

gtsd(sec_01A)sqrt((length(sec_01A)-1)(length(sec_01A)))

[1] 1199845

Donglei Du (UNB) ADM 2623 Business Statistics 16 59

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 3: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 3 59

Introduction

Dispersion (aka variability scatter or spread)) characterizes howstretched or squeezed of the data

A measure of statistical dispersion is a nonnegative real number thatis zero if all the data are the same and increases as the data becomemore diverse

Dispersion is contrasted with location or central tendency andtogether they are the most used properties of distributions

There are many types of dispersion measures

RangeMean Absolute DeviationVarianceStandard Deviation

Donglei Du (UNB) ADM 2623 Business Statistics 4 59

Range

Range = maxminusmin

Donglei Du (UNB) ADM 2623 Business Statistics 5 59

Example

Example Find the range of the following data 10 7 6 2 5 4 8 1

Range = 10minus 1 = 9

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

gtrange(sec_01A)

[1] 130 189

Range = 189minus 130 = 59

Donglei Du (UNB) ADM 2623 Business Statistics 6 59

Properties of Range

Only two values are used in its calculation

It is influenced by an extreme value (non-robust)

It is easy to compute and understand

Donglei Du (UNB) ADM 2623 Business Statistics 7 59

Mean Absolute Deviation

The Mean Absolute Deviation of a set of n numbers

MAD =|x1 minus micro|+ + |xn minus micro|

n

|x ‐ ||x1 ‐ |

|x ‐ |

|x3 ‐ |

|x4 ‐ |

x1 x4x2 x3

|x2 |

Donglei Du (UNB) ADM 2623 Business Statistics 8 59

Example

Example A sample of four executives received the following bonuseslast year ($000) 140 150 170 160

Problem Determine the MAD

Solution

x =14 + 15 + 17 + 16

4=

62

4= 155

MAD =|14minus 155|+ |15minus 155|+ |17minus 155|+ |16minus 155|

4

=4

4= 1

Donglei Du (UNB) ADM 2623 Business Statistics 9 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

gtmad(sec_01A center=mean(sec_01A)constant = 1)

[1] 65

Donglei Du (UNB) ADM 2623 Business Statistics 10 59

Properties of MAD

All values are used in the calculation

It is not unduly influenced by large or small values (robust)

The absolute values are difficult to manipulate

Donglei Du (UNB) ADM 2623 Business Statistics 11 59

Variance

The variance of a set of n numbers as population

Var = σ2 =(x1 minus micro)2 + + (xn minus micro)2

n(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

n(computational formula)

The variance of a set of n numbers as sample

S2 =(x1 minus micro)2 + + (xn minus micro)2

nminus 1(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

nminus 1(computational formula)

Donglei Du (UNB) ADM 2623 Business Statistics 12 59

Standard Deviation

The Standard Deviation is the square root of variance

Other notations σ for population and S for sample

Donglei Du (UNB) ADM 2623 Business Statistics 13 59

Example Conceptual formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the mean variance and Standard Deviation

Solution Mean and variance

micro =10 + 11 + 13

3=

34

3asymp 1133333333333

σ2 asymp (10minus 1133)2 + (11minus 1133)2 + (13minus 1133)2

3

=1769 + 01089 + 27889

3=

46668

3= 15556

Standard Deviationσ asymp 1247237

Donglei Du (UNB) ADM 2623 Business Statistics 14 59

Example Computational formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the variance and Standard Deviation

Solution

Variance

σ2 =(102 + 112 + 132)minus (10+11+13)2

3

3

=390minus 1156

3

3=

390minus 38533

3=

467

3= 1555667

Standard Deviationσ asymp 1247665

If the above is sample then σ2 asymp 233335 and σ asymp 1527531

Donglei Du (UNB) ADM 2623 Business Statistics 15 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Sample Variance

gtvar(sec_01A)

[1] 1463228

if you want the population variance

gtvar(sec_01A)(length(sec_01A)-1)(length(sec_01A))

[1] 1439628

Sample standard deviation

gtsd(sec_01A)

[1] 120964

if you want the population SD

gtsd(sec_01A)sqrt((length(sec_01A)-1)(length(sec_01A)))

[1] 1199845

Donglei Du (UNB) ADM 2623 Business Statistics 16 59

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 4: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Introduction

Dispersion (aka variability scatter or spread)) characterizes howstretched or squeezed of the data

A measure of statistical dispersion is a nonnegative real number thatis zero if all the data are the same and increases as the data becomemore diverse

Dispersion is contrasted with location or central tendency andtogether they are the most used properties of distributions

There are many types of dispersion measures

RangeMean Absolute DeviationVarianceStandard Deviation

Donglei Du (UNB) ADM 2623 Business Statistics 4 59

Range

Range = maxminusmin

Donglei Du (UNB) ADM 2623 Business Statistics 5 59

Example

Example Find the range of the following data 10 7 6 2 5 4 8 1

Range = 10minus 1 = 9

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

gtrange(sec_01A)

[1] 130 189

Range = 189minus 130 = 59

Donglei Du (UNB) ADM 2623 Business Statistics 6 59

Properties of Range

Only two values are used in its calculation

It is influenced by an extreme value (non-robust)

It is easy to compute and understand

Donglei Du (UNB) ADM 2623 Business Statistics 7 59

Mean Absolute Deviation

The Mean Absolute Deviation of a set of n numbers

MAD =|x1 minus micro|+ + |xn minus micro|

n

|x ‐ ||x1 ‐ |

|x ‐ |

|x3 ‐ |

|x4 ‐ |

x1 x4x2 x3

|x2 |

Donglei Du (UNB) ADM 2623 Business Statistics 8 59

Example

Example A sample of four executives received the following bonuseslast year ($000) 140 150 170 160

Problem Determine the MAD

Solution

x =14 + 15 + 17 + 16

4=

62

4= 155

MAD =|14minus 155|+ |15minus 155|+ |17minus 155|+ |16minus 155|

4

=4

4= 1

Donglei Du (UNB) ADM 2623 Business Statistics 9 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

gtmad(sec_01A center=mean(sec_01A)constant = 1)

[1] 65

Donglei Du (UNB) ADM 2623 Business Statistics 10 59

Properties of MAD

All values are used in the calculation

It is not unduly influenced by large or small values (robust)

The absolute values are difficult to manipulate

Donglei Du (UNB) ADM 2623 Business Statistics 11 59

Variance

The variance of a set of n numbers as population

Var = σ2 =(x1 minus micro)2 + + (xn minus micro)2

n(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

n(computational formula)

The variance of a set of n numbers as sample

S2 =(x1 minus micro)2 + + (xn minus micro)2

nminus 1(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

nminus 1(computational formula)

Donglei Du (UNB) ADM 2623 Business Statistics 12 59

Standard Deviation

The Standard Deviation is the square root of variance

Other notations σ for population and S for sample

Donglei Du (UNB) ADM 2623 Business Statistics 13 59

Example Conceptual formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the mean variance and Standard Deviation

Solution Mean and variance

micro =10 + 11 + 13

3=

34

3asymp 1133333333333

σ2 asymp (10minus 1133)2 + (11minus 1133)2 + (13minus 1133)2

3

=1769 + 01089 + 27889

3=

46668

3= 15556

Standard Deviationσ asymp 1247237

Donglei Du (UNB) ADM 2623 Business Statistics 14 59

Example Computational formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the variance and Standard Deviation

Solution

Variance

σ2 =(102 + 112 + 132)minus (10+11+13)2

3

3

=390minus 1156

3

3=

390minus 38533

3=

467

3= 1555667

Standard Deviationσ asymp 1247665

If the above is sample then σ2 asymp 233335 and σ asymp 1527531

Donglei Du (UNB) ADM 2623 Business Statistics 15 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Sample Variance

gtvar(sec_01A)

[1] 1463228

if you want the population variance

gtvar(sec_01A)(length(sec_01A)-1)(length(sec_01A))

[1] 1439628

Sample standard deviation

gtsd(sec_01A)

[1] 120964

if you want the population SD

gtsd(sec_01A)sqrt((length(sec_01A)-1)(length(sec_01A)))

[1] 1199845

Donglei Du (UNB) ADM 2623 Business Statistics 16 59

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 5: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Range

Range = maxminusmin

Donglei Du (UNB) ADM 2623 Business Statistics 5 59

Example

Example Find the range of the following data 10 7 6 2 5 4 8 1

Range = 10minus 1 = 9

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

gtrange(sec_01A)

[1] 130 189

Range = 189minus 130 = 59

Donglei Du (UNB) ADM 2623 Business Statistics 6 59

Properties of Range

Only two values are used in its calculation

It is influenced by an extreme value (non-robust)

It is easy to compute and understand

Donglei Du (UNB) ADM 2623 Business Statistics 7 59

Mean Absolute Deviation

The Mean Absolute Deviation of a set of n numbers

MAD =|x1 minus micro|+ + |xn minus micro|

n

|x ‐ ||x1 ‐ |

|x ‐ |

|x3 ‐ |

|x4 ‐ |

x1 x4x2 x3

|x2 |

Donglei Du (UNB) ADM 2623 Business Statistics 8 59

Example

Example A sample of four executives received the following bonuseslast year ($000) 140 150 170 160

Problem Determine the MAD

Solution

x =14 + 15 + 17 + 16

4=

62

4= 155

MAD =|14minus 155|+ |15minus 155|+ |17minus 155|+ |16minus 155|

4

=4

4= 1

Donglei Du (UNB) ADM 2623 Business Statistics 9 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

gtmad(sec_01A center=mean(sec_01A)constant = 1)

[1] 65

Donglei Du (UNB) ADM 2623 Business Statistics 10 59

Properties of MAD

All values are used in the calculation

It is not unduly influenced by large or small values (robust)

The absolute values are difficult to manipulate

Donglei Du (UNB) ADM 2623 Business Statistics 11 59

Variance

The variance of a set of n numbers as population

Var = σ2 =(x1 minus micro)2 + + (xn minus micro)2

n(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

n(computational formula)

The variance of a set of n numbers as sample

S2 =(x1 minus micro)2 + + (xn minus micro)2

nminus 1(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

nminus 1(computational formula)

Donglei Du (UNB) ADM 2623 Business Statistics 12 59

Standard Deviation

The Standard Deviation is the square root of variance

Other notations σ for population and S for sample

Donglei Du (UNB) ADM 2623 Business Statistics 13 59

Example Conceptual formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the mean variance and Standard Deviation

Solution Mean and variance

micro =10 + 11 + 13

3=

34

3asymp 1133333333333

σ2 asymp (10minus 1133)2 + (11minus 1133)2 + (13minus 1133)2

3

=1769 + 01089 + 27889

3=

46668

3= 15556

Standard Deviationσ asymp 1247237

Donglei Du (UNB) ADM 2623 Business Statistics 14 59

Example Computational formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the variance and Standard Deviation

Solution

Variance

σ2 =(102 + 112 + 132)minus (10+11+13)2

3

3

=390minus 1156

3

3=

390minus 38533

3=

467

3= 1555667

Standard Deviationσ asymp 1247665

If the above is sample then σ2 asymp 233335 and σ asymp 1527531

Donglei Du (UNB) ADM 2623 Business Statistics 15 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Sample Variance

gtvar(sec_01A)

[1] 1463228

if you want the population variance

gtvar(sec_01A)(length(sec_01A)-1)(length(sec_01A))

[1] 1439628

Sample standard deviation

gtsd(sec_01A)

[1] 120964

if you want the population SD

gtsd(sec_01A)sqrt((length(sec_01A)-1)(length(sec_01A)))

[1] 1199845

Donglei Du (UNB) ADM 2623 Business Statistics 16 59

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 6: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Example

Example Find the range of the following data 10 7 6 2 5 4 8 1

Range = 10minus 1 = 9

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

gtrange(sec_01A)

[1] 130 189

Range = 189minus 130 = 59

Donglei Du (UNB) ADM 2623 Business Statistics 6 59

Properties of Range

Only two values are used in its calculation

It is influenced by an extreme value (non-robust)

It is easy to compute and understand

Donglei Du (UNB) ADM 2623 Business Statistics 7 59

Mean Absolute Deviation

The Mean Absolute Deviation of a set of n numbers

MAD =|x1 minus micro|+ + |xn minus micro|

n

|x ‐ ||x1 ‐ |

|x ‐ |

|x3 ‐ |

|x4 ‐ |

x1 x4x2 x3

|x2 |

Donglei Du (UNB) ADM 2623 Business Statistics 8 59

Example

Example A sample of four executives received the following bonuseslast year ($000) 140 150 170 160

Problem Determine the MAD

Solution

x =14 + 15 + 17 + 16

4=

62

4= 155

MAD =|14minus 155|+ |15minus 155|+ |17minus 155|+ |16minus 155|

4

=4

4= 1

Donglei Du (UNB) ADM 2623 Business Statistics 9 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

gtmad(sec_01A center=mean(sec_01A)constant = 1)

[1] 65

Donglei Du (UNB) ADM 2623 Business Statistics 10 59

Properties of MAD

All values are used in the calculation

It is not unduly influenced by large or small values (robust)

The absolute values are difficult to manipulate

Donglei Du (UNB) ADM 2623 Business Statistics 11 59

Variance

The variance of a set of n numbers as population

Var = σ2 =(x1 minus micro)2 + + (xn minus micro)2

n(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

n(computational formula)

The variance of a set of n numbers as sample

S2 =(x1 minus micro)2 + + (xn minus micro)2

nminus 1(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

nminus 1(computational formula)

Donglei Du (UNB) ADM 2623 Business Statistics 12 59

Standard Deviation

The Standard Deviation is the square root of variance

Other notations σ for population and S for sample

Donglei Du (UNB) ADM 2623 Business Statistics 13 59

Example Conceptual formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the mean variance and Standard Deviation

Solution Mean and variance

micro =10 + 11 + 13

3=

34

3asymp 1133333333333

σ2 asymp (10minus 1133)2 + (11minus 1133)2 + (13minus 1133)2

3

=1769 + 01089 + 27889

3=

46668

3= 15556

Standard Deviationσ asymp 1247237

Donglei Du (UNB) ADM 2623 Business Statistics 14 59

Example Computational formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the variance and Standard Deviation

Solution

Variance

σ2 =(102 + 112 + 132)minus (10+11+13)2

3

3

=390minus 1156

3

3=

390minus 38533

3=

467

3= 1555667

Standard Deviationσ asymp 1247665

If the above is sample then σ2 asymp 233335 and σ asymp 1527531

Donglei Du (UNB) ADM 2623 Business Statistics 15 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Sample Variance

gtvar(sec_01A)

[1] 1463228

if you want the population variance

gtvar(sec_01A)(length(sec_01A)-1)(length(sec_01A))

[1] 1439628

Sample standard deviation

gtsd(sec_01A)

[1] 120964

if you want the population SD

gtsd(sec_01A)sqrt((length(sec_01A)-1)(length(sec_01A)))

[1] 1199845

Donglei Du (UNB) ADM 2623 Business Statistics 16 59

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 7: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Properties of Range

Only two values are used in its calculation

It is influenced by an extreme value (non-robust)

It is easy to compute and understand

Donglei Du (UNB) ADM 2623 Business Statistics 7 59

Mean Absolute Deviation

The Mean Absolute Deviation of a set of n numbers

MAD =|x1 minus micro|+ + |xn minus micro|

n

|x ‐ ||x1 ‐ |

|x ‐ |

|x3 ‐ |

|x4 ‐ |

x1 x4x2 x3

|x2 |

Donglei Du (UNB) ADM 2623 Business Statistics 8 59

Example

Example A sample of four executives received the following bonuseslast year ($000) 140 150 170 160

Problem Determine the MAD

Solution

x =14 + 15 + 17 + 16

4=

62

4= 155

MAD =|14minus 155|+ |15minus 155|+ |17minus 155|+ |16minus 155|

4

=4

4= 1

Donglei Du (UNB) ADM 2623 Business Statistics 9 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

gtmad(sec_01A center=mean(sec_01A)constant = 1)

[1] 65

Donglei Du (UNB) ADM 2623 Business Statistics 10 59

Properties of MAD

All values are used in the calculation

It is not unduly influenced by large or small values (robust)

The absolute values are difficult to manipulate

Donglei Du (UNB) ADM 2623 Business Statistics 11 59

Variance

The variance of a set of n numbers as population

Var = σ2 =(x1 minus micro)2 + + (xn minus micro)2

n(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

n(computational formula)

The variance of a set of n numbers as sample

S2 =(x1 minus micro)2 + + (xn minus micro)2

nminus 1(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

nminus 1(computational formula)

Donglei Du (UNB) ADM 2623 Business Statistics 12 59

Standard Deviation

The Standard Deviation is the square root of variance

Other notations σ for population and S for sample

Donglei Du (UNB) ADM 2623 Business Statistics 13 59

Example Conceptual formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the mean variance and Standard Deviation

Solution Mean and variance

micro =10 + 11 + 13

3=

34

3asymp 1133333333333

σ2 asymp (10minus 1133)2 + (11minus 1133)2 + (13minus 1133)2

3

=1769 + 01089 + 27889

3=

46668

3= 15556

Standard Deviationσ asymp 1247237

Donglei Du (UNB) ADM 2623 Business Statistics 14 59

Example Computational formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the variance and Standard Deviation

Solution

Variance

σ2 =(102 + 112 + 132)minus (10+11+13)2

3

3

=390minus 1156

3

3=

390minus 38533

3=

467

3= 1555667

Standard Deviationσ asymp 1247665

If the above is sample then σ2 asymp 233335 and σ asymp 1527531

Donglei Du (UNB) ADM 2623 Business Statistics 15 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Sample Variance

gtvar(sec_01A)

[1] 1463228

if you want the population variance

gtvar(sec_01A)(length(sec_01A)-1)(length(sec_01A))

[1] 1439628

Sample standard deviation

gtsd(sec_01A)

[1] 120964

if you want the population SD

gtsd(sec_01A)sqrt((length(sec_01A)-1)(length(sec_01A)))

[1] 1199845

Donglei Du (UNB) ADM 2623 Business Statistics 16 59

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 8: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Mean Absolute Deviation

The Mean Absolute Deviation of a set of n numbers

MAD =|x1 minus micro|+ + |xn minus micro|

n

|x ‐ ||x1 ‐ |

|x ‐ |

|x3 ‐ |

|x4 ‐ |

x1 x4x2 x3

|x2 |

Donglei Du (UNB) ADM 2623 Business Statistics 8 59

Example

Example A sample of four executives received the following bonuseslast year ($000) 140 150 170 160

Problem Determine the MAD

Solution

x =14 + 15 + 17 + 16

4=

62

4= 155

MAD =|14minus 155|+ |15minus 155|+ |17minus 155|+ |16minus 155|

4

=4

4= 1

Donglei Du (UNB) ADM 2623 Business Statistics 9 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

gtmad(sec_01A center=mean(sec_01A)constant = 1)

[1] 65

Donglei Du (UNB) ADM 2623 Business Statistics 10 59

Properties of MAD

All values are used in the calculation

It is not unduly influenced by large or small values (robust)

The absolute values are difficult to manipulate

Donglei Du (UNB) ADM 2623 Business Statistics 11 59

Variance

The variance of a set of n numbers as population

Var = σ2 =(x1 minus micro)2 + + (xn minus micro)2

n(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

n(computational formula)

The variance of a set of n numbers as sample

S2 =(x1 minus micro)2 + + (xn minus micro)2

nminus 1(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

nminus 1(computational formula)

Donglei Du (UNB) ADM 2623 Business Statistics 12 59

Standard Deviation

The Standard Deviation is the square root of variance

Other notations σ for population and S for sample

Donglei Du (UNB) ADM 2623 Business Statistics 13 59

Example Conceptual formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the mean variance and Standard Deviation

Solution Mean and variance

micro =10 + 11 + 13

3=

34

3asymp 1133333333333

σ2 asymp (10minus 1133)2 + (11minus 1133)2 + (13minus 1133)2

3

=1769 + 01089 + 27889

3=

46668

3= 15556

Standard Deviationσ asymp 1247237

Donglei Du (UNB) ADM 2623 Business Statistics 14 59

Example Computational formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the variance and Standard Deviation

Solution

Variance

σ2 =(102 + 112 + 132)minus (10+11+13)2

3

3

=390minus 1156

3

3=

390minus 38533

3=

467

3= 1555667

Standard Deviationσ asymp 1247665

If the above is sample then σ2 asymp 233335 and σ asymp 1527531

Donglei Du (UNB) ADM 2623 Business Statistics 15 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Sample Variance

gtvar(sec_01A)

[1] 1463228

if you want the population variance

gtvar(sec_01A)(length(sec_01A)-1)(length(sec_01A))

[1] 1439628

Sample standard deviation

gtsd(sec_01A)

[1] 120964

if you want the population SD

gtsd(sec_01A)sqrt((length(sec_01A)-1)(length(sec_01A)))

[1] 1199845

Donglei Du (UNB) ADM 2623 Business Statistics 16 59

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 9: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Example

Example A sample of four executives received the following bonuseslast year ($000) 140 150 170 160

Problem Determine the MAD

Solution

x =14 + 15 + 17 + 16

4=

62

4= 155

MAD =|14minus 155|+ |15minus 155|+ |17minus 155|+ |16minus 155|

4

=4

4= 1

Donglei Du (UNB) ADM 2623 Business Statistics 9 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

gtmad(sec_01A center=mean(sec_01A)constant = 1)

[1] 65

Donglei Du (UNB) ADM 2623 Business Statistics 10 59

Properties of MAD

All values are used in the calculation

It is not unduly influenced by large or small values (robust)

The absolute values are difficult to manipulate

Donglei Du (UNB) ADM 2623 Business Statistics 11 59

Variance

The variance of a set of n numbers as population

Var = σ2 =(x1 minus micro)2 + + (xn minus micro)2

n(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

n(computational formula)

The variance of a set of n numbers as sample

S2 =(x1 minus micro)2 + + (xn minus micro)2

nminus 1(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

nminus 1(computational formula)

Donglei Du (UNB) ADM 2623 Business Statistics 12 59

Standard Deviation

The Standard Deviation is the square root of variance

Other notations σ for population and S for sample

Donglei Du (UNB) ADM 2623 Business Statistics 13 59

Example Conceptual formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the mean variance and Standard Deviation

Solution Mean and variance

micro =10 + 11 + 13

3=

34

3asymp 1133333333333

σ2 asymp (10minus 1133)2 + (11minus 1133)2 + (13minus 1133)2

3

=1769 + 01089 + 27889

3=

46668

3= 15556

Standard Deviationσ asymp 1247237

Donglei Du (UNB) ADM 2623 Business Statistics 14 59

Example Computational formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the variance and Standard Deviation

Solution

Variance

σ2 =(102 + 112 + 132)minus (10+11+13)2

3

3

=390minus 1156

3

3=

390minus 38533

3=

467

3= 1555667

Standard Deviationσ asymp 1247665

If the above is sample then σ2 asymp 233335 and σ asymp 1527531

Donglei Du (UNB) ADM 2623 Business Statistics 15 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Sample Variance

gtvar(sec_01A)

[1] 1463228

if you want the population variance

gtvar(sec_01A)(length(sec_01A)-1)(length(sec_01A))

[1] 1439628

Sample standard deviation

gtsd(sec_01A)

[1] 120964

if you want the population SD

gtsd(sec_01A)sqrt((length(sec_01A)-1)(length(sec_01A)))

[1] 1199845

Donglei Du (UNB) ADM 2623 Business Statistics 16 59

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 10: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

gtmad(sec_01A center=mean(sec_01A)constant = 1)

[1] 65

Donglei Du (UNB) ADM 2623 Business Statistics 10 59

Properties of MAD

All values are used in the calculation

It is not unduly influenced by large or small values (robust)

The absolute values are difficult to manipulate

Donglei Du (UNB) ADM 2623 Business Statistics 11 59

Variance

The variance of a set of n numbers as population

Var = σ2 =(x1 minus micro)2 + + (xn minus micro)2

n(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

n(computational formula)

The variance of a set of n numbers as sample

S2 =(x1 minus micro)2 + + (xn minus micro)2

nminus 1(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

nminus 1(computational formula)

Donglei Du (UNB) ADM 2623 Business Statistics 12 59

Standard Deviation

The Standard Deviation is the square root of variance

Other notations σ for population and S for sample

Donglei Du (UNB) ADM 2623 Business Statistics 13 59

Example Conceptual formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the mean variance and Standard Deviation

Solution Mean and variance

micro =10 + 11 + 13

3=

34

3asymp 1133333333333

σ2 asymp (10minus 1133)2 + (11minus 1133)2 + (13minus 1133)2

3

=1769 + 01089 + 27889

3=

46668

3= 15556

Standard Deviationσ asymp 1247237

Donglei Du (UNB) ADM 2623 Business Statistics 14 59

Example Computational formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the variance and Standard Deviation

Solution

Variance

σ2 =(102 + 112 + 132)minus (10+11+13)2

3

3

=390minus 1156

3

3=

390minus 38533

3=

467

3= 1555667

Standard Deviationσ asymp 1247665

If the above is sample then σ2 asymp 233335 and σ asymp 1527531

Donglei Du (UNB) ADM 2623 Business Statistics 15 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Sample Variance

gtvar(sec_01A)

[1] 1463228

if you want the population variance

gtvar(sec_01A)(length(sec_01A)-1)(length(sec_01A))

[1] 1439628

Sample standard deviation

gtsd(sec_01A)

[1] 120964

if you want the population SD

gtsd(sec_01A)sqrt((length(sec_01A)-1)(length(sec_01A)))

[1] 1199845

Donglei Du (UNB) ADM 2623 Business Statistics 16 59

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 11: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Properties of MAD

All values are used in the calculation

It is not unduly influenced by large or small values (robust)

The absolute values are difficult to manipulate

Donglei Du (UNB) ADM 2623 Business Statistics 11 59

Variance

The variance of a set of n numbers as population

Var = σ2 =(x1 minus micro)2 + + (xn minus micro)2

n(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

n(computational formula)

The variance of a set of n numbers as sample

S2 =(x1 minus micro)2 + + (xn minus micro)2

nminus 1(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

nminus 1(computational formula)

Donglei Du (UNB) ADM 2623 Business Statistics 12 59

Standard Deviation

The Standard Deviation is the square root of variance

Other notations σ for population and S for sample

Donglei Du (UNB) ADM 2623 Business Statistics 13 59

Example Conceptual formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the mean variance and Standard Deviation

Solution Mean and variance

micro =10 + 11 + 13

3=

34

3asymp 1133333333333

σ2 asymp (10minus 1133)2 + (11minus 1133)2 + (13minus 1133)2

3

=1769 + 01089 + 27889

3=

46668

3= 15556

Standard Deviationσ asymp 1247237

Donglei Du (UNB) ADM 2623 Business Statistics 14 59

Example Computational formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the variance and Standard Deviation

Solution

Variance

σ2 =(102 + 112 + 132)minus (10+11+13)2

3

3

=390minus 1156

3

3=

390minus 38533

3=

467

3= 1555667

Standard Deviationσ asymp 1247665

If the above is sample then σ2 asymp 233335 and σ asymp 1527531

Donglei Du (UNB) ADM 2623 Business Statistics 15 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Sample Variance

gtvar(sec_01A)

[1] 1463228

if you want the population variance

gtvar(sec_01A)(length(sec_01A)-1)(length(sec_01A))

[1] 1439628

Sample standard deviation

gtsd(sec_01A)

[1] 120964

if you want the population SD

gtsd(sec_01A)sqrt((length(sec_01A)-1)(length(sec_01A)))

[1] 1199845

Donglei Du (UNB) ADM 2623 Business Statistics 16 59

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 12: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Variance

The variance of a set of n numbers as population

Var = σ2 =(x1 minus micro)2 + + (xn minus micro)2

n(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

n(computational formula)

The variance of a set of n numbers as sample

S2 =(x1 minus micro)2 + + (xn minus micro)2

nminus 1(conceptual formula)

=

nsumi=1

x2i minus

(nsum

i=1xi

)2

n

nminus 1(computational formula)

Donglei Du (UNB) ADM 2623 Business Statistics 12 59

Standard Deviation

The Standard Deviation is the square root of variance

Other notations σ for population and S for sample

Donglei Du (UNB) ADM 2623 Business Statistics 13 59

Example Conceptual formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the mean variance and Standard Deviation

Solution Mean and variance

micro =10 + 11 + 13

3=

34

3asymp 1133333333333

σ2 asymp (10minus 1133)2 + (11minus 1133)2 + (13minus 1133)2

3

=1769 + 01089 + 27889

3=

46668

3= 15556

Standard Deviationσ asymp 1247237

Donglei Du (UNB) ADM 2623 Business Statistics 14 59

Example Computational formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the variance and Standard Deviation

Solution

Variance

σ2 =(102 + 112 + 132)minus (10+11+13)2

3

3

=390minus 1156

3

3=

390minus 38533

3=

467

3= 1555667

Standard Deviationσ asymp 1247665

If the above is sample then σ2 asymp 233335 and σ asymp 1527531

Donglei Du (UNB) ADM 2623 Business Statistics 15 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Sample Variance

gtvar(sec_01A)

[1] 1463228

if you want the population variance

gtvar(sec_01A)(length(sec_01A)-1)(length(sec_01A))

[1] 1439628

Sample standard deviation

gtsd(sec_01A)

[1] 120964

if you want the population SD

gtsd(sec_01A)sqrt((length(sec_01A)-1)(length(sec_01A)))

[1] 1199845

Donglei Du (UNB) ADM 2623 Business Statistics 16 59

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 13: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Standard Deviation

The Standard Deviation is the square root of variance

Other notations σ for population and S for sample

Donglei Du (UNB) ADM 2623 Business Statistics 13 59

Example Conceptual formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the mean variance and Standard Deviation

Solution Mean and variance

micro =10 + 11 + 13

3=

34

3asymp 1133333333333

σ2 asymp (10minus 1133)2 + (11minus 1133)2 + (13minus 1133)2

3

=1769 + 01089 + 27889

3=

46668

3= 15556

Standard Deviationσ asymp 1247237

Donglei Du (UNB) ADM 2623 Business Statistics 14 59

Example Computational formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the variance and Standard Deviation

Solution

Variance

σ2 =(102 + 112 + 132)minus (10+11+13)2

3

3

=390minus 1156

3

3=

390minus 38533

3=

467

3= 1555667

Standard Deviationσ asymp 1247665

If the above is sample then σ2 asymp 233335 and σ asymp 1527531

Donglei Du (UNB) ADM 2623 Business Statistics 15 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Sample Variance

gtvar(sec_01A)

[1] 1463228

if you want the population variance

gtvar(sec_01A)(length(sec_01A)-1)(length(sec_01A))

[1] 1439628

Sample standard deviation

gtsd(sec_01A)

[1] 120964

if you want the population SD

gtsd(sec_01A)sqrt((length(sec_01A)-1)(length(sec_01A)))

[1] 1199845

Donglei Du (UNB) ADM 2623 Business Statistics 16 59

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 14: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Example Conceptual formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the mean variance and Standard Deviation

Solution Mean and variance

micro =10 + 11 + 13

3=

34

3asymp 1133333333333

σ2 asymp (10minus 1133)2 + (11minus 1133)2 + (13minus 1133)2

3

=1769 + 01089 + 27889

3=

46668

3= 15556

Standard Deviationσ asymp 1247237

Donglei Du (UNB) ADM 2623 Business Statistics 14 59

Example Computational formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the variance and Standard Deviation

Solution

Variance

σ2 =(102 + 112 + 132)minus (10+11+13)2

3

3

=390minus 1156

3

3=

390minus 38533

3=

467

3= 1555667

Standard Deviationσ asymp 1247665

If the above is sample then σ2 asymp 233335 and σ asymp 1527531

Donglei Du (UNB) ADM 2623 Business Statistics 15 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Sample Variance

gtvar(sec_01A)

[1] 1463228

if you want the population variance

gtvar(sec_01A)(length(sec_01A)-1)(length(sec_01A))

[1] 1439628

Sample standard deviation

gtsd(sec_01A)

[1] 120964

if you want the population SD

gtsd(sec_01A)sqrt((length(sec_01A)-1)(length(sec_01A)))

[1] 1199845

Donglei Du (UNB) ADM 2623 Business Statistics 16 59

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 15: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Example Computational formula

Example The hourly wages earned by three students are $10 $11$13

Problem Find the variance and Standard Deviation

Solution

Variance

σ2 =(102 + 112 + 132)minus (10+11+13)2

3

3

=390minus 1156

3

3=

390minus 38533

3=

467

3= 1555667

Standard Deviationσ asymp 1247665

If the above is sample then σ2 asymp 233335 and σ asymp 1527531

Donglei Du (UNB) ADM 2623 Business Statistics 15 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Sample Variance

gtvar(sec_01A)

[1] 1463228

if you want the population variance

gtvar(sec_01A)(length(sec_01A)-1)(length(sec_01A))

[1] 1439628

Sample standard deviation

gtsd(sec_01A)

[1] 120964

if you want the population SD

gtsd(sec_01A)sqrt((length(sec_01A)-1)(length(sec_01A)))

[1] 1199845

Donglei Du (UNB) ADM 2623 Business Statistics 16 59

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 16: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Sample Variance

gtvar(sec_01A)

[1] 1463228

if you want the population variance

gtvar(sec_01A)(length(sec_01A)-1)(length(sec_01A))

[1] 1439628

Sample standard deviation

gtsd(sec_01A)

[1] 120964

if you want the population SD

gtsd(sec_01A)sqrt((length(sec_01A)-1)(length(sec_01A)))

[1] 1199845

Donglei Du (UNB) ADM 2623 Business Statistics 16 59

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 17: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Note

Conceptual formula may have accumulated rounding error

Computational formula only has rounding error towards the end

Donglei Du (UNB) ADM 2623 Business Statistics 17 59

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 18: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Properties of Variancestandard deviation

All values are used in the calculation

It is not extremely influenced by outliers (non-robust)

The units of variance are awkward the square of the original unitsTherefore standard deviation is more natural since it recovers heoriginal units

Donglei Du (UNB) ADM 2623 Business Statistics 18 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 19: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 19 59

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 20: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

VarianceStandard Deviation for Grouped Data

The variance of a sample of data organized in a frequency distributionis computed by the following formula

S2 =

ksumi=1

fix2i minus

(ksumi1fixi

)2

n

nminus 1

where fi is the class frequency and xi is the class midpoint for Classi = 1 k

Donglei Du (UNB) ADM 2623 Business Statistics 20 59

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 21: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Example

Example Recall the weight example from Chapter 2

class freq (fi) mid point (xi) fixi fix2i

[130 140) 3 135 405 54675

[140 150) 12 145 1740 252300

[150 160) 23 155 3565 552575

[160 170) 14 165 2310 381150

[170 180) 6 175 1050 183750

[180 190] 4 185 740 136900

62 9810 1561350

Solution The VarianceStandard Deviation are

S2 =1 561 350minus 98102

62

62minus 1asymp 1500793

S asymp 1225069

The real sample varianceSD for the raw data is 1463228120964

Donglei Du (UNB) ADM 2623 Business Statistics 21 59

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 22: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Range for grouped data

The range of a sample of data organized in a frequency distribution iscomputed by the following formula

Range = upper limit of the last class - lower limit of the first class

Donglei Du (UNB) ADM 2623 Business Statistics 22 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 23: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 23 59

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 24: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Coefficient of Variation (CV)

The Coefficient of Variation (CV) for a data set defined as the ratioof the standard deviation to the mean

CV =

σmicro for populationSx for sample

It is also known as unitized risk or the variation coefficient

The absolute value of the CV is sometimes known as relative standarddeviation (RSD) which is expressed as a percentage

It shows the extent of variability in relation to mean of the population

It is a normalized measure of dispersion of a probability distribution orfrequency distributionCV is closely related to some important concepts in other disciplines

The reciprocal of CV is related to the Sharpe ratio in FinanceWilliam Forsyth Sharpe is the winner of the 1990 Nobel Memorial Prizein Economic Sciences

The reciprocal of CV is one definition of the signal-to-noise ratio inimage processing

Donglei Du (UNB) ADM 2623 Business Statistics 24 59

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 25: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Example

Example Rates of return over the past 6 years for two mutual fundsare shown below

Fund A 83 -60 189 -57 236 20

Fund B 12 -48 64 102 253 14

Problem Which fund has higher riskSolution The CVrsquos are given as

CVA =1319

985= 134

CVB =1029

842= 122

There is more variability in Fund A as compared to Fund BTherefore Fund A is riskierAlternatively if we look at the reciprocals of CVs which are theSharpe ratios Then we always prefer the fund with higher Sharperatio which says that for every unit of risk taken how much returnyou can obtainDonglei Du (UNB) ADM 2623 Business Statistics 25 59

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 26: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Properties of Coefficient of Variation

All values are used in the calculation

It is only defined for ratio level of data

The actual value of the CV is independent of the unit in which themeasurement has been taken so it is a dimensionless number

For comparison between data sets with different units orwidely different means one should use the coefficient of variationinstead of the standard deviation

Donglei Du (UNB) ADM 2623 Business Statistics 26 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 27: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 27 59

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 28: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Coefficient of Skewness

Skewness is a measure of the extent to which a probability distributionof a real-valued random variable rdquoleansrdquo to one side of the mean Theskewness value can be positive or negative or even undefined

The Coefficient of Skewness for a data set

Skew = E[(Xminusmicro

σ

)3 ]=micro3

σ3

where micro3 is the third moment about the mean micro σ is the standarddeviation and E is the expectation operator

Donglei Du (UNB) ADM 2623 Business Statistics 28 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 29: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

Skewness

gtskewness(sec_01A)

[1] 04267608

Donglei Du (UNB) ADM 2623 Business Statistics 29 59

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 30: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Skewness Risk

Skewness risk is the risk that results when observations are not spreadsymmetrically around an average value but instead have a skeweddistribution

As a result the mean and the median can be different

Skewness risk can arise in any quantitative model that assumes asymmetric distribution (such as the normal distribution) but is appliedto skewed data

Ignoring skewness risk by assuming that variables are symmetricallydistributed when they are not will cause any model to understate therisk of variables with high skewness

Reference Mandelbrot Benoit B and Hudson Richard L The(mis)behaviour of markets a fractal view of risk ruin and reward2004

Donglei Du (UNB) ADM 2623 Business Statistics 30 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 31: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 31 59

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 32: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Coefficient of Excess Kurtosis Kurt

Kurtosis (meaning curved arching) is a measure of the rdquopeakednessrdquoof the probability distribution

The Excess Kurtosis for a data set

Kurt =E[(X minus micro)4]

(E[(X minus micro)2])2minus 3 =

micro4

σ4minus 3

where micro4 is the fourth moment about the mean and σ is the standarddeviationThe Coefficient of excess kurtosis provides a comparison of the shapeof a given distribution to that of the normal distribution

Negative excess kurtosis platykurticsub-Gaussian A low kurtosisimplies a more rounded peak and shorterthinner tails such asuniform Bernoulli distributionsPositive excess kurtosis leptokurticsuper-Gaussian A high kurtosisimplies a sharper peak and longerfatter tails such as the Studentrsquost-distribution exponential Poisson and the logistic distributionZero excess kurtosis are called mesokurtic or mesokurtotic such asNormal distribution

Donglei Du (UNB) ADM 2623 Business Statistics 32 59

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 33: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Example

Example the weight example (weightcsv)

The R code

gtweight lt- readcsv(weightcsv)

gtsec_01Alt-weight$Weight01A2013Fall

kurtosis

gtkurtosis(sec_01A)

[1] -006660501

Donglei Du (UNB) ADM 2623 Business Statistics 33 59

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 34: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Kurtosis Risk

Kurtosis risk is the risk that results when a statistical model assumesthe normal distribution but is applied to observations that do notcluster as much near the average but rather have more of a tendencyto populate the extremes either far above or far below the average

Kurtosis risk is commonly referred to as rdquofat tailrdquo risk The rdquofat tailrdquometaphor explicitly describes the situation of having moreobservations at either extreme than the tails of the normaldistribution would suggest therefore the tails are rdquofatterrdquo

Ignoring kurtosis risk will cause any model to understate the risk ofvariables with high kurtosis

Donglei Du (UNB) ADM 2623 Business Statistics 34 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 35: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 35 59

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 36: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Chebyshevrsquos Theorem

Given a data set with mean micro and standard deviation σ for anyconstant k the minimum proportion of the values that lie within kstandard deviations of the mean is at least

1minus 1

k2

2

11k

+k‐k

kk

+k k

Donglei Du (UNB) ADM 2623 Business Statistics 36 59

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 37: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Example

Given a data set of five values 140 150 170 160 150

Calculate mean and standard deviation micro = 154 and σ asymp 10198

For k = 11 there should be at least

1minus 1

112asymp 17

within the interval

[microminus kσ micro+ kσ] = [154minus 11(10198) 154 + 11(10198)]

= [1428 1652]

Let us count the real percentage Among the five three of which(namely 150 150 160) are in the above interval Therefore thereare three out of five ie 60 percent which is greater than 17percent So the theorem is correct though not very accurate

Donglei Du (UNB) ADM 2623 Business Statistics 37 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 38: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 38 59

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 39: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Example

Example The average of the marks of 65 students in a test was704 The standard deviation was 24

Problem 1 About what percent of studentsrsquo marks were between656 and 752

Solution Based on Chebychevrsquos Theorem 75 since656 = 704minus (2)(24) and 752 = 704 + (2)(24)

Problem 2 About what percent of studentsrsquo marks were between632 and 776

Solution Based on Chebychevrsquos Theorem 889 since632 = 704minus (3)(24) and 776 = 704 + (3)(24)

Donglei Du (UNB) ADM 2623 Business Statistics 39 59

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 40: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

The Empirical rule

We can obtain better estimation if we know more

For any symmetrical bell-shaped distribution (ie Normaldistribution)

About 68 of the observations will lie within 1 standard deviation ofthe meanAbout 95 of the observations will lie within 2 standard deviation ofthe meanVirtually all the observations (997) will be within 3 standarddeviation of the mean

Donglei Du (UNB) ADM 2623 Business Statistics 40 59

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 41: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

The Empirical rule

Donglei Du (UNB) ADM 2623 Business Statistics 41 59

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 42: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Comparing Chebyshevrsquos Theorem with The Empirical rule

k Chebyshevrsquos Theorem The Empirical rule

1 0 682 75 953 889 997

Donglei Du (UNB) ADM 2623 Business Statistics 42 59

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 43: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

An Approximation for Range

Based on the last claim in the empirical rule we can approximate therange using the following formula

Range asymp 6σ

Or more often we can use the above to estimate standard deviationfrom range (such as in Project Management or QualityManagement)

σ asymp Range

6

Example If the range for a normally distributed data is 60 given aapproximation of the standard deviation

Solution σ asymp 606 = 10

Donglei Du (UNB) ADM 2623 Business Statistics 43 59

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 44: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Why you should check the normal assumption first

Model risk is everywhere

Black Swans are more than you think

Donglei Du (UNB) ADM 2623 Business Statistics 44 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 45: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 45 59

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 46: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Coefficient of Correlation Scatterplot for the weight vsheight collected form the guessing game in the first class

Coefficient of Correlation is a measure of the linear correlation(dependence) between two sets of data x = (x1 xn) andy = (y1 yn)

A scatter plot pairs up values of two quantitative variables in a dataset and display them as geometric points inside a Cartesian diagram

Donglei Du (UNB) ADM 2623 Business Statistics 46 59

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 47: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Scatterplot Example

160 170 180 190

130

140

150

160

170

180

190

Weight

Hei

ght

The R code

gtweight_height lt- readcsv(weight_height_01Acsv)

gtplot(weight_height$weight_01A weight_height$height_01A

xlab=Weight ylab=Height)

Donglei Du (UNB) ADM 2623 Business Statistics 47 59

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 48: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Coefficient of Correlation r

Pearsonrsquos correlation coefficient between two variables is defined asthe covariance of the two variables divided by the product of theirstandard deviations

rxy =

sumni=1(Xi minus X)(Yi minus Y )radicsumn

i=1(Xi minus X)2radicsumn

i=1(Yi minus Y )2

=

nnsumi=1

xiyi minusnsumi=1

xinsumi=1

yiradicn

nsumi=1

xi minus(n

nsumi=1

xi

)2radicn

nsumi=1

yi minus(n

nsumi=1

yi

)2

Donglei Du (UNB) ADM 2623 Business Statistics 48 59

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 49: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Coefficient of Correlation Interpretation

Donglei Du (UNB) ADM 2623 Business Statistics 49 59

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 50: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Coefficient of Correlation Example

For the the weight vs height collected form the guessing game in thefirst class

The Coefficient of Correlation is

r asymp 02856416

Indicated weak positive linear correlation

The R code

gt weight_height lt- readcsv(weight_height_01Acsv)

gt cor(weight_height$weight_01A weight_height$height_01A

use = everything method = pearson)

[1] 02856416

Donglei Du (UNB) ADM 2623 Business Statistics 50 59

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 51: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Layout1 Measure of Dispersion scale parameter

IntroductionRangeMean Absolute DeviationVariance and Standard DeviationRange for grouped dataVarianceStandard Deviation for Grouped DataRange for grouped data

2 Coefficient of Variation (CV)3 Coefficient of Skewness (optional)

Skewness Risk4 Coefficient of Kurtosis (optional)

Kurtosis Risk5 Chebyshevrsquos Theorem and The Empirical rule

Chebyshevrsquos TheoremThe Empirical rule

6 Correlation Analysis7 Case study

Donglei Du (UNB) ADM 2623 Business Statistics 51 59

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 52: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Case 1 Wisdom of the Crowd

The aggregation of information in groups resultsin decisions that are often better than could havebeen made by any single member of the group

We collect some data from the guessing game in the first class Theimpression is that the crowd can be very wise

The question is when the crowd is wise We given a mathematicalexplanation based on the concepts of mean and standard deviation

For more explanations please refer to Surowiecki James (2005)The Wisdom of Crowds Why the Many Are Smarter Than the Fewand How Collective Wisdom Shapes Business Economies Societiesand Nations

Donglei Du (UNB) ADM 2623 Business Statistics 52 59

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 53: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Case 1 Wisdom of the Crowd Mathematical explanation

Let xi (i = 1 n) be the individual predictions and let θ be thetrue value

The crowdrsquos overall prediction

micro =

nsumi=1

xi

n

The crowdrsquos (squared) error

(microminus θ)2 =

nsumi=1

xi

nminus θ

2

=

nsumi=1

(xi minus θ)2

nminus

nsumi=1

(xi minus micro)2

n

= Biasminus σ2 = Bias - Variance

Diversity of opinion Each person should have private informationindependent of each other

Donglei Du (UNB) ADM 2623 Business Statistics 53 59

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 54: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Case 2 Friendship paradox Why are your friends morepopular than you

On average most people have fewer friends thantheir friends have Scott L Feld (1991)

References Feld Scott L (1991) rdquoWhy your friends have morefriends than you dordquo American Journal of Sociology 96 (6)1464-147

Donglei Du (UNB) ADM 2623 Business Statistics 54 59

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 55: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Case 2 Friendship paradox An example

References [hyphen]httpwwweconomistcomblogseconomist-explains201304economist-explains-why-friends-more-popular-paradox

Donglei Du (UNB) ADM 2623 Business Statistics 55 59

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 56: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Case 2 Friendship paradox An example

Person degree

Alice 1Bob 3

Chole 2Dave 2

Individualrsquos average number of friends is

micro =8

4= 2

Individualrsquos friendsrsquo average number of friends is

12 + 22 + 22 + 32

1 + 2 + 2 + 3=

18

8= 225

Donglei Du (UNB) ADM 2623 Business Statistics 56 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 57: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 57 59

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 58: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Case 2 Friendship paradox Mathematical explanation

Let xi (i = 1 n) be the degrees of a network

Individualrsquos average number of friends is

micro =

nsumi=1

xi

n

Individualrsquos friendsrsquo average number of friends is

nsumi=1

x2i

nsumi=1

xi

= micro+σ2

micro

Donglei Du (UNB) ADM 2623 Business Statistics 58 59

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study
Page 59: Lecture 4: Measure of Dispersion - University of New …ddu/2623/Lecture_notes/Lecture4_student.pdf · 2016-10-03 · 6 Correlation Analysis 7 Case study Donglei Du (UNB) ... Range

Case 2 A video by Nicholas Christakis How socialnetworks predict epidemics

Here is the youtube link

Donglei Du (UNB) ADM 2623 Business Statistics 59 59

  • Measure of Dispersion scale parameter
    • Introduction
    • Range
    • Mean Absolute Deviation
    • Variance and Standard Deviation
    • Range for grouped data
    • VarianceStandard Deviation for Grouped Data
    • Range for grouped data
      • Coefficient of Variation (CV)
      • Coefficient of Skewness (optional)
        • Skewness Risk
          • Coefficient of Kurtosis (optional)
            • Kurtosis Risk
              • Chebyshevs Theorem and The Empirical rule
                • Chebyshevs Theorem
                • The Empirical rule
                  • Correlation Analysis
                  • Case study