numerical descriptive techniques
DESCRIPTION
Numerical Descriptive Techniques. Chapter 4. 4.2 Measures of Central Location. Usually, we focus our attention on two types of measures when describing population characteristics: Central location (e.g. average) Variability or spread. - PowerPoint PPT PresentationTRANSCRIPT
1
Numerical Descriptive Techniques
Chapter 4
2
4.2 Measures of Central LocationUsually, we focus our attention on two types of
measures when describing population characteristics: Central location (e.g. average) Variability or spread
The measure of central location reflects the locations of all the actual data points.
3
With one data pointclearly the central location is at the pointitself.
4.2 Measures of Central LocationThe measure of central location reflects the
locations of all the actual data points.How?
But if the third data point appears on the left hand-sideof the midrange, it should “pull”the central location to the left.
With two data points,the central location should fall in the middlebetween them (in order to reflect the location ofboth of them).
4
Sum of the observationsNumber of observations
Mean =
This is the most popular and useful measure of central location
The Arithmetic Mean
5
nx
x in
1i
Sample mean Population mean
Nx i
N1i
Sample size Population size
nx
x in
1i
The Arithmetic Mean
6
10
...
101021
101 xxxx
x ii
• Example 4.1The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Find the mean time on the Internet.
00 77 222211.011.0
• Example 4.2
Suppose the telephone bills of Example 2.1 representthe population of measurements. The population mean is
200x...xx
200x 20021i
2001i 42.1942.19 38.4538.45 45.7745.77
43.5943.59
The Arithmetic Mean
The arithmetic mean
7
Odd number of observations
0, 0, 5, 7, 8 9, 12, 14, 220, 0, 5, 7, 8, 9, 12, 14, 22, 330, 0, 5, 7, 8, 9, 12, 14, 22, 33
Even number of observations
Example 4.3
Find the median of the time on the internetfor the 10 adults of example 4.1
The Median of a set of observations is the value that falls in the middle when the observations are arranged in order of magnitude.
The Median
Suppose only 9 adults were sampled (exclude, say, the longest time (33))
Comment
8.5, 8
8
The Mode of a set of observations is the value that occurs most frequently.
Set of data may have one mode (or modal class), or two or more modes.
The modal classFor large data setsthe modal class is much more relevant than a single-value mode.
The Mode
9
Example 4.5Find the mode for the data in Example 4.1. Here are the data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22
Solution
• All observation except “0” occur once. There are two “0”. Thus, the mode is zero.
• Is this a good measure of central location?• The value “0” does not reside at the center of this set
(compare with the mean = 11.0 and the mode = 8.5).
The Mode
The Mode The Mean, Median, Mode
10
Relationship among Mean, Median, and Mode
If a distribution is symmetrical, the mean, median and mode coincide
If a distribution is asymmetrical, and skewed to the left or to the right, the three measures differ.A positively skewed distribution
(“skewed to the right”)
MeanMedian
Mode
11
If a distribution is symmetrical, the mean, median and mode coincide
If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.
A positively skewed distribution(“skewed to the right”)
MeanMedian
Mode MeanMedian
Mode
A negatively skewed distribution(“skewed to the left”)
Relationship among Mean, Median, and Mode
12
This is a measure of the average growth rate.Let Ri denote the the rate of return in period i
(i=1,2…,n). The geometric mean of the returns R1, R2, …,Rn is the constant Rg that produces the same terminal wealth at the end of period n as do the actual returns for the n periods.
The Geometric Mean
13
If the rate of return was Rg in everyperiod, the nth period return wouldbe calculated by:
ng )R1( )R1)...(R1)(R1( n21
For the given series of rate of returns the nth period return iscalculated by:
Rg is selected such that…
1)R1)...(R1)(R1(R nn21g 1)R1)...(R1)(R1(R n
n21g
The Geometric MeanThe Geometric Mean
14
4.3 Measures of variability
Measures of central location fail to tell the whole story about the distribution.
A question of interest still remains unanswered:
How much are the observations spread outaround the mean value?
15
4.3 Measures of variability
Observe two hypothetical data sets:
The average value provides a good representation of theobservations in the data set.
Small variability
This data set is now changing to...
16
4.3 Measures of variability
Observe two hypothetical data sets:
The average value provides a good representation of theobservations in the data set.
Small variability
Larger variability
The same average value does not provide as good representation of theobservations in the data set as before.
17
The range of a set of observations is the difference between the largest and smallest observations.
Its major advantage is the ease with which it can be computed.
Its major shortcoming is its failure to provide information on the dispersion of the observations between the two end points.
? ? ?
But, how do all the observations spread out?
Smallestobservation
Largestobservation
The range cannot assist in answering this questionRange
The range
18
This measure reflects the dispersion of all the observations
The variance of a population of size N x1, x2,…,xN
whose mean is is defined as
The variance of a sample of n observationsx1, x2, …,xn whose mean is is defined asx
N
)x( 2i
N1i2
N
)x( 2i
N1i2
1n
)xx(s
2i
n1i2
1n
)xx(s
2i
n1i2
The Variance
19
Why not use the sum of deviations?
Consider two small populations:
1098
74 10
11 12
13 16
8-10= -2
9-10= -111-10= +1
12-10= +2
4-10 = - 6
7-10 = -3
13-10 = +3
16-10 = +6
Sum = 0
Sum = 0
The mean of both populations is 10...
…but measurements in Bare more dispersedthen those in A.
A measure of dispersion Should agrees with this observation.
Can the sum of deviationsBe a good measure of dispersion?
A
B
The sum of deviations is zero for both populations, therefore, is not a good measure of dispersion.
20
Let us calculate the variance of the two populations
185
)1016()1013()1010()107()104( 222222B
25
)1012()1011()1010()109()108( 222222A
Why is the variance defined as the average squared deviation?Why not use the sum of squared deviations as a measure of variation instead?
After all, the sum of squared deviations increases in magnitude when the variationof a data set increases!!
The Variance
21
Which data set has a larger dispersion?Which data set has a larger dispersion?
1 3 1 32 5
A B
Data set Bis more dispersedaround the mean
Let us calculate the sum of squared deviations for both data sets
The Variance
22
1 3 1 32 5
A B
SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10SumB = (1-3)2 + (5-3)2 = 8
SumA > SumB. This is inconsistent with the observation that set B is more dispersed.
The Variance
23
1 3 1 32 5
A B
However, when calculated on “per observation” basis (variance), the data set dispersions are properly ranked.
A2 = SumA/N = 10/5 = 2
B2 = SumB/N = 8/2 = 4
The Variance
24
Example 4.7 The following sample consists of the number of jobs
six students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and variance
Solution
2
2222
in
1i2
jobs2.33
)1413...()1415()1417(16
11n
)xx(s
jobs146
846
13972315176
xx i
61i
The Variance
25
2
2222
2i
n1i2
i
n
1i
2
jobs2.33
613...1517
13...151716
1
n)x(
x1n
1s
The Variance – Shortcut method
26
The standard deviation of a set of observations is the square root of the variance .
2
2
:deviationandardstPopulation
ss:deviationstandardSample
2
2
:deviationandardstPopulation
ss:deviationstandardSample
Standard Deviation
27
Example 4.8 To examine the consistency of shots for a new
innovative golf club, a golfer was asked to hit 150 shots, 75 with a currently used (7-iron) club, and 75 with the new club.
The distances were recorded. Which 7-iron is more consistent?
Standard Deviation
28
Example 4.8 – solution
Standard Deviation
Excel printout, from the “Descriptive Statistics” sub-menu.
Current Innovation
Mean 150.5467 Mean 150.1467Standard Error 0.668815 Standard Error 0.357011Median 151 Median 150Mode 150 Mode 149Standard Deviation 5.792104 Standard Deviation 3.091808Sample Variance 33.54847 Sample Variance 9.559279Kurtosis 0.12674 Kurtosis -0.88542Skewness -0.42989 Skewness 0.177338Range 28 Range 12Minimum 134 Minimum 144Maximum 162 Maximum 156Sum 11291 Sum 11261Count 75 Count 75
The innovation club is more consistent, and because the means are close, is considered a better club
The Standard Deviation
29
Interpreting Standard Deviation
The standard deviation can be used to compare the variability of several distributions make a statement about the general shape of a
distribution. The empirical rule: If a sample of observations has a
mound-shaped distribution, the interval
tsmeasuremen the of 68%ely approximat contains )sx,sx(
tsmeasuremen the of 95%ely approximat contains )s2x,s2x( tsmeasuremen the of 99.7%ely approximat contains )s3x,s3x(
30
Example 4.9A statistics practitioner wants to describe the way returns on investment are distributed. The mean return = 10% The standard deviation of the return = 8% The histogram is bell shaped.
Interpreting Standard Deviation
31
Example 4.9 – solutionThe empirical rule can be applied (bell shaped histogram)Describing the return distribution
Approximately 68% of the returns lie between 2% and 18% [10 – 1(8), 10 + 1(8)]
Approximately 95% of the returns lie between -6% and 26% [10 – 2(8), 10 + 2(8)]
Approximately 99.7% of the returns lie between -14% and 34% [10 – 3(8), 10 + 3(8)]
Interpreting Standard Deviation
32
The proportion of observations in any sample that lie within k standard deviations of the mean is at least 1-1/k2 for k > 1.
This theorem is valid for any set of measurements (sample, population) of any shape!!
K Interval Chebysheff Empirical Rule1 at least 0% approximately 68%2 at least 75% approximately 95%3 at least 89% approximately 99.7%
s2x,s2x sx,sx
s3x,s3x
The Chebysheff’s Theorem
(1-1/12)
(1-1/22)
(1-1/32)
33
Example 4.10 The annual salaries of the employees of a chain of computer
stores produced a positively skewed histogram. The mean and standard deviation are $28,000 and $3,000,respectively. What can you say about the salaries at this chain?SolutionAt least 75% of the salaries lie between $22,000 and $34,000 28000 – 2(3000) 28000 + 2(3000)At least 88.9% of the salaries lie between $$19,000 and $37,000 28000 – 3(3000) 28000 + 3(3000)
The Chebysheff’s Theorem
34
The coefficient of variation of a set of measurements is the standard deviation divided by the mean value.
This coefficient provides a proportionate measure of variation.
CV : variationoft coefficien Population
x
scv : variationoft coefficien Sample
A standard deviation of 10 may be perceivedlarge when the mean value is 100, but only moderately large when the mean value is 500
The Coefficient of Variation
35Your score
4.4 Measures of Relative Standing and Box Plots
Percentile The pth percentile of a set of measurements is the
value for which • p percent of the observations are less than that value• 100(1-p) percent of all the observations are greater than
that value. Example
• Suppose your score is the 60% percentile of a SAT test. Then 60% of all the scores lie here 40%
36
Commonly used percentiles First (lower)decile = 10th percentile First (lower) quartile, Q1, = 25th percentile Second (middle)quartile,Q2, = 50th percentile Third quartile, Q3, = 75th percentile Ninth (upper)decile = 90th percentile
Quartiles
37
Quartiles
Example
Find the quartiles of the following set of measurements 7, 8, 12, 17, 29, 18, 4, 27, 30, 2, 4, 10, 21, 5, 8
38
SolutionSort the observations2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 18, 21, 27, 29, 30
At most (.25)(15) = 3.75 observations should appear below the first quartile.Check the first 3 observations on the left hand side.
At most (.25)(15) = 3.75 observations should appear below the first quartile.Check the first 3 observations on the left hand side.
At most (.75)(15)=11.25 observations should appear above the first quartile.Check 11 observations on the right hand side.
At most (.75)(15)=11.25 observations should appear above the first quartile.Check 11 observations on the right hand side.
The first quartileThe first quartile
Comment:If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations.
Comment:If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations.
15 observations
Quartiles
39
Find the location of any percentile using the formula
Example 4.11Calculate the 25th, 50th, and 75th percentile of the data in Example 4.1
Location of Percentiles
percentilePtheoflocationtheisLwhere100P
)1n(L
thP
P
percentilePtheoflocationtheisLwhere100P
)1n(L
thP
P
40
2 3
0 5
1
0
LocationLocation
Values
Location 3
Example 4.11 – solution After sorting the data we have 0, 0, 5, 7, 8, 9, 22, 33.
75.210025
)110(L25 3.75
The 2.75th locationTranslates to the value(.75)(5 – 0) = 3.75
2.75
Location of Percentiles
41
Example 4.11 – solution continued
The 50th percentile is halfway between the fifth and sixth observations (in the middle between 8 and 9), that is 8.5.
Location of Percentiles
5.510050
)110(L50
42
Example 4.11 – solution continued
The 75th percentile is one quarter of the distance between the eighth and ninth observation that is14+.25(22 – 14) = 16.
Location of Percentiles
25.810075
)110(L75
Eighth observation
Ninthobservation
43
Quartiles and Variability
Quartiles can provide an idea about the shape of a histogram
Q1 Q2 Q3
Positively skewedhistogram
Q1 Q2 Q3
Negatively skewedhistogram
44
This is a measure of the spread of the middle 50% of the observations
Large value indicates a large spread of the observations
Interquartile range = Q3 – Q1
Interquartile Range
45
1.5(Q3 – Q1) 1.5(Q3 – Q1)
This is a pictorial display that provides the main descriptive measures of the data set:
• L - the largest observation• Q3 - The upper quartile
• Q2 - The median
• Q1 - The lower quartile• S - The smallest observation
S Q1 Q2 Q3 LWhisker Whisker
Box Plot
46
Example 4.14 (Xm02-01)
Box Plot
Bills42.1938.4529.2389.35118.04110.46...
Smallest = 0Q1 = 9.275Median = 26.905Q3 = 84.9425Largest = 119.63IQR = 75.6675Outliers = ()
Left hand boundary = 9.275–1.5(IQR)= -104.226Right hand boundary=84.9425+ 1.5(IQR)=198.4438
9.2750 84.9425 198.4438119.63-104.22626.905
No outliers are found
47
Additional Example - GMAT scoresCreate a box plot for the data regarding the GMAT scores of 200 applicants (see GMAT.XLS)
Box Plot
GMAT512531461515...
Smallest = 449Q1 = 512Median = 537Q3 = 575Largest = 788IQR = 63Outliers = (788, 788, 766, 763, 756, 719, 712, 707, 703, 694, 690, 675, )
537512449 575417.5512-1.5(IQR) 575+1.5(IQR)
669.5 788
48
Interpreting the box plot results• The scores range from 449 to 788.• About half the scores are smaller than 537, and about half are larger than
537.• About half the scores lie between 512 and 575.• About a quarter lies below 512 and a quarter above 575.
Q1
512Q2
537Q3
575
25% 50% 25%
449 669.5
Box PlotGMAT - continued
49
50%
25% 25%
The histogram is positively skewed
Q1
512Q2
537Q3
575
25% 50% 25%
449 669.5
Box PlotGMAT - continued
50
Example 4.15 (Xm04-15) A study was organized to compare the quality of
service in 5 drive through restaurants. Interpret the results
Example 4.15 – solution Minitab box plot
Box Plot
51
100 200 300
1
2
3
4
5
C6
C7
Wendy’s service time appears to be the shortest and most consistent.
Hardee’s service time variability is the largest
Jack in the box is the slowest in service
Box Plot
Jack in the Box
Hardee’s
McDonalds
Wendy’s
Popeyes
52
100 200 300
1
2
3
4
5
C6
C7
Popeyes
Wendy’s
Hardee’s
Jack in the Box
Wendy’s service time appears to be the shortest and most consistent.
McDonalds
Hardee’s service time variability is the largest
Jack in the box is the slowest in service
Box Plot
Times are positively skewed
Times are symmetric
53
4.5 Measures of Linear Relationship
The covariance and the coefficient of correlation are used to measure the direction and strength of the linear relationship between two variables. Covariance - is there any pattern to the way two
variables move together? Coefficient of correlation - how strong is the linear
relationship between two variables
54
N
)y)((xY)COV(X,covariance Population yixi
N
)y)((xY)COV(X,covariance Population yixi
x (y) is the population mean of the variable X (Y).N is the population size.
1-n)yy)(x(x
y) cov(x,covariance Sample ii
1-n)yy)(x(x
y) cov(x,covariance Sample ii
Covariance
x (y) is the sample mean of the variable X (Y).n is the sample size.
55
Compare the following three sets
Covariance
xi yi (x – x) (y – y) (x – x)(y – y)267
132027
-312
-707
21014
x=5 y =20 Cov(x,y)=17.5
xi yi (x – x) (y – y) (x – x)(y – y)267
272013
-312
70-7
-210-14
x=5 y =20 Cov(x,y)=-17.5
xi yi
267
202713
Cov(x,y) = -3.5
x=5 y =20
56
If the two variables move in opposite directions, (one increases when the other one decreases), the covariance is a large negative number.
If the two variables are unrelated, the covariance will be close to zero.
If the two variables move in the same direction, (both increase or both decrease), the covariance is a large positive number.
Covariance
57
This coefficient answers the question: How strong is the association between X and Y.
yx
)Y,X(COV
ncorrelatio oft coefficien Population
yx
)Y,X(COV
ncorrelatio oft coefficien Population
yxss)Y,Xcov(
r
ncorrelatio oft coefficien Sample
yxss
)Y,Xcov(r
ncorrelatio oft coefficien Sample
The coefficient of correlation
58
COV(X,Y)=0 or r =
+1
0
-1
Strong positive linear relationship
No linear relationship
Strong negative linear relationship
or
COV(X,Y)>0
COV(X,Y)<0
The coefficient of correlation
59
If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship).
If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship).
No straight line relationship is indicated by a coefficient close to zero.
The coefficient of correlation
60
Compute the covariance and the coefficient of correlation to measure how GMAT scores and GPA in an MBA program are related to one another.
Solution We believe GMAT affects GPA. Thus
• GMAT is labeled X• GPA is labeled Y
The coefficient of correlation and the covariance – Example 4.16
61
1 599 9.6 358801 92.16 5750.4
2 689 8.8 474721 77.44 6063.2
3 584 7.4 341056 54.76 4321.6
4 631 10 398161 100 6310
11 593 8.8 351649 77.44 5218.4
12 683 8 466489 64 5464
Total 7,587 106.4 4,817,755 957.2 67,559.2
Student x y x2 y2 xy
………………………………………………….
n
xx
ns
n
yxyx
n
yx
i
iiii
222
1
1
1
1
),cov(
FormulasShortcut
The coefficient of correlation and the covariance – Example 4.16
cov(x,y)=(1/12-1)[67,559.2-(7587)(106.4)/12]=26.16
Sx = {(1/12-1)[4,817,755-(7587)2/12)]}.5=43.56Sy = similar to Sx = 1.12
r = cov(x,y)/SxSy = 26.16/(43.56)(1.12) = .5362
62
Use the Covariance option in Data Analysis If your version of Excel returns the population covariance and
variances, multiply each one by n/n-1 to obtain the corresponding sample values.
Use the Correlation option to produce the correlation matrix.
The coefficient of correlation and the covariance – Example 4.16 – Excel
GPA GMAT
GPA 1.15
GMAT 23.98 1739.52
GPA GMAT
GPA 1.25
GMAT 26.16 1897.66
1212-1
Variance-Covariance MatrixPopulation values
Sample values
Population values
Sample values
63
Interpretation The covariance (26.16) indicates that GMAT score
and performance in the MBA program are positively related.
The coefficient of correlation (.5365) indicates that there is a moderately strong positive linear relationship between GMAT and MBA GPA.
The coefficient of correlation and the covariance – Example 4.16 – Excel
64
We are seeking a line that best fits the data when two variables are (presumably) related to one another.
We define “best fit line” as a line for which the sum of squared differences between it and the data points is minimized. 2
ii
n
1i)yy(Minimize
The actual y value of point i The y value of point icalculated from the equation i10i xbby
The Least Squares Method
65
The least Squares Method
Errors
Different lines generate different errors, thus different sum of squares of errors.
X
Y
There is a line that minimizes the sum of squared errors
Errors
66
The least Squares Method
The coefficients b0 and b1 of the line that minimizes the sum of squares of errors are calculated from the data.
n
x
xandn
y
ywhere
xbyb
xx
yyxx
s
yxb
n
ii
n
ii
n
ii
n
iii
x
11
10
1
2
121 ,
)(
))((),cov(
n
x
xandn
y
ywhere
xbyb
xx
yyxx
s
yxb
n
ii
n
ii
n
ii
n
iii
x
11
10
1
2
121 ,
)(
))((),cov(
67145.)25.632)(0138(.87.8
87.812
4.106
25.63212
587,7
0138.2.1897
16.26),cov(
10
21
xbybn
yy
n
xx
s
yxb
i
i
x
145.)25.632)(0138(.87.8
87.812
4.106
25.63212
587,7
0138.2.1897
16.26),cov(
10
21
xbybn
yy
n
xx
s
yxb
i
i
x
Scatter Diagram
y = 0.1496 + 0.0138x
6
8
10
12
500 600 700 800
Scatter Diagram
y = 0.1496 + 0.0138x
6
8
10
12
500 600 700 800
Example 4.17 Find the least squares line for Example 4.16 (Xm04-16.xls)
The Least Squares Method