Chapter-2 Statistical Chapter-2 Statistical
description of quantitative description of quantitative
variablevariable
Teaching contents
In this section, we shall study descriptive
techniques of quantitative variable.
Section 1 Frequency distribution table and
frequency distribution graph
Section 2 Measures of central tendency
Section 3 Measures of dispersion tendency
Teaching aimsTeaching aims
To learn the usage of frequency table
and graph.
To master the application of different
indexes.
Department of Health Statistics
Section 1 Frequency distribution
table and frequency distribution
graph
part 1 Frequency distribution table and
graph of qualitative variable
part 2 Frequency distribution table and
graph of quantitative variable
part 3 Usage of frequency distribution
graph
Department of Health Statistics
NEXT
[Example 1.1] university officials
periodically review the distribution of
undergraduate majors to help determine
a fair allocation of resources , and the
following data were obtained
college Number of majors
agriculture 1500
Arts and sciences 3200
education 1200
Engineering 4100
Department of Health Statistics
Table 1.1 the distribution of undergraduate majors
Department of Health Statistics
backFig 1.1 the distribution of undergraduate majors
number of maj ors
0
1000
2000
3000
4000
5000
engi neeri ng arts andsci ences
agri cul ture educati on
[Example 1. 2 ] The techniques will be
illustrated using the Scottish Heart
Health Study, but for simplicity we shall
now take only one variable recorded on
50 subjects.
Department of Health Statistics
Department of Health Statistics
5.75 6.29 6.13 6.78 6.46
6.76 5.98 6.25 6.31 5.99
6.47 5.71 5.19 4.35 5.35
7.11 6.89 6.05 7.01 5.86
5.42 4.92 7.12 5.85 5.64
7.04 6.23 5.71 6.74 6.36
5.75 7.71 6.19 7.55 6.76
7.14 5.73 6.73 7.86 5.51
6.02 6.54 5.34 6.92 7.15
6.55 7.16 4.79 6.64 6.83
Table 1.2 Serum total cholesterol (mmol/L) of 50 subjects from the Scottish Heart Health
Study
How to describe the data in table 1.2?How to describe the data in table 1.2?
List all the data one by one, but it is
difficult for the reader to learn the
distribution character of 50 individuals.
Summarize it using specific index, which
is economical in space and easier for the
reader to understand.
Department of Health Statistics
FREQUENCY DISTRIBUTION TABLE and FREQUENCY DISTRIBUTION TABLE and
FREQUENCY DISTRIBUTION GRAPHFREQUENCY DISTRIBUTION GRAPH
Step 1 to find MIN and MAX, and
compute range
Step 2 set up class intervals
Step 3 set all the data in one of the
class intervals
Department of Health Statistics
MIN 4.35
MAX 7.86
RANGE 3.51
Range is the difference between MAX
and MIN
Department of Health Statistics
Step 1
Divide the range by the approximate
number of class intervals.
Generally we will wish to have 7 to 15
class intervals, which is related with
sample size. The larger sample size is,
the more class intervals there are
accordingly.
Department of Health Statistics
Step 2
Suppose we wish to have 7 class
intervals, then the interval width is
3.51(range)/7 ≈ 0.5
So we choose 0.5 as the interval
width .
Department of Health Statistics
Step 2
Divide the range by the desired
number of subintervals.
Department of Health Statistics
Step 2
Your attention: The first subinterval
must contain MIN, and the last one
must include MAX.
Construct frequency distribution and
keep a tally of the number of
measurements falling in a each
interval.
Department of Health Statistics
Step 3
Your attention: Each class interval
include the lower limit (L), but not
the upper limit (U).
For example, there is a data of 5.5,
it should be in the forth group.
Department of Health Statistics
Step 3
Cholesterol
(mmol/L)
4.0-4.5
4.5-5.0
5.0-5.5
5.5-6.0
6.0-6.5
6.5-7.0
7.0-7.5
7.5-8.0
Department of Health Statistics
Lower limit
Cholesterol
(mmol/L) mark Frequency percentage
Cumulative
percentage
4.0-4.5 | 1 2% 2%
4.5-5.0 | | 2 4% 6%
5.0-5.5 | | | | 4 8% 14%
5.5-6.0 | | | | | | | | | | | 11 22% 36%
6.0-6.5 | | | | | | | | | | | 11 22% 58%
6.5-7.0 | | | | | | | | | | | 11 22% 80%
7.0-7.5 | | | | | | | 7 14% 94%
7.5-8.0 | | | 3 6% 100%
total 50 100%
Upper limit
Table 1.3 frequency distribution table for serum total cholesterol
Percentage is frequency divided by sample size(50)
Department of Health Statistics
Serum Cholesterol
7.757.256.756.255.755.254.754.25
frequency
12
10
8
6
4
2
0
Std. Dev = .76
Mean = 6.29
N = 50.00
3
7
111111
4
2
1
Fig 1.2 frequency distribution graph for serum total cholesterol
Department of Health Statistics
Serum Cholesterol
7.757.256.756.255.755.254.754.25
frequency
12
10
8
6
4
2
0
Std. Dev = .76
Mean = 6.29
N = 50.00
3
7
111111
4
2
1
number of maj ors
0
1000
2000
3000
4000
5000
engi neeri ng arts andsci ences
agri cul ture educati on
The difference
Usage of frequency distribution Usage of frequency distribution graph graph
1 To describe the distribution
characters of frequency.
From table 3 and figure 2, we can know
serum total cholesterol of most people
is from 5.0 to 7.0 mol/L, the proportion
beyond is very small.
Department of Health Statistics
How to describe the distribution How to describe the distribution characters of data?characters of data?
Central tendency
Dispersion tendency
Department of Health Statistics
Serum Cholesterol
7.757.256.756.255.755.254.754.25
frequency
12
10
8
6
4
2
0
Std. Dev = .76
Mean = 6.29
N = 50.00
3
7
111111
4
2
1
Department of Health Statistics
Describe How Data Are Distributed
Positive-SkewedNegative-Skewed Symmetric
Mercury
concentrati on
(g/g) number
<0. 3 3
0. 3~ 17
0. 7~ 66
1. 1~ 60
1. 5~ 48
1. 9~ 18
2. 3~ 16
2. 7~ 6
3. 1~ 1
3. 5~ 1
3. 9~ 2
total 238
Table 2 Mercury concentrationOf hair in 238 health people
0
10
20
30
40
50
60
70
0. 3< 0. 3~ 0. 7~ 1. 1~ 1. 5~ 1. 9~ 2. 3~ 2. 7~ 3. 1~ 3. 5~ 3. 9~
ug/ g发汞值( )
人数
Mercury concentration
Of hair
num
be
r
Positive-Skewed
table3 Myoglobin concentrationin blood serum of 101 normal people
Myogl obi n
concentrati on
(g/ ml )
number
0~ 2
5~ 3
10~ 7
15~ 9
20~ 10
25~ 22
30~ 23
35~ 14
40~ 9
45~50 2
101 0
5
10
15
20
25
0~ 5~ 10~ 15~ 20~ 25~ 30~ 35~ 40~ 45~
ug/ ml血清肌红蛋白( )
人数
num
be
r
Negative-Skewed
Myoglobin concentrationIn blood serum
2 From the frequency distribution, we can
find the outlier ( too large or too small value)
very easily.
For instance, all the serum total cholesterol
is from 4.0 to 8.0, if one value is 28 (too
large, we think it’s impossible) , we called it
outlier and should check whether it is right.
3 It is a way of describing data.
Department of Health Statistics
Department of Health Statistics
Section 2 Measures of
central tendency
arithmetic mean
geometric mean
Median and Percentile
Mode
2
1
3
4
Cen
tral te
nden
cy
Central tendency reflects the average
level of a series of measurements.
The arithmetic meanThe arithmetic mean
[Definition] The arithmetic mean,
also called mean, is defined to be the
sum of the measurements divided by
the total number measurements.
Department of Health Statistics
[symbols] the population mean is denoted by the Greek letter μ (read “mu”) and the sample mean is denoted by the symbol (read “X-bar”)
[Sample mean]
X
n
XX
Department of Health Statistics
n is the total number of observations.
X is a particular value.
(read “sigma”) indicates the operation
of adding.
mean
N
X[Population Mean][Population Mean]
[example2.1] The mean score on a given
test can be found for an entire class. Take
a look at this American History class :
Department of Health Statistics
mean
[solution] We find the mean score, by
adding all the scores together and
dividing by 10 (the number of
scores).
4.8210
85...7590
n
XX
Department of Health Statistics
mean
Department of Health Statistics
All the values are included while
computing the mean.
The mean is easily affected by largest
or smallest values.
mean
[ Properties of the Arithmetic Mean][ Properties of the Arithmetic Mean]
0)( XX
Department of Health Statistics
[notice]
Mean can only be used in homogenous
data.
For example, we can compute the mean
height of ten-year-old boys. But it is
unscientific to calculate the mean height
of boys from 1 to 14 years.
Only when the distribution is normal, can
we compute mean.
mean
Department of Health Statistics
mean
Mean can be
used.
Department of Health Statistics
Geometric MeanGeometric Mean
[Definition]
The geometric mean is defined as the
nth root of the product of the n
numbers.
[symbol] G
Geometric MeanGeometric Mean
[formula][formula]
)lg
(1lg
lg
lg2
lg1
lg)21
lg(lg
21
n
XG
n
Xn
nXXX
nnXXX
G
or
nnXXXG
Department of Health Statistics
Geometric MeanGeometric Mean
Department of Health Statistics
[Example 2.3] The antibody’s levels of
serum of six patients are listed.
1:10 , 1:20 , 1:40 , 1:80 , 1:80 , 1:1
60,
Please calculate the geometric mean?
Geometric MeanGeometric Mean
[solution][solution]
Department of Health Statistics
Geometric MeanGeometric Mean
45)6522.1(lg
)6
160lg...20lg10lg(lg
)lg
(lg
1
1
1
n
XG
So the Geometric Mean is 1:45
X is reciprocal of antibody’s level; and lgX is the logarithm of reciprocal.
Sample size
Inverse logarithm
Department of Health Statistics
[Usage of G ]
Geometric mean is often used in
geometric proportion data.
Such as 1:2 1:4 1:8 1:16 1:32
Geometric MeanGeometric Mean
Median
[Definition]
The median, also called 50th percentile,
is the midpoint of the observations when
they are arranged in ascending order.
Department of Health Statistics
median
[formula][formula]
When n is odd, the median is still the middle value when the data are arranged in ascending order.
)(2
11
22
nn XXM2
1 nXM
Department of Health Statistics
When n is even, the
median is the mean
of the middle two
values when the data
are arranged in
ascending order.
.
median
2/)(1
22
nn XXM
[Example 2.5][Example 2.5]
Each of 7children in the second grade
was given a reading aptitude test, the
scores were as shown below.
95 86 64 81 75 76 69
Determine the median test score.
Department of Health Statistics
median
[solution][solution]
Firstly, we must arrange the scores in
ascending order
64 69 75 76 81 86 95
There are 7 measurements, and the
forth is the midpoint value, so the
median is 76, or we can use formula
764
2
1 XXM n
Department of Health Statistics
median
[Example 2.6][Example 2.6]
An experiment was conducted to measure the
effectiveness of a new procedure pruning grapes.
10 were assigned the task of pruning an acre of
grapes. The productivity, measured in worker-
hours/acre, is recorded for each person
4.4 4.9 3.8 5.2 4.7 4.6 5.4 3.8 4.0 4.3
Determine the median productivity for the group.
Department of Health Statistics
median
[solution][solution]
Arrange the data in ascending order
3.8 3.8 4.0 4.3 4.4 4.6 4.7 4.9 5.2 5.4
Compute the mean of the 5th and 6th
5.42/)(2/)( 65
21
2
XXXXMnn
Department of Health Statistics
median
[exercise][exercise]
Exercise capacity (in seconds) was
determined for each of 11 patients
being treated for chronic heart failure.
Department of Health Statistics
906 684 897 1320 1200 882
711 837 1008 1170 1056
Determine the median and mean.
median
Answer
Mean 970
Median 906
When sample size is very larger or to
the grouped data, we can chose other
formula to compute median(P50).
Department of Health Statistics
median
Min
P0
Max
P100X% ( 100-X )
%
Px
M
P50
)%( Lx
x fnxf
iLP
)%50(50 Lm
fnf
iLP
fx=frequency of the group including median
I = interval width
L: lower limit of the group including median.
is the cumulative frequency less than
the group including median.
)%50(50 Lm
fnf
iLP
Lf
Department of Health Statistics
median
[Example 2.7 ][Example 2.7 ]
Determine the median in example 1.2
Department of Health Statistics
median
Department of Health Statistics
Lower limit
Cholesterol
(mmol/L) frequence percentage
Cumulative
frequence
Cumulative
percentage
4.0-4.5 1 2% 1 2%
4.5-5.0 2 4% 3 6%
5.0-5.5 4 8% 7 14%
5.5-6.0 11 22% 18 36%
6.0-6.5 11 22% 29 58%
6.5-7.0 11 22% 40 80%
7.0-7.5 7 14% 47 94%
7.5-8.0 3 6% 50 100%
total 50 100%
Upper limit
median
Department of Health Statistics
To determine which interval the median
belongs to
we must find the first interval for
which the cumulative frequency
reaches 0.50. This interval will be the
one containing the median.
median
For these data, the interval from 6.0
to 6.5 is the first interval for which the
cumulative frequency reaches 0.50, as
shown in the table, column 6. So this
interval contains the median. Then,
L=6.0 fm=11 n=50 i=0.5 =18
Lf
32.6182511
5.00.6)%(50 L
x
fnxf
iLP
Department of Health Statistics
median
[Exercise][Exercise]
Calculate P25 and P75 in example 1.2
75.57%255011
5.05.5)%(25 L
x
fnxf
iLP
87.629%755011
5.05.6)%(75 L
x
fnxf
iLP
Department of Health Statistics
median
Department of Health Statistics
[Properties of the Median][Properties of the Median]
It is not affected by extreme values.
It is the best index when there is no
exact value in one or two ends of the
distribution.
median
[Exercise][Exercise]
One doctor measured the delitescence (days) of some infectious disease in 10 patients. The outcomes are as follows:
6 , 13 , 5 , 9 , 12 , 10 , 8 , 11 , 8 ,> 14
Please calculate the average delitescence.
Department of Health Statistics
median
There is no exact value at the right end of There is no exact value at the right end of
distribution, so we should choose median. distribution, so we should choose median.
Firstly, we Sort the data from the smallest Firstly, we Sort the data from the smallest
to the largest oneto the largest one
5 6 8 8 9 10 11 12 13 > 14
calculate the mean of 9 and 10, it is 9.5
So the average delitescence is 9.5 days
Department of Health Statistics
[answer]median
Department of Health Statistics
[Usage of median]
• Median can be used in any type of quantitative variable, not only for the data with the normal distribution, but also for the data with the skewed distribution or when there are some unknown values in the data.
• In symmetrical data, mean equals to median theoretically.
median
Mode
[Definition] The mode of a set of
measurements is defined to be the
measurement that occurs most
often(with the highest frequency).
Department of Health Statistics
Department of Health Statistics
[Example 2.8]
Please find out the mode of 9
undergraduates’ English scores
76 87 69 76 85 80 79 81 83
We will find that there are two ’76’ in this
example, so the mode is 76.
Mode is the observation unit which
occur most often. In some cases,
perhaps there are more than one
modes.
Department of Health Statistics
Department of Health Statistics
[Example 2.9]
Please find out the mode of 10 boy’s heights
(m).
1.45,1.50,1.32,1.37,1.45,1.60
1.48,1.41,1.35,1.50
We will find that there are two modes in
this example: 1.45 and 1.50.
Department of Health Statistics
Summary
In a normal distribution, the mean,
median, and mode are identical.
For normal distributions, the mean is the
most efficient and can reflect character
of all measurements.
Department of Health Statistics
Department of Health Statistics
Section 3 Measures of
dispersion tendency
Central tendency can reflect the
average level of quantitative variable.
But it is not enough to know the central
tendency of the distribution only, we
should also describe the variation of
the observations.
Department of Health Statistics
Department of Health Statistics
Group A: 3 4 5 6 7
Group B: 1 3 5 7 9
Mean of group A=(3+4+5+6+7)/5=5
Mean of group B=(1+3+5+7+9)/5=5
The dispersions of the two groups are
different.
Range
Quartile range
Variance or standard
deviation Coefficient of
variation
2
1
3
4
Disp
ersio
n te
nden
cy
Dispersion tendency reflects the
degree of variability of different
measurements.
[Definition]
Department of Health Statistics
Value(min)-Value(max)Range
Range is the difference between MAX
and MIN.
range
[example 3.1][example 3.1]
Determine the range of the following data set.
1, 6, 2, 3, 9, 7, 5
[solution 3.1]
RANGE=9-1=8.
Department of Health Statistics
range
Merit of range
It is the simplest
measurement of
data variability.
limitation of range
It is least useful for it
can only reflect the
difference between
MAX and MIN. And it is
easily affected by
extreme value.
Department of Health Statistics
range
Department of Health Statistics
The interquartile range is the distance
between the third quartile Q3 (P75) and the
first quartile Q1 (P25) .
This distance will include the middle 50
percent of the observations.
Interquartile range = Q3 - Q1
[Definition]
25% 25% 25% 25%
L Q1 Q2 Q3 U
interquartile Rangeinterquartile Range
[Example 3.2]
Calculate the IQR in example 1.1
in virtue of the following table.
Department of Health Statistics
interquartile Rangeinterquartile Range
Department of Health Statistics
Lower limit
Cholesterol
(mmol/L) frequence percentage
Cumulative
frequence
Cumulative
percentage
4.0-4.5 1 2% 1 2%
4.5-5.0 2 4% 3 6%
5.0-5.5 4 8% 7 14%
5.5-6.0 11 22% 18 36%
6.0-6.5 11 22% 29 58%
6.5-7.0 11 22% 40 80%
7.0-7.5 7 14% 47 94%
7.5-8.0 3 6% 50 100%
total 50 100%
Upper limit
interquartile Rangeinterquartile Range
[Solution 3.2] [Solution 3.2]
Above all, we should calculate PAbove all, we should calculate P2525 and P and P7575
75.57%255011
5.05.5)%(25 L
x
fnxf
iLP
87.629%755011
5.05.6)%(75 L
x
fnxf
iLP
Department of Health Statistics
IQR=6.87-5.75=1.12
interquartile Rangeinterquartile Range
Department of Health Statistics
IQR(Q), although more sensitive to
data pileup about the midpoint than
the range, is still not sufficient for our
purpose. It can only reflect the
variability of middle 50%
measurements. And also, it is limited
in interpreting the variability of s
single set of measurements.
[Properties]interquartile Rangeinterquartile Range
The population variance of a set of
n measurements x1,x2… with
arithmetic mean μ is the sum of
the squared deviations divided by
n.
Department of Health Statistics
[ Definition]
variance
2
2
( )X
N
The sample variance of a set of n
measurements x1,x2… with arithmetic
mean is the sum of the squared
deviations divided by n-1.
X
Department of Health Statistics
[ Definition]
variance
1
)( 22
n
XXs
Department of Health Statistics
variance
1
)( 22
n
XXs
mean
Degree of freedom
2)( XX
is the squared deviation
[Example 3.3]
The time between an electric light stimulus and a bar press to avoid a shock was noted for each of five conditioned rats. Use the data below to compute the sample variance.
Shock avoidance times (seconds): 5,4,3,1,3
Department of Health Statistics
variance
[Solution 3.3][Solution 3.3]
Department of Health Statistics
XX i 2)( XX i Xi
5 1.8 3.24
6 0.8 0.64
7 -0.2 0.04
8 -2.2 4.84
3 - 0.2 0.04
TOTAL 16 0 8.80
The deviations and the squared deviations are shown below. The sample mean is 3.2
variance
[Solution 3.3][Solution 3.3]
Using the total of the squared deviations column, we find the sample variance to be
2.24
8.8
1
)( 22
n
XXs
Department of Health Statistics
variance
Department of Health Statistics
All values are used in the calculation.
Not influenced by extreme values.
The units of variance is difficult to
explain, It is the square of the original
units.
[Properties]
variance
[definition]
Standard deviation is the positive
square root of the variance.
[symbol]
Population standard deviation σ
Sample standard deviation S
Department of Health Statistics
Standard deviation
N
X 2)(
1
)( 2
n
XXS
[Example 3.4][Example 3.4]
Calculate the sample standard deviation in Example 3.3
[solution 3.4]
48.12.24
8.8
1
)( 2
n
XXs
Department of Health Statistics
Standard deviation
Department of Health Statistics
– It is the best measurement describing
the variability of quantitative variable,
which can reflect the variability of any
data.
–Only when the data come from normal
distribution, can it be used.
[Properties ]
Standard deviation
[definition]
The coefficient of variation is the ratio of
the standard deviation to the arithmetic
mean, expressed as a percentage:
Department of Health Statistics
%100X
sCV
Coefficient of VariationCoefficient of Variation
[Usage][Usage]
The measurements with different units,
such as the variability comparison of height
(cm) and weight (kg)
When the mean of two groups is quite
different, one is very small, while the other
is very large. such as the weight of
elephants and infants
Department of Health Statistics
Coefficient of VariationCoefficient of Variation
[example 3.6][example 3.6]
kgSkgXWeight
cmScmXHeight
7,64:
5.8,165:
Department of Health Statistics
One doctor measured the heights and
weights of 50 people, the outcome is
Compare which variability is much larger
between height and weight?
Coefficient of VariationCoefficient of Variation
[Solution 3.6][Solution 3.6]
%9.10%10064/7:
%15.5%100165/5.8:
CVWeight
CVHeight
Department of Health Statistics
So the variability of weight is much larger.
Coefficient of VariationCoefficient of Variation
Department of Health Statistics
Department of Health Statistics
SX
Description of data from normal distribution
)( 7525 PPM
Description of data from skewed distribution
94