chapter 3. data descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– for grouped data, because...
TRANSCRIPT
Chapter 3. Data Description
Graphical Methods
• Pie chart– It is used to display the percentage of the total number of
measurements falling into each of the categories of the variableby partition a circle.
• Bar chart– To use the height of bars to represent the number of
observations in categories.
• Frequency histogram– To divide the range of the measurements by the approximate
number of class intervals desired. To construct the frequency table which lists the number of the measurements, i.e., frequency, in the class intervals, and then use histograms or bars to represent the frequencies.
– Class histogram versus relative frequency histogram
Statistical Distribution (1)
• Probability– The chance of an event occurring
• Probability Distribution– The probability distribution of a discrete random variable is a table,
graph, formula, or other device used to specify all possible values of a discete random variable along with their respective probability.
• Unimodal distribution– A histogram with one major peak is called unimodal
distribution.
• Bimodal distribution– A histogram with two major peaks is called bimodal
distribution.
Statistical Distribution (2)
• Uniform distribution– Every interval has the same number of observations.
• Normal distribution– The relative frequency histogram or probability distribution is
a smooth bell-shaped curve.
• Lognormal distribution– After taking logarithm of the values of a random variable, the
relative frequency histogram shows a smooth-bell-shaped curve.
Exploratory Data Analysis
• The aim of the data analysis is to explore and understand the characteristics of the data.
• The basic tool for this analysis is graphical techniques.– Stem-and-leaf plot
• It is a clever, simple device for constructing a histogram-like picture of a frequency distribution.
Describing Data on a Single Variable
• Measures of Central Tendency– To describe the center of the distribution of measurements.
• Measures of variability– To describe how the measurements vary about the center of
the distribution.
• Parameter versus statistic– Parameters are the numerical descriptive measures for a
population.– Statistics are the numerical descriptive measures for a sample.
Measures of Central Tendency (1)
• Mode– The mode of a set of measurements is defined to be the
measurement that occurs most often (with the highest frequency).
– The mode is also commonly used as a measure of popularity that reflects central tendency or opinion.
• Median– The middle value when the measurements are arranged from
lowest to highest.– For an event number of measurements, the median is the
average of the two middle values when the measurements are arranged from the lowest to highest.
– When there are an odd number number of measurements, the median is the middle value.
Measures of Central Tendency (2)
• Grouped data median– For grouped data, the median is calculated as:
widthintervalmedian thecontaining interval class theoffrequency
classmedian thebefore classes allfor frequency) e(cumulativ sfrequencie of sum the
frequency totalmedian thecontains that interval theoflimit classLower
)5.0(
==
===
−+=
wf
cfnL
cfnfwLmedian
m
b
bm
Measures of Central Tendency (3)
• Arithmetic mean (or mean)– The sum of the measurements divided by the total number of
measurements.– Population mean --------– Sample mean -------------
– Grouped data mean
yµ
nyyyy
n
ii
n
n
yy ++++= ==∑
...1 321
tsmeasuremen ofnumber totalthe interval classth - theoffrequency
interval classth - theofmidpoint 1
==
=≅∑=
nif
iyn
yfy
i
i
n
iii
Measures of Central Tendency (4)
• Trimmed mean– a variation of the mean– It drops the highest and lowest extreme values and averages
the rest.– 5% trimmed mean drops the highest 5% and the lowest 5% of
the measurements and averages the rest.– In a limiting sense the median is a 50% trimmed mean.
• Outlier– the extreme values of measurements– The mean is subject to distortion due to the presence pf one or
more outliers.
Measures of Central Tendency (5)
Relationship• among the mean( ), the trimmed mean( ), the median( ) and the mode( ) for distributions with different skewness.
• It is not restricted to using only one measure of central tendency. For some data sets, it will be necessary to use more than one of these measures to provide an accurate descriptive summary of central tendency for the data.
µTM dM oM
Major Characteristics of Each Measure of Central Tendency (1)
• Mode– It is the most frequency or probable measurement in the data
set.– There can be more than one mode for a data set.– It is not influenced by extreme measurements.– Modes of subsets cannot be combined to determine the mode of
the complete data set.– For grouped data its value can change depending on the
categories used.– It is applicable for both qualitative and quantitative data.
Major Characteristics of Each Measure of Central Tendency (2)
• Median– It is the central value; 50% of the measurements lie above it
and 50% fall below it.– There is only one median for a data set.– It is not influenced by extreme measurements.– Medians of subsets cannot be combined to determine the
median of the complete data set.– For grouped data, its value is rather stable even when the data
are organized into different categories.– It is applicable to quantitative data only.
Major Characteristics of Each Measure of Central Tendency (3)
• Mean– It is arithmetic average of the measurements in a data set.– There is only one mean for a data set.– Its value is influenced by extreme measurements; trimming
can help to reduce the degree of influence.– Means of subsets can be combined to determine the mean of
the complete data set.– It is applicable to quantitative data only.
Measures of Variability (1)
• Variability– The description of the dispersion or spread of the
measurements.
0 5 1 0 1 5 2 0 2 5 3 0
y
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Rel
ativ
e Fr
eque
ncy
0 5 1 0 1 5 2 0 2 5 3 0
y
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Rel
ativ
e Fr
eque
ncy
0 5 1 0 1 5 2 0 2 5 3 0
y0.
00.
10.
20.
30.
40.
50.
6
Rel
ativ
e Fr
eque
ncy
Measures of Variability (2)
• Range– It is defined to be the difference between the largest and the
smallest measurements of the data set.– The simplest but least useful measure of data variability is the
range.– For grouped data, because the individual measurements are
not known, the range is taken to be the difference between the upper limit of the last interval and the lower limit of the first interval.
Measures of Variability (3)
• Percentile– The p-th percentile of a set of n measurements arranged in
order of magnitude is that value that has at most p% of the measurements below it and at most (100-p)% above it.
– Specific percentiles of interest are the 25th, 50th, and 75th
percentiles, often called the lower quartile, the middle quartile(median), and the upper quartile, respectively.
5 10 15 20 25
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Rel
ativ
e Fr
eque
ncy
MedianIQR
Lower quartile Upper quartile
25% 25%
25%25%
Measures of Variability (4)
• For grouped data, the following formula can be used to approximate the percentiles for the original data.
widthinterval
interest of percentile theincludes that interval class theoffrequency class percentile thebrfore
intervals class allfor frequency cumulativefrequency total
interest of percentile includes that interval class theoflimit lower
interest of percentile
))100/((
=
=
==
==
−⋅+=
w
f
cfn
LP
cfnPfwLP
p
b
bp
Measures of Variability (5)
• Interquartile range (IQR)– The difference between the upper and lower quartiles– IQR = 75th percentile – 25th percentile– IQR ignores the extremes in the data set completely.– The IQR does not provide a lot of useful information about the
variability of a single set of measurements, but can be quite useful when comparing the variabilities of two or more data sets.
Measures of Variability (6)
• Variance– Many different measures of variability can be constructed by using
the deviation .– The measure which involves the sum of the squared deviation of the
measurements form their mean is called the variance.– Population variance:
– Sample variance:
– The use of (n-1) as the denominator of s2 is not arbitrary. This definition of the sample variance makes it an unbiased estimator of the population variance σ2.
yy −
n
yyn
i
2
12)(∑
=
−=σ
1
)( 2
12
−
−=∑=
n
yys
n
i
Measures of Variability (7)
• Standard deviation– The positive square root of the variance– One reason for defining the standard deviation is that it yields
a measure of variability having the same units of measurement as the original data, whereas the units for variance are the square of the measurement units.
– The other reason of using standard deviation is to apply the empirical rule to mound-shaped or bell-shaped distribution.
– Empirical rule• Given a set of n measurements possessing a mound-shaped
distribution, then the interval contains about 68% of the measurements;
• then the interval contains about 95% of the measurements;
• then the interval contains about 99.7% of the measurements;
sy ±
sy 2±
sy 3±
Measures of Variability (8)
• Approximating s– The empirical rule of standard deviation states that
approximately 95% of the measurements lie in the interval . The length of this interval is, therefore, 4s. Because the range of the measurements is approximately 4s, we can obtain an approximate value for s by dividing the range by 4.
sy 2±
4range valueeapproximat =
Measures of Variability (8)
• Coefficient of variation (CV)– It measures the variability in the values in a population relative
to the magnitude of the population mean.
– The CV is a unit-free number because the standard deviation and the mean are measured using the same units. It can be used to compare the variability in two considerably different processes or populations.
– In many applications, the CV is expressed as a percentage:
µσ
=CV
)%(100µσ
=CV
Examination of the Shape of a Distribution
• Stem-and leaf plot• Boxplot
– The boxplot is more concerned with the symmetry of the distribution and incorporates numerical measures of central tendency and location to study the variability of the measurements and the measurements in the tails of the distribution.
• Box-and-whiskers plot– Any value beyond an inner fence on either side is called a mild
outlier, and a value beyond an outer fence on either side is called an extreme outlier.
Homework
• 3.42 (p.94)• 3.44 (p.94)• 3.65 (p.112)• 3.72, 73, 74 (p.113)
Pie Chart
63.6% Cola
12.7% Lemon-lime 6.3% Dr Pepper-type3.0% Root beer
5.8% Orange
8.6% Other
Bar Chart
Great Britain West Germany Japan Netherlands Ireland
City
0
2000
4000
6000
Num
ber o
f Wor
kers
138
6500
14501200
200
Frequency Histogram
3.55 3.75 3.95 4.15 4.35 4.55 4.75 4.95
Class interval for weight gain
0
4
8
12
Freq
uenc
y
Relative Frequency Histogram
3.55 3.75 3.95 4.15 4.35 4.55 4.75 4.95Class intervals for weight gain
0.00
0.04
0.08
0.12
Rela
tive
freq
uenc
y
Influence of the Number of Class Intervals on the Appearance of Histograms
3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0
05
1015
20
Chick weight gain(grams)
Freq
uenc
y
3 .6 3.8 4.0 4.2 4.4 4.6 4.8
02
46
810
12
Chick weight gain(grams)
Freq
uenc
y
3 .6 3.8 4.0 4.2 4.4 4.6 4.80
24
68
1012
Chick weight gain(grams)
Freq
uenc
y
Uniform Distribution
0.0 0.2 0.4 0.6 0.8 1.0
050
100
150
Y
Freq
uenc
y
Normal Distribution
2 4 6 8 10 12 14
020
4060
80
Y
Freq
uenc
y
Unimodal Distribution (right-skewed)
0 5 10 15
010
2030
4050
Y
Freq
uenc
y
Unimodal Distribution (left-skewed)
0 2 4 6 8 10 12 14
010
2030
4050
Y
Freq
uenc
y
Bimodal Distribution
0 5 10 15 20
010
2030
Y
Freq
uenc
y
Bimodal Distribution (left-skewed)
0 5 10 15 20 25
05
1015
2025
30
Y
Freq
uenc
y
Stem-and-leaf PlotN = 200 Median = 3.565317 Quartiles = 2.39919, 5.23789 Decimal point is at the colon 1 : 00002222333444 1 : 6666777888999999 2 : 0001111122222233333444444 2 : 55555555555667778889999 3 : 00112222223333444 3 : 555566677778888999 4 : 0000000111222222233333444 4 : 5566789 5 : 00022333334 5 : 5566777889999 6 : 00124 6 : 6888 7 : 01124 7 : 57 8 : 24 8 : 559 9 : 02 9 : 9 High: 10.96026 10.99154 11.00564 11.76325 11.86953 High: 12.47704 12.84776
Example 1. Median
Each of 10 children in the second grade was given a reading aptitude test. The scores were as follows:
id 1 2 3 4 5 6 7 8 9 10 grade 95 86 78 90 62 73 89 92 84 76
Determine the median test score. Solution: Sort these scores:
62 73 76 78 84 86 89 90 92 95 Because there are an event number of measurements, the median is the average of the two midpoint scores.
852
8684=
+=median
Example 2. Grouped Data Median
• The frequency table for the chick data. Compute the median weight gain for these data.
Class Interval fi cum(fi) fi/n cum(fi/n) 3.55 - 3.65 1 1 0.01 0.01 3.65 - 3.75 1 2 0.01 0.02 3.75 - 3.85 6 8 0.06 0.08 3.85 - 3.95 6 14 0.06 0.14 3.95 - 4.05 10 24 0.10 0.24 4.05 - 4.15 10 34 0.10 0.34 4.15 - 4.25 13 47 0.13 0.47 4.25 - 4.35 11 58 0.11 0.58 4.35 - 4.45 13 71 0.13 0.71 4.45 - 4.55 7 78 0.07 0.78 4.55 - 4.65 6 84 0.06 0.84 4.65 - 4.75 7 91 0.07 0.91 4.75 - 4.85 5 96 0.05 0.96 4.85 - 4.95 4 100 0.04 1.00 Totals n = 100 1.00 28.4)4750(
111.025.4
)5.0(
471.0
1001125.4
=−+=
−+=
=====
bm
b
m
cfnfwLmedian
cfwnfL
Example 3. Grouped Data Mean
• The actual value of the sample mean is:
• To use the grouped data formula to calculate the mean:
Class Interval fi yi fiyi fi(yi-y)2 3.55 - 3.65 1 3.6 3.6 0.478864 3.65 - 3.75 1 3.7 3.7 0.350464 3.75 - 3.85 6 3.8 22.8 1.452384 3.85 - 3.95 6 3.9 23.4 0.921984 3.95 - 4.05 10 4.0 40.0 0.852640 4.05 - 4.15 10 4.1 41.0 0.368640 4.15 - 4.25 13 4.2 54.6 0.110032 4.25 - 4.35 11 4.3 47.3 0.000704 4.35 - 4.45 13 4.4 57.2 0.151632 4.45 - 4.55 7 4.5 31.5 0.302848 4.55 - 4.65 6 4.6 27.6 0.569184 4.65 - 4.75 7 4.7 32.9 1.165248 4.75 - 4.85 5 4.8 24.0 1.290320 4.85 - 4.95 4 4.9 19.6 1.478656 Totals n = 100 429.2 9.493600
292.4100
2.4291 ===∑=
n
yy
n
ii
292.4100
2.4291 ==≅∑=
n
yfy
n
iii
Relation Among Mean, Trimmed Mean, Median and Mode (1)
-4 -2 0 2 4
0.0
0.1
0.2
0.3
µ TM Md Mo
Relation Among Mean, Trimmed Mean, Median and Mode (2)
2 4 6 8 10 12
0.0
0.1
0.2
0.3
µTMMdMo
Relation Among Mean, Trimmed Mean, Median and Mode (3)
0.00
0.05
0.10
0.15
0.20
µ TM Md Mo
Boxplot (1)3.
63.
84.
04.
24.
44.
64.
8
Q1
Q3
Median
Upper inner fence: Q3+1.5(IQR)
Lower inner fence: Q1-1.5(IQR)
Upper outer fence: Q3+3(IQR)
Lower outer fence: Q1-3(IQR)
Boxplot (2)20
4060
8010
012
014
0
Boxplot (3)20
4060
8010
012
014
0
Scatterplot
5 30 55 80 105 130 155base
15
20
25
30
35
40
age