chapter 3. data descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– for grouped data, because...

Chapter 3. Data Description

Graphical Methods

• Pie chart– It is used to display the percentage of the total number of

measurements falling into each of the categories of the variableby partition a circle.

• Bar chart– To use the height of bars to represent the number of

observations in categories.

• Frequency histogram– To divide the range of the measurements by the approximate

number of class intervals desired. To construct the frequency table which lists the number of the measurements, i.e., frequency, in the class intervals, and then use histograms or bars to represent the frequencies.

– Class histogram versus relative frequency histogram

Statistical Distribution (1)

• Probability– The chance of an event occurring

• Probability Distribution– The probability distribution of a discrete random variable is a table,

graph, formula, or other device used to specify all possible values of a discete random variable along with their respective probability.

• Unimodal distribution– A histogram with one major peak is called unimodal

distribution.

• Bimodal distribution– A histogram with two major peaks is called bimodal

distribution.

Statistical Distribution (2)

• Uniform distribution– Every interval has the same number of observations.

• Normal distribution– The relative frequency histogram or probability distribution is

a smooth bell-shaped curve.

• Lognormal distribution– After taking logarithm of the values of a random variable, the

relative frequency histogram shows a smooth-bell-shaped curve.

Exploratory Data Analysis

• The aim of the data analysis is to explore and understand the characteristics of the data.

• The basic tool for this analysis is graphical techniques.– Stem-and-leaf plot

• It is a clever, simple device for constructing a histogram-like picture of a frequency distribution.

Describing Data on a Single Variable

• Measures of Central Tendency– To describe the center of the distribution of measurements.

• Measures of variability– To describe how the measurements vary about the center of

the distribution.

• Parameter versus statistic– Parameters are the numerical descriptive measures for a

population.– Statistics are the numerical descriptive measures for a sample.

Measures of Central Tendency (1)

• Mode– The mode of a set of measurements is defined to be the

measurement that occurs most often (with the highest frequency).

– The mode is also commonly used as a measure of popularity that reflects central tendency or opinion.

• Median– The middle value when the measurements are arranged from

lowest to highest.– For an event number of measurements, the median is the

average of the two middle values when the measurements are arranged from the lowest to highest.

– When there are an odd number number of measurements, the median is the middle value.


• Grouped data median– For grouped data, the median is calculated as:

widthintervalmedian thecontaining interval class theoffrequency

classmedian thebefore classes allfor frequency) e(cumulativ sfrequencie of sum the

frequency totalmedian thecontains that interval theoflimit classLower

)5.0(

==

===

−+=

wf

cfnL

cfnfwLmedian

m

b

bm


• Arithmetic mean (or mean)– The sum of the measurements divided by the total number of

measurements.– Population mean --------– Sample mean -------------

– Grouped data mean

yµ

nyyyy

n

ii

n

n

yy ++++= ==∑

...1 321

tsmeasuremen ofnumber totalthe interval classth - theoffrequency

interval classth - theofmidpoint 1

==

=≅∑=

nif

iyn

yfy

i

i

n

iii


• Trimmed mean– a variation of the mean– It drops the highest and lowest extreme values and averages

the rest.– 5% trimmed mean drops the highest 5% and the lowest 5% of

the measurements and averages the rest.– In a limiting sense the median is a 50% trimmed mean.

• Outlier– the extreme values of measurements– The mean is subject to distortion due to the presence pf one or

more outliers.


Relationship• among the mean( ), the trimmed mean( ), the median( ) and the mode( ) for distributions with different skewness.

• It is not restricted to using only one measure of central tendency. For some data sets, it will be necessary to use more than one of these measures to provide an accurate descriptive summary of central tendency for the data.

µTM dM oM

Major Characteristics of Each Measure of Central Tendency (1)

• Mode– It is the most frequency or probable measurement in the data

set.– There can be more than one mode for a data set.– It is not influenced by extreme measurements.– Modes of subsets cannot be combined to determine the mode of

the complete data set.– For grouped data its value can change depending on the

categories used.– It is applicable for both qualitative and quantitative data.


• Median– It is the central value; 50% of the measurements lie above it

and 50% fall below it.– There is only one median for a data set.– It is not influenced by extreme measurements.– Medians of subsets cannot be combined to determine the

median of the complete data set.– For grouped data, its value is rather stable even when the data

are organized into different categories.– It is applicable to quantitative data only.


• Mean– It is arithmetic average of the measurements in a data set.– There is only one mean for a data set.– Its value is influenced by extreme measurements; trimming

can help to reduce the degree of influence.– Means of subsets can be combined to determine the mean of

the complete data set.– It is applicable to quantitative data only.

Measures of Variability (1)

• Variability– The description of the dispersion or spread of the

measurements.

0 5 1 0 1 5 2 0 2 5 3 0

y

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Rel

ativ

e Fr

eque

ncy

0 5 1 0 1 5 2 0 2 5 3 0

y

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Rel

ativ

e Fr

eque

ncy

0 5 1 0 1 5 2 0 2 5 3 0

y0.

00.

10.

20.

30.

40.

50.

6

Rel

ativ

e Fr

eque

ncy


• Range– It is defined to be the difference between the largest and the

smallest measurements of the data set.– The simplest but least useful measure of data variability is the

range.– For grouped data, because the individual measurements are

not known, the range is taken to be the difference between the upper limit of the last interval and the lower limit of the first interval.


• Percentile– The p-th percentile of a set of n measurements arranged in

order of magnitude is that value that has at most p% of the measurements below it and at most (100-p)% above it.

– Specific percentiles of interest are the 25th, 50th, and 75th

percentiles, often called the lower quartile, the middle quartile(median), and the upper quartile, respectively.

5 10 15 20 25

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Rel

ativ

e Fr

eque

ncy

MedianIQR

Lower quartile Upper quartile

25% 25%

25%25%


• For grouped data, the following formula can be used to approximate the percentiles for the original data.

widthinterval

interest of percentile theincludes that interval class theoffrequency class percentile thebrfore

intervals class allfor frequency cumulativefrequency total

interest of percentile includes that interval class theoflimit lower

interest of percentile

))100/((

=

=

==

==

−⋅+=

w

f

cfn

LP

cfnPfwLP

p

b

bp


• Interquartile range (IQR)– The difference between the upper and lower quartiles– IQR = 75th percentile – 25th percentile– IQR ignores the extremes in the data set completely.– The IQR does not provide a lot of useful information about the

variability of a single set of measurements, but can be quite useful when comparing the variabilities of two or more data sets.


• Variance– Many different measures of variability can be constructed by using

the deviation .– The measure which involves the sum of the squared deviation of the

measurements form their mean is called the variance.– Population variance:

– Sample variance:

– The use of (n-1) as the denominator of s2 is not arbitrary. This definition of the sample variance makes it an unbiased estimator of the population variance σ2.

yy −

n

yyn

i

2

12)(∑

=

−=σ

1

)( 2

12

−

−=∑=

n

yys

n

i


• Standard deviation– The positive square root of the variance– One reason for defining the standard deviation is that it yields

a measure of variability having the same units of measurement as the original data, whereas the units for variance are the square of the measurement units.

– The other reason of using standard deviation is to apply the empirical rule to mound-shaped or bell-shaped distribution.

– Empirical rule• Given a set of n measurements possessing a mound-shaped

distribution, then the interval contains about 68% of the measurements;

• then the interval contains about 95% of the measurements;

• then the interval contains about 99.7% of the measurements;

sy ±

sy 2±

sy 3±


• Approximating s– The empirical rule of standard deviation states that

approximately 95% of the measurements lie in the interval . The length of this interval is, therefore, 4s. Because the range of the measurements is approximately 4s, we can obtain an approximate value for s by dividing the range by 4.

sy 2±

4range valueeapproximat =


• Coefficient of variation (CV)– It measures the variability in the values in a population relative

to the magnitude of the population mean.

– The CV is a unit-free number because the standard deviation and the mean are measured using the same units. It can be used to compare the variability in two considerably different processes or populations.

– In many applications, the CV is expressed as a percentage:

µσ

=CV

)%(100µσ

=CV

Examination of the Shape of a Distribution

• Stem-and leaf plot• Boxplot

– The boxplot is more concerned with the symmetry of the distribution and incorporates numerical measures of central tendency and location to study the variability of the measurements and the measurements in the tails of the distribution.

• Box-and-whiskers plot– Any value beyond an inner fence on either side is called a mild

outlier, and a value beyond an outer fence on either side is called an extreme outlier.

Homework

• 3.42 (p.94)• 3.44 (p.94)• 3.65 (p.112)• 3.72, 73, 74 (p.113)

Pie Chart

63.6% Cola

12.7% Lemon-lime 6.3% Dr Pepper-type3.0% Root beer

5.8% Orange

8.6% Other

Bar Chart

Great Britain West Germany Japan Netherlands Ireland

City

0

2000

4000

6000

Num

ber o

f Wor

kers

138

6500

14501200

200

Frequency Histogram

3.55 3.75 3.95 4.15 4.35 4.55 4.75 4.95

Class interval for weight gain

0

4

8

12

Freq

uenc

y

Relative Frequency Histogram

3.55 3.75 3.95 4.15 4.35 4.55 4.75 4.95Class intervals for weight gain

0.00

0.04

0.08

0.12

Rela

tive

freq

uenc

y

Influence of the Number of Class Intervals on the Appearance of Histograms

3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0

05

1015

20

Chick weight gain(grams)

Freq

uenc

y

3 .6 3.8 4.0 4.2 4.4 4.6 4.8

02

46

810

12


Freq

uenc

y

3 .6 3.8 4.0 4.2 4.4 4.6 4.80

24

68

1012


Freq

uenc

y

Uniform Distribution

0.0 0.2 0.4 0.6 0.8 1.0

050

100

150

Y

Freq

uenc

y

Normal Distribution

2 4 6 8 10 12 14

020

4060

80

Y

Freq

uenc

y

Unimodal Distribution (right-skewed)

0 5 10 15

010

2030

4050

Y

Freq

uenc

y

Unimodal Distribution (left-skewed)

0 2 4 6 8 10 12 14

010

2030

4050

Y

Freq

uenc

y

Bimodal Distribution

0 5 10 15 20

010

2030

Y

Freq

uenc

y

Bimodal Distribution (left-skewed)

0 5 10 15 20 25

05

1015

2025

30

Y

Freq

uenc

y

Stem-and-leaf PlotN = 200 Median = 3.565317 Quartiles = 2.39919, 5.23789 Decimal point is at the colon 1 : 00002222333444 1 : 6666777888999999 2 : 0001111122222233333444444 2 : 55555555555667778889999 3 : 00112222223333444 3 : 555566677778888999 4 : 0000000111222222233333444 4 : 5566789 5 : 00022333334 5 : 5566777889999 6 : 00124 6 : 6888 7 : 01124 7 : 57 8 : 24 8 : 559 9 : 02 9 : 9 High: 10.96026 10.99154 11.00564 11.76325 11.86953 High: 12.47704 12.84776

Example 1. Median

Each of 10 children in the second grade was given a reading aptitude test. The scores were as follows:

id 1 2 3 4 5 6 7 8 9 10 grade 95 86 78 90 62 73 89 92 84 76

Determine the median test score. Solution: Sort these scores:

62 73 76 78 84 86 89 90 92 95 Because there are an event number of measurements, the median is the average of the two midpoint scores.

852

8684=

+=median

Example 2. Grouped Data Median

• The frequency table for the chick data. Compute the median weight gain for these data.

Class Interval fi cum(fi) fi/n cum(fi/n) 3.55 - 3.65 1 1 0.01 0.01 3.65 - 3.75 1 2 0.01 0.02 3.75 - 3.85 6 8 0.06 0.08 3.85 - 3.95 6 14 0.06 0.14 3.95 - 4.05 10 24 0.10 0.24 4.05 - 4.15 10 34 0.10 0.34 4.15 - 4.25 13 47 0.13 0.47 4.25 - 4.35 11 58 0.11 0.58 4.35 - 4.45 13 71 0.13 0.71 4.45 - 4.55 7 78 0.07 0.78 4.55 - 4.65 6 84 0.06 0.84 4.65 - 4.75 7 91 0.07 0.91 4.75 - 4.85 5 96 0.05 0.96 4.85 - 4.95 4 100 0.04 1.00 Totals n = 100 1.00 28.4)4750(

111.025.4

)5.0(

471.0

1001125.4

=−+=

−+=

=====

bm

b

m

cfnfwLmedian

cfwnfL

Example 3. Grouped Data Mean

• The actual value of the sample mean is:

• To use the grouped data formula to calculate the mean:

Class Interval fi yi fiyi fi(yi-y)2 3.55 - 3.65 1 3.6 3.6 0.478864 3.65 - 3.75 1 3.7 3.7 0.350464 3.75 - 3.85 6 3.8 22.8 1.452384 3.85 - 3.95 6 3.9 23.4 0.921984 3.95 - 4.05 10 4.0 40.0 0.852640 4.05 - 4.15 10 4.1 41.0 0.368640 4.15 - 4.25 13 4.2 54.6 0.110032 4.25 - 4.35 11 4.3 47.3 0.000704 4.35 - 4.45 13 4.4 57.2 0.151632 4.45 - 4.55 7 4.5 31.5 0.302848 4.55 - 4.65 6 4.6 27.6 0.569184 4.65 - 4.75 7 4.7 32.9 1.165248 4.75 - 4.85 5 4.8 24.0 1.290320 4.85 - 4.95 4 4.9 19.6 1.478656 Totals n = 100 429.2 9.493600

292.4100

2.4291 ===∑=

n

yy

n

ii

292.4100

2.4291 ==≅∑=

n

yfy

n

iii

Relation Among Mean, Trimmed Mean, Median and Mode (1)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

µ TM Md Mo


2 4 6 8 10 12

0.0

0.1

0.2

0.3

µTMMdMo


0.00

0.05

0.10

0.15

0.20

µ TM Md Mo

Boxplot (1)3.

63.

84.

04.

24.

44.

64.

8

Q1

Q3

Median

Upper inner fence: Q3+1.5(IQR)

Lower inner fence: Q1-1.5(IQR)

Upper outer fence: Q3+3(IQR)

Lower outer fence: Q1-3(IQR)

Boxplot (2)20

4060

8010

012

014

0

Boxplot (3)20

4060

8010

012

014

0

Scatterplot

5 30 55 80 105 130 155base

15

20

25

30

35

40

age

chapter 3. data descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– for grouped data, because...

Documents