chapter 3. data descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– for grouped data, because...

47
Chapter 3. Data Description

Upload: others

Post on 31-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Chapter 3. Data Description

Page 2: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Graphical Methods

• Pie chart– It is used to display the percentage of the total number of

measurements falling into each of the categories of the variableby partition a circle.

• Bar chart– To use the height of bars to represent the number of

observations in categories.

• Frequency histogram– To divide the range of the measurements by the approximate

number of class intervals desired. To construct the frequency table which lists the number of the measurements, i.e., frequency, in the class intervals, and then use histograms or bars to represent the frequencies.

– Class histogram versus relative frequency histogram

Page 3: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Statistical Distribution (1)

• Probability– The chance of an event occurring

• Probability Distribution– The probability distribution of a discrete random variable is a table,

graph, formula, or other device used to specify all possible values of a discete random variable along with their respective probability.

• Unimodal distribution– A histogram with one major peak is called unimodal

distribution.

• Bimodal distribution– A histogram with two major peaks is called bimodal

distribution.

Page 4: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Statistical Distribution (2)

• Uniform distribution– Every interval has the same number of observations.

• Normal distribution– The relative frequency histogram or probability distribution is

a smooth bell-shaped curve.

• Lognormal distribution– After taking logarithm of the values of a random variable, the

relative frequency histogram shows a smooth-bell-shaped curve.

Page 5: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Exploratory Data Analysis

• The aim of the data analysis is to explore and understand the characteristics of the data.

• The basic tool for this analysis is graphical techniques.– Stem-and-leaf plot

• It is a clever, simple device for constructing a histogram-like picture of a frequency distribution.

Page 6: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Describing Data on a Single Variable

• Measures of Central Tendency– To describe the center of the distribution of measurements.

• Measures of variability– To describe how the measurements vary about the center of

the distribution.

• Parameter versus statistic– Parameters are the numerical descriptive measures for a

population.– Statistics are the numerical descriptive measures for a sample.

Page 7: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Measures of Central Tendency (1)

• Mode– The mode of a set of measurements is defined to be the

measurement that occurs most often (with the highest frequency).

– The mode is also commonly used as a measure of popularity that reflects central tendency or opinion.

• Median– The middle value when the measurements are arranged from

lowest to highest.– For an event number of measurements, the median is the

average of the two middle values when the measurements are arranged from the lowest to highest.

– When there are an odd number number of measurements, the median is the middle value.

Page 8: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Measures of Central Tendency (2)

• Grouped data median– For grouped data, the median is calculated as:

widthintervalmedian thecontaining interval class theoffrequency

classmedian thebefore classes allfor frequency) e(cumulativ sfrequencie of sum the

frequency totalmedian thecontains that interval theoflimit classLower

)5.0(

==

===

−+=

wf

cfnL

cfnfwLmedian

m

b

bm

Page 9: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Measures of Central Tendency (3)

• Arithmetic mean (or mean)– The sum of the measurements divided by the total number of

measurements.– Population mean --------– Sample mean -------------

– Grouped data mean

nyyyy

n

ii

n

n

yy ++++= ==∑

...1 321

tsmeasuremen ofnumber totalthe interval classth - theoffrequency

interval classth - theofmidpoint 1

==

=≅∑=

nif

iyn

yfy

i

i

n

iii

Page 10: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Measures of Central Tendency (4)

• Trimmed mean– a variation of the mean– It drops the highest and lowest extreme values and averages

the rest.– 5% trimmed mean drops the highest 5% and the lowest 5% of

the measurements and averages the rest.– In a limiting sense the median is a 50% trimmed mean.

• Outlier– the extreme values of measurements– The mean is subject to distortion due to the presence pf one or

more outliers.

Page 11: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Measures of Central Tendency (5)

Relationship• among the mean( ), the trimmed mean( ), the median( ) and the mode( ) for distributions with different skewness.

• It is not restricted to using only one measure of central tendency. For some data sets, it will be necessary to use more than one of these measures to provide an accurate descriptive summary of central tendency for the data.

µTM dM oM

Page 12: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Major Characteristics of Each Measure of Central Tendency (1)

• Mode– It is the most frequency or probable measurement in the data

set.– There can be more than one mode for a data set.– It is not influenced by extreme measurements.– Modes of subsets cannot be combined to determine the mode of

the complete data set.– For grouped data its value can change depending on the

categories used.– It is applicable for both qualitative and quantitative data.

Page 13: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Major Characteristics of Each Measure of Central Tendency (2)

• Median– It is the central value; 50% of the measurements lie above it

and 50% fall below it.– There is only one median for a data set.– It is not influenced by extreme measurements.– Medians of subsets cannot be combined to determine the

median of the complete data set.– For grouped data, its value is rather stable even when the data

are organized into different categories.– It is applicable to quantitative data only.

Page 14: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Major Characteristics of Each Measure of Central Tendency (3)

• Mean– It is arithmetic average of the measurements in a data set.– There is only one mean for a data set.– Its value is influenced by extreme measurements; trimming

can help to reduce the degree of influence.– Means of subsets can be combined to determine the mean of

the complete data set.– It is applicable to quantitative data only.

Page 15: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Measures of Variability (1)

• Variability– The description of the dispersion or spread of the

measurements.

0 5 1 0 1 5 2 0 2 5 3 0

y

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Rel

ativ

e Fr

eque

ncy

0 5 1 0 1 5 2 0 2 5 3 0

y

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Rel

ativ

e Fr

eque

ncy

0 5 1 0 1 5 2 0 2 5 3 0

y0.

00.

10.

20.

30.

40.

50.

6

Rel

ativ

e Fr

eque

ncy

Page 16: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Measures of Variability (2)

• Range– It is defined to be the difference between the largest and the

smallest measurements of the data set.– The simplest but least useful measure of data variability is the

range.– For grouped data, because the individual measurements are

not known, the range is taken to be the difference between the upper limit of the last interval and the lower limit of the first interval.

Page 17: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Measures of Variability (3)

• Percentile– The p-th percentile of a set of n measurements arranged in

order of magnitude is that value that has at most p% of the measurements below it and at most (100-p)% above it.

– Specific percentiles of interest are the 25th, 50th, and 75th

percentiles, often called the lower quartile, the middle quartile(median), and the upper quartile, respectively.

5 10 15 20 25

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Rel

ativ

e Fr

eque

ncy

MedianIQR

Lower quartile Upper quartile

25% 25%

25%25%

Page 18: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Measures of Variability (4)

• For grouped data, the following formula can be used to approximate the percentiles for the original data.

widthinterval

interest of percentile theincludes that interval class theoffrequency class percentile thebrfore

intervals class allfor frequency cumulativefrequency total

interest of percentile includes that interval class theoflimit lower

interest of percentile

))100/((

=

=

==

==

−⋅+=

w

f

cfn

LP

cfnPfwLP

p

b

bp

Page 19: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Measures of Variability (5)

• Interquartile range (IQR)– The difference between the upper and lower quartiles– IQR = 75th percentile – 25th percentile– IQR ignores the extremes in the data set completely.– The IQR does not provide a lot of useful information about the

variability of a single set of measurements, but can be quite useful when comparing the variabilities of two or more data sets.

Page 20: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Measures of Variability (6)

• Variance– Many different measures of variability can be constructed by using

the deviation .– The measure which involves the sum of the squared deviation of the

measurements form their mean is called the variance.– Population variance:

– Sample variance:

– The use of (n-1) as the denominator of s2 is not arbitrary. This definition of the sample variance makes it an unbiased estimator of the population variance σ2.

yy −

n

yyn

i

2

12)(∑

=

−=σ

1

)( 2

12

−=∑=

n

yys

n

i

Page 21: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Measures of Variability (7)

• Standard deviation– The positive square root of the variance– One reason for defining the standard deviation is that it yields

a measure of variability having the same units of measurement as the original data, whereas the units for variance are the square of the measurement units.

– The other reason of using standard deviation is to apply the empirical rule to mound-shaped or bell-shaped distribution.

– Empirical rule• Given a set of n measurements possessing a mound-shaped

distribution, then the interval contains about 68% of the measurements;

• then the interval contains about 95% of the measurements;

• then the interval contains about 99.7% of the measurements;

sy ±

sy 2±

sy 3±

Page 22: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Measures of Variability (8)

• Approximating s– The empirical rule of standard deviation states that

approximately 95% of the measurements lie in the interval . The length of this interval is, therefore, 4s. Because the range of the measurements is approximately 4s, we can obtain an approximate value for s by dividing the range by 4.

sy 2±

4range valueeapproximat =

Page 23: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Measures of Variability (8)

• Coefficient of variation (CV)– It measures the variability in the values in a population relative

to the magnitude of the population mean.

– The CV is a unit-free number because the standard deviation and the mean are measured using the same units. It can be used to compare the variability in two considerably different processes or populations.

– In many applications, the CV is expressed as a percentage:

µσ

=CV

)%(100µσ

=CV

Page 24: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Examination of the Shape of a Distribution

• Stem-and leaf plot• Boxplot

– The boxplot is more concerned with the symmetry of the distribution and incorporates numerical measures of central tendency and location to study the variability of the measurements and the measurements in the tails of the distribution.

• Box-and-whiskers plot– Any value beyond an inner fence on either side is called a mild

outlier, and a value beyond an outer fence on either side is called an extreme outlier.

Page 25: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Homework

• 3.42 (p.94)• 3.44 (p.94)• 3.65 (p.112)• 3.72, 73, 74 (p.113)

Page 26: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Pie Chart

63.6% Cola

12.7% Lemon-lime 6.3% Dr Pepper-type3.0% Root beer

5.8% Orange

8.6% Other

Page 27: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Bar Chart

Great Britain West Germany Japan Netherlands Ireland

City

0

2000

4000

6000

Num

ber o

f Wor

kers

138

6500

14501200

200

Page 28: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Frequency Histogram

3.55 3.75 3.95 4.15 4.35 4.55 4.75 4.95

Class interval for weight gain

0

4

8

12

Freq

uenc

y

Page 29: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Relative Frequency Histogram

3.55 3.75 3.95 4.15 4.35 4.55 4.75 4.95Class intervals for weight gain

0.00

0.04

0.08

0.12

Rela

tive

freq

uenc

y

Page 30: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Influence of the Number of Class Intervals on the Appearance of Histograms

3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0

05

1015

20

Chick weight gain(grams)

Freq

uenc

y

3 .6 3.8 4.0 4.2 4.4 4.6 4.8

02

46

810

12

Chick weight gain(grams)

Freq

uenc

y

3 .6 3.8 4.0 4.2 4.4 4.6 4.80

24

68

1012

Chick weight gain(grams)

Freq

uenc

y

Page 31: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Uniform Distribution

0.0 0.2 0.4 0.6 0.8 1.0

050

100

150

Y

Freq

uenc

y

Page 32: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Normal Distribution

2 4 6 8 10 12 14

020

4060

80

Y

Freq

uenc

y

Page 33: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Unimodal Distribution (right-skewed)

0 5 10 15

010

2030

4050

Y

Freq

uenc

y

Page 34: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Unimodal Distribution (left-skewed)

0 2 4 6 8 10 12 14

010

2030

4050

Y

Freq

uenc

y

Page 35: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Bimodal Distribution

0 5 10 15 20

010

2030

Y

Freq

uenc

y

Page 36: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Bimodal Distribution (left-skewed)

0 5 10 15 20 25

05

1015

2025

30

Y

Freq

uenc

y

Page 37: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Stem-and-leaf PlotN = 200 Median = 3.565317 Quartiles = 2.39919, 5.23789 Decimal point is at the colon 1 : 00002222333444 1 : 6666777888999999 2 : 0001111122222233333444444 2 : 55555555555667778889999 3 : 00112222223333444 3 : 555566677778888999 4 : 0000000111222222233333444 4 : 5566789 5 : 00022333334 5 : 5566777889999 6 : 00124 6 : 6888 7 : 01124 7 : 57 8 : 24 8 : 559 9 : 02 9 : 9 High: 10.96026 10.99154 11.00564 11.76325 11.86953 High: 12.47704 12.84776

Page 38: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Example 1. Median

Each of 10 children in the second grade was given a reading aptitude test. The scores were as follows:

id 1 2 3 4 5 6 7 8 9 10 grade 95 86 78 90 62 73 89 92 84 76

Determine the median test score. Solution: Sort these scores:

62 73 76 78 84 86 89 90 92 95 Because there are an event number of measurements, the median is the average of the two midpoint scores.

852

8684=

+=median

Page 39: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Example 2. Grouped Data Median

• The frequency table for the chick data. Compute the median weight gain for these data.

Class Interval fi cum(fi) fi/n cum(fi/n) 3.55 - 3.65 1 1 0.01 0.01 3.65 - 3.75 1 2 0.01 0.02 3.75 - 3.85 6 8 0.06 0.08 3.85 - 3.95 6 14 0.06 0.14 3.95 - 4.05 10 24 0.10 0.24 4.05 - 4.15 10 34 0.10 0.34 4.15 - 4.25 13 47 0.13 0.47 4.25 - 4.35 11 58 0.11 0.58 4.35 - 4.45 13 71 0.13 0.71 4.45 - 4.55 7 78 0.07 0.78 4.55 - 4.65 6 84 0.06 0.84 4.65 - 4.75 7 91 0.07 0.91 4.75 - 4.85 5 96 0.05 0.96 4.85 - 4.95 4 100 0.04 1.00 Totals n = 100 1.00 28.4)4750(

111.025.4

)5.0(

471.0

1001125.4

=−+=

−+=

=====

bm

b

m

cfnfwLmedian

cfwnfL

Page 40: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Example 3. Grouped Data Mean

• The actual value of the sample mean is:

• To use the grouped data formula to calculate the mean:

Class Interval fi yi fiyi fi(yi-y)2 3.55 - 3.65 1 3.6 3.6 0.478864 3.65 - 3.75 1 3.7 3.7 0.350464 3.75 - 3.85 6 3.8 22.8 1.452384 3.85 - 3.95 6 3.9 23.4 0.921984 3.95 - 4.05 10 4.0 40.0 0.852640 4.05 - 4.15 10 4.1 41.0 0.368640 4.15 - 4.25 13 4.2 54.6 0.110032 4.25 - 4.35 11 4.3 47.3 0.000704 4.35 - 4.45 13 4.4 57.2 0.151632 4.45 - 4.55 7 4.5 31.5 0.302848 4.55 - 4.65 6 4.6 27.6 0.569184 4.65 - 4.75 7 4.7 32.9 1.165248 4.75 - 4.85 5 4.8 24.0 1.290320 4.85 - 4.95 4 4.9 19.6 1.478656 Totals n = 100 429.2 9.493600

292.4100

2.4291 ===∑=

n

yy

n

ii

292.4100

2.4291 ==≅∑=

n

yfy

n

iii

Page 41: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Relation Among Mean, Trimmed Mean, Median and Mode (1)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

µ TM Md Mo

Page 42: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Relation Among Mean, Trimmed Mean, Median and Mode (2)

2 4 6 8 10 12

0.0

0.1

0.2

0.3

µTMMdMo

Page 43: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Relation Among Mean, Trimmed Mean, Median and Mode (3)

0.00

0.05

0.10

0.15

0.20

µ TM Md Mo

Page 44: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Boxplot (1)3.

63.

84.

04.

24.

44.

64.

8

Q1

Q3

Median

Upper inner fence: Q3+1.5(IQR)

Lower inner fence: Q1-1.5(IQR)

Upper outer fence: Q3+3(IQR)

Lower outer fence: Q1-3(IQR)

Page 45: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Boxplot (2)20

4060

8010

012

014

0

Page 46: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Boxplot (3)20

4060

8010

012

014

0

Page 47: Chapter 3. Data Descriptionweb.cjcu.edu.tw/~jdwu/biostat01/lect003.pdf– For grouped data, because the individual measurements are not known, the range is taken to be the difference

Scatterplot

5 30 55 80 105 130 155base

15

20

25

30

35

40

age