chapter 1 introduction to biostatisticseacademic.ju.edu.jo/oalkam/material/biostat lectures...
TRANSCRIPT
1
University of Jordan Fall 2009/2010 Department of Mathematics
Chapter 1Introduction to Biostatistics
Introduction; Some Basic ConceptsStatistics is a science related to making decisions in the face of uncertainty, it comprises thefollowing
1) Descriptive statistics : Concerned with the collection, organization, summarization andanalysis of a body of data
2) Inferential statistics : Concerned with drawing inferences about a large body of data(called a population) through examining a part of that body (called a sample)
The performance of statistical activities is motivated by the need to answer a question about acertain population. The usual setup of such activities starts with picking up a sample from thepopulation that is similar to that population in the sense that it has all the characteristics andproperties of the population (such a sample is said to be an unbiased sample), then to collectinformation from the sample and use it to answer the question about the population. If thequestion (hence the data) is related to a medical, biological, or nutritive problem then we usethe term biostatistics to distinguish this particular kind of statistical tools.
Now we introduce some of the vocabulary and concepts that are widely used in any statisticscourse.
Random Variable: the information or data collected from the subjects can not be exactlypredicted in advance, they are referred to as random variables.
Random variables are two kinds :Qualitative Variables : They divide the subjects into groups or categories, the value of aqualitative variable can not be measured or counted, for example the birth place, gender, ormarital status of an individual .Qualitative random variables are either “nomimnal” or “ordinal”.The possible values of a nominal random variable do not have a natural order. For example:“gender”, “marital status”, “nationality”….. The possible values of an ordinal random variable
Page 1 of 41
Biostatistics handouts Part 1 (Chapter 1 - Chapter 5) Dr. Osama Alkam
Page 1
Biostatistics Dr. Osama Alkam
2
can be ordered naturally. For example: “rank”, “letter grade”, “degree of improvement such aslow, weak, good, very good and excellent…”.Quantitative Variables : The value of a quantitative variable can be measured or counted. Wedistinguish between two kinds of quantitative variables:
1. Discrete Variables: if the value of the variable can be counted then it is called a
discrete random variable, an example of a discrete random variable is the number ofadmissions to a general hospital or the number of family members of an individual.Discrete random variables are characterized by gaps or interruptions in the valuesthey assume.
2. Continuous Variables: if the value of the variable can be measured then it is called acontinuous random variable, an example of a continuous random variable is theperiod of treatment of a tuberculoses patient. A continuous random variable canassume any value within a specified relevant interval of values.
Sources of Data: The information about the subjects are usually collected from one or more ofthe following sources
1. Routinely kept records or archives: for example the medical history of apatient.
2. Surveys: if the data needed is not available in the kept records then it logicalto think of a survey, for example information about whether the patientreceived a good treatment or not is not usually kept in the hospital records butcan be surveyed .
3. Experiments: Frequently the data needed to answer a question are availableonly as the result of an experiment . Different strategies of motivation may betried by a pediatrician or a dentist with different children to know the beststrategy for maximizing children compliance.
4. External Sources: the data needed to answer a question may already exist inthe form of a published report. International organizations like WHO or healthministries usually publish reports that make a good source of data that can bebenefited from.
Page 2 of 41
Page 2
Biostatistics Dr. Osama Alkam
3
The Simple Random Sample (SRS)If a sample of size n is drawn from a population of size N in such a way that everypossible sample of size n has the same chance of being selected then the sample iscalled a simple random sample.
One method of selecting a simple random sample is a method which is uses randomnumber generators or random number tables. The procedure of that method is thefollowing:
1. Get a list of all subjects in the population2. Obtain random numbers from a random number generator or a table3. Select the subjects whose numbers in the list match with the obtained random
numbers.Note: The above method is ideal but it is practically inapplicable to some data, inparticular it is difficult to implement it when we need to draw a sample from a relativelyhuge population.Reading Assignment:Chapter 1 (1.1,1.2,1.4) in W.W.Daniel.
Chapter 2Descriptive Statistics
IntroductionIn this chapter we learn several techniques for organizing and presenting data so that we mayeasily determine what information they contain.
The Ordered ArrayAn ordered array is a listing of the values of a collection of data in order of magnitude from thesmallest value to the largest value. An ordered array enables one to determine quickly thevalue of the smallest measurement and the value of the largest measurement.
Page 3 of 41
Page 3
Biostatistics Dr. Osama Alkam
4
Example: The following data are the ages of 30 people, rounded to the nearest year, who havebeen discharged from a general hospital last Friday
51 70 79 75 55 25 38 74 54 7237 15 56 17 77 43 16 15 72 9225 30 24 46 47 46 38 81 49 45
In order to put the above data in an ordered array we just list the measurements from thesmallest to the largest
15 15 16 17 24 25 25 30 37 3838 43 45 46 46 47 49 51 54 5556 70 72 72 74 75 77 79 81 92
Frequency tables without classes:Such tables can be used to organize all types of data.Example: the following table shows “letter grades” of 150 students.
X (letter grade) Frequency (number of students)
F 12
D 15
D+ 20
C 35
C+ 30
B 18
B+ 12
A 8
Grouped Data – Frequency tables with classesTo group a set of observations we select a set of contiguous, non-overlapping intervals suchthat each observation belongs to exactly one interval. These intervals are called class intervals.Class intervals need not have the same width. All class intervals are listed in a table which isreferred to as a frequency table. A typical frequency table consists of the following
Class intervals: a column in which all class intervals are listed
Page 4 of 41
Page 4
Biostatistics Dr. Osama Alkam
5
Midpoints : a column in which the midpoints of the class intervals are computed. Themidpoints of a class interval equals (left side + right side)/2
Frequency : the frequency of a class interval is the number of observations thatbelong to the class interval.
Cumulative Frequency: the cumulative frequency of a class interval is the number ofobservations that are less than or equal the right-hand side of that class interval.
Relative Frequency : the relative frequency of a class interval equals(the frequency of the class interval / total frequency)
Cumulative Relative Frequency: It equals (cumulative frequency / total frequency)A natural question is how many class intervals should be included in a frequency table?
A rule of thumb states that the number of class intervals k should be between 5 and 15.We may use the following rule given by Sturges as a guide for computing k:
The number of class intervals is the closest integer k to n 101 3.322log ( ) where n is
the total number of observations. The number of class intervals specified by the rule can beincreased or decreased for more convenience or better presentation. After having decidedabout the number of class intervals we decide about the class widths. If we decide to give allclasses the same width then we compute the class width using the formula (largest value –smallest value ) / k rounded to the nearest number from above with the same accuracy
unit.
Example:Put the data mentioned in the previous example in a frequency table.We start with computing the number of classes.
30n , 101 3.322log 30 5.906 . Thus we should have 6 class intervals.
To obtain the class width we compute 92 1512.333
6
. Since the observations are integers,
we round 12.333 to 13. Thus the class width is 13.Now we are ready to construct the first class interval which has the least observation, namely15, as a left-hand side and (left-hand side + the class width – one accuracy unit) as a right-hand side, the second class's left-hand side is the first class's right-hand side + one accuracyunit. The right-hand side of each class is the left-hand side of the class + the class width – oneaccuracy unit. We construct the other class intervals similarly.
Page 5 of 41
Page 5
Biostatistics Dr. Osama Alkam
6
Class intervals Midpoint Frequency CumulativeFrequency
RelativeFrequency
CumulativeRelative
Frequency
15 - 27 21 7 7 0.233 0.233
28 - 40 34 4 11 0.133 0.367
41 - 53 47 7 18 0.233 0.600
54 - 66 60 3 21 0.100 0.700
67 - 79 73 7 28 0.233 0.933
80 - 92 86 2 30 0.067 1.000
Example: Consider the following cumulative frequency distribution.
Class Cumulative Frequency10 - 15 6
16 - 21 13
22 - 27 38
28 - 33 42
34 - 39 50
a) What is the width (or length) of each class?b) Find the relative frequency of the second class.c) Find the proportion of observations that are greater than or equal to 16 and less than or
equal 33.Solution:a) The class width equals16 – 10 = 6 (or you may say 15 – 10 + one accuracy unit = 5 + 1 = 6)b) The frequency of the second class equals 13 – 6 = 7 and the total frequency is 50. Thus therelative frequency of the second class equals 7/50 = 0.14.c) The observations that are greater than or equal to 16 and less than or equal 33 are those inthe second, third and fourth classes and their frequencies are 13 – 6 = 7,38 – 13 = 25 and42 – 38 = 4, respectively. Thus their proportion is (7+25+4)/50 = 0.72.
The Histogram; The Frequency Polygon:The histogram is a graphical representation of the frequency distribution (or the relativefrequency distribution), it reveals the shape of the data, for example the presence or absenceof symmetry. When we construct the histogram the boundaries of the class intervals arepresented by the horizontal axis, while the vertical axis has as its scale the frequency (or the
Page 6 of 41
Page 6
Biostatistics Dr. Osama Alkam
7
relative frequency). Above each class interval on the horizontal axis a rectangle with heightbeing equal to the frequency (or relative frequency) of the relevant class interval is constructed.All rectangles must be contiguous .The frequency (or relative frequency) polygon is another graphical representation for thefrequency (or relative frequency) distribution. To draw a frequency polygon we place a dotabove the midpoint of each class interval represented on the horizontal axis in addition to twoextra dots on the horizontal axis at the midpoints of two additional class intervals, one islocated to the left of the first class and the other is located to the right of the last class , theheight of each dot equals the frequency of the relevant class interval and the heights of theextra dots are zero Connecting the dots with line segments produces a frequency polygon.Example: Construct the frequency histogram and the frequency polygon of the following part ofa frequency table.
Class Intervals Frequency Midpoint Actual limits
15 - 27 7 21 14.5 – 27.5
28 - 40 4 34 27.5 – 40.5
41 - 53 7 47 40.5 – 53.5
54 - 66 3 60 53.5 – 66.5
67 - 79 7 73 66.5 – 79.5
80 - 92 2 86 79.5 – 92.5
The following is the histogram
14.5
012345678
Freq
uenc
y
Biostatistics Dr. Osama Alkam
7
relative frequency). Above each class interval on the horizontal axis a rectangle with heightbeing equal to the frequency (or relative frequency) of the relevant class interval is constructed.All rectangles must be contiguous .The frequency (or relative frequency) polygon is another graphical representation for thefrequency (or relative frequency) distribution. To draw a frequency polygon we place a dotabove the midpoint of each class interval represented on the horizontal axis in addition to twoextra dots on the horizontal axis at the midpoints of two additional class intervals, one islocated to the left of the first class and the other is located to the right of the last class , theheight of each dot equals the frequency of the relevant class interval and the heights of theextra dots are zero Connecting the dots with line segments produces a frequency polygon.Example: Construct the frequency histogram and the frequency polygon of the following part ofa frequency table.
Class Intervals Frequency Midpoint Actual limits
15 - 27 7 21 14.5 – 27.5
28 - 40 4 34 27.5 – 40.5
41 - 53 7 47 40.5 – 53.5
54 - 66 3 60 53.5 – 66.5
67 - 79 7 73 66.5 – 79.5
80 - 92 2 86 79.5 – 92.5
The following is the histogram
27.5 40.5 53.5 66.5 79.5 92.5
Actual Limits
Biostatistics Dr. Osama Alkam
7
relative frequency). Above each class interval on the horizontal axis a rectangle with heightbeing equal to the frequency (or relative frequency) of the relevant class interval is constructed.All rectangles must be contiguous .The frequency (or relative frequency) polygon is another graphical representation for thefrequency (or relative frequency) distribution. To draw a frequency polygon we place a dotabove the midpoint of each class interval represented on the horizontal axis in addition to twoextra dots on the horizontal axis at the midpoints of two additional class intervals, one islocated to the left of the first class and the other is located to the right of the last class , theheight of each dot equals the frequency of the relevant class interval and the heights of theextra dots are zero Connecting the dots with line segments produces a frequency polygon.Example: Construct the frequency histogram and the frequency polygon of the following part ofa frequency table.
Class Intervals Frequency Midpoint Actual limits
15 - 27 7 21 14.5 – 27.5
28 - 40 4 34 27.5 – 40.5
41 - 53 7 47 40.5 – 53.5
54 - 66 3 60 53.5 – 66.5
67 - 79 7 73 66.5 – 79.5
80 - 92 2 86 79.5 – 92.5
The following is the histogram
Page 7 of 41
Page 7
Biostatistics Dr. Osama Alkam
8
The following is the frequency polygon
Stem-and-Leaf Display (optional):The stem-and-leaf display is similar to the histogram and has the same purpose, its mainadvantage over the histogram is that it preserves the information contained in the individualdata items. It is effective with relatively small data sets .To construct a stem-and-leaf plot we :
1. partition each datum into two parts; the leaf which consists of the units digit and thestem which consists of the rest digits of the datum
2. on the left hand side of the page write down the stem3. draw a line to the right of these stems4. on the other side of the line, write down the leaves of all data with the same stem on
the left.The stems of the data should form an ordered column with the smallest stem at the top and thelargest at the bottom. All the stems within the range are included in the stem column even if nodata with that stem is within our data items. Decimals when present in the original data areomitted in the stem-and-leaf display. If all data items are fractions less than one the we canmagnify the data through multiplying each data item by a number (10, 100, 1000 etc.) beforewe display the data in a stem-and-leaf plot..Example:Display the following data in a stem-and-leaf plot2,3,6,7,12,15,15,15,17,20,20,21,29,29,34,51,56,60,65,69,80,89
0
1
2
3
4
5
6
7
8
8 21 34 47 60 73 86 99
Freq
uenc
y
Midpoint
Page 8 of 41
Page 8
Biostatistics Dr. Osama Alkam
9
Solution :Stems | Leaves
0 | 2 3 6 71 | 2 5 5 5 72 | 0 0 1 9 93 | 44 |5 | 1 66 | 0 5 97 |8 | 0 9
Reading Assignment:Chapter 2 (2.1,2.2,2.3) in W.W. Daniel
Descriptive Statistics – Measures of Central TendencyA descriptive measure is a single number that is used to summarize the data. Descriptivemeasures may be computed from the data of a sample or the data of a population.Definition:
1. A descriptive measure computed from the data of a sample is called a statistic .2. A descriptive measure computed from the data of a population is called a parameter.
Arithmetic Mean: The arithmetic mean of a sample is denoted by x and of a population is
denoted by . From now on we will just say the mean for the arithmetic mean.
1) For raw (unorganized) data:
1
n
ii
xx
n
, where 1 2, ,..., nx x x are the observations in the sample and n is
their number ( the sample size ).
1
N
ii
x
N
, where 1 2, ,..., Nx x x are the observations in the population and N
is their number ( the population size ).
Page 9 of 41
Page 9
Biostatistics Dr. Osama Alkam
10
2) For frequency tables:
= ∑∑ , where 1 2, ,..., nx x x are the observations (or midpoints) and
, , …, are their corresponding frequencies.
Properties of the Mean:1. Uniqueness : for a given set of data there is one and only one mean.2. Simplicity: it is so easy to compute the mean of any sample.3. The value of each data item has an influence on the mean, thus the mean is
affected by extreme values, this makes the mean, in some cases, not a goodrepresentative of the tendency of the values of the majority of the data..
Example:The mean of the data 50,49,53,48,54,420 equals (50+49+53+48+54+420)/6= 112.333;a number which does not represent the tendency of the data, however if we trim outthe observation 420 then the mean becomes (50+49+53+48+54)/5 = 50.8 . Notice theinfluence of the observation 420 on the value of the mean.Example(part 1 is optional):Compute the mean for the following two data sets.
1) Stem | Leaf0 | 1, 2 , 51 | 0, 12 | 1, 1, 1, 23 | 0, 1, 2, 2
2)
Class Frequency
0 – 2 3
3 – 5 2
6 – 8 1
9 – 11 3
12 – 14 2
Solution:1) mean = sum of observations/ number of observations
= (1+2+5+10+11+21+21+21+22+30+31+32+32)/13 = 239/13 = 18.3946
Page 10 of 41
Page 10
Biostatistics Dr. Osama Alkam
11
2)
Midpoint ( ) Frequency ( )1 3 3
4 2 8
7 1 7
10 2 20
13 2 26
Total 10 64
mean = 64/10 = 6.4The Median:The median of a finite set of observations is the value which divides the set into two equal partssuch that the number of values equal to or greater than the median is equal to the number ofvalues equal or less than the median. The median will be the middle value (or the averageof the two middle values) when all values have been arranged in order of magnitude.Example :Find the median of the following observations
45,78,23,54,61,12,90,46,68,45,11Solution:The first step will be arranging the data in order of magnitude
11, 12, 23, 23, 45, 45, 46, 54, 68, 78, 90Notice that 45 is located exactly in the middle of all ordered values, thus the median is 45.Example :Find the median of 65, 78,94,25,23,56,66,38,78,23,80Solution:We order the data as a first step
23, 23, 25, 38, 56, 66, 78, 78, 80, 94Notice that no single datum is located in the middle of the ordered data because the number ofdata items is even, however the two values 56 and 66 are located in the middle, thus themedian equals (56+66)/2= 61.Properties of the Median:
1. Uniqueness2. Simplicity3. Unlike the mean, it is not drastically affected by extreme values.
Page 11 of 41
Page 11
Biostatistics Dr. Osama Alkam
12
The Mode:A mode of a set of observations is an observation that has the largest frequency. If allobservations have the same frequency. A data set may have more than one mode. The modemay be used to describe qualitative data. A mode of grouped data is estimated by the midpointof a class with the highest frequency.
Example:The following table represents the nationalities of a sample of 10 patients who hadpsychotherapy last year in a private clinicBritish, French, American, American, Dutch, British, Spanish, South African, French, AmericanTo find the mode of the above nationalities we make the following table
Nationality Frequency
American 3
British 2
Dutch 2
French 2
Spanish 1
South African 1
Notice that the most frequently occurring nationality is American, thus the mode is American.
Example:Find the mode of the data 28, 28, 28, 28, 28, 29, 30, 31, 32, 32, 32, 32, 32, 36, 39, 42, 44,44,45
Solution:There are two modes for the above data namely 28 and 32 because they have the samehighest frequency.Reading Assignment:Chapter 2 (2.4) in W.W. Daniel.
Page 12 of 41
Page 12
Biostatistics Dr. Osama Alkam
13
Descriptive Statistics – Measures of Dispersion
The dispersion of a set of data (or observations) refers to the variety that they exhibit. Ameasure of dispersion provides information about the amount of variability present in aset of data. When the dispersion is "small", the values of the data items are "close" together.The following graph represents two frequency polygons for population A and population B withthe same mean
notice that population B exhibits more dispersion because the values of its observations aremore spread out. Dispersion can be measured using one of the following measures:The Range:The range of a set of values is given by largest value smallest valueR .
The range is so simple to compute, but it is not usually used as a reliable measure ofdispersion because it is drastically affected by extreme values.The Variance:
1) For raw data: the variance of the sample 1 2, ,... nx x x is given by
2
2 1
( )
1
n
ii
x xs
n
where x is the mean of the sample.
One can easily show that the above formula for the variance has also following form
= 1− 1 2 − 2=1which is easier for computations with calculators.
The populations variance is given by = ∑ ( ) = (∑ − )where N is the populations size and is the mean of the population.
Page 13 of 41
Page 13
Biostatistics Dr. Osama Alkam
14
2) For frequency tables: = 1(∑ )=1 −1 ∑ 2 − (∑ )=1 2=1where , , … , are the observations (or the midpoints) and , , … , are their
corresponding frequencies.The Standard Deviation:The variance represents squared units and, therefore, is not an appropriate measure ofdispersion when we want to express it in terms of the original units. To obtain a measure ofdispersion in the original units, we take the square root of the variance, which we refer to asthe standard deviation.
The standard deviations of a sample and the population is denoted by s and respectively.= √ and = √Example: Find the mean and standard deviation of each of the following samples
i) 42, 28, 28, 61, 31, 23, 50, 34, 32, 37ii)
Class Frequency
–2 – 0 1
1 – 3 2
4 – 6 3
7 – 9 2
10 – 12 1
Solution: i)
42 1764
28 784
28 784
61 3721
31 961
23 529
50 2500
34 1156
32 1024
37 1369
366 14592 ← Total
Page 14 of 41
Page 14
Biostatistics Dr. Osama Alkam
15
Thus, = = 36.6 and = × . = 11.5297i)
Midpoint Frequency
– 1 1 – 1 1 1
2 2 4 4 8
5 3 15 25 75
8 2 16 64 128
11 1 11 121 121
Total 9 45 not needed 333
= = 5 and = × = 3.67423The Coefficient of VariationThe coefficient of variation, denoted by C.V., is a unit free measure that is used to compare theamount of dispersion between two different sets of data with (possibly) different means and
different units. The coefficient of variation is given by . . 100sC Vx
Example:The following table summarizes the data collected about the weights of two samples of humanmales
Sample 1 Sample 2
Age 25 years 11 years
Mean Weight 145 pounds 80 pounds
Standard Deviation 10 pounds 10 pounds
Which of the samples is more dispersed?Solution:To compare dispersion we compute the C.V. for each sample.
C.V. for sample 1 = 10100 6.9
145
C.V. for sample 2 = 10100 12.5
80
Since the C.V. of sample 2 is greater than the C.V. of sample 1, sample 2 is more dispersed.
Page 15 of 41
Page 15
Biostatistics Dr. Osama Alkam
16
Percentiles and QuartilesPercentiles and quartiles are used to indicate certain positions (or locations) of the
observations (or data). The pth percentile is denoted by pP ; it is the number P such that
(almost) p% of the observations are less than or equal to P . The 25th percentile is also
denoted by 1Q and is also called the 1st quartile. The second quartile 2Q is the 50th
percentile (the median) while the 3rd quartile 3Q is the 75th percentile.
Computing percentiles:
1) For ungrouped data, the pth percentile is thought of to be the ( 1)100
thpn ordered
observation. Thus
1Q is the 0.25 ( 1) thn ordered observation
2Q is the 0.5 ( 1) thn ordered observation
3Q is the 0.75 ( 1) thn ordered observation
The pth percentile for ungrouped data is computed using the formula:+ ( − )( − ), where = ( + 1) and is the floor of and n is the
number of observations (or total frequency). Before you apply the formula, make sure that theobservations are written in an ascending order.
2) For grouped data. Think of the pth percentile to be the observation that has
cumulative frequency × , where is the total frequency. Find the first class that
has cumulative frequency greater than or equal × , say − . Use the
values and cumulative frequencies of the of this
class to approximate the required percentile linearly as shown in the example.Interquartile Range:
The interquartile range is denoted by IQR . It is given by 3 1IQR Q Q
Example:Find , the median, , and IQR for the following observations
23, 12, 54, 43, 51, 17, 32, 19, 14, 22, 25, 28, 33, 42, 26, 38, 50Solution:We start with putting the above data in ascending order:12, 14, 17, 19, 22, 23, 25, 26, 28, 32, 33, 38, 42, 43, 50, 51, 54
The number of observations is 17n .
Page 16 of 41
Page 16
Biostatistics Dr. Osama Alkam
17
The first quartile 1Q is the 0.25 (17 1)th ordered observation, i.e. 1Q is the 4.5th observation.
Now, the 4th observation is 19 and the 5th observation is 22, hence the 4.5th observation is+ 0.5( − ) = 19 (0.5 (22 19)) = 20.5.
The median is the 0.5 (17 1)th ordered observation, i.e. it is the 9th observation, namely 28.
3Q is the 0.75 (17 1)th observation, i.e., it is the 13.5th observation, namely
42 (0.5 (43 42)) = 42.5
60P is the 0.6 (17 1)th observation, i.e., it is the 10.8th observation, namely
32 (0.8 (33 32)) = 32.8
3 1IQR Q Q = 42.5 – 20.5 = 22.
Example:Find the median and the 80th percentile of the following data.
x Frequency
3 2
5 4
9 3
12 5
17 4
Total 18
Solution: The total frequency = 18.
To find the median. × 19 = 9.5. Thus, the median is the 9.5th ordered
observation, which is + 0.5( − ) = 9 + 0.5 × (12 − 9) = 10.5 To find the 80th percentile. × 19 = 15.2. Thus, the 8th percentile is the 15.2nd
ordered observation, which is + 0.2( − ) = 17 + 0.2 × (17 − 17) =17.
Page 17 of 41
Page 17
Biostatistics Dr. Osama Alkam
18
Example:Find the median for the following grouped data.
Class Frequency
2 – 6 3
7 – 11 5
12 – 16 7
17 – 21 2
Total 17
Solution:= 17, = 50 × 17 = 8.5. The first class that has cumulative frequency 8.5 is
12 – 16 and the actual limits of this class are 11.5 and 16.5, respectively.
? = 11.5 + 0.5 × = 11.8571 = median.
Example: Consider the following table of grouped data
Class Frequency10 – 15 316 – 21 222 – 27 528 – 33 234 – 39 4
Total 16
Estimate the proportion of observations that are less than 24.
Page 18 of 41
Page 18
Biostatistics Dr. Osama Alkam
19
Solution:
The observation 24 belongs to the class 22 – 27.
Box-and-Whisker Plots (Box plots) (Optional) :A box-and-Whisker plot (or simply a box plot) is a useful visual device for demonstrating theinformation contained in a data set. It reveals information regarding the amount of spread,location of concentration, and symmetry of the data. The construction of such a plot makesuse of the quartiles of a data set and may be accomplished by the following steps:1. Represent the data on the horizontal axis.2. Draw a box in the space above the horizontal axis in such a way that the left end of the box
aligns with the first quartile 1Q and the right end of the box aligns with the third quartile 3Q .
3. Divide the box into two parts by a vertical line that aligns with the median 2Q .
4. Draw a horizontal line called a whisker from the left end of the box to a point that aligns withthe smallest measurement in the data set.5. Draw another horizontal line, or whisker, from the right end of the box to a point that alignswith the largest measurement in the data set.
Example: Construct a box-and-whisker plot for the data in the previous example.
Page 19 of 41
Page 19
Biostatistics Dr. Osama Alkam
20
Solution:
Reading Assignment:Chapter 2 (2.5) in W.W. Daniel.
Chapter 3Some Basic Probability Concepts
Elementary Properties of Probability:
A random experiment is an experiment whose outcome is a random variable, i.e., cannot be predicted with certainty.
The sample space of a random experiment is the collection of all possible values of itsoutcome.
An event is a subcollection of the sample space. The empty event is denoted by , itis the event of having no outcomes.
The probability of an event E is denoted by P(E). It is a nonnegative number , lessthan or equal to 1 that measures the likelihood of the occurrence of the event E.
Example: The following is the sample space of the experiment of tossing a coin:
,S H T
where H stands for head and T stands for tail.
The following is the collection of all possible events of the experiment of tossing a coin:
,{ },{ },{ , }H T H T
Example : Find the sample space and five different events of the experiment of tossing a coin2 times.Solution :
( , ), ( , ), ( , ), ( , )S H H H T T H T T
Page 20 of 41
Page 20
Biostatistics Dr. Osama Alkam
21
The following are events of the experiment
1 ( , ), ( , )E H T H H , 2 ( , ), ( , ), ( , )E H T H H T T , 3 ( , )E H H ,
4 ( , ), ( , ), ( , ), ( , )E S H H H T T H T T , 5E
Definition: If every possible value of the outcome of a random experiment has the samechance to occur then the experiment is said to be equally likely. If an experiment is equallylikely and has a finite sample space S , then the probability of an event E this experiment is
given by | |( )
| |
EP E
S , where | | stands for the number of elements and S is the sample
space of the experiment.Example:Find the probability of having a total number of dots greater than 4 if a pair of fair dice arerolled.Solution:The sample space of the experiment of rolling a pair of dice is
(1,1), (1,2), (1,3),..., (1,6), (2,1), (2,2),..., (2,6),...(6,6)S
The mentioned event is the following
(1,4), (1,5), (1,6), (2,3), (2,4), (2,5), (2,6), (3,2),..., (3,6), (4,1),..., (4,6), (5,1),..., (5,6), (6,1),..., (6,6)E
Notice that | | 36S and | | 30E . Thus 30 5( )
36 6P E
Conditional Probability: If A and B are events then by ( | )P B A we denote the probability
of occurrence of the event B given that the event A has occurred. It is called a conditionalprobability and it is read " probability of B given A"
Elementary Properties:1. for any event E , 0 ( ) 1P E
2. ( ) 0P and ( ) 1P S
3. if 1 2, ,..., nS s s s then 1 2({ }) ({ }) ... ({ }) 1nP s P s P s
4. if ⊆ then ( ) ≤ ( )5. ( ) = ( ) = 1 − ( )6. ( ) = ( ∪ ) = ( ) + ( ) − ( ) = ( ) + ( ) − ( ∩ )7. ( ∪ ) = ( ∩ ) ( ∩ ) = ( ∪ )
Page 21 of 41
Page 21
Biostatistics Dr. Osama Alkam
22
Example: The following table represents the frequency of cocaine use by gender among 111adult cocaine users (in the US)
Life time frequency of cocaine use Male (M) Female (F) Total
1 – 19 times (A) 32 7 39
20 – 99 times (B) 18 20 38
100 + times (C) 25 9 34
Total 75 36 111
1. What is the probability that a randomly selected user will be a male?2. If we pick a person at random from the 111 group and found out that he is a male (M),
what is the probability that he used cocaine 100 + times (C) ?3. What is the probability that a randomly selected person from the 111 group is a
male (M) and a person who used cocaine 100 + times (C) ?4. What is the probability that a randomly selected person from the 111 group is a
female (F) or a person who used cocaine 20 - 99 times (B) ?5. What is the probability that a randomly selected person from the 111 group is not a
a person who used cocaine 100 + times (C) ?Solution:
1. | | 75( )
111 111
MP M
2. We use the notation ( | )P C M to denote the probability of the event C given that the
event M has occurred. It is read "probability of C given M . Knowing that the
selected person is a male reduces our sample space to the group of males only, thus
| " " | 25( | )
| | 75
C for malesP C M
M
3. | | 25( )
111 111
M and CP M and C
4. ( ) = ( ) + ( ) − ( ) = + − =5. ( ) = ( ) = 1 − ( ) = 1 − =
Example: In a group of people , 25% have both diabetes and hypertension , 42% havehypertension, and 35% have diabetes. A person is selected at random from this group. What isthe probability that this person
a. is diabetic or hypertensive?
Page 22 of 41
Page 22
Biostatistics Dr. Osama Alkam
23
b. does not have hypertension?c. is not diabetic and does not have hypertension?
Solution:a. ( ) = ( ) + ( ) −( ) = 0.35 + 0.42 − 0.25 = 0.52b. = 1 − ( ) = 1 − 0.42 = 0.58c. ∩ = ∪ = 1 −( ∪ ) = 1 − 0.52 = 0.48
Calculating the Probability of an Event; Conditional Probability :
Recall that by ( | )P B A we denote the probability of occurrence of the event B given that the
event A has occurred. The conditional probability ( | )P B A can be computed using the
formula ( ) ( )( | )
( ) ( )
P Aand B P A BP B A
P A P A
Thus ( ) ( ) ( ) ( | )P A and B P A B P A P B A
Example: Let A, B be two events such that P(A) = 0.4, P(B) = 0.8 and P(A∩B) = 0.3. Find( | ).
Solution: ( | ) = ( ∩ )( ) . Use the following table to find the value of each of these
quantities. Total Probability
0.3 0.5 0.8
0.1 0.1 0.2
Total Probability 0.4 0.6 1
Thus ( | ) = ( ∩ )( ) = .. = 0.625Definition: The events A and B are independent if ( ) ( ) ( ) ( )P Aand B P A B P A P B
Equivalently, if P(A) > 0 and P(B)>0 then the events A and B are independent if
( | ) ( )P B A P B (and ( | ) ( )P A B P A )
Example: In a group of people , 25% have both diabetes and hypertension , 42% havehypertension, and 35% have diabetes.a. What is the percent of those people that have hypertension also have a diabetes ?b. For that group of people, are the events "Diabetic" and " Hypertensive" independent ?
Page 23 of 41
Page 23
Biostatistics Dr. Osama Alkam
24
Solution:
a. ( | )P diabetic hypertensive = ( )
( )
P diabetic and hypertensive
P have hypertension
= 0.250.595
0.42
Thus the percent of those people that have hypertension also have a diabetes is 59.5% .b. The events "Diabetic" and " Has Hypertension" are not independent because
( | ) 0.595 ( ) 0.35P diabetic hypertensive P diabetic
Fact: If and are independent then the following events are also independent: and ,
and , and .
Example: Let , be two independent events such that ( ) = 0.4 and ( ) = 0.2. Find
i) ( ∩ ) ii) ( | ).
Solution:Since and are independent, and , and are independent. Thus:
i) ( ∩ ) = ( ) ( ) = (1 − 0.4)0.2 = 0.12ii) ) ( | ) = ( ) = 1 − 0.2 = 0.8
Definition: : The events A and B are mutually exclusive if ( ) ( ) ( )P A B P A P B .
Equivalently, the events A and B are mutually exclusive if ( ) 0P A B .
Example: if a person (in the above example) is selected at random,d. what is the probability that this person is diabetic or hypertensive?e. are the events "Diabetic" and " Hypertensive" mutually exclusive?
Solution:
a.( )
( ) ( ) ( )
0.35 0.42 0.25 0.52
P diabetic or hypertensive
P diabetic P hypertensive P diabetic and hypertensive
b. The events "Diabetic" and " Hypertensive" are not mutually exclusive because( ) 0P diabetic and hypertensive
Example: if a person (in the above example) is selected at random, what is the probability thatthis person:
a. does not have hypertensionb. is not diabetic and does not have hypertension
Solution:
a. ( ) 1 ( ) 1 0.42 0.58P has hypertension P has hypertension
Page 24 of 41
Page 24
Biostatistics Dr. Osama Alkam
25
b.( ) ( )
1 ( )
1 0.52 0.48
P diabetic and hypertensive P diabetic or hypertensive
P diabetic or hypertensive
Bayes’s Theorem. Screening Tests, Sensitivity,…HANDOUT IS NOT AVAILABLE. READ FROM YOUR MAIN REFERENCE.Reading Assignment:Chapter 3 (3.1,3.2,3.3,3.4, 3.5) in W.W. Daniel.
Page 25 of 41
Page 25
Chapter 4 Probability Distributions
University of Jordan Fall 2008 / 2009 Department of Mathematics Chapter 4
Probability Distributions The Distribution of a Discrete Random Variable: The distribution of a discrete random variable X is a table, a graph or a formula that is used to specify all possible values of X along with the probability of each one of these possible values. Example: Consider the following distribution of a discrete random variable X.
k P(X = k)
0 0.2
1 0.3
2 0.1
3 0.4
Total 1
Find: 1) P(X is odd) 2) P(X is even | X > 0) Solution:
1) P(X is odd) = P(X = 1 or X = 3) = P(X = 1) + P(X = 3) = 0.3 + 0.4 = 0.7 2) P(X is even | X > 0) = P(X is even and X > 0) / P(X > 0) = P(X = 2) / (1 – P(X = 0)) = 0.1 / 0.8 = 0.125
The Expected Value (Mean) and Variance of a Discrete Random Variable: The expected value (or the mean) of a discrete random variable X is denoted by E(X)
(or ) and is given by ∑ , where the sum runs over all possible values of the
random variable . The variance of is given by , where ∑
Example: Find and for the random variable given in the above example.
Solution: k P(X = k) P(X = k) P(X = k)
0 0.2 0 0
1 0.3 0.3 0.3
2 0.1 0.2 0.4
3 0.4 1.2 3.6
Total 1 1.7 4.3
1 . 7 and 4 . 3 1 . 71 . 4 1
Biostatistics Dr. Osama Alkam
Page 26 of 41
Page 26
331331 Biostatistics Lecture #9 Dr. Osama Alkam
Chapter 4 Probability Distributions
The Binomial Experiment and Distribution: Before we introduce the binomial (or Bernoulli) experiments we introduce some notations for some relevant mathematical quantities. 1. The Factorial of a Nonnegative Integer : if n is a nonnegative integer then by !n we denote
what refers to " n factorial " defined by 1 0
!( 1) ( 2) ... 2 1 0if n
nn n n if n
=⎧= ⎨ × − × − × × × >⎩
Remark: for any 1, ! ( 1)!n n n n≥ = × −
Example: 0! 1, 1! 1, 2! 2, 3! 3 2 1 6, 4! 4 3! 24,...= = = = × × = = × =
2. Combinations: : if n is a positive integer and k is an integer such that 0 k n< ≤ then the
combination nk⎛ ⎞⎜ ⎟⎝ ⎠
is defined by !! ( )!
n nk k n k⎛ ⎞
=⎜ ⎟ × −⎝ ⎠
Example:
10 10! 110 10! 0!⎛ ⎞
= =⎜ ⎟ ×⎝ ⎠
10 10! 10 0! 10!
⎛ ⎞= =⎜ ⎟ ×⎝ ⎠
10 10! 10 9! 101 1! 9! 1 9!⎛ ⎞ ×
= = =⎜ ⎟ × ×⎝ ⎠
10 10! 10 9 8 7 6! 10 3 7 2104 4! 6! 4 3 2 6!⎛ ⎞ × × × ×
= = = × × =⎜ ⎟ × × × ×⎝ ⎠
Fact:The number of ways of selecting k objects from n objects is given by nk⎛ ⎞⎜ ⎟⎝ ⎠
.
Example: How many teams of 6 players can we choose out of a group of 8 people?
Answer: 8 8! 8 7 6! 286 6! 2! 6! 2⎛ ⎞ × ×
= = =⎜ ⎟ × ×⎝ ⎠ teams.
Example: In how many ways can we choose 3 balls from an urn that contains 5 balls.
Answer: 53!! !
!!
10 ways.
Example: How many events with size 4 are there if the size of the sample space is 6?
Answer: 6 6! 6 5 4! 154 4! 2! 4! 2⎛ ⎞ × ×
= = =⎜ ⎟ × ×⎝ ⎠events.
Page 27 of 41
Page 27
Chapter 4 Probability Distributions
The binomial (or Bernoulli) experiment : A binomial (or Bernoulli) experiment is a random experiment that has the following properties:
1) has exactly one of two possible outcomes, one is referred to as success and the other is referred to as failure.
2) the probability of success in each trial of the experiment is constant, usually denoted by .
3) all trials of the experiment are independent. Examples:
1. Tossing a coin. The outcome is either a head or a tail. 2. Checking whether a new born is a boy or a girl 3. Checking whether a person is diabetic or not
The Binomial Random Variable: The binomial random variable is the number of successes when a binomial experiment, with
probability of success in each trial, is performed times. We denote it by ~ , . The
possible values of are 0,1,…, .
Examples: 1. Select a random sample of 10 people. Let be the number of diabetics within this
sample. Then ~ 10, , where is the proportion of diabetics in the population
from which the sample is selected. The possible values of are 0,1,2,…,10.
2. Toss a fair coin 20 times. Let be the number of times a head comes out. Then
~ 20,0.5 . The possible values of are 0,1,2,…,20.
Fact: If ~ , then
1) for each 0,1, . . , , 1
2)
3) 1
Example:
Let ~ 5,0.3 . Find: 1) 2 2) 3)
Solution:
1) 2 52 0.3 0.7 !
! !0.09 0.343 10 0.09
0.343 0.3087
2) 5 0 . 3 1 . 5
3) . Thus 5 0 . 3
0.7 1.5 3 . 3
Biostatistics Dr. Osama Alkam
Page 28 of 41
Page 28
331331 Biostatistics Lecture #9 Dr. Osama Alkam
Chapter 4 Probability Distributions
Solution:
0 0.25
1 0.5
2 0.25
Total 1
Example: Suppose that the probability that a patient suffering from migraine headache pain will obtain a relief with a particular drug is 0.9. Three randomly selected sufferers from migraine headache are given this drug. Find the probability that the number of sufferers in the selected sample obtaining relief will be:
1) Exactly zero 2) At least one 3) Two or three 4) At most two
Solution: Let be the number of sufferers in the selected sample obtaining a relief. Then ~ 3,0.9 .
1) 0 30 0.9 0.1 0.1 0.001
2) 1 1 0 1 0 . 0 0 1 0 . 9 9 9
3) 2 3 2 3 0.243 0.729 0.972 4) 2 1 3 1 0 . 7 2 9 0 . 2 7 1
Note: The binomial distribution is completely determined by and . They are called “ the
parameters of the binomial distribution” Binomial Tables: When is large, the calculations of binomial probabilities using the equation can be tedious.
We may bypass these tedious calculations through using a binomial table. Binomial tables enable us to read the value of for any 0,1, … , .
Example: Let ~ 2,0.5 . Exhibit the distribution of as a table.
Page 29 of 41
Page 29
331331 Biostatistics Lecture #9 Dr. Osama Alkam
Chapter 4 Probability Distributions
The following is a part of the binomial table for 10, .
Example: Let ~ 10,0.3 . Use the above table to find:
1) 4
2) 4
3) 4
4) 4
5) 4
6) 2 6
7) 2 6
8) 2 6
9) 2 6
Solution: 1) 4 0.850
2) 4 3 0.650
3) 4 4 3 0.850 0.650 0.200
4) 4 1 4 1 0.850 0.150
5) 4 1 4 1 3 1 0.650 0.350
6) 2 6 3 5 5 2 0.953
0.383 0.570
The rest are left as an exercise. Reading Assignment: Chapter 4 (4.1,4.2,4,3) in W.W. Daniel , 7th edition.
Page 30 of 41
Page 30
Biostatistics Dr. Osama Alkam
26
Chapter 4Probability Distributions
Page 31 of 41
Page 31
Biostatistics Dr. Osama Alkam
27
The Poisson Random Variable:
The Poisson random variable is the number of occurrences of a rare event in an
interval of time or a space unit. If is the average (or expected) number of
occurrences of this event in the time (or space) unit then we write ~ ( ).
The possible values of are 0,1,2,…
Fact: If ~ ( ) then
1) for each = 0,12, . .,, ( = ) = ! , where ≈ 2.712) ( ) =3) ( ) =
Example: Let ~ (3). Find: 1) ( > 0) 2) ( )Solution:
1) ( > 0) = 1 − ( ≤ 0) = 1 − ( = 0) = 1 − ! = 1 −2) ( ) = ( ) − ( ( )) . Thus ( ) = ( ) + ( ( )) = 3 + 3 = 12
Page 32 of 41
Page 32
Biostatistics Dr. Osama Alkam
28
Example: The number of cases admitted to the CCU in a certain hospital is
distributed according to a Poisson distribution with average 3 cases per day. Find the
probability of admitting 25 case to the CCU in this hospital in a random week.
Solution: Let be the number of cases admitted to the CCU in this hospital in a
week. Then ~ (3 × 7) = (21) . Thus, ( = 25) = ! =0.055546Note: Poisson distribution is completely determined by It is called “ the parameter
of the Poisson distribution”
Poisson Tables: Poisson tables enable us to read the value of ( ≤ ) for any= 0,1, …when ~ ( ) for several values of .
The following is a part of a Poisson table for ( ).
Exercise: Let ~ (1.5). Use the above table to find:
1) ( ≤ 3)2) ( < 3)3) ( = 3)4) ( > 2)5) ( ≥ 2)6) (2 < < 5)7) (2 ≤ < 5)8) (2 ≤ ≤ 5)9) (2 < ≤ 5)
Reading Assignment: Chapter 4 (4.4) in W.W. Daniel.
Page 33 of 41
Page 33
Biostatistics Dr. Osama Alkam
29
The Normal Distribution:
Normal distribution is probably one of the most important and widely used continuousdistributions. A normally distributed random variable is known as a normal randomvariable. The following are the properties of the normal distribution:
Properties of the Normal Distribution:
1. It is bell shaped and is symmetrical about its mean.
2. Its mean equals its median equals it mode..3. It is a continuous distribution.4. It is completely determined by its mean and its variance. A normal random variableX with mean and variance is expressed as ~ ( . )5. The total area under the curve equals 1. Thus, the area of the distribution on eachside of the mean is 0.5.6. The probability that the normal random variable will have a value between any twopoints is equal to the area under the curve between those points.
Page 34 of 41
Page 34
Biostatistics Dr. Osama Alkam
30
The curve on the right is skewed to the right. Its mode < its median < its mean. Theone on the left is skewed to the left. Its mode > its median > its mean.
To find the probability that a normal random variable X will have a value smaller thana given number, we transform the normal random X to the standard normal randomvariable Z that has mean 0 and variance 1. This transformation is done using the
formula = .
A standard Z table can be used to find probabilities for any normal curve problem thathas been converted to Z scores.The following steps are helpful when working with the normal curve problems:1. Graph the normal distribution, and shade the area related to the probability youwant to find.2. Convert the boundaries of the shaded area from X values to the standard normalrandom variable Z values using the Z formula above.3. Use the standard Z table to find the probabilities or the areas related to the Z valuesin step 2.
Example:
The weights of 1000 children are normally distributed with mean 25 kg and standarddeviation 5 kg.
1) Find the proportion of children that have weights between 22 kg and 28 kg.
2) About how many children have weights smaller than 30 kg?
3) If a child is randomly selected, find the probability that her/his weight issmaller than 28.
4) Find the third quartile of the weights of these children.
5) Find a positive number C such that 68% of the children have weights between25 – C and 25+C.
Biostatistics Dr. Osama Alkam
30
The curve on the right is skewed to the right. Its mode < its median < its mean. Theone on the left is skewed to the left. Its mode > its median > its mean.
To find the probability that a normal random variable X will have a value smaller thana given number, we transform the normal random X to the standard normal randomvariable Z that has mean 0 and variance 1. This transformation is done using the
formula = .
A standard Z table can be used to find probabilities for any normal curve problem thathas been converted to Z scores.The following steps are helpful when working with the normal curve problems:1. Graph the normal distribution, and shade the area related to the probability youwant to find.2. Convert the boundaries of the shaded area from X values to the standard normalrandom variable Z values using the Z formula above.3. Use the standard Z table to find the probabilities or the areas related to the Z valuesin step 2.
Example:
The weights of 1000 children are normally distributed with mean 25 kg and standarddeviation 5 kg.
1) Find the proportion of children that have weights between 22 kg and 28 kg.
2) About how many children have weights smaller than 30 kg?
3) If a child is randomly selected, find the probability that her/his weight issmaller than 28.
4) Find the third quartile of the weights of these children.
5) Find a positive number C such that 68% of the children have weights between25 – C and 25+C.
Biostatistics Dr. Osama Alkam
30
The curve on the right is skewed to the right. Its mode < its median < its mean. Theone on the left is skewed to the left. Its mode > its median > its mean.
To find the probability that a normal random variable X will have a value smaller thana given number, we transform the normal random X to the standard normal randomvariable Z that has mean 0 and variance 1. This transformation is done using the
formula = .
A standard Z table can be used to find probabilities for any normal curve problem thathas been converted to Z scores.The following steps are helpful when working with the normal curve problems:1. Graph the normal distribution, and shade the area related to the probability youwant to find.2. Convert the boundaries of the shaded area from X values to the standard normalrandom variable Z values using the Z formula above.3. Use the standard Z table to find the probabilities or the areas related to the Z valuesin step 2.
Example:
The weights of 1000 children are normally distributed with mean 25 kg and standarddeviation 5 kg.
1) Find the proportion of children that have weights between 22 kg and 28 kg.
2) About how many children have weights smaller than 30 kg?
3) If a child is randomly selected, find the probability that her/his weight issmaller than 28.
4) Find the third quartile of the weights of these children.
5) Find a positive number C such that 68% of the children have weights between25 – C and 25+C.
Page 35 of 41
Page 35
Biostatistics Dr. Osama Alkam
31
Solution:
Let X represent the children’s weights. Then ~ (25, 5 ).
1) To find (22 < < 28).
(22 < < 28) = 22 − 255 < < 28 − 255 = (−0.6 < < 0.6)= ( < 0.6) − ( < −0.6) = 0.7257 − 0.2743 = 0.45142) ( < 30) = < = ( < 1) = 0.841
Thus, about 0.841 × 1000 = 841 children have weights less than 30 kg.
3) Find ( < 28) (Exercise)
4) The third quartile is nothing but which is characterized by the property( < ) = 0.75. Thus, < = 0.75. From the standard normal table
we find that ≈ 0.67. Hence, = 5 × 0.67 + 25 = 28.35 kg.
5) (25 − < < 25 + ) = 0.68 → < < = 0.68 →< < = 0.68 → < = 0.84 → = 1 → = 5.Reading Assignment:
Chapter 4 (4.6,4.7) in W.W. Daniel.
Chapter 5Some Important Sampling Distributions
Introduction:A statistical measure for a sample is called a statistic and a statistical measure for a
population is called a parameter. Example of statistics are , s , … . The following are
parameters , σ , … . A statistic is a random variable but a parameter is not. Sample statistics
like x and s are used to estimate population parameters like and , respectively. There is
some difference (or error ) between statistics and parameters. Different samples from thesame population may have different amounts of sampling error. Studying sampling distributionsof sample statistics helps us understand statistical inference and allows us to answer questionsabout sample statistics.Sampling Distributions :The sampling distribution of a statistic is the distribution of the values taken by that statistic inall possible samples of the same size that are drawn from the same population.
Page 36 of 41
Page 36
Biostatistics Dr. Osama Alkam
32
Note : The number of all possible samples of size n , drawn without replacement from a
population of size N , equals !
! ( )!
N N
n n N n
. If we allow replacement then the number
of all possible samples is nN .
Example :The following table gives all possible samples of size 2 drawn with replacement from apopulation that comprises the weights ( in pounds ) of 5 children together with the mean ofeach samplePopulation data : 65 54 67 65 88
Population 65 54 67 65 88
65 (65,65), 65 (54,65),59.5 (67,65),66 (65,65),65 (88,65),76.5
54 (65,54),59.5 (54,54),54 (67,54),60.5 (65,54),59.5 (88,54),71
67 (65,67),66 (54,67),60.5 (67,67),67 (65,67),66 (88,67),77.5
65 (65,65),65 (54,65),59.5 (67,65),66 (65,65),65 (88,65),76.5
88 (65,88),76.5 (54,88),71 (67,88),77.5 (65,88),76.5 (88,88),88
The following chart represents the above samples' means
Page 37 of 41
Page 37
Biostatistics Dr. Osama Alkam
33
Sampling Distribution of the Mean:Theorem:
The sampling distribution of x in a normally distributed population with mean and standard
deviation is also normally distributed with mean and standard deviationn
,where n is
the sample size, provided that sampling is performed with replacement. If sampling isperformed without replacement then the sampling distribution is also normally distributed with
mean and standard deviation1
N n
Nn
, where N is the size of the population.
The factor1
N n
N
is called the correction factor. It is negligible if 0.05n N or
N is very large (infinite or practically infinite).
The Central Limit Theorem (CLT) : When the sample size is large ( 30n ), the above
Theorem is also valid even if the population is not normally distributed. In fact the sampling
distribution of the mean is almost normal when n is large .The larger the sample size, the
closer the sampling distribution of the mean to being normally distributed.Example:Suppose that the ages of Jordan University students follow a normal distribution with mean20.5 years and standard deviation 1.4 years. If we repeatedly collect samples of size 49n :
a) what is the sampling distribution of x ?
Answer: ~ 20.5, ( . ) ~ (20.5,0.04)~ (20.5, (0.2) )b) what is the probability that the mean age of a randomly selected sample of size 49 of
Jordan University students is smaller than 21 years?
Answer: 21 20.5( 21) ( ) ( 2.5) 0.9938
0.2P x P Z P Z
c) what is the probability that an individual student is younger than 21 years old ?Answer: ~ (21.5, (1.4)thus
21 20.5( 21) ( ) ( 0.36) 0.6406
1.4P x P Z P Z
d) what is the distribution of x if the ages of Jordan University students do not follow a
normal distribution ?
Page 38 of 41
Page 38
Biostatistics Dr. Osama Alkam
34
Answer : The distribution of x will be approximately normal with mean 20.5 and standard
deviation 0.2 since the sample size is > 30,Reading Assignment:
Chapter 5 (5.1,5.2,5.3) in W.W. Daniel.Distribution of the Difference Between Two Sample Means:Suppose that we want to know whether or not the mean serum cholesterol level is higher in apopulation of sedentary office workers than in a population of laborers. If we know that thosemeans are different then we may wish to know by how much they differ. One way is to take a
random sample from each population then look at the sampling distribution of 1 2x x to
answer probability questions and draw statistical inference.
Sampling Distribution of 1 2x x :
Theorem:
If we draw two independent random samples of sizes 1n and 2n from two distinct normally
distributed populations, having means 1 2, and standard deviations 1 and 2 ,
respectively, then 1 2x x is normally distributed with mean1 2x x
1 2 and standard
deviation1 2
2 21 2
1 2x x n n
Note: The above theorem is also valid if the populations are not (both) normally distributed
provided that both 1n and 2n are greater than or equal to 30.
Example:One group on a diet lost an average of 7.2 kg with standard deviation 3.7 kg., another groupon sportive exercises lost an average of 4.0 kg with a standard deviation of 3.9 kg. Suppose we
collect samples of sizes 1 42n from the diet group and 2 47n from the exercises group :
(a) what is the sampling distribution of 1 2x x ?
Answer: the sampling distribution of 1 2x x is approximately normal ( since 1 30n and
2 30n ) with mean 7.2 4.0 3.2 kg and standard deviation2 2(3.7) (3.9)
0.80642 47
kg
(b) what is the probability that the difference between mean weight loss of the two groupsis larger than 4.0 kg ?
Page 39 of 41
Page 39
Biostatistics Dr. Osama Alkam
35
1 2
4.0 3.24.0
0.806
1
1 0.8389
0.1611
P x x P Z
P Z
Answer: 0.993
0.99
(c) what is the probability that the mean weight loss of the exercises group is larger than4.0 kg ?
Answer : ~ 4.0, . = (4.0, (0.569) ), thus2
4.0 4.0( 4.0) 0 0.5
0.569P x P Z
(d) Find the IQR (interquartile range) of − .
Solution:= − = − .( − < ) = 0.75 → < − 3.20.806 = 0.75 → − 3.20.806 = 0.675→ = 0.675 × 0.806 + 3.2 = 3.744( − < ) = 0.25 → < − 3.20.806 = 0.25 → − 3.20.806 = −0.675→ = 3.2 − 0.675 × 0.806 = 2.656Thus, = 3.744 − 2.656 = 1.088
Distribution of the Sample Proportion:In this section we study the distribution of sample proportion. Such distribution helps us answerprobability questions about proportions when it is tedious, difficult or practically impossible touse binomial tables. For example, suppose that in a certain population 0.08 percent are colorblind, if we randomly select 1500 individuals from this population, what is the probability thatthe proportion of color blinds in that sample is at least 0.10. To answer such question usingbinomial tables we need to find the probability that the variable x is greater than or equal to
0.10 1500 150 given that x is binomially distributed with 0.08p and 1500n . How
would we answer that question if we don't have binomial tables for 1500n (or even for any
25)n ?
Distribution of Sample Proportion; An Empirical Rule:When the sample size is "large" (we will see shortly what large means), the distribution ofsample proportions is approximately normally distributed with mean equal to the true population
proportion p and standard deviation equal to (1 )p p
n
. The sample is considered "large
enough" if 5np and (1 ) 5n p .
Page 40 of 41
Page 40
Biostatistics Dr. Osama Alkam
36
Example:Suppose that in a certain population 0.08 percent are color blind, if we randomly select 1500individuals from this population. Find:
a) the probability that the proportion of color blinds in that sample is at least 0.10.
b) the 95th percentile of .Solution:
a) 0.08p and 1500n . Since 1500 0.08 120 5np and
(1 ) 1500 0.92 1380 5n p , the proportion of color blinds is approximately
normally distributed with mean 0.08p and standard deviation
(1 ) 0.08 0.920.007
1500
p p
n
Thus
b) ( < ) = 0.95 → < .. = 0.95 → .. = 1.65 → =1.65 × 0.007 + 0.08 = 0.09155Distribution of the difference between two sample proportionsHANDOUT IS NOT AVAILABLE. READ DIRECTLY FROM YOUR MAIN REFERENCE.Reading Assignment:Chapter 5 (5.1,5.2,5.3,5.4,5.5,5.6) in W.W. Daniel.
Page 41 of 41
Page 41