chapter 1 introduction to biostatisticseacademic.ju.edu.jo/oalkam/material/biostat lectures...

48
1 University of Jordan Fall 2009/2010 Department of Mathematics Chapter 1 Introduction to Biostatistics Introduction; Some Basic Concepts Statistics is a science related to making decisions in the face of uncertainty, it comprises the following 1) Descriptive statistics : Concerned with the collection, organization, summarization and analysis of a body of data 2) Inferential statistics : Concerned with drawing inferences about a large body of data (called a population) through examining a part of that body (called a sample) The performance of statistical activities is motivated by the need to answer a question about a certain population. The usual setup of such activities starts with picking up a sample from the population that is similar to that population in the sense that it has all the characteristics and properties of the population (such a sample is said to be an unbiased sample), then to collect information from the sample and use it to answer the question about the population. If the question (hence the data) is related to a medical, biological, or nutritive problem then we use the term biostatistics to distinguish this particular kind of statistical tools. Now we introduce some of the vocabulary and concepts that are widely used in any statistics course. Random Variable: the information or data collected from the subjects can not be exactly predicted in advance, they are referred to as random variables. Random variables are two kinds : Qualitative Variables : They divide the subjects into groups or categories, the value of a qualitative variable can not be measured or counted, for example the birth place, gender, or marital status of an individual . Qualitative random variables are either “nomimnal” or “ordinal”. The possible values of a nominal random variable do not have a natural order. For example: “gender”, “marital status”, “nationality”….. The possible values of an ordinal random variable Page 1 of 41 Biostatistics handouts Part 1 (Chapter 1 - Chapter 5) Dr. Osama Alkam

Upload: others

Post on 22-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

1

University of Jordan Fall 2009/2010 Department of Mathematics

Chapter 1Introduction to Biostatistics

Introduction; Some Basic ConceptsStatistics is a science related to making decisions in the face of uncertainty, it comprises thefollowing

1) Descriptive statistics : Concerned with the collection, organization, summarization andanalysis of a body of data

2) Inferential statistics : Concerned with drawing inferences about a large body of data(called a population) through examining a part of that body (called a sample)

The performance of statistical activities is motivated by the need to answer a question about acertain population. The usual setup of such activities starts with picking up a sample from thepopulation that is similar to that population in the sense that it has all the characteristics andproperties of the population (such a sample is said to be an unbiased sample), then to collectinformation from the sample and use it to answer the question about the population. If thequestion (hence the data) is related to a medical, biological, or nutritive problem then we usethe term biostatistics to distinguish this particular kind of statistical tools.

Now we introduce some of the vocabulary and concepts that are widely used in any statisticscourse.

Random Variable: the information or data collected from the subjects can not be exactlypredicted in advance, they are referred to as random variables.

Random variables are two kinds :Qualitative Variables : They divide the subjects into groups or categories, the value of aqualitative variable can not be measured or counted, for example the birth place, gender, ormarital status of an individual .Qualitative random variables are either “nomimnal” or “ordinal”.The possible values of a nominal random variable do not have a natural order. For example:“gender”, “marital status”, “nationality”….. The possible values of an ordinal random variable

Page 1 of 41

Biostatistics handouts Part 1 (Chapter 1 - Chapter 5) Dr. Osama Alkam

Page 1

Page 2: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

2

can be ordered naturally. For example: “rank”, “letter grade”, “degree of improvement such aslow, weak, good, very good and excellent…”.Quantitative Variables : The value of a quantitative variable can be measured or counted. Wedistinguish between two kinds of quantitative variables:

1. Discrete Variables: if the value of the variable can be counted then it is called a

discrete random variable, an example of a discrete random variable is the number ofadmissions to a general hospital or the number of family members of an individual.Discrete random variables are characterized by gaps or interruptions in the valuesthey assume.

2. Continuous Variables: if the value of the variable can be measured then it is called acontinuous random variable, an example of a continuous random variable is theperiod of treatment of a tuberculoses patient. A continuous random variable canassume any value within a specified relevant interval of values.

Sources of Data: The information about the subjects are usually collected from one or more ofthe following sources

1. Routinely kept records or archives: for example the medical history of apatient.

2. Surveys: if the data needed is not available in the kept records then it logicalto think of a survey, for example information about whether the patientreceived a good treatment or not is not usually kept in the hospital records butcan be surveyed .

3. Experiments: Frequently the data needed to answer a question are availableonly as the result of an experiment . Different strategies of motivation may betried by a pediatrician or a dentist with different children to know the beststrategy for maximizing children compliance.

4. External Sources: the data needed to answer a question may already exist inthe form of a published report. International organizations like WHO or healthministries usually publish reports that make a good source of data that can bebenefited from.

Page 2 of 41

Page 2

Page 3: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

3

The Simple Random Sample (SRS)If a sample of size n is drawn from a population of size N in such a way that everypossible sample of size n has the same chance of being selected then the sample iscalled a simple random sample.

One method of selecting a simple random sample is a method which is uses randomnumber generators or random number tables. The procedure of that method is thefollowing:

1. Get a list of all subjects in the population2. Obtain random numbers from a random number generator or a table3. Select the subjects whose numbers in the list match with the obtained random

numbers.Note: The above method is ideal but it is practically inapplicable to some data, inparticular it is difficult to implement it when we need to draw a sample from a relativelyhuge population.Reading Assignment:Chapter 1 (1.1,1.2,1.4) in W.W.Daniel.

Chapter 2Descriptive Statistics

IntroductionIn this chapter we learn several techniques for organizing and presenting data so that we mayeasily determine what information they contain.

The Ordered ArrayAn ordered array is a listing of the values of a collection of data in order of magnitude from thesmallest value to the largest value. An ordered array enables one to determine quickly thevalue of the smallest measurement and the value of the largest measurement.

Page 3 of 41

Page 3

Page 4: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

4

Example: The following data are the ages of 30 people, rounded to the nearest year, who havebeen discharged from a general hospital last Friday

51 70 79 75 55 25 38 74 54 7237 15 56 17 77 43 16 15 72 9225 30 24 46 47 46 38 81 49 45

In order to put the above data in an ordered array we just list the measurements from thesmallest to the largest

15 15 16 17 24 25 25 30 37 3838 43 45 46 46 47 49 51 54 5556 70 72 72 74 75 77 79 81 92

Frequency tables without classes:Such tables can be used to organize all types of data.Example: the following table shows “letter grades” of 150 students.

X (letter grade) Frequency (number of students)

F 12

D 15

D+ 20

C 35

C+ 30

B 18

B+ 12

A 8

Grouped Data – Frequency tables with classesTo group a set of observations we select a set of contiguous, non-overlapping intervals suchthat each observation belongs to exactly one interval. These intervals are called class intervals.Class intervals need not have the same width. All class intervals are listed in a table which isreferred to as a frequency table. A typical frequency table consists of the following

Class intervals: a column in which all class intervals are listed

Page 4 of 41

Page 4

Page 5: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

5

Midpoints : a column in which the midpoints of the class intervals are computed. Themidpoints of a class interval equals (left side + right side)/2

Frequency : the frequency of a class interval is the number of observations thatbelong to the class interval.

Cumulative Frequency: the cumulative frequency of a class interval is the number ofobservations that are less than or equal the right-hand side of that class interval.

Relative Frequency : the relative frequency of a class interval equals(the frequency of the class interval / total frequency)

Cumulative Relative Frequency: It equals (cumulative frequency / total frequency)A natural question is how many class intervals should be included in a frequency table?

A rule of thumb states that the number of class intervals k should be between 5 and 15.We may use the following rule given by Sturges as a guide for computing k:

The number of class intervals is the closest integer k to n 101 3.322log ( ) where n is

the total number of observations. The number of class intervals specified by the rule can beincreased or decreased for more convenience or better presentation. After having decidedabout the number of class intervals we decide about the class widths. If we decide to give allclasses the same width then we compute the class width using the formula (largest value –smallest value ) / k rounded to the nearest number from above with the same accuracy

unit.

Example:Put the data mentioned in the previous example in a frequency table.We start with computing the number of classes.

30n , 101 3.322log 30 5.906 . Thus we should have 6 class intervals.

To obtain the class width we compute 92 1512.333

6

. Since the observations are integers,

we round 12.333 to 13. Thus the class width is 13.Now we are ready to construct the first class interval which has the least observation, namely15, as a left-hand side and (left-hand side + the class width – one accuracy unit) as a right-hand side, the second class's left-hand side is the first class's right-hand side + one accuracyunit. The right-hand side of each class is the left-hand side of the class + the class width – oneaccuracy unit. We construct the other class intervals similarly.

Page 5 of 41

Page 5

Page 6: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

6

Class intervals Midpoint Frequency CumulativeFrequency

RelativeFrequency

CumulativeRelative

Frequency

15 - 27 21 7 7 0.233 0.233

28 - 40 34 4 11 0.133 0.367

41 - 53 47 7 18 0.233 0.600

54 - 66 60 3 21 0.100 0.700

67 - 79 73 7 28 0.233 0.933

80 - 92 86 2 30 0.067 1.000

Example: Consider the following cumulative frequency distribution.

Class Cumulative Frequency10 - 15 6

16 - 21 13

22 - 27 38

28 - 33 42

34 - 39 50

a) What is the width (or length) of each class?b) Find the relative frequency of the second class.c) Find the proportion of observations that are greater than or equal to 16 and less than or

equal 33.Solution:a) The class width equals16 – 10 = 6 (or you may say 15 – 10 + one accuracy unit = 5 + 1 = 6)b) The frequency of the second class equals 13 – 6 = 7 and the total frequency is 50. Thus therelative frequency of the second class equals 7/50 = 0.14.c) The observations that are greater than or equal to 16 and less than or equal 33 are those inthe second, third and fourth classes and their frequencies are 13 – 6 = 7,38 – 13 = 25 and42 – 38 = 4, respectively. Thus their proportion is (7+25+4)/50 = 0.72.

The Histogram; The Frequency Polygon:The histogram is a graphical representation of the frequency distribution (or the relativefrequency distribution), it reveals the shape of the data, for example the presence or absenceof symmetry. When we construct the histogram the boundaries of the class intervals arepresented by the horizontal axis, while the vertical axis has as its scale the frequency (or the

Page 6 of 41

Page 6

Page 7: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

7

relative frequency). Above each class interval on the horizontal axis a rectangle with heightbeing equal to the frequency (or relative frequency) of the relevant class interval is constructed.All rectangles must be contiguous .The frequency (or relative frequency) polygon is another graphical representation for thefrequency (or relative frequency) distribution. To draw a frequency polygon we place a dotabove the midpoint of each class interval represented on the horizontal axis in addition to twoextra dots on the horizontal axis at the midpoints of two additional class intervals, one islocated to the left of the first class and the other is located to the right of the last class , theheight of each dot equals the frequency of the relevant class interval and the heights of theextra dots are zero Connecting the dots with line segments produces a frequency polygon.Example: Construct the frequency histogram and the frequency polygon of the following part ofa frequency table.

Class Intervals Frequency Midpoint Actual limits

15 - 27 7 21 14.5 – 27.5

28 - 40 4 34 27.5 – 40.5

41 - 53 7 47 40.5 – 53.5

54 - 66 3 60 53.5 – 66.5

67 - 79 7 73 66.5 – 79.5

80 - 92 2 86 79.5 – 92.5

The following is the histogram

14.5

012345678

Freq

uenc

y

Biostatistics Dr. Osama Alkam

7

relative frequency). Above each class interval on the horizontal axis a rectangle with heightbeing equal to the frequency (or relative frequency) of the relevant class interval is constructed.All rectangles must be contiguous .The frequency (or relative frequency) polygon is another graphical representation for thefrequency (or relative frequency) distribution. To draw a frequency polygon we place a dotabove the midpoint of each class interval represented on the horizontal axis in addition to twoextra dots on the horizontal axis at the midpoints of two additional class intervals, one islocated to the left of the first class and the other is located to the right of the last class , theheight of each dot equals the frequency of the relevant class interval and the heights of theextra dots are zero Connecting the dots with line segments produces a frequency polygon.Example: Construct the frequency histogram and the frequency polygon of the following part ofa frequency table.

Class Intervals Frequency Midpoint Actual limits

15 - 27 7 21 14.5 – 27.5

28 - 40 4 34 27.5 – 40.5

41 - 53 7 47 40.5 – 53.5

54 - 66 3 60 53.5 – 66.5

67 - 79 7 73 66.5 – 79.5

80 - 92 2 86 79.5 – 92.5

The following is the histogram

27.5 40.5 53.5 66.5 79.5 92.5

Actual Limits

Biostatistics Dr. Osama Alkam

7

relative frequency). Above each class interval on the horizontal axis a rectangle with heightbeing equal to the frequency (or relative frequency) of the relevant class interval is constructed.All rectangles must be contiguous .The frequency (or relative frequency) polygon is another graphical representation for thefrequency (or relative frequency) distribution. To draw a frequency polygon we place a dotabove the midpoint of each class interval represented on the horizontal axis in addition to twoextra dots on the horizontal axis at the midpoints of two additional class intervals, one islocated to the left of the first class and the other is located to the right of the last class , theheight of each dot equals the frequency of the relevant class interval and the heights of theextra dots are zero Connecting the dots with line segments produces a frequency polygon.Example: Construct the frequency histogram and the frequency polygon of the following part ofa frequency table.

Class Intervals Frequency Midpoint Actual limits

15 - 27 7 21 14.5 – 27.5

28 - 40 4 34 27.5 – 40.5

41 - 53 7 47 40.5 – 53.5

54 - 66 3 60 53.5 – 66.5

67 - 79 7 73 66.5 – 79.5

80 - 92 2 86 79.5 – 92.5

The following is the histogram

Page 7 of 41

Page 7

Page 8: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

8

The following is the frequency polygon

Stem-and-Leaf Display (optional):The stem-and-leaf display is similar to the histogram and has the same purpose, its mainadvantage over the histogram is that it preserves the information contained in the individualdata items. It is effective with relatively small data sets .To construct a stem-and-leaf plot we :

1. partition each datum into two parts; the leaf which consists of the units digit and thestem which consists of the rest digits of the datum

2. on the left hand side of the page write down the stem3. draw a line to the right of these stems4. on the other side of the line, write down the leaves of all data with the same stem on

the left.The stems of the data should form an ordered column with the smallest stem at the top and thelargest at the bottom. All the stems within the range are included in the stem column even if nodata with that stem is within our data items. Decimals when present in the original data areomitted in the stem-and-leaf display. If all data items are fractions less than one the we canmagnify the data through multiplying each data item by a number (10, 100, 1000 etc.) beforewe display the data in a stem-and-leaf plot..Example:Display the following data in a stem-and-leaf plot2,3,6,7,12,15,15,15,17,20,20,21,29,29,34,51,56,60,65,69,80,89

0

1

2

3

4

5

6

7

8

8 21 34 47 60 73 86 99

Freq

uenc

y

Midpoint

Page 8 of 41

Page 8

Page 9: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

9

Solution :Stems | Leaves

0 | 2 3 6 71 | 2 5 5 5 72 | 0 0 1 9 93 | 44 |5 | 1 66 | 0 5 97 |8 | 0 9

Reading Assignment:Chapter 2 (2.1,2.2,2.3) in W.W. Daniel

Descriptive Statistics – Measures of Central TendencyA descriptive measure is a single number that is used to summarize the data. Descriptivemeasures may be computed from the data of a sample or the data of a population.Definition:

1. A descriptive measure computed from the data of a sample is called a statistic .2. A descriptive measure computed from the data of a population is called a parameter.

Arithmetic Mean: The arithmetic mean of a sample is denoted by x and of a population is

denoted by . From now on we will just say the mean for the arithmetic mean.

1) For raw (unorganized) data:

1

n

ii

xx

n

, where 1 2, ,..., nx x x are the observations in the sample and n is

their number ( the sample size ).

1

N

ii

x

N

, where 1 2, ,..., Nx x x are the observations in the population and N

is their number ( the population size ).

Page 9 of 41

Page 9

Page 10: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

10

2) For frequency tables:

= ∑∑ , where 1 2, ,..., nx x x are the observations (or midpoints) and

, , …, are their corresponding frequencies.

Properties of the Mean:1. Uniqueness : for a given set of data there is one and only one mean.2. Simplicity: it is so easy to compute the mean of any sample.3. The value of each data item has an influence on the mean, thus the mean is

affected by extreme values, this makes the mean, in some cases, not a goodrepresentative of the tendency of the values of the majority of the data..

Example:The mean of the data 50,49,53,48,54,420 equals (50+49+53+48+54+420)/6= 112.333;a number which does not represent the tendency of the data, however if we trim outthe observation 420 then the mean becomes (50+49+53+48+54)/5 = 50.8 . Notice theinfluence of the observation 420 on the value of the mean.Example(part 1 is optional):Compute the mean for the following two data sets.

1) Stem | Leaf0 | 1, 2 , 51 | 0, 12 | 1, 1, 1, 23 | 0, 1, 2, 2

2)

Class Frequency

0 – 2 3

3 – 5 2

6 – 8 1

9 – 11 3

12 – 14 2

Solution:1) mean = sum of observations/ number of observations

= (1+2+5+10+11+21+21+21+22+30+31+32+32)/13 = 239/13 = 18.3946

Page 10 of 41

Page 10

Page 11: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

11

2)

Midpoint ( ) Frequency ( )1 3 3

4 2 8

7 1 7

10 2 20

13 2 26

Total 10 64

mean = 64/10 = 6.4The Median:The median of a finite set of observations is the value which divides the set into two equal partssuch that the number of values equal to or greater than the median is equal to the number ofvalues equal or less than the median. The median will be the middle value (or the averageof the two middle values) when all values have been arranged in order of magnitude.Example :Find the median of the following observations

45,78,23,54,61,12,90,46,68,45,11Solution:The first step will be arranging the data in order of magnitude

11, 12, 23, 23, 45, 45, 46, 54, 68, 78, 90Notice that 45 is located exactly in the middle of all ordered values, thus the median is 45.Example :Find the median of 65, 78,94,25,23,56,66,38,78,23,80Solution:We order the data as a first step

23, 23, 25, 38, 56, 66, 78, 78, 80, 94Notice that no single datum is located in the middle of the ordered data because the number ofdata items is even, however the two values 56 and 66 are located in the middle, thus themedian equals (56+66)/2= 61.Properties of the Median:

1. Uniqueness2. Simplicity3. Unlike the mean, it is not drastically affected by extreme values.

Page 11 of 41

Page 11

Page 12: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

12

The Mode:A mode of a set of observations is an observation that has the largest frequency. If allobservations have the same frequency. A data set may have more than one mode. The modemay be used to describe qualitative data. A mode of grouped data is estimated by the midpointof a class with the highest frequency.

Example:The following table represents the nationalities of a sample of 10 patients who hadpsychotherapy last year in a private clinicBritish, French, American, American, Dutch, British, Spanish, South African, French, AmericanTo find the mode of the above nationalities we make the following table

Nationality Frequency

American 3

British 2

Dutch 2

French 2

Spanish 1

South African 1

Notice that the most frequently occurring nationality is American, thus the mode is American.

Example:Find the mode of the data 28, 28, 28, 28, 28, 29, 30, 31, 32, 32, 32, 32, 32, 36, 39, 42, 44,44,45

Solution:There are two modes for the above data namely 28 and 32 because they have the samehighest frequency.Reading Assignment:Chapter 2 (2.4) in W.W. Daniel.

Page 12 of 41

Page 12

Page 13: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

13

Descriptive Statistics – Measures of Dispersion

The dispersion of a set of data (or observations) refers to the variety that they exhibit. Ameasure of dispersion provides information about the amount of variability present in aset of data. When the dispersion is "small", the values of the data items are "close" together.The following graph represents two frequency polygons for population A and population B withthe same mean

notice that population B exhibits more dispersion because the values of its observations aremore spread out. Dispersion can be measured using one of the following measures:The Range:The range of a set of values is given by largest value smallest valueR .

The range is so simple to compute, but it is not usually used as a reliable measure ofdispersion because it is drastically affected by extreme values.The Variance:

1) For raw data: the variance of the sample 1 2, ,... nx x x is given by

2

2 1

( )

1

n

ii

x xs

n

where x is the mean of the sample.

One can easily show that the above formula for the variance has also following form

= 1− 1 2 − 2=1which is easier for computations with calculators.

The populations variance is given by = ∑ ( ) = (∑ − )where N is the populations size and is the mean of the population.

Page 13 of 41

Page 13

Page 14: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

14

2) For frequency tables: = 1(∑ )=1 −1 ∑ 2 − (∑ )=1 2=1where , , … , are the observations (or the midpoints) and , , … , are their

corresponding frequencies.The Standard Deviation:The variance represents squared units and, therefore, is not an appropriate measure ofdispersion when we want to express it in terms of the original units. To obtain a measure ofdispersion in the original units, we take the square root of the variance, which we refer to asthe standard deviation.

The standard deviations of a sample and the population is denoted by s and respectively.= √ and = √Example: Find the mean and standard deviation of each of the following samples

i) 42, 28, 28, 61, 31, 23, 50, 34, 32, 37ii)

Class Frequency

–2 – 0 1

1 – 3 2

4 – 6 3

7 – 9 2

10 – 12 1

Solution: i)

42 1764

28 784

28 784

61 3721

31 961

23 529

50 2500

34 1156

32 1024

37 1369

366 14592 ← Total

Page 14 of 41

Page 14

Page 15: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

15

Thus, = = 36.6 and = × . = 11.5297i)

Midpoint Frequency

– 1 1 – 1 1 1

2 2 4 4 8

5 3 15 25 75

8 2 16 64 128

11 1 11 121 121

Total 9 45 not needed 333

= = 5 and = × = 3.67423The Coefficient of VariationThe coefficient of variation, denoted by C.V., is a unit free measure that is used to compare theamount of dispersion between two different sets of data with (possibly) different means and

different units. The coefficient of variation is given by . . 100sC Vx

Example:The following table summarizes the data collected about the weights of two samples of humanmales

Sample 1 Sample 2

Age 25 years 11 years

Mean Weight 145 pounds 80 pounds

Standard Deviation 10 pounds 10 pounds

Which of the samples is more dispersed?Solution:To compare dispersion we compute the C.V. for each sample.

C.V. for sample 1 = 10100 6.9

145

C.V. for sample 2 = 10100 12.5

80

Since the C.V. of sample 2 is greater than the C.V. of sample 1, sample 2 is more dispersed.

Page 15 of 41

Page 15

Page 16: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

16

Percentiles and QuartilesPercentiles and quartiles are used to indicate certain positions (or locations) of the

observations (or data). The pth percentile is denoted by pP ; it is the number P such that

(almost) p% of the observations are less than or equal to P . The 25th percentile is also

denoted by 1Q and is also called the 1st quartile. The second quartile 2Q is the 50th

percentile (the median) while the 3rd quartile 3Q is the 75th percentile.

Computing percentiles:

1) For ungrouped data, the pth percentile is thought of to be the ( 1)100

thpn ordered

observation. Thus

1Q is the 0.25 ( 1) thn ordered observation

2Q is the 0.5 ( 1) thn ordered observation

3Q is the 0.75 ( 1) thn ordered observation

The pth percentile for ungrouped data is computed using the formula:+ ( − )( − ), where = ( + 1) and is the floor of and n is the

number of observations (or total frequency). Before you apply the formula, make sure that theobservations are written in an ascending order.

2) For grouped data. Think of the pth percentile to be the observation that has

cumulative frequency × , where is the total frequency. Find the first class that

has cumulative frequency greater than or equal × , say − . Use the

values and cumulative frequencies of the of this

class to approximate the required percentile linearly as shown in the example.Interquartile Range:

The interquartile range is denoted by IQR . It is given by 3 1IQR Q Q

Example:Find , the median, , and IQR for the following observations

23, 12, 54, 43, 51, 17, 32, 19, 14, 22, 25, 28, 33, 42, 26, 38, 50Solution:We start with putting the above data in ascending order:12, 14, 17, 19, 22, 23, 25, 26, 28, 32, 33, 38, 42, 43, 50, 51, 54

The number of observations is 17n .

Page 16 of 41

Page 16

Page 17: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

17

The first quartile 1Q is the 0.25 (17 1)th ordered observation, i.e. 1Q is the 4.5th observation.

Now, the 4th observation is 19 and the 5th observation is 22, hence the 4.5th observation is+ 0.5( − ) = 19 (0.5 (22 19)) = 20.5.

The median is the 0.5 (17 1)th ordered observation, i.e. it is the 9th observation, namely 28.

3Q is the 0.75 (17 1)th observation, i.e., it is the 13.5th observation, namely

42 (0.5 (43 42)) = 42.5

60P is the 0.6 (17 1)th observation, i.e., it is the 10.8th observation, namely

32 (0.8 (33 32)) = 32.8

3 1IQR Q Q = 42.5 – 20.5 = 22.

Example:Find the median and the 80th percentile of the following data.

x Frequency

3 2

5 4

9 3

12 5

17 4

Total 18

Solution: The total frequency = 18.

To find the median. × 19 = 9.5. Thus, the median is the 9.5th ordered

observation, which is + 0.5( − ) = 9 + 0.5 × (12 − 9) = 10.5 To find the 80th percentile. × 19 = 15.2. Thus, the 8th percentile is the 15.2nd

ordered observation, which is + 0.2( − ) = 17 + 0.2 × (17 − 17) =17.

Page 17 of 41

Page 17

Page 18: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

18

Example:Find the median for the following grouped data.

Class Frequency

2 – 6 3

7 – 11 5

12 – 16 7

17 – 21 2

Total 17

Solution:= 17, = 50 × 17 = 8.5. The first class that has cumulative frequency 8.5 is

12 – 16 and the actual limits of this class are 11.5 and 16.5, respectively.

? = 11.5 + 0.5 × = 11.8571 = median.

Example: Consider the following table of grouped data

Class Frequency10 – 15 316 – 21 222 – 27 528 – 33 234 – 39 4

Total 16

Estimate the proportion of observations that are less than 24.

Page 18 of 41

Page 18

Page 19: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

19

Solution:

The observation 24 belongs to the class 22 – 27.

Box-and-Whisker Plots (Box plots) (Optional) :A box-and-Whisker plot (or simply a box plot) is a useful visual device for demonstrating theinformation contained in a data set. It reveals information regarding the amount of spread,location of concentration, and symmetry of the data. The construction of such a plot makesuse of the quartiles of a data set and may be accomplished by the following steps:1. Represent the data on the horizontal axis.2. Draw a box in the space above the horizontal axis in such a way that the left end of the box

aligns with the first quartile 1Q and the right end of the box aligns with the third quartile 3Q .

3. Divide the box into two parts by a vertical line that aligns with the median 2Q .

4. Draw a horizontal line called a whisker from the left end of the box to a point that aligns withthe smallest measurement in the data set.5. Draw another horizontal line, or whisker, from the right end of the box to a point that alignswith the largest measurement in the data set.

Example: Construct a box-and-whisker plot for the data in the previous example.

Page 19 of 41

Page 19

OAlkam0795636213
Typewritten Text
Let p be the required proportion.
Page 20: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

20

Solution:

Reading Assignment:Chapter 2 (2.5) in W.W. Daniel.

Chapter 3Some Basic Probability Concepts

Elementary Properties of Probability:

A random experiment is an experiment whose outcome is a random variable, i.e., cannot be predicted with certainty.

The sample space of a random experiment is the collection of all possible values of itsoutcome.

An event is a subcollection of the sample space. The empty event is denoted by , itis the event of having no outcomes.

The probability of an event E is denoted by P(E). It is a nonnegative number , lessthan or equal to 1 that measures the likelihood of the occurrence of the event E.

Example: The following is the sample space of the experiment of tossing a coin:

,S H T

where H stands for head and T stands for tail.

The following is the collection of all possible events of the experiment of tossing a coin:

,{ },{ },{ , }H T H T

Example : Find the sample space and five different events of the experiment of tossing a coin2 times.Solution :

( , ), ( , ), ( , ), ( , )S H H H T T H T T

Page 20 of 41

Page 20

Page 21: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

21

The following are events of the experiment

1 ( , ), ( , )E H T H H , 2 ( , ), ( , ), ( , )E H T H H T T , 3 ( , )E H H ,

4 ( , ), ( , ), ( , ), ( , )E S H H H T T H T T , 5E

Definition: If every possible value of the outcome of a random experiment has the samechance to occur then the experiment is said to be equally likely. If an experiment is equallylikely and has a finite sample space S , then the probability of an event E this experiment is

given by | |( )

| |

EP E

S , where | | stands for the number of elements and S is the sample

space of the experiment.Example:Find the probability of having a total number of dots greater than 4 if a pair of fair dice arerolled.Solution:The sample space of the experiment of rolling a pair of dice is

(1,1), (1,2), (1,3),..., (1,6), (2,1), (2,2),..., (2,6),...(6,6)S

The mentioned event is the following

(1,4), (1,5), (1,6), (2,3), (2,4), (2,5), (2,6), (3,2),..., (3,6), (4,1),..., (4,6), (5,1),..., (5,6), (6,1),..., (6,6)E

Notice that | | 36S and | | 30E . Thus 30 5( )

36 6P E

Conditional Probability: If A and B are events then by ( | )P B A we denote the probability

of occurrence of the event B given that the event A has occurred. It is called a conditionalprobability and it is read " probability of B given A"

Elementary Properties:1. for any event E , 0 ( ) 1P E

2. ( ) 0P and ( ) 1P S

3. if 1 2, ,..., nS s s s then 1 2({ }) ({ }) ... ({ }) 1nP s P s P s

4. if ⊆ then ( ) ≤ ( )5. ( ) = ( ) = 1 − ( )6. ( ) = ( ∪ ) = ( ) + ( ) − ( ) = ( ) + ( ) − ( ∩ )7. ( ∪ ) = ( ∩ ) ( ∩ ) = ( ∪ )

Page 21 of 41

Page 21

Page 22: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

22

Example: The following table represents the frequency of cocaine use by gender among 111adult cocaine users (in the US)

Life time frequency of cocaine use Male (M) Female (F) Total

1 – 19 times (A) 32 7 39

20 – 99 times (B) 18 20 38

100 + times (C) 25 9 34

Total 75 36 111

1. What is the probability that a randomly selected user will be a male?2. If we pick a person at random from the 111 group and found out that he is a male (M),

what is the probability that he used cocaine 100 + times (C) ?3. What is the probability that a randomly selected person from the 111 group is a

male (M) and a person who used cocaine 100 + times (C) ?4. What is the probability that a randomly selected person from the 111 group is a

female (F) or a person who used cocaine 20 - 99 times (B) ?5. What is the probability that a randomly selected person from the 111 group is not a

a person who used cocaine 100 + times (C) ?Solution:

1. | | 75( )

111 111

MP M

2. We use the notation ( | )P C M to denote the probability of the event C given that the

event M has occurred. It is read "probability of C given M . Knowing that the

selected person is a male reduces our sample space to the group of males only, thus

| " " | 25( | )

| | 75

C for malesP C M

M

3. | | 25( )

111 111

M and CP M and C

4. ( ) = ( ) + ( ) − ( ) = + − =5. ( ) = ( ) = 1 − ( ) = 1 − =

Example: In a group of people , 25% have both diabetes and hypertension , 42% havehypertension, and 35% have diabetes. A person is selected at random from this group. What isthe probability that this person

a. is diabetic or hypertensive?

Page 22 of 41

Page 22

Page 23: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

23

b. does not have hypertension?c. is not diabetic and does not have hypertension?

Solution:a. ( ) = ( ) + ( ) −( ) = 0.35 + 0.42 − 0.25 = 0.52b. = 1 − ( ) = 1 − 0.42 = 0.58c. ∩ = ∪ = 1 −( ∪ ) = 1 − 0.52 = 0.48

Calculating the Probability of an Event; Conditional Probability :

Recall that by ( | )P B A we denote the probability of occurrence of the event B given that the

event A has occurred. The conditional probability ( | )P B A can be computed using the

formula ( ) ( )( | )

( ) ( )

P Aand B P A BP B A

P A P A

Thus ( ) ( ) ( ) ( | )P A and B P A B P A P B A

Example: Let A, B be two events such that P(A) = 0.4, P(B) = 0.8 and P(A∩B) = 0.3. Find( | ).

Solution: ( | ) = ( ∩ )( ) . Use the following table to find the value of each of these

quantities. Total Probability

0.3 0.5 0.8

0.1 0.1 0.2

Total Probability 0.4 0.6 1

Thus ( | ) = ( ∩ )( ) = .. = 0.625Definition: The events A and B are independent if ( ) ( ) ( ) ( )P Aand B P A B P A P B

Equivalently, if P(A) > 0 and P(B)>0 then the events A and B are independent if

( | ) ( )P B A P B (and ( | ) ( )P A B P A )

Example: In a group of people , 25% have both diabetes and hypertension , 42% havehypertension, and 35% have diabetes.a. What is the percent of those people that have hypertension also have a diabetes ?b. For that group of people, are the events "Diabetic" and " Hypertensive" independent ?

Page 23 of 41

Page 23

Page 24: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

24

Solution:

a. ( | )P diabetic hypertensive = ( )

( )

P diabetic and hypertensive

P have hypertension

= 0.250.595

0.42

Thus the percent of those people that have hypertension also have a diabetes is 59.5% .b. The events "Diabetic" and " Has Hypertension" are not independent because

( | ) 0.595 ( ) 0.35P diabetic hypertensive P diabetic

Fact: If and are independent then the following events are also independent: and ,

and , and .

Example: Let , be two independent events such that ( ) = 0.4 and ( ) = 0.2. Find

i) ( ∩ ) ii) ( | ).

Solution:Since and are independent, and , and are independent. Thus:

i) ( ∩ ) = ( ) ( ) = (1 − 0.4)0.2 = 0.12ii) ) ( | ) = ( ) = 1 − 0.2 = 0.8

Definition: : The events A and B are mutually exclusive if ( ) ( ) ( )P A B P A P B .

Equivalently, the events A and B are mutually exclusive if ( ) 0P A B .

Example: if a person (in the above example) is selected at random,d. what is the probability that this person is diabetic or hypertensive?e. are the events "Diabetic" and " Hypertensive" mutually exclusive?

Solution:

a.( )

( ) ( ) ( )

0.35 0.42 0.25 0.52

P diabetic or hypertensive

P diabetic P hypertensive P diabetic and hypertensive

b. The events "Diabetic" and " Hypertensive" are not mutually exclusive because( ) 0P diabetic and hypertensive

Example: if a person (in the above example) is selected at random, what is the probability thatthis person:

a. does not have hypertensionb. is not diabetic and does not have hypertension

Solution:

a. ( ) 1 ( ) 1 0.42 0.58P has hypertension P has hypertension

Page 24 of 41

Page 24

Page 25: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

25

b.( ) ( )

1 ( )

1 0.52 0.48

P diabetic and hypertensive P diabetic or hypertensive

P diabetic or hypertensive

Bayes’s Theorem. Screening Tests, Sensitivity,…HANDOUT IS NOT AVAILABLE. READ FROM YOUR MAIN REFERENCE.Reading Assignment:Chapter 3 (3.1,3.2,3.3,3.4, 3.5) in W.W. Daniel.

Page 25 of 41

Page 25

Page 26: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics
Page 27: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics
Page 28: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics
Page 29: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics
Page 30: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics
Page 31: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics
Page 32: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics
Page 33: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Chapter 4 Probability Distributions

University of Jordan Fall 2008 / 2009 Department of Mathematics Chapter 4

Probability Distributions The Distribution of a Discrete Random Variable: The distribution of a discrete random variable X is a table, a graph or a formula that is used to specify all possible values of X along with the probability of each one of these possible values. Example: Consider the following distribution of a discrete random variable X.

k P(X = k)

0 0.2

1 0.3

2 0.1

3 0.4

Total 1

Find: 1) P(X is odd) 2) P(X is even | X > 0) Solution:

1) P(X is odd) = P(X = 1 or X = 3) = P(X = 1) + P(X = 3) = 0.3 + 0.4 = 0.7 2) P(X is even | X > 0) = P(X is even and X > 0) / P(X > 0) = P(X = 2) / (1 – P(X = 0)) = 0.1 / 0.8 = 0.125

The Expected Value (Mean) and Variance of a Discrete Random Variable: The expected value (or the mean) of a discrete random variable X is denoted by E(X)

(or ) and is given by ∑ , where the sum runs over all possible values  of the

random variable . The variance of is given by , where ∑

Example: Find and for the random variable given in the above example.

Solution: k P(X = k) P(X = k) P(X = k)

0 0.2 0 0

1 0.3 0.3 0.3

2 0.1 0.2 0.4

3 0.4 1.2 3.6

Total 1 1.7 4.3

1 . 7 and 4 . 3 1 . 71 . 4 1

Biostatistics Dr. Osama Alkam

Page 26 of 41

Page 26

DELL
Cross-Out
DELL
Text Box
Spring
Page 34: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

331331 Biostatistics Lecture #9 Dr. Osama Alkam

Chapter 4 Probability Distributions

The Binomial Experiment and Distribution: Before we introduce the binomial (or Bernoulli) experiments we introduce some notations for some relevant mathematical quantities. 1. The Factorial of a Nonnegative Integer : if n is a nonnegative integer then by !n we denote

what refers to " n factorial " defined by 1 0

!( 1) ( 2) ... 2 1 0if n

nn n n if n

=⎧= ⎨ × − × − × × × >⎩

Remark: for any 1, ! ( 1)!n n n n≥ = × −

Example: 0! 1, 1! 1, 2! 2, 3! 3 2 1 6, 4! 4 3! 24,...= = = = × × = = × =

2. Combinations: : if n is a positive integer and k is an integer such that 0 k n< ≤ then the

combination nk⎛ ⎞⎜ ⎟⎝ ⎠

is defined by !! ( )!

n nk k n k⎛ ⎞

=⎜ ⎟ × −⎝ ⎠

Example:

10 10! 110 10! 0!⎛ ⎞

= =⎜ ⎟ ×⎝ ⎠

10 10! 10 0! 10!

⎛ ⎞= =⎜ ⎟ ×⎝ ⎠

10 10! 10 9! 101 1! 9! 1 9!⎛ ⎞ ×

= = =⎜ ⎟ × ×⎝ ⎠

10 10! 10 9 8 7 6! 10 3 7 2104 4! 6! 4 3 2 6!⎛ ⎞ × × × ×

= = = × × =⎜ ⎟ × × × ×⎝ ⎠

Fact:The number of ways of selecting k objects from n objects is given by nk⎛ ⎞⎜ ⎟⎝ ⎠

.

Example: How many teams of 6 players can we choose out of a group of 8 people?

Answer: 8 8! 8 7 6! 286 6! 2! 6! 2⎛ ⎞ × ×

= = =⎜ ⎟ × ×⎝ ⎠ teams.

Example: In how many ways can we choose 3 balls from an urn that contains 5 balls.

Answer: 53!! !

!!

10 ways.

Example: How many events with size 4 are there if the size of the sample space is 6?

Answer: 6 6! 6 5 4! 154 4! 2! 4! 2⎛ ⎞ × ×

= = =⎜ ⎟ × ×⎝ ⎠events.

Page 27 of 41

Page 27

Page 35: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Chapter 4 Probability Distributions

The binomial (or Bernoulli) experiment : A binomial (or Bernoulli) experiment is a random experiment that has the following properties:

1) has exactly one of two possible outcomes, one is referred to as success and the other is referred to as failure.

2) the probability of success in each trial of the experiment is constant, usually denoted by .

3) all trials of the experiment are independent. Examples:

1. Tossing a coin. The outcome is either a head or a tail. 2. Checking whether a new born is a boy or a girl 3. Checking whether a person is diabetic or not

The Binomial Random Variable: The binomial random variable is the number of successes when a binomial experiment, with

probability of success in each trial, is performed times. We denote it by ~ , . The

possible values of are 0,1,…, .

Examples: 1. Select a random sample of 10 people. Let be the number of diabetics within this

sample. Then ~ 10, , where is the proportion of diabetics in the population

from which the sample is selected. The possible values of are 0,1,2,…,10.

2. Toss a fair coin 20 times. Let be the number of times a head comes out. Then

~ 20,0.5 . The possible values of are 0,1,2,…,20.

Fact: If ~ , then

1) for each 0,1, . . , , 1

2)

3) 1

Example:

Let ~ 5,0.3 . Find: 1) 2 2) 3)

Solution:

1) 2 52 0.3 0.7 !

! !0.09 0.343 10 0.09

0.343 0.3087

2) 5 0 . 3 1 . 5

3) . Thus 5 0 . 3

0.7 1.5 3 . 3

Biostatistics Dr. Osama Alkam

Page 28 of 41

Page 28

Page 36: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

331331 Biostatistics Lecture #9 Dr. Osama Alkam

Chapter 4 Probability Distributions

Solution:

0 0.25

1 0.5

2 0.25

Total 1

Example: Suppose that the probability that a patient suffering from migraine headache pain will obtain a relief with a particular drug is 0.9. Three randomly selected sufferers from migraine headache are given this drug. Find the probability that the number of sufferers in the selected sample obtaining relief will be:

1) Exactly zero 2) At least one 3) Two or three 4) At most two

Solution: Let be the number of sufferers in the selected sample obtaining a relief. Then ~ 3,0.9 .

1) 0 30 0.9 0.1 0.1 0.001

2) 1 1 0 1 0 . 0 0 1 0 . 9 9 9

3) 2     3 2 3 0.243 0.729 0.972 4) 2 1 3 1 0 . 7 2 9 0 . 2 7 1

Note: The binomial distribution is completely determined by and . They are called “ the

parameters of the binomial distribution” Binomial Tables: When is large, the calculations of binomial probabilities using the equation can be tedious.

We may bypass these tedious calculations through using a binomial table. Binomial tables enable us to read the value of for any 0,1, … , .

Example: Let ~ 2,0.5 . Exhibit the distribution of as a table.

Page 29 of 41

Page 29

Page 37: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

331331 Biostatistics Lecture #9 Dr. Osama Alkam

Chapter 4 Probability Distributions

The following is a part of the binomial table for 10, .

Example: Let ~ 10,0.3 . Use the above table to find:

1) 4

2) 4

3) 4

4) 4

5) 4

6) 2 6

7) 2 6

8) 2 6

9) 2 6

Solution: 1) 4 0.850

2) 4 3 0.650

3) 4 4 3 0.850 0.650 0.200

4) 4 1 4 1 0.850 0.150

5) 4 1 4 1 3 1 0.650 0.350

6) 2 6 3 5 5 2 0.953

0.383 0.570

The rest are left as an exercise. Reading Assignment: Chapter 4 (4.1,4.2,4,3) in W.W. Daniel , 7th edition.

Page 30 of 41

Page 30

DELL
Text Box
.
Page 38: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

26

Chapter 4Probability Distributions

Page 31 of 41

Page 31

Page 39: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

27

The Poisson Random Variable:

The Poisson random variable is the number of occurrences of a rare event in an

interval of time or a space unit. If is the average (or expected) number of

occurrences of this event in the time (or space) unit then we write ~ ( ).

The possible values of are 0,1,2,…

Fact: If ~ ( ) then

1) for each = 0,12, . .,, ( = ) = ! , where ≈ 2.712) ( ) =3) ( ) =

Example: Let ~ (3). Find: 1) ( > 0) 2) ( )Solution:

1) ( > 0) = 1 − ( ≤ 0) = 1 − ( = 0) = 1 − ! = 1 −2) ( ) = ( ) − ( ( )) . Thus ( ) = ( ) + ( ( )) = 3 + 3 = 12

Page 32 of 41

Page 32

Page 40: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

28

Example: The number of cases admitted to the CCU in a certain hospital is

distributed according to a Poisson distribution with average 3 cases per day. Find the

probability of admitting 25 case to the CCU in this hospital in a random week.

Solution: Let be the number of cases admitted to the CCU in this hospital in a

week. Then ~ (3 × 7) = (21) . Thus, ( = 25) = ! =0.055546Note: Poisson distribution is completely determined by It is called “ the parameter

of the Poisson distribution”

Poisson Tables: Poisson tables enable us to read the value of ( ≤ ) for any= 0,1, …when ~ ( ) for several values of .

The following is a part of a Poisson table for ( ).

Exercise: Let ~ (1.5). Use the above table to find:

1) ( ≤ 3)2) ( < 3)3) ( = 3)4) ( > 2)5) ( ≥ 2)6) (2 < < 5)7) (2 ≤ < 5)8) (2 ≤ ≤ 5)9) (2 < ≤ 5)

Reading Assignment: Chapter 4 (4.4) in W.W. Daniel.

Page 33 of 41

Page 33

Page 41: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

29

The Normal Distribution:

Normal distribution is probably one of the most important and widely used continuousdistributions. A normally distributed random variable is known as a normal randomvariable. The following are the properties of the normal distribution:

Properties of the Normal Distribution:

1. It is bell shaped and is symmetrical about its mean.

2. Its mean equals its median equals it mode..3. It is a continuous distribution.4. It is completely determined by its mean and its variance. A normal random variableX with mean and variance is expressed as ~ ( . )5. The total area under the curve equals 1. Thus, the area of the distribution on eachside of the mean is 0.5.6. The probability that the normal random variable will have a value between any twopoints is equal to the area under the curve between those points.

Page 34 of 41

Page 34

Page 42: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

30

The curve on the right is skewed to the right. Its mode < its median < its mean. Theone on the left is skewed to the left. Its mode > its median > its mean.

To find the probability that a normal random variable X will have a value smaller thana given number, we transform the normal random X to the standard normal randomvariable Z that has mean 0 and variance 1. This transformation is done using the

formula = .

A standard Z table can be used to find probabilities for any normal curve problem thathas been converted to Z scores.The following steps are helpful when working with the normal curve problems:1. Graph the normal distribution, and shade the area related to the probability youwant to find.2. Convert the boundaries of the shaded area from X values to the standard normalrandom variable Z values using the Z formula above.3. Use the standard Z table to find the probabilities or the areas related to the Z valuesin step 2.

Example:

The weights of 1000 children are normally distributed with mean 25 kg and standarddeviation 5 kg.

1) Find the proportion of children that have weights between 22 kg and 28 kg.

2) About how many children have weights smaller than 30 kg?

3) If a child is randomly selected, find the probability that her/his weight issmaller than 28.

4) Find the third quartile of the weights of these children.

5) Find a positive number C such that 68% of the children have weights between25 – C and 25+C.

Biostatistics Dr. Osama Alkam

30

The curve on the right is skewed to the right. Its mode < its median < its mean. Theone on the left is skewed to the left. Its mode > its median > its mean.

To find the probability that a normal random variable X will have a value smaller thana given number, we transform the normal random X to the standard normal randomvariable Z that has mean 0 and variance 1. This transformation is done using the

formula = .

A standard Z table can be used to find probabilities for any normal curve problem thathas been converted to Z scores.The following steps are helpful when working with the normal curve problems:1. Graph the normal distribution, and shade the area related to the probability youwant to find.2. Convert the boundaries of the shaded area from X values to the standard normalrandom variable Z values using the Z formula above.3. Use the standard Z table to find the probabilities or the areas related to the Z valuesin step 2.

Example:

The weights of 1000 children are normally distributed with mean 25 kg and standarddeviation 5 kg.

1) Find the proportion of children that have weights between 22 kg and 28 kg.

2) About how many children have weights smaller than 30 kg?

3) If a child is randomly selected, find the probability that her/his weight issmaller than 28.

4) Find the third quartile of the weights of these children.

5) Find a positive number C such that 68% of the children have weights between25 – C and 25+C.

Biostatistics Dr. Osama Alkam

30

The curve on the right is skewed to the right. Its mode < its median < its mean. Theone on the left is skewed to the left. Its mode > its median > its mean.

To find the probability that a normal random variable X will have a value smaller thana given number, we transform the normal random X to the standard normal randomvariable Z that has mean 0 and variance 1. This transformation is done using the

formula = .

A standard Z table can be used to find probabilities for any normal curve problem thathas been converted to Z scores.The following steps are helpful when working with the normal curve problems:1. Graph the normal distribution, and shade the area related to the probability youwant to find.2. Convert the boundaries of the shaded area from X values to the standard normalrandom variable Z values using the Z formula above.3. Use the standard Z table to find the probabilities or the areas related to the Z valuesin step 2.

Example:

The weights of 1000 children are normally distributed with mean 25 kg and standarddeviation 5 kg.

1) Find the proportion of children that have weights between 22 kg and 28 kg.

2) About how many children have weights smaller than 30 kg?

3) If a child is randomly selected, find the probability that her/his weight issmaller than 28.

4) Find the third quartile of the weights of these children.

5) Find a positive number C such that 68% of the children have weights between25 – C and 25+C.

Page 35 of 41

Page 35

Page 43: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

31

Solution:

Let X represent the children’s weights. Then ~ (25, 5 ).

1) To find (22 < < 28).

(22 < < 28) = 22 − 255 < < 28 − 255 = (−0.6 < < 0.6)= ( < 0.6) − ( < −0.6) = 0.7257 − 0.2743 = 0.45142) ( < 30) = < = ( < 1) = 0.841

Thus, about 0.841 × 1000 = 841 children have weights less than 30 kg.

3) Find ( < 28) (Exercise)

4) The third quartile is nothing but which is characterized by the property( < ) = 0.75. Thus, < = 0.75. From the standard normal table

we find that ≈ 0.67. Hence, = 5 × 0.67 + 25 = 28.35 kg.

5) (25 − < < 25 + ) = 0.68 → < < = 0.68 →< < = 0.68 → < = 0.84 → = 1 → = 5.Reading Assignment:

Chapter 4 (4.6,4.7) in W.W. Daniel.

Chapter 5Some Important Sampling Distributions

Introduction:A statistical measure for a sample is called a statistic and a statistical measure for a

population is called a parameter. Example of statistics are , s , … . The following are

parameters , σ , … . A statistic is a random variable but a parameter is not. Sample statistics

like x and s are used to estimate population parameters like and , respectively. There is

some difference (or error ) between statistics and parameters. Different samples from thesame population may have different amounts of sampling error. Studying sampling distributionsof sample statistics helps us understand statistical inference and allows us to answer questionsabout sample statistics.Sampling Distributions :The sampling distribution of a statistic is the distribution of the values taken by that statistic inall possible samples of the same size that are drawn from the same population.

Page 36 of 41

Page 36

Page 44: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

32

Note : The number of all possible samples of size n , drawn without replacement from a

population of size N , equals !

! ( )!

N N

n n N n

. If we allow replacement then the number

of all possible samples is nN .

Example :The following table gives all possible samples of size 2 drawn with replacement from apopulation that comprises the weights ( in pounds ) of 5 children together with the mean ofeach samplePopulation data : 65 54 67 65 88

Population 65 54 67 65 88

65 (65,65), 65 (54,65),59.5 (67,65),66 (65,65),65 (88,65),76.5

54 (65,54),59.5 (54,54),54 (67,54),60.5 (65,54),59.5 (88,54),71

67 (65,67),66 (54,67),60.5 (67,67),67 (65,67),66 (88,67),77.5

65 (65,65),65 (54,65),59.5 (67,65),66 (65,65),65 (88,65),76.5

88 (65,88),76.5 (54,88),71 (67,88),77.5 (65,88),76.5 (88,88),88

The following chart represents the above samples' means

Page 37 of 41

Page 37

Page 45: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

33

Sampling Distribution of the Mean:Theorem:

The sampling distribution of x in a normally distributed population with mean and standard

deviation is also normally distributed with mean and standard deviationn

,where n is

the sample size, provided that sampling is performed with replacement. If sampling isperformed without replacement then the sampling distribution is also normally distributed with

mean and standard deviation1

N n

Nn

, where N is the size of the population.

The factor1

N n

N

is called the correction factor. It is negligible if 0.05n N or

N is very large (infinite or practically infinite).

The Central Limit Theorem (CLT) : When the sample size is large ( 30n ), the above

Theorem is also valid even if the population is not normally distributed. In fact the sampling

distribution of the mean is almost normal when n is large .The larger the sample size, the

closer the sampling distribution of the mean to being normally distributed.Example:Suppose that the ages of Jordan University students follow a normal distribution with mean20.5 years and standard deviation 1.4 years. If we repeatedly collect samples of size 49n :

a) what is the sampling distribution of x ?

Answer: ~ 20.5, ( . ) ~ (20.5,0.04)~ (20.5, (0.2) )b) what is the probability that the mean age of a randomly selected sample of size 49 of

Jordan University students is smaller than 21 years?

Answer: 21 20.5( 21) ( ) ( 2.5) 0.9938

0.2P x P Z P Z

c) what is the probability that an individual student is younger than 21 years old ?Answer: ~ (21.5, (1.4)thus

21 20.5( 21) ( ) ( 0.36) 0.6406

1.4P x P Z P Z

d) what is the distribution of x if the ages of Jordan University students do not follow a

normal distribution ?

Page 38 of 41

Page 38

Page 46: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

34

Answer : The distribution of x will be approximately normal with mean 20.5 and standard

deviation 0.2 since the sample size is > 30,Reading Assignment:

Chapter 5 (5.1,5.2,5.3) in W.W. Daniel.Distribution of the Difference Between Two Sample Means:Suppose that we want to know whether or not the mean serum cholesterol level is higher in apopulation of sedentary office workers than in a population of laborers. If we know that thosemeans are different then we may wish to know by how much they differ. One way is to take a

random sample from each population then look at the sampling distribution of 1 2x x to

answer probability questions and draw statistical inference.

Sampling Distribution of 1 2x x :

Theorem:

If we draw two independent random samples of sizes 1n and 2n from two distinct normally

distributed populations, having means 1 2, and standard deviations 1 and 2 ,

respectively, then 1 2x x is normally distributed with mean1 2x x

1 2 and standard

deviation1 2

2 21 2

1 2x x n n

Note: The above theorem is also valid if the populations are not (both) normally distributed

provided that both 1n and 2n are greater than or equal to 30.

Example:One group on a diet lost an average of 7.2 kg with standard deviation 3.7 kg., another groupon sportive exercises lost an average of 4.0 kg with a standard deviation of 3.9 kg. Suppose we

collect samples of sizes 1 42n from the diet group and 2 47n from the exercises group :

(a) what is the sampling distribution of 1 2x x ?

Answer: the sampling distribution of 1 2x x is approximately normal ( since 1 30n and

2 30n ) with mean 7.2 4.0 3.2 kg and standard deviation2 2(3.7) (3.9)

0.80642 47

kg

(b) what is the probability that the difference between mean weight loss of the two groupsis larger than 4.0 kg ?

Page 39 of 41

Page 39

Page 47: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

35

1 2

4.0 3.24.0

0.806

1

1 0.8389

0.1611

P x x P Z

P Z

Answer: 0.993

0.99

(c) what is the probability that the mean weight loss of the exercises group is larger than4.0 kg ?

Answer : ~ 4.0, . = (4.0, (0.569) ), thus2

4.0 4.0( 4.0) 0 0.5

0.569P x P Z

(d) Find the IQR (interquartile range) of − .

Solution:= − = − .( − < ) = 0.75 → < − 3.20.806 = 0.75 → − 3.20.806 = 0.675→ = 0.675 × 0.806 + 3.2 = 3.744( − < ) = 0.25 → < − 3.20.806 = 0.25 → − 3.20.806 = −0.675→ = 3.2 − 0.675 × 0.806 = 2.656Thus, = 3.744 − 2.656 = 1.088

Distribution of the Sample Proportion:In this section we study the distribution of sample proportion. Such distribution helps us answerprobability questions about proportions when it is tedious, difficult or practically impossible touse binomial tables. For example, suppose that in a certain population 0.08 percent are colorblind, if we randomly select 1500 individuals from this population, what is the probability thatthe proportion of color blinds in that sample is at least 0.10. To answer such question usingbinomial tables we need to find the probability that the variable x is greater than or equal to

0.10 1500 150 given that x is binomially distributed with 0.08p and 1500n . How

would we answer that question if we don't have binomial tables for 1500n (or even for any

25)n ?

Distribution of Sample Proportion; An Empirical Rule:When the sample size is "large" (we will see shortly what large means), the distribution ofsample proportions is approximately normally distributed with mean equal to the true population

proportion p and standard deviation equal to (1 )p p

n

. The sample is considered "large

enough" if 5np and (1 ) 5n p .

Page 40 of 41

Page 40

Page 48: Chapter 1 Introduction to Biostatisticseacademic.ju.edu.jo/oalkam/Material/Biostat lectures part1_ch1-ch5 updated.pdf · 1 University of Jordan Fall 2009/2010 Department of Mathematics

Biostatistics Dr. Osama Alkam

36

Example:Suppose that in a certain population 0.08 percent are color blind, if we randomly select 1500individuals from this population. Find:

a) the probability that the proportion of color blinds in that sample is at least 0.10.

b) the 95th percentile of .Solution:

a) 0.08p and 1500n . Since 1500 0.08 120 5np and

(1 ) 1500 0.92 1380 5n p , the proportion of color blinds is approximately

normally distributed with mean 0.08p and standard deviation

(1 ) 0.08 0.920.007

1500

p p

n

Thus

b) ( < ) = 0.95 → < .. = 0.95 → .. = 1.65 → =1.65 × 0.007 + 0.08 = 0.09155Distribution of the difference between two sample proportionsHANDOUT IS NOT AVAILABLE. READ DIRECTLY FROM YOUR MAIN REFERENCE.Reading Assignment:Chapter 5 (5.1,5.2,5.3,5.4,5.5,5.6) in W.W. Daniel.

Page 41 of 41

Page 41