sample-based epidemiology concepts infant mortality in the usa (1991) infant mortality in the usa...

Sample-Based Epidemiology ConceptsSample-Based Epidemiology Concepts

Infant Mortality in the USA (1991)Infant Mortality in the USA (1991)UnmarriedUnmarried MarriedMarried TotalTotal

DeathsDeaths 16,71216,712 18,78418,784 35,49635,496AliveAlive 1,197,1421,197,142 2,878,4212,878,421 4,075,5634,075,563TotalTotal 1,213,8541,213,854 2,897,2052,897,205 4,111,0594,111,059

We rarely have the luxury of having the entire population at our disposal so we We rarely have the luxury of having the entire population at our disposal so we usually take a small (or large, if you have the money and time and even larger usually take a small (or large, if you have the money and time and even larger if you also have lots of post-docs to collate data) random sample from our if you also have lots of post-docs to collate data) random sample from our selected population and estimate the population incidence (probabilities) based selected population and estimate the population incidence (probabilities) based on the sample. This means that we will have on the sample. This means that we will have errors in estimationerrors in estimation; with ; with big big errorserrors if we use if we use small numberssmall numbers of people of people in our samples and in our samples and smaller errorssmaller errors if we if we use use bigger numbersbigger numbers of people in our samples. of people in our samples.

Because of the error in Because of the error in estimating the population parameter,estimating the population parameter, we have to we have to calculate calculate confidence limitsconfidence limits for our for our estimate; our sample predicts a parameter but estimate; our sample predicts a parameter but the parameter could be smaller or larger than the predicted value – so we need to the parameter could be smaller or larger than the predicted value – so we need to know the range of possible values for the predicted parameterknow the range of possible values for the predicted parameter . . To see how this To see how this works we have to delve into the incredibly coolworks we have to delve into the incredibly cool

Universe of Statistical Analysis.Universe of Statistical Analysis.

The terms confidence limits and estimate of population parameters are highly relevant to research in the health sciences because they are statistical concepts.

Statistics and statistical analysis is nothing more than calculating measures of probability, association, central tendency and variance of sample data (statistics) and the probabilities that the calculated statistics relate to the target population (statistical analysis).

Of course statistical probabilities are not exactly the same as the actual population probabilities of infant mortality (0.0086) and infant non-mortality (0.9914) for the USA in 1991; two separate population parameters.

A parameter is any measure from a population while a statistic is any measure from a sample.

If we test entire populations then we do not need statistical analysis. For example:

If another population (lets say, another country) was measured in its’ entirety and the other country’s infant mortality and infant non-mortality were calculated as 0.0085 and 0.9915, respectively [compared to infant mortality (0.0086) and infant non-mortality (0.9914) for the USA in 1991] we could conclude with absolute certainty (100% confidence) that the two populations were completely different with regard to these two parameters because we would be absolutely certain that the calculated numbers are exactly descriptive of the respective populations (even though there is just a tiny difference between the two populations). Different numbers means different!

However, because samples are not necessarily exactly representative of the population from which they came, differing numbers from two (or more) different samples do not necessarily guarantee that the samples came from two (or more) different populations.

As previously mentioned, we simply NEVER (well, not very often anyway) have the luxury of being able to measure the entire population so we have to suffer with a (usually) small sample that was selected from the population.

We then measure whatever it is we are interested in; lets say: “Infant Mortality” or “Height”, and then assume that our sample represents our population and that whatever the sample statistic is, that same number is an estimate of the parameter of the population from which the sample was selected.

Because such an assumption may not be absolutely true; ie. the sample doesn’t perfectly represent the population, we need to have some idea of where the actual population parameter might be …

To do this, we simply perform a particular type of statistical analysis to estimate a range of possible values that would include the population parameter ... we use the sample data to do so: the sample statistic is used to estimate the exact middle of the range and the variability of the numbers in the sample is used to estimate the highest and lowest value of the range …

To understand how these statistical calculations are made we need to start with a frequency distribution of the data:

To understand how these statistical calculations are made we need to start with a frequency distribution of the data:

Once we have a frequency distribution of the data then the mathematical properties of the frequency distribution can be used to estimate the range of values that the population parameter might exist – within certain confidence limits or confidence intervals . . .

The predicted range of values within which the population parameter might exist is calculated on the basis of Confidence Intervals and these are defined by percentages: 95% confidence interval, 90% confidence interval, 99% confidence interval . . .

These percentages relate to statistical probabilities . . .

95% CI: There is a probability of 0.95 that the population parameter exists within the calculated range of values – or a probability of 0.05 that it does not . . .



An extremely accurate, but rather cumbersome way to describe data; especially if there were hundreds or thousands of people in the population . . . . .

A little less accurate of a description but a whole lot easier to describe because only the shape of the line is being described; not each of the individual data points. Note that the shape of the line still accurately describes how the data is distributed on the number line, we just need a more accurate way to describe the line …

And there even is a way to calculate those two parts of the curve. (If you look at the right and left halves of the curve separately, you may recognize them as sigmoid curves.)

The measure of central tendency most often used to describe the peak of the data curve is called mu (µ - population parameter) or mean ( x – sample statistic) and the measure of variability most often used to describe the dispersion of the data along the number line is called the standard deviation (σ – parameter; sd - statistic);which is equal to the square root of the variance (σ2) or (V).

µ = ∑ x / n (commonly called the average – add up all the scores and divide by the total number of scores)

∑ (x - µ)2

σ2 = ————— (subtract the mean from each score, square each result, add n up all the squares, and then divide by n; then take the

square root to get σ)

The µ corresponds to the exact point on the number line where the central peak of the frequency distribution curve sits and the σ corresponds to the exact point on the number line where the data starts to spread out faster away from the mid-point.

An advantage of describing your population in terms of how the data is distributed on a number line using µ and σ is that any population can be represented by this exact same kind of a curved line; a line often called a normal curve.

An important property of these curves is that they are very easy to describe in terms of mathematical probabilities. For example, we know that 50% of all the body weights (data points) in the population are greater than the center point (µ = 5’ 6.25”) which means there is a 0.50 probability that a randomly selected individual is taller than 5’ 6.75”. We also know that 68.26% of all the data points are between the 2 σ limits (4’ 1.75” to 6’ 10.75”) which means there is a 0.6826 probability that a randomly selected individual will be between 4’ 1.75” tall and 6’ 10.75” tall.

This graph simply illustrates more “percentages of the data distributed along the number line” in different sections of the curve; based on how far along the number line you go in σ units. Again, using percent as probabilities, there is a 0.3413 probability that a randomly selected individual would be between the mean and one standard deviation above the mean, or to put it a different way, we would be 34.13% confident that a randomly selected individual would be somewhere between the mean and +1 sd, or 2.28% confident that a randomly selected individual would be +2sd above the mean . . .

Note that the z-score number corresponds to the sd unit.

Now . . . from this curve you notice that standard deviation units and z-score units are the same thing.

In between the +1 and -1 units are found 68.26% of all the scores in the frequency distribution.

In between the +2 and -2 units are found 95.54% of all the scores in the frequency distribution

To make things easier, tables of z-scores and the % of scores in between the z-score limits are available in most statistics textbooks . . . A few of those values are reproduced here:

Z-Score % Z-Score %

1.00 68.26 2.5 98.761.5 86.60 2.57 99.001.65 90.00 3.0 99.741.96 95.00* 3.27 99.902.00 95.54 3.3+ ~100

* Traditional level for “statistical significance”

Now . . .to figure out where the confidence limits actually come from in all those epidemiology papers . . .

The “baby” data illustrates this fairly well . . .

UnmarriedUnmarried MarriedMarried TotalTotalSample1Sample1 BirthsBirths 3535 6565 100100Sample2Sample2 BirthsBirths 2929 7171 100100Sample3Sample3 BirthsBirths 3333 6767 100100Sample4Sample4 BirthsBirths 4141 5959 100100

If we randomly sampled 100 live births from all of the 4,111,059 live births in If we randomly sampled 100 live births from all of the 4,111,059 live births in the USA in 1991 we might find that the USA in 1991 we might find that 3535 births were associated with births were associated with unmarried unmarried mothersmothers. This would give a sample probability (statistic) of . This would give a sample probability (statistic) of 3535 unwed mothers unwed mothers / / 100100 live births = live births = 0.350.35 - an - an estimateestimate of the population probability (parameter) of the population probability (parameter) that a birth is associated with an unmarried mother.that a birth is associated with an unmarried mother.

The sample probability (statistic) is not the correct probability for the entire The sample probability (statistic) is not the correct probability for the entire population, just the correct probability for the sample.population, just the correct probability for the sample.

If we took 3 more (different) random samples from the same population, each If we took 3 more (different) random samples from the same population, each of 100 live births, we would probably find a different probability that the birth of 100 live births, we would probably find a different probability that the birth is associated with unwed mothers for each sample that was randomly selected; is associated with unwed mothers for each sample that was randomly selected; we might get 29 / 100 = 0.29; 33 / 100 = 0.33; 41 / 100 = 0.41; and so on . . . we might get 29 / 100 = 0.29; 33 / 100 = 0.33; 41 / 100 = 0.41; and so on . . . and and we would never be 100% certain (confident) that any one sample probability we would never be 100% certain (confident) that any one sample probability would exactly represent the population parameter.would exactly represent the population parameter.

We need some way to deal with this uncertainty so we construct We need some way to deal with this uncertainty so we construct confidence limitsconfidence limits or a or a confidence intervalconfidence interval..

Marital status of samples of new mothers Marital status of samples of new mothers in the USA (1991)in the USA (1991)UnmarriedUnmarried MarriedMarried TotalTotal

Sample1Sample1 BirthsBirths 3535 6565 100100Sample2Sample2 BirthsBirths 4141 5959 100100Sample3Sample3 BirthsBirths 3333 6767 100100Sample4Sample4 BirthsBirths 2929 7171 100 …100 …

If we could keep sampling samples (of n = 100) and calculating probabilities If we could keep sampling samples (of n = 100) and calculating probabilities forever we would end up with an infinite number of sample probabilities. forever we would end up with an infinite number of sample probabilities. Sample probabilities close to the true population probability would appear Sample probabilities close to the true population probability would appear numerous times while those far away would appear less frequently; the most numerous times while those far away would appear less frequently; the most frequently occurring sample probability (from the infinite number of samples) frequently occurring sample probability (from the infinite number of samples) would correspond to the population probability while the least frequent would correspond to the population probability while the least frequent probabilities would correspond to the extreme values (again, from the infinite probabilities would correspond to the extreme values (again, from the infinite number of samples).number of samples).

This infinite number of This infinite number of theoretical sample probabilitiestheoretical sample probabilities would obviously fit into would obviously fit into some kind of frequency distribution curve that is some kind of frequency distribution curve that is normally distributednormally distributed. From this . From this theoreticaltheoretical Normal Distribution we can construct a confidence interval using Normal Distribution we can construct a confidence interval using standard percentile scores (actually the same sd units called z-scores illustrated in standard percentile scores (actually the same sd units called z-scores illustrated in previous slides) which will then be related to just how confident we want to be; previous slides) which will then be related to just how confident we want to be; 95% confident? 90% confident? 99% confident? 95% confident? 90% confident? 99% confident? – just plug in the sample – just plug in the sample values you are interested in, and appropriate z-score value that corresponds to values you are interested in, and appropriate z-score value that corresponds to your chosen %-confidence level into the formula and voila: your chosen %-confidence level into the formula and voila: Confidence Confidence IntervalsIntervals

This is another figure of that same normal curve with z-scores and percentages; the actual z-scores that correspond to 95% and 90% of the data have been added … Just imagine that this curve illustrates the distribution of an infinite number of probabilities calculated from the infinite number of samples (n = 100) that were randomly selected from the same population)

We already have some idea where the middle of this “population curve” fits on a number line because we have the (ONE) sample estimate of that point; we are just not 100% confident that the sample statistic is exactly the same as the population parameter. What we need to know is the range of possible values that the actual population center-point might be within – so we calculate that range using the above theoretical curve …

Marital status of a sample of new mothers in the USA (1991)Marital status of a sample of new mothers in the USA (1991)

UnmarriedUnmarried MarriedMarried TotalTotal ProbabilityProbabilityBirthsBirths 3535 6565 100100 0.350.35

Confidence Interval - 95% (use z-score of 1.96)Confidence Interval - 95% (use z-score of 1.96) 0.35 x 0.650.35 x 0.65

0.35 ± 0.35 ± ( ( 1.961.96 √ √ —————— —————— )) = = 0.35 ± (1.96 √0.002275) 0.35 ± (1.96 √0.002275) 100100 == 0.35 (0.257, 0.443)0.35 (0.257, 0.443)

Confidence Interval - 90% (use z-score of 1.644)Confidence Interval - 90% (use z-score of 1.644)== 0.35 ± (1.644 √0.002275)0.35 ± (1.644 √0.002275)= = 0.35 (0.272, 0.428)0.35 (0.272, 0.428)

*True population probability *True population probability == 0.295 0.295 (1,213,854 / 4,111,059)(1,213,854 / 4,111,059)

The confidence interval is simply the range of values in a frequency distribution The confidence interval is simply the range of values in a frequency distribution of values from all possible samples of the same size between which you might of values from all possible samples of the same size between which you might expect to find the true population value (parameter), ie. expect to find the true population value (parameter), ie. The sample statistic The sample statistic predicts that the parameter is 0.35 but it is 90% probable the true parameter is predicts that the parameter is 0.35 but it is 90% probable the true parameter is somewhere between 0.272 and 0.428; and 95% probable the parameter is between somewhere between 0.272 and 0.428; and 95% probable the parameter is between 0.257 & 0.443.0.257 & 0.443.

These two graphs illustrate the previous calculations as well as the effect of sample size on the “accuracy” of using the sample statistics to predict the population variance.

From the previous formula, the z-score values (1.96 or 1.644) describe the confidence limits between which we will look for our predicted population “value”

The term √ (0.35 x 0.65) / 100is a calculation of the sample variance – note that the sample n is part of the equation.

The larger the n, the narrower the variance (n=1000 = .285 - .305 vs. n=100 = .3 - .4) in predicting the population variance.

With smaller sample sizes, or with highly variable data, or with p ~ 0 or 1, it is With smaller sample sizes, or with highly variable data, or with p ~ 0 or 1, it is problematic to accurately predict population variance using the sample problematic to accurately predict population variance using the sample variance, so this next formula is actually used a lot more:variance, so this next formula is actually used a lot more:

(2 x 100 x 0.35) + 1.96(2 x 100 x 0.35) + 1.962 2 ± 1.96 √1.96± 1.96 √1.9622 + (4 x 100 x 0.35 x 0.65) + (4 x 100 x 0.35 x 0.65) —————————————————————————————— ——————————————————————————————

2 ( 100 + 1.962 ( 100 + 1.9622))

== 35 ± (0.264, 0.447)35 ± (0.264, 0.447)

[ previous calculation[ previous calculation = = 35 ± (0.257, 0.43) ]35 ± (0.257, 0.43) ]

True population probability =True population probability = 0.2950.295

*You will notice that all epidemiology publications will give the confidence *You will notice that all epidemiology publications will give the confidence intervals associated with each variable measured.intervals associated with each variable measured.

**and since computers do all the work nowadays and they can calculate exact **and since computers do all the work nowadays and they can calculate exact intervals based on the sampling distribution of P, based on the binomial intervals based on the sampling distribution of P, based on the binomial distribution, we don’t have to bother with knowing any of these formulas, just distribution, we don’t have to bother with knowing any of these formulas, just have an idea about what the formulas are actually calculating …have an idea about what the formulas are actually calculating …

sample-based epidemiology concepts infant mortality in the usa (1991) infant mortality in the usa...

Documents

selected population

infant nonmortality

separate population

small sample

countrys infant mortality

statistical probabilities

sample sta

different numbers