proportion estimation: 1 point and confidence interval estimation of a population proportion,

Proportion Estimation: 1

Point and Confidence Interval Estimation

of a Population Proportion,


We are frequently interested in estimating the

proportion of a population with a characteristic of

interest, such as:

• the proportion of smokers

• the proportion of cancer patients who survive at

least 5 years

• the proportion of new HIV patients who are

female


• If we take a random sample from a population

• observe the number of subjects with the characteristic of interest (# of “successes”)

• we are observing a binomial random variable.

Now, however, we will focus on

• estimating the true proportion , , in the population

• rather than focusing on the count.


Again, one way to deal with this type of data is to define a random variable X that can take two values:

X = 1, if characteristic is present – a “success”

X = 0, if characteristic is absent – a “failure”

Then

• if we sum all values in a population,

• we are summing zeros and ones –

• this will give a count of the number of individuals in the population with the characteristic:

1

#n

ii

X of successes


1

#1 N

ss

successesx proportion of successes

N N

The population mean is the Proportion of individuals in population with the characteristic:

The sample proportion is then:

1

#1 n

ii

successesp X X

n n

Therefore, p is the estimator of , the proportion with a characteristic of interest.


By the Central Limit Theorem, we know, for n large

even when X is not normally distributed.

When X is a 0,1 variable, for n large we know from the central limit theorem.

2~ ( , / )X N n

2~ ( , / )X P N n ( )Approximately


What is the variance, 2, for a 0,1 variable?

We know

2 2 2

1 1

1 1( ) ( )

N N

s ss s

x xN N

By use of algebra, and the fact that xs2 = xs.

for a 0,1 variable,

we can show that 2 (1 )


2 2 2 2

1 1

1 1( ) 2

N N

s s ss s

x x xN N

2 2

1 1 1

12

s

N N N

ss s s

x xN

2

1 1

1 12

s

N N

ss s

Nx x

N N N

2 22 (1 )

For those who want the algebra:

expand

x2 = x, for 0,1

sum over constant


Hence,

The standard error of the sample proportion is

2 (1 )

22 2 (1 )

PX n n

(1 )P n

Standard error of P:


We also know, by the central limit theorem, that for

large n, P is approximately normally distributed:

For Estimation of the population proportion, :

Point Estimate: Confidence Interval Estimate:

(1 )~ ,P N

n

1

1 n

ii

P Xn

1 / 2( )( )PP z


Example: Suppose that a sample of 1000 voters is taken to determine presidential preference.

In this sample, 585 persons indicated that they would vote for candidate A.

Construct a 95% confidence interval estimate for the true proportion, , in the population planning to vote for candidate A.

The confidence interval for takes the form:

1 / 2 1 / 2( ) ( )PX z se P z


1. The point estimate of the proportion is:p= (585/1000) = .585

2. The 95% confidence interval estimator of is

However we don’t know , so we will use p in it’s place to estimate the standard error:

1 / 2 .975

(1 )( )( ) ( )PP z P z

n

.975

(1 ) .585(.415)( ) .585 (1.96)

1000

p pp z

n

.585 (1.96)(.0156)

.585 (.0305)


The 95% CI on the proportion preferring Candidate A is (.554, .616).

This does not include the value .50:

Either we obtained an unusually large sample mean (such that the interval estimate did not overlap µ=0.5) if µ really is .5, or the population mean is not .5, suggesting that candidate A will win the election.


When is the sample large enough to use the normal approximation to the binomial?

When (n)(π)5, and (n)(1-π)5

That is,

when both the expected number of successes and the expected number of failures is greater than 5.


Aside: improve to the normal Appoximation for a Binomial

• The Binomial distribution is discrete, while the normal distribution is continuous. When the true proportion,π, is known, we can match the binomial distribution better to a normal distribution by including a correction. The correction is called the ‘continuity correction’.

1 2( ),x x

P Pn n

We use instead the normal approximation for the probability

1 21 / 2 1 / 2( )x x

P Pn n

• For example, when π = .5, and n = 10, to approximate


Example of ‘Continuity Correction' to the Normal Approximation to the Binomial.

Suppose π = .5 and n = 16. Compare the exact normal approximation and continuity corrected values of P(.4375 ≤ P ≤ .5).

• From Binomial Table:

• Using Normal Approximation, no correction

• Using Correction:

7 8(.4375 .5) ( ) .1746 .1964 0.371

16 16P P P P

.4375 0.5 .5 .5 .0625( ) ( 0) ( .5 0) .1915

0.125 0.125 .125P z P z P z

6.5 8.5( ) (.40625 .53125)16 16( 0.75 0.25) .5987 .2266 .3721

P P P P

P z


Using P in place of to estimate the standard error p:

1.If (n)(π)5 and (n)(1-π)5, use P:

2.Otherwise, a) Assume π=.5,or b) use an ‘exact ’method for the CI

We do this to avoid underestimating the variance,

(1– ) which is at a maximum when =.5

Don’t use Student’s t with proportions since the assumption of normality of the underlying population elements is not satisfied by a 0,1 variable.

(1 )( )

P Pse P

n

.5(1 .5)( )se P

n


What do we use when the normal approximation is not appropriate?

Exact Binomial Confidence Intervals for can be computed:

Solve for x in the following and then substitute into p= x/n:

Lower Limit:

Upper Limit:

Clearly, exact binomial CI is not simple to compute

0

Pr( | ) (1 )x

k n kn k

k

X x p C p p

Pr( | ) (1 )n

k n kn k

k x

X x p C p p


Go to Minitab or other software

Stat Basic Statistics 1 Proportion

Leave blank for Binomial CI;

Check for Normal approx.

n

x


EXACT Binomial:Test and CI for One Proportion

Test of p = 0.5 vs p not = 0.5

ExactSample X N Sample p 95.0% CI P-Value1 585 1000 0.585 (0.553748, 0.615750) 0.000

Normal Approximation:Test and CI for One Proportion

Test of p = 0.5 vs p not = 0.5

Sample X N Sample p 95.0% CI Z-Value P-Value1 585 1000 0.585 (0.554461, 0.615539) 5.38 0.000


Sample Size Estimation when the goal is Estimating a Population Proportion,

The pattern is the same as when goal is estimation of a mean:

If we know

• the desired precision (width of interval)

• confidence level

• “guess” of the proportion to get std error

we can estimate the sample size, n.


The width of a confidence interval for P is:

w = 2[z1-/2 (P)] ,

where P is the standard error of P

( )P P + z1-/2(P)P – z1-/2(P)

w

Using

we have

(1 )P n

1 / 2

(1 )2( )w z

n


Solving for n gives us

21 / 2

2

4( ) (1 )zn

w

Note:

• this requires information about which is our goal!

• However, (1–) is at a maximum when =.5

• To be conservative

• (over- rather than under-estimate sample size)

• use (.5) in place of


Substituting in .5 for gives a conservative sample size estimator of:

2 21 / 2 1 / 2

2 2

4( ) (1 ) 4( ) .5(.5)z zn

w w

21 / 2

2

( )zn

w


Example:

For an election poll, how many voters should be surveyed to estimate the proportion, to within 5%, in favor of re-electing the current mayor, with 95% confidence?

1. We have a confidence level, 1– = .95 z.975 = 1.96

2. We have a desired width of 5% = .05, w = .10

Conservative: n = (z1-a/2)2/w2 = (1.96)2/(.10)2 = 384.16

We should poll 385 voters to achieve a 95% CI of 5%


What if we have some information on ?

A previous poll tells us that the current office-holder had ~ 75% of the voter support.

Assuming = .75:

n = 4(1–)(z1-/2)2/w2

= 4(.75)(.25)(1.96)2/(.10)2 = 288.12

Using available information

• we get a sample size estimate of 289 voters

• which can save us considerable time and expense, compared to the more conservative estimate.


Confidence Interval Calculation for the

Difference between two proportions, 1 – 2,

Two independent groups

We are often interested in comparing proportions from 2 populations:

• Is the incidence of disease A the same in two populations?

• Patients are treated with either drug D, or with placebo. Is the proportion “improved” the same in both groups?


Suppose we take independent, random samples from two groups, and estimate a proportion in each.

For large enough sample size, we know:

1 11 1

1

(1 )~ ,P N

n

2 22 2

2

(1 )~ ,P N

n

Then the standard error of the difference between the sample proportions is the square root of the sum of the variances:

1 2

1 1 2 2

1 2

(1 ) (1 )P P n n


Or, since we don’t know the true proportions, the sample estimate of the standard error:

1 1 2 21 2

1 2

(1 ) (1 )( )

P P P Pse P P

n n

Thus, for n large, the (1) confidence interval estimator is:

1 1 2 21 2 1 / 2

1 2

(1 ) (1 )( ) ( )

P P P PP P z

n n


Example:

In a clinical trial for a new drug to treat hypertension, 50 patients were randomly assigned to receive the new drug, and 50 patients to receive a placebo.

34 of the patients receiving the drug showed improvement, while 15 of those receiving placebo showed improvement.

Compute a 95% confidence interval estimate for the difference between proportions improved.


1. Point Estimate of (1 – 2):

p1 = 34/50 = .68

p2 = 15/50 = .30 (p1 – p2)= .68 – .30 = .38

2. Since we have n1 = n2 = 50, our sample size is

large enough to use the sample estimate of standard error:

1 1 2 21 2

1 2

(1 ) (1 )( )

p p p pse p p

n n

.68(.32) .30(.70).0923

50 50


3. Confidence coefficient:

For 1 – = .95, z1-/2 = z.975 = 1.96

4. Confidence Interval Estimate:

The 95% CI estimate is:

(.199 , .561) or (19.9% , 56.1%)

The difference between proportions improved is bounded away from zero – it seems that the proportion improved by the drug is clearly greater than the proportion by placebo.

1 1 2 21 2 1 / 2

1 2

(1 ) (1 )( ) ( )

P P P PP P z

n n

(.38) (1.96)(.0923) .38 .181


Using Minitab: Stat Basic Statistics 2 Proportions

Enter sample sizes n1 and n2

Enter # of successes x1 and x2


Test and Confidence Interval for Two Proportions

Sample X N Sample p

1 34 50 0.680000

2 15 50 0.300000

Estimate for p(1) - p(2): 0.38

95% CI for p(1) - p(2): (0.198748, 0.561252)


The same cautions apply here, as for estimates for a single proportion

• the sample size should be large enough in each group, so that the normal approximation will hold:

• nπ5 and n(1-π)5 for each sample

• Otherwise: a) use .5 in place of π when estimating the variance for the confidence interval.b) use some other method.

• Minitab offers the option to compute a pooled estimate of the standard error


And in summary:

Confidence interval estimates provide

• a range of likely values

• an associated probability, or confidence level.

The width of the confidence interval depends upon:

• The underlying variability in the population

• The sample size

• The confidence level


It is important to keep track of assumptions that we must make about the data:

• Samples should be selected randomly

• selection of any element is independent of selection of any others

• For many cases, we must assume that the underlying population follows a normal distribution

• without this assumption, probabilities computed using the

t-distribution

2–distribution

F-distribution may not be correct.


When we speak of “knowing” the population variance, 2,

• we really mean that we have an outside source of information

• previous research, census data, etc.

• the key is that we are not using the sample estimate, s2, based upon the current sample.


The key to confidence interval estimation is to know

• what parameter you are estimating

• the point estimate of the parameter

• the confidence level

• what distributional assumptions are required

• the associated distribution for computing probabilities.

I have started a summary table for you below – completing this table will be a good review exercise.


Distribution of data

Parameter to Estimate

Point Estimate

(1 –) Confidence Interval Estimate

N( , 2) 2 known:

2 unknown:

Any,

n large

For n large:

2 known:

Bin (n,) P

N( , 2) 2 S2

N( , 2)

N( , 2)

–

Bin (n1,)

Bin (n2,)

–

N( , 2)

N( , 2)

X

X

1 / 2 /X z n

1;1 / 2 /nX t S n

1 / 2 /X z n

1 / 2 /X z S n

proportion estimation: 1 point and confidence interval estimation of a population proportion,

Documents

proportion estimation

population proportion

sample proportion

proportion of individuals

true proportion

proportion of smokers

proportion of cancer

confidence interval