stat171_09_2015_1 copy 4

47
STAT171 Statistical Data Analysis (2015) 1 Topic 9 Inference regarding proportions (one and two populations)

Upload: nigerianhacks

Post on 09-Nov-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

  • STAT171 Statistical Data Analysis

    (2015)

    1

    Topic 9

    Inference regarding proportions

    (one and two populations)

  • 1. Testing a hypothesis about pi.

    2. Confidence interval for pi.

    J & BChapter 8 section 5 (one proportion)

    Chapter 10 section 7 (two proportions)

    8.5

    8.5

    2

    3. Testing a hypothesis about twoproportions, pi1 and pi2 .

    4. Confidence interval for pi1- pi2 .

    10.7

    10.7

  • Notation

    Text & Lecture notesn = sample size

    (number of independent Bernoulli trails)X = count of the number of successes

    Lecture notespi = population proportion (a constant)P = the sample proportion P = X / n

    3

    P = the sample proportion P = X / n

    Text bookP = population proportion (a constant)

    = the sample proportion = X / nP P

    Care is needed due to different notation!!!!

  • Testing a single proportionExample: In past years, each year 15% of people who insured their car made a claim. This year, of a random sample of 400 policies, 76 made a claim. Is there any evidence that the proportion has changed?

    Setting up the problem:

    4

    X = the number in the sample who made a claim this year

    X ~ B (n , pi) (x = 0, 1, , n)

    P = the sample proportion who made a claim this year

    P = X / n

    =

    n

    n

    nnnp ,...,2,1,0

    We have to assume the policyholders are independent

  • Distribution results for X: We have: X = number of claims made in the sample

    this yearX ~ Binomial

    n=400 pi=0.15 IF the claim rate is unchanged

    X ~ B (400, pi)

    5

    ( )( ) ( ) ( )

    ( )( )

    ( ) ( )

    , 1

    For n "sufficiently large" (CLT applies):

    approx ~ , 1

    or ~ 0,11

    For ~ , :

    E X n Var X n

    X N n n

    X n Nn

    X Bin n

    pi pi pi

    pi pi pi

    pi

    pi pi

    pi

    = =

    ( )both 15

    and 1 15n

    n

    pi

    pi

    >

    >

  • Distribution results for P: The sample proportion of claims, P, has a scaled binomial distribution

    P ~ (1/n) * B (400, pi)pi=0.15 IF the claim rate is unchanged

    ( ) ( )

    For :

    E XX nE P En n n

    XPn

    pipi

    = = = =

    =

    6

    ( ) ( ) ( ) ( )

    ( )

    ( ) ( )

    2 2

    1 1

    For n "sufficiently large" (CLT applies):

    1approx ~ ,

    or ~ 0,11

    n n n

    Var X nXVar P Varn n n n

    P Nn

    P N

    n

    pi pi pi pi

    pi pipi

    pi

    pi pi

    = = = =

    ( )both 15

    and 1 15n

    n

    pi

    pi

    >

    >

  • Example (cont)Here, we have observed the sample result p = 76/400 = 0.19

    We want to test given that p is 0.19, do we have evidence that pi is no longer 0.15? (i.e. has the claim rate changed from 0.15?)

    Under the assumption that pi is 0.15, we want the probability of getting a sample

    7

    proportion at least as far away as 0.19 is from 0.15 (that is p 0.11 or p 0.19).

    This is the same as getting a sample count X 76 or X 44 since 0.15*400 = 60

    76 60 = 16 (we observed 16 more than we expected)

    and 60 - 16 = 44(the same distance away in the other direction)

  • We can obtain this probability in two ways:

    (1) using the exact binomial:

    ( ) ( )

    ( ) ( )

    44400

    0

    400400

    76

    400Prob 0.15 0.85

    4000.15 0.85

    x x

    x

    x x

    x

    x

    x

    =

    =

    =

    +

    8

    (2) Using the normal approximation to the binomial:

    ( ) ( )

    ( ) ( ) 0,1

    1

    Prob 0.11 Prob 0.19

    where approxP N

    n

    P P

    pi

    pi pi

    +

  • For the general case

    Following the steps as for a test of : H0: pi = pi0 e.g. H0: pi = 0.15

    H1: pi pi0 H1: pi 0.15 = = 0.05

    CLT assumption check

    9

    The text book states that to use the z-test for proportions, we need:

    both npipipipi0 15and n(1-pipipipi0) 15

    We will use this check (as it is the one in the quizzes).

  • If H0 is true, set up the test statistic:

    if

    then 0

    0 0(1 - )0 ,1P

    n

    Npipi pi

    ( )0 00

    1,P N

    n

    pi pipi

    10

    Note: the mean and variance are exact, the normality is approximate here

    with observed value

    n

    0

    0 0(1- )obs

    pz

    n

    pi

    pi pi

    =

  • Obtain the p-valueThis enables us to determine whether zobs is a believable or not-believable value from the Z distribution

    For a HA: pi pi0 p-value P( | Z | |zobs| )

    Make the decision:p-value Reject H

    11

    p-value Reject H0p-value > Retain H0

    Write a meaningful conclusion

  • Continuity CorrectionWe are approximating a discrete (binomial)distribution with a continuous (normal)distribution. Therefore, the continuity correction should be used.For a two-sided alternative, the corrected test statistic is:

    01

    2p

    npi

    See J&B p.254

    Allows finding the area in both

    12

    The larger n is, the less important it is to use the continuity correction.

    0

    0 0

    2(1- )obs

    pnz

    n

    pi

    pi pi

    =

    the area in both tails includingthe observed

    sample p

    Note: The text book (J&B) does not use the cc in hypothesis testing for proportions(which leads to less accurate approximations to the p-value) and this is also the case for the quizzes.

  • One tailed testsThe hypothesis test can be one or two-tailed.

    If one-tailed where H1: pipipipi > pipipipi0Test statistic is:

    p-value P(Z zobs)

    0

    0 0

    1p pi2

    pi (1-pi )n

    obsnz

    =

    13

    p-value P(Z zobs)

    If one-tailed where H1: pipipipi < pipipipi0Test statistic is:

    p-value P( Z zobs)

    0

    0 0

    1p pi2

    pi (1-pi )n

    obsnz

    +=

  • For the exampleH0: pi = 0.15H1: pi 0.15 = 0.05

    Checking the assumption of approximate normality:

    n*pi0 = 400*0.15 = 60 15

    and n*(1- pi0) = 400*0.85 = 340 15

    14

    and n*(1- pi0) = 400*0.85 = 340 15 reasonable to assume normality

    The test statistic is

    0

    0 0

    1| |2

    (1- )P

    nZ

    n

    pi

    pi pi

    =

  • With observed value

    p-val = P(| Z | 2.17 ) 0.030Reject H0 at the 5% level of significance.There is sufficient evidence to conclude the proportion of claimants is different

    10.19 0.15 0.03875800 2.170.01785360.15(1-0.15)

    400

    obsz

    =

    15

    the proportion of claimants is different from previous years. The sample proportion of insured claiming this year is significantly greater than 15%.

    In the above example, not using the continuity correction gives zobs = 2.24 with an associated p-value of 0.025.That is, no c.c. will give a smaller p-value than when c.c. is used the actual Type I error rate will be higher than specified by the significance level.

  • Confidence interval for pipipipi[Usually a confidence interval is of the form:

    statistic z/2* std. error(statistic) ]

    Here, it should be:

    But pipipipi is unknown.

    We dont even have a hypothesised value!

    pipipipi

    2(1 - )

    np z

    pi pi

    Ideally, to get the CI for pipipipi, we have to solve a quadratic

    16

    So, we use p as our best estimate of pipipipi an approx confidence interval for pipipipi is:

    /2(1- )p p

    np z

    ... we use an approximation for the standard error of P instead of the exact standard error but we still refer to the z-tables, not the t ... theory to be done next year.

  • Confidence interval CLT check:In the hypothesis test for pi, we used the normal approximation to the binomial, and had to check the validity of this under H0.

    We also need to check the validity of using the CLT for the confidence interval, but here we do not have a pi0.

    Instead, we check using the sample p:CLT check:

    17

    CLT check: we need np 15 and n(1-p) 15where np = the sample number of successes and n(1-p) = the sample number of fails

    Continuity correction:When doing confidence intervals for pi, we dont worry about continuity correction. It is pointless trying to improve accuracy when the standard error is only approximated.

  • For the example

    CLT check: np = 76 15 n(1-p) = 324 15

    We can validly use the normal approximation to the binomial here.

    95% confidence interval for pi

    0.19(0.81)0.19 1.96400

    18

    We are 95% confident that the interval (0.1515, 0.2285) includes the true population proportion pi of claimants this year.

    ( )( )( )

    0.19 1.96 0.01963

    0.19 0.0385

    0.1515,0.2285

  • Using the CI for pipipipi for testing H0Even though this interval for pi does not contain 0.15 (the hypothesised proportion for this year), we cannot accurately use it to test the hypothesis H0: pi = 0.15 vs H0: pi 0.15

    Why is this so?In evaluating the standard error of P: the hypothesis test uses pipipipi0 ; but the C.I. uses the sample p .

    19

    the C.I. uses the sample p .

    Under H0: pi = 0.15 we used:

    For the C.I. calculations we used

    ( ) ( )0 01 0.15 0.85 0.01785400n

    se p pi pi = =

    ( ) ( )1 0.19 0.81 0.019615400

    p pn

    se p = =

  • However, in most cases, the difference between the two s.e.s will be very small.

    Only if pi0 is close to the CI boundaries is there a problem with using the CI to perform the

    20

    hypothesis test.

    Here, the 95% ci for pi was (0.1515, 0.2285)

    and we were testing H0: pi = 0.15, so it a bit too close to call in this case (so we would have to do the hypothesis test).

  • Limits on c.i.s for pipipipiA two-sided approx confidence interval for pipipipi is:

    However, pipipipi must be in the interval (0,1) as it is a proportion.

    /2p(1-p)

    np z

    Ideally, to get the CI for pi, we have to solve a quadratic

    21

    The confidence interval CLT checknp 15 and n(1-p) 15

    should guarantee the ci will not be outside the interval (0,1).

    The 3 CLT check will guarantee the ci for pi is in the interval (0,1), as

    long as the z/2 < 3.

  • One sided c.i.s for pipipipi

    For a one-sided CI for pi using the normal approximation, we cannothave a boundary of we have boundaries of 0 or 1

    for a proportion.

    The 100(1-)% ci for pi:

    22

    The 100(1-)% ci for pi:

    For a < alternative:

    For a > alternative:

    p(1-p)n

    ,0 p z

    +

    p(1-p),

    n1p z

  • Under Stat Basic Stats 1 Proportion

    In MTB 17, there is a drop-down panel for this.

    Using Minitab (16):

    Under options, Click on: use test based on normal distribution to carry out the z test.

    Otherwise, p-value is calculated using exact binomial probabilities.

    23

    Large n normal approx quite accurate and quicker than many binomial calculations

    Small n normal approx not necessarily accurate and a small number of binomial calculations is quite quick

  • MTB > POne 400 76;SUBC> Test 0.15;SUBC> UseZ.

    Test and CI for One Proportion

    Test of p = 0.15 vs p not = 0.15

    X N Sample_p 95%CI Z-Val P-Val

    76 400 0.190 (0.151555,0.228445) 2.24 0.025

    Using the normal approximation.

    Resulting Minitab outputMinitab does not

    use pipipipi notation in the output, it uses p

    24

    Using the normal approximation.

    MTB > POne 400 76;SUBC> Test 0.15.

    Test and CI for One Proportion

    Test of p = 0.15 vs p not = 0.15Exact

    X N Sample p 95% CI P-Value76 400 0.190 (0.152721,0.231938) 0.036

  • Two sample test of proportionsUsed if we have two independentsamples, where we measure the proportion of something in each.

    Example

    Children are randomly selected from two different schools take the same test.

    The number who pass at each school is

    25

    The number who pass at each school is recorded.

    At School1, 40 out of 70 pass the test.

    At School2, 45 out of 100 pass the test.

    We want to know: is there any difference between the two schools in their overall pass rates?

    The (hypothetical) populations of interest are all students who may ever be in either of the schools.

  • Here we have two independent samples:

    School1 p1 = 40/70 0.57 n1 = 70

    School2 p2 = 45/100 = 0.45 n2 = 100

    Based on these samples, we need to decide which scenario we believe:

    The proportions estimate the same pipipipi(and the difference between p and p

    26

    (and the difference between p1 and p2can be explained by random variation)

    or

    The proportions estimate two differentpopulation proportions pi1 and pi2(and the difference between p1 and p2is due to this systematic difference)

  • General case for two proportionsSample1: observe X1 successes

    from n1 observations P1 = X1 / n1

    Sample2: observe X2 successesfrom n2 observations P2 = X2 / n2

    27

    We want to test: H0: pi1 = pi2 (= pi)H1: pi1 pi2 at sig level

    If n1 and n2 are large enough to apply the CLT:

    ( )2 22 2

    2

    1~ ,P N

    n

    pi pipi

    ( )1 11 1

    1

    1~ ,P N

    n

    pi pipi

  • If the two samples are independent:

    If H0 is true, i.e. if pipipipi1 = pipipipi2 = pipipipi

    ( ) ( )1 1 2 21 2 1 2

    1 2

    1 1 ,P P N

    n n

    pi pi pi pipi pi

    +

    ( ) ( )1 2

    1 2

    1 1 ,

    1 1

    P P Nn n

    pi pi pi pipi pi

    +

    28

    Therefore the test statistic is:

    But pipipipi is unknown !!!

    ( )1 2

    1 1 0 , 1N

    n npi pi

    +

    ( )1 2

    1 2

    1 11obs

    p pz

    n npi pi

    =

    +

  • We cannot get the exact standard error of P1 P2 , as we need the (unknown) value of pi to substitute in. use the pooled sample

    proportion to estimate pi .

    Use= p = weighted average of p1 and p2pi

    29

    = p = weighted average of p1 and p2

    = number in sample1+number in sample2total n

    So

    pi

    1 1 2 2 1 2

    1 2 1 2

    n p n p x xpn n n n

    pi+ +

    = = =

    + +

    The sample proportions are weighted by the sample sizes

  • This then gives an estimated standard error of P1 P2, and we get the observed test statistic:

    Obtain p-value and then reject or

    ( )1 2

    1 2

    1 1 1

    obsp p

    z

    n npi pi

    =

    +

    30

    Obtain p-value and then reject or retain H0 like any other z-test.As for any test, this can be one or two tailed.

    This IS a z-test (even though we have estimated the standard error of (P1P2) using the pooled p-hat) ... as we have used binomial distributional properties in this estimation.

  • We need approximate normality for bothsample proportions under H0, but we dont know the value of pi, so use its estimate, the pooled sample proportion p:

    Need: n 1 p 15 and n 1(1- p) 15n 2 p 15 and n 2(1- p) 15

    These are just the number of successes and

    CLT check for two proportions

    31

    These are just the number of successes and failures in the two samples.

    Continuity correction?There is no need for continuity correction in two sample proportions tests, as you need to add and subtract a correction term (one for p1 and one for p2) and they will approximately cancel.

  • For the school exampleH0: pi1 = pi2H1: pi1 pi2 = 0.05

    p1 = 40 / 70 0.57 based on n1 = 70p2 = 45/100 = 0.45 based on n2 = 100

    Under H0, the pooled proportion is:

    32

    Checking for approximate normality:

    n1*p = 35 15 and n1(1-p) = 35 15 n2*p = 50 15 and n2(1-p) = 50 15

    CLT applies

    40 45 85 1 0.5070 100 170 2

    p pi + = = =+

    = =

  • p-value P(| Z | 1.55)

    ( )

    40 4570 100

    1 10.5 0.570 100

    17140

    0.07792

    1.558

    obsz

    =

    +

    33

    2*0.061 0.121

    p-value > 0.05 retain H0

    There is insufficient evidence, at the 5% significance level, to be able to conclude there is a difference in the pass rate between the two schools.

  • Confidence interval for pipipipi1 - pipipipi2Here, we have no null hypothesis, so we are not assuming that pi1 = pi2 .

    When doing the hypothesis test, we averaged p1 and p2 to get a pooled p, and used that in our estimate of s.e.(p1 - p2). However, to evaluate the confidence interval, we still need an estimate of

    ( ) ( )

    34

    use p1 as an estimate of pi1 and p2 as an estimate of pi2So, an approx 100(1-)% C.I. for pi1 - pi2 is:

    ( ) 1 1 2 21 2 21 2

    (1 ) (1 )p p p pp p zn n

    +

    ( ) ( ) ( )1 1 2 21 21 2

    1 1se p p

    n n

    pi pi pi pi = +

  • Warning: We cant use the confidence interval to accurately carry out the hypothesis test H0: pi1 = pi2 , as the standard error of the difference in the two sample proportions is evaluated in different ways:

    Under H0, there was only one value of pi to estimate, and we then used the pooled p to estimate the relevant standard error

    ( ) ( ) 1 1

    35

    In the CI calculations, we are not assuming pi1 = pi2 , so the relevant standard error is estimated by:

    ( ) 1 1 2 21 21 2

    (1 ) (1 )p p p pse p p

    n n

    = +

    ( ) ( )1 21 2

    1 1 1se p p

    n npi pi

    = +

  • For the example95% C.I. for pi1-pi2 is:

    ( )( )( )

    40 30 45 5540 45 70 70 100 100

    1.96*70 100 70 100

    0.5714 0.45 1.96*0.077289

    0.1214 0.1514

    +

    We are 95% confident that the above interval includes the true difference between the population proportions.

    ( )( )0.030 , 0.273

    36

    Note: the standard error used in the hypothesis test calculations was 0.07792;

    the standard error used in the ci calculations was 0.07729

  • Using MinitabUnder Stat Basic Stats 2 Proportions

    37

    Note the only option is to use the normal approximation. There is no exact binomial test.

    Use pooled estimate of p for test must be ticked or the CI (unpooled p) estimate of the standard error is used in the hypothesis test.

  • MTB > PTwo 70 40 100 45;SUBC> Pooled.

    Test and CI for Two Proportions

    Sample X N Sample p1 40 70 0.5714292 45 100 0.450000

    Difference = p(1) - p(2)

    Output (pooled option)

    38

    Difference = p(1) - p(2)

    Estimate for difference: 0.121429

    95% CI for diff:(-0.0300545, 0.272912)

    Test for difference = 0 (vs not = 0): Z = 1.56 P-Value = 0.119

    Fisher's exact test: P-Value = 0.161Ignore this until next year

  • MTB > PTwo 70 40 100 45.

    Test and CI for Two Proportions

    Sample X N Sample p1 40 70 0.5714292 45 100 0.450000

    Difference = p(1) - p(2)

    Output (unpooled option)

    39

    Estimate for difference: 0.121429

    95% CI for diff:(-0.0300545, 0.272912)

    Test for difference = 0 (vs not = 0): Z = 1.57 P-Value = 0.116

    Results: same CI only a small difference

    in z (1.56 vs 1.57)and p values (0.119 vs 0.116)

  • Topic 9. Appendix A

    Insurance claims example: In past years, 15% of the policy holders

    have made an insurance claim per year.

    This year, of a random sample of 400 policies, 76 have made a claim.

    Is there any evidence that the proportion has increased by a factor of more than 1.2 times?

    40

    times?

    Sample estimate of pi is p = 76/400 = 0.19

    Ratio of proportions is

    (bigger than 1.2)

    sample proportion 0.19 1.267past proportion 0.15

    =

  • The results of previous inference were:

    H0: pi = 0.15 vs H0: pi 0.15 was rejected with p-val = 0.030

    The 95% CI for pi is (0.1515, 0.2285 )

    We found that there is evidence that

    41

    We found that there is evidence that the true proportion this year is higher than 15%

    BUT we have not yet answered the question about whether the proportion has increased by a factor of more than 1.2 times!!!

  • We can approach this a couple of ways:

    (1) Hypothesis testH0: pi = 0.15 * 1.2 = 0.18H1: pi > 0.18 = 0.05

    01 76 1 0.18

    2 400 800p

    nzpi

    = =

    CLT check:npi = 400 * 0.18 = 72 15

    n(1-pi) = 400 * 0.82 = 328 15

    42

    p-value = P(Z 0.46) 0.3228

    Insufficient evidence (at = 5%) to conclude the proportion has increased by a factor of more than 1.2.

    ( ) ( )0 02 400 8001 0.18 0.82

    400

    0.00875 0.460.0192094

    obsnz

    n

    pi pi= =

  • (2) Confidence interval (one sided)(not strictly equivalent)95% CI lower bound for pi:

    ( )0.05

    1

    76 32476 400 4001.645400 400

    0.19 0.032267

    0.1577

    p pp z

    n

    =

    =

    =

    CLT check (sample numbers):np = 76 15 n(1-p) = 324 15

    Max value for a proportion is 1, so upper limit cannot be

    43

    We are 95% certain the interval (0.1577, 1) contains the true proportion. Because 0.18 is in the interval (and is not close to the boundary), there is insufficient evidence to be able to claim that the proportion has increased by a factor of more than 1.2.

    To claim the proportion has increased by a factor of more than 1.2, the CI for the new pi would have to be completely ABOVE 0.18 .

    0.1577=

  • (3) Confidence interval for the ratioThe claim is that the ratio of the proportions is more than 1.2

    i.e. that pi / 0.15 > 1.2

    The appropriate 95% one-sided CI for the new pi was found to be (0.1577, 1).So, the appropriate 95% one-sided CI for

    the ratio pi / 0.15 is ( )0.1577 1

    44

    the ratio pi / 0.15 is

    Because 1.2 is in the interval (and is not close to the boundary), there is insufficient evidence to be able to claim that the ratio of the proportions is more than 1.2.

    To claim the ratio of proportions has is more than 1.2, the CI would have to be completely ABOVE 1.2

    ( )0.1577 1, 1.051,6.6670.15 0.15

  • Summary: One Sample Proportions Test & C.I.Hypothesis test:H0: pi = pi0 versus H1: pi pi0Sample: P = X/n where X ~ Bino (n , pi)CLT check: npi0 15 and n(1-pi0) 15

    01

    2(1- )obs

    pnz

    pi

    pi pi

    =

    45

    Confidence Interval:CLT check: np 15 and n(1-p) 15 An approximate 100(1-)% CI for pi is

    0 0(1- )n

    pi pi

    2p(1-p)

    np z

    One-sided CIs for pi must have as their limit 0 or 1 (not )

  • Summary: Two Sample Proportions Test

    Hypothesis test:H0: pi1 = pi2 versus H1: pi1 pi2Sample: P1 = X1/n1 and P2 = X2/n2

    Pooled estimate of pi is 1 21 2

    X XPn n

    +=

    +

    46

    CLT check:n1*p 15 and n2p 15

    n1*(1-p) 15 and n2(1-p) 15

    1 2

    1 2

    1 1(1 )obs

    p pz

    p pn n

    =

    +

  • Summary: Two Sample Proportions C.I.

    Confidence Interval:Sample: P1 = X1/n1 and P2 = X2/n2

    CLT check (simply uses observed counts):

    n1*p1 15 and n2p2 15

    47

    n1*p1 15 and n2p2 15 n1*(1-p1) 15 and n2(1-p2) 15

    An approximate 100(1-)% CI for (pi1-pi2) is

    ( ) 1 1 2 21 2 21 2

    (1- ) (1- )p p p pp p zn n

    +

    One-sided CIs for (pi1 - pi2) must have as their limit -1 or +1 (not )