cervical cancer vaccine - eco no metrics project

Upload: oana-caz

Post on 07-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 Cervical Cancer Vaccine - Eco No Metrics Project

    1/11

    Cervical Cancer VaccineCazacu OanaGroup 132

  • 8/4/2019 Cervical Cancer Vaccine - Eco No Metrics Project

    2/11

    2 Cazacu Oana group 132Cervical Cancer Vaccine

    Introduction

    In this project I tried to show how several aspects related to the cervical cancer vaccinecampaign influenced the people, and how many, eventually found out about the campaign. Thedata provided in our project was taken from the statistic data base that we did last year for our

    statistic project. Last year, I and my team made a questionnaire which was filled in by 135 people. Their answers formed our data base.

    Hyp othesis testing

    A hypothesis is a statement about a population parameter from one or more populations.A hypothesis test is a procedure that states the hypothesis which is to be tested, it uses sampleinformation and formulated a decision rule, and it is based on the outcome of the decision rule.Depending on the outcome of the decision rule the hypothesis can be statistically validated or rejected.

    In order to do a hypothesis test, the following steps must be followed:

    1. State the hypothesis2. State the significance level3. State the cut-off values4. Calculate the test statistics5. Compare z calculated to critical values6. Decide into what region falls z: rejection or acceptance7. Decide upon the acceptance or rejection of null hypothesis8. Comment the decision

    A hypothesis test always includes two hypotheses:a) The null hypothesis (H 0): The null hypothesis is the hypothesis to be tested.

    b) The alternative hypothesis (H 1): The alternative hypothesis is the one accepted if the nullhypothesis is rejected.H0 and H 1 can be almost anything, and as complicated or as simple as we wish.

    The significance level is one of the most confusing terms for students. Each studentwonders if a 5 percent significance level mean there is only a 5% chance that my result aresignificant?.The significance level is actually the alpha, or Type I risk. If the null hypothesis istrue, there is a 5 percent chance of rejecting it because of random variation.

    F irst ProblemA survey was made to test is if the percentage of people that are between the age of 16-

    20 found out about the cervical cancer vaccine campaign from watching TV is significantly

    higher than those that that found out about it through the internet. The results of the survey saidthat the number of persons that had as a source of information the TV was smaller than thenumber of persons that had as a source of information the Internet.

    I want to test at a level of significance of 5% if the number of people with the age between 16 and 20 who found out about the vaccine from Internet is significantly higher than thenumber of people who found out from the TV.

  • 8/4/2019 Cervical Cancer Vaccine - Eco No Metrics Project

    3/11

    3 Cazacu Oana group 132Cervical Cancer Vaccine

    Table 1 present the classification of people based on 6 groups of age. I choose to makethe hypothesis testing on the group 16 to 20 because it was the largest group. All the data wascollected from the data base.

    Table 1: Persons grouped by age

    Also, from the data base I collected all the information about who found out about thevaccine from the television and from the internet for all groups of age. I will present this

    information in Table 2.

    Groups of age Information mean - Television Information mean - Internet 16-20 44 2121-25 18 726-30 10 331-35 3 236-40 5 241-45 5 1

    Total 85 36

    We are in the case of testing a proportion.

    In order for me to do the hypothesis test, I have to follow the next steps:

    Ste p 1: In this step I have to define the null hypothesis and the alternative hypothesis.I denote with n 1 the total persons, no matter the age they have, that found out about the vaccinefrom the television, and with n 2 the total number of persons who found out through the internet.

    53%

    22%

    10%

    4% 6%5%

    Persons grouped by age categories16-20 21-2 5 26-30 31-3 5 36-40 41-4 5

  • 8/4/2019 Cervical Cancer Vaccine - Eco No Metrics Project

    4/11

    4 Cazacu Oana group 132Cervical Cancer Vaccine

    n1 = 85

    n2 = 36

    I compute the percentage for number of person in the following way:

    y p1 = = 51.76 % . This shows how much of the people who are between the age 16 and20 have found out about the vaccine from the Television represent from the total number of people who found out about the vaccine from the Television.

    y p2 = = 58.33 % . This shows how much of the people who are between the age 16 and20 have found out about the vaccine from the Internet represent from the total number of

    people who found out about the vaccine from the Internet.

    The null hypothesis is: H 0 : p 2 - p 1 = 0The alternative hypothesis is: H 1 : p 2 - p 1 > 0

    Ste p 2 : The significance level is set to be 5%.

    Ste p 3 : Due to the fact that we are conducting a right-hand tail test, the cut-off value isset to be + 1.645.

    Ste p 4 : Establish the Acceptance Region: AR (- , + 1.645]

    Ste p 5 : We compute z calculated according to the following formula:

    We substitute all the values into the formula and we compute the result:

    = = = 0.66Ste p 6 : z calculated AR

  • 8/4/2019 Cervical Cancer Vaccine - Eco No Metrics Project

    5/11

    5 Cazacu Oana group 132Cervical Cancer Vaccine

    Ste p 7: Because zcalculated falls into the acceptance region, in 95 % of cases we donthave enough sample evidence to reject H 0 and to accept H 1. There is a 5% probability to bewrong. To be 100 % sure we have to test the whole population.

    Ste p 8: In 95% of cases the number of persons who found out about the vaccine throughInternet is not significantly higher than the number of persons who found out about the vaccinethrough the Television.

    Second p roblem

    In 25% of cases when women caught the disease (cervical cancer) the doctors claimedthat it wasnt due to the STD ( Sexually Transmitted Diseases) factors. A random survey of 135

    people found out that in 55% of cases, when the woman caught the disease, it was due to theSTD factors.

    I want to test at a 5% level of significance if there is a significance difference between thesurveys results and the doctors claims.

    Ste p 1 : We formulate the null and alternative hypothesis:H0 : = 25 %H1 : 25 %

    Ste p 2 : The significance level is set at 5%.

    Ste p 3 : Because is a two-tailed test we have to cut-off values 1.96 and + 1.96.

    Ste p 4 : We establish the Acceptance Region: AR [- 1.96 , + 1.96]

    Ste p 5 : We compute zcalculated using the following formula:

    zcalculated =

    zcalculated =

    = = = 7.01

    Ste p 6 : zcalculated belongs to the Rejection Region: RR (-

  • 8/4/2019 Cervical Cancer Vaccine - Eco No Metrics Project

    6/11

    6 Cazacu Oana group 132Cervical Cancer Vaccine

    Ste p 7: In 95% of cases we have enough sample evidence to reject the null hypothesisand accept the alternative hypothesis. There is a 5% probability to be wrong. To be 100% surewe have to test the whole population.

    Ste p 8: In 95% of cases we have enough sample evidence to reject the doctors claim andto accept the surveys results.

    Sim p le Linear Regression Model

    Links between variables can be explained by two techniques: regression and correlation.The correlation shows how strong is the relationship between variables, while regression helps inexplaining and predicting a value based on another factor (or others), which, of course, willreduce the uncertainty of important phenomena, but random. There are three main goals whenwe analyze the links between statistical variables: to describe and understand the relationships of dependence, to predict a new value of the variable-effect and adjust and control the variableeffect of intervention on the variable concerned.

    The model is also called the two-variable linear regression model or bivariate linear regression model because it relates the two variables x and y.

    The simple regression model is a model where the dependent variable is a linear functionof a single independent variable, plus an error term.

    The variables y and x have several different names used interchangeably, as follows. y iscalled the de p endent variable , the exp lained variable , the res p onse variable , the p redictedvariable , or the regressand . x is called the inde p endent variable , the exp lanator y variable ,the control variable , the p redictor variable , or the regressor .

    The dependent variable is the variable we wish to explain, and the independent variable isthe variable used to explain the dependent variable.

    For calculating the simple linear regression, we will consider the dependent variable isrepresented by the number of persons who didnt do the vaccine, and the independent variable isrepresented by the number of persons who had a medical reason for not doing the vaccine. Allthe data was taken from our data base.

    In our data base we have 135 entries. Table 3 shows how many persons have did thevaccine and how many have not.

    Personswho

    took thevaccine

    Personswho

    didn'ttake

    ii10i xY !

  • 8/4/2019 Cervical Cancer Vaccine - Eco No Metrics Project

    7/11

    7 Cazacu Oana group 132Cervical Cancer Vaccine

    For calculating the simple linear regression, I will consider the dependent variable thenumber of persons who didnt take the vaccine and the independent the number of persons whodidnt take the vaccine because they had a medical reason not to take it.

    Income Persons who didnt take the vaccine People that had a medicalreason for not taking the

    vaccine< 600 45 10

    600 - 1000 22 31000 - 2000 15 22000 - 3000 12 33000 -5 000 3 0

    > 5000 1 1Confidential 28 10

    Total 126 29

    SUMMARY OUTPUT

    Regression StatisticsM ultiple R 0.89 7332912R Square 0.80 52063 55 Adjusted R Sq uare 0. 76624 7626Standard Error 7 .39 581338

    Observations 7

    y The variables are the persons who didnt take the vaccine (grouped by the level of income) and the persons who didnt take the vaccine for medical reasons.y The persons who didnt take the vaccine are the de p endent variable , and the persons

    who didnt take the vaccine for a medical reason are the inde p endent variable .

    y Multi p le R shows how strong is the correlation between the persons that didnt take thevaccine and the persons who didnt take the vaccine for medical reasons. In our case Multi p le R = 0. 897 . Taking into consideration that it larger than 0.75, we can say that between the variablethere is a strong positive correlation.

    y R Square = 0. 805 . That means that 80.5% of the variation of the dependent variable isexplained only by independent variable considering that the other factors are the same. Thatmeans that 80.5% of people who didnt do the vaccine didnt do it for medical reasons.

    y A djusted R Square = 0. 766 . That means that 76.6% of the variation of the dependentvariable is explained by the independent variable and the other factors. It means that 76.6% of

    people, who didnt do the vaccine, didnt do it because of the medical reasons and other factors.

  • 8/4/2019 Cervical Cancer Vaccine - Eco No Metrics Project

    8/11

    8 Cazacu Oana group 132Cervical Cancer Vaccine

    ANOVA

    df SS MS F Significance F Regression 1 1130. 509 7 22 1130. 509 7 22 20.668188 5 0.006133631Residual 5 273.4902 77 8 54.6980 555 6

    Total 6 1404

    y Significance F shows if the model is not valid or not. Significance F = 0.006 which issmaller than = 0.05 and also = 0.01. Because the computed probability is much smaller thanour accepted probability the model is valid in 95% of cases, also in 99% of cases, also in 99.3%of cases.

    Coefficients StandardError

    t Stat P-value Lower 95%

    Upper 95%

    Lower 95.0%

    Upper 95.0%

    Intercept 4.2653 4.1160 1.0363 0.3476 -6.3152 14.8457 -6.3152 14.8457People that

    had amedicalreason for not takingthe vaccine

    3.3153 0.7292 4.5462 0.0061 1.4407 5.18984 1.4407 5.1898

    The equation of the simple linear regression is:Y = 4.26 + 3.31 XY = Number of persons who didnt take the vaccineX = Number of persons who didnt take the vaccine for medical reasons

    Inter p retation :

    y If no persons would invoke as a reason for not taking the vaccine the medical reason, still4.26 persons wouldnt take the vaccine.y The slope = 3.31 it is positive, meaning we have a positive correlation between the

    variables.y If one person in addition would invoke as a reason for not taking the vaccine the medical

    reason, the number of persons who didnt take the vaccine will be higher with 3.31 persons.

    The inference u p on the slo p e:

    In our case P-value = 0.0061. The value is smaller than the level of significance alpha;meaning that in 95% of cases we have can reject H 0 and accept H 1.

    The null hypothesis is that the variables are independent and the alternative hypothesis is

    that the variables are dependent.There is sufficient evidence that in 95% of cases the medical reasons have influenced people in not taking the vaccine.

    At 95% level of confidence, the confidence interval for the slope is (1.4407, 5.1898)We can be sure that for a large number of persons who didnt do the vaccine the number

    of persons who didnt do the vaccine for medical reason will increase with at least 1.44 personsuntil it reaches the maximum increase of 5.19 persons.

  • 8/4/2019 Cervical Cancer Vaccine - Eco No Metrics Project

    9/11

    9 Cazacu Oana group 132Cervical Cancer Vaccine

    Multi p le Regression Model

    Multiple regression is a statistical method used to examine the relationship between onedependent variable Y and one or more independent variables X i. The regression parameters or coefficients bi in the regression equation:

    For the multiple regression model I used the data comprised for the simple linear regression and in addition, the persons who knew about the side effects that the vaccine caused.

    Income Persons who didnt take thevaccine

    People that had amedical reason for not

    taking the vaccine

    Persons who knewabout the side

    effects< 600 45 10 17

    600 - 1000 22 3 61000 - 2000 15 2 22000 - 3000 12 3 33000 -5 000 3 0 2

    > 5000 1 1 1Confidential 28 10 14

    Total 126 29 45

    SUMMARY OUTPUT

    Regression Statistics

    M ultiple R 0.932 792139R Square 0.8 701011 74Adjusted R Sq uare 0.80 51517 61Standard Error 6. 75 23690 52

    Observations 7

    y The variables of the models are the number of persons per income who didnt takethe vaccine, the number of persons who had a medical reason for not taking it, and the number of

    persons who knew about the side effects of the vaccine.

    y The variables of the model are:The number of persons who didnt take the vaccine dependent variableThe number of persons who had a medical reason for not taking it independent variableThe number of persons who knew about the side effects independent variable

  • 8/4/2019 Cervical Cancer Vaccine - Eco No Metrics Project

    10/11

    10 Cazacu Oana group 132Cervical Cancer Vaccine

    y The level of correlation between the variables is given by Multi p le R . In our caseMulti p le R = 0. 93 . It is larger than 0.75, which means that between the variable it exists a

    positive strong correlation.

    y The proportion of the number of persons who didnt take the vaccine into thenumber of persons who had a medical reason for not taking the vaccine, and the number of

    persons who knew about the side effects, while the other factors are considered constant is given by R Square . In our case R Square = 0. 87 . Meaning 87%.

    y The proportion of the number of persons who didnt take the vaccine into thenumber of persons who had a medical reason for not taking the vaccine, and the number of

    persons who knew about the side effects, while the other factors are also taken into considerationis given by A djusted R Square . In our case A djusted R Square = 0. 80 , meaning 80%.

    ANOVA

    df SS MS F Significance F Regression 2 1221.622 610.811 13.3966 0.0168 73705 Residual 4 182.3 77 95 45 .59449

    Total 6 1404

    y Significance F shows if the model is not valid or not. Significance F = 0.016 which is smaller than = 0.05. Because the computed probability is much smaller than our accepted probability the model is valid in 95% of cases, also in 98.4% of cases.

    Coefficients StandardError

    t Stat P-value Lower 95%

    Upper 95%

    Lower 95.0%

    Upper 95.0%

    Intercept 3.8567 3.7690 1.0232 0.3640 -6.6076 14.3210 -6.6077 14.3210People that

    had amedicalreason for not takingthe vaccine

    - 0.4876 2.7713 - 0.1760 0.8689 - 8.1821 7.2069 -8.1821 7.2069

    Person whoknew aboutthe sideeffects of thevaccine

    2.5143 1.7786 1.4136 0.2304 -2.4240 7.4526 -2.4240 7.4526

    y The form of the regression model : Y = 3.86 - 0.49 X 1 + 2.51 X 23.86 the intercept-0.49 the slope2.51 the slope

  • 8/4/2019 Cervical Cancer Vaccine - Eco No Metrics Project

    11/11

    11 Cazacu Oana group 132Cervical Cancer Vaccine

    Inter p retations :

    y If no persons would invoke as a reason for not taking the vaccine the medicalreason, and if no person would know about the side effects, there will still be 3.86 persons whowouldnt take the vaccine.

    y The first slope is negative that is why between the dependent variable thenumber of persons who didnt take the vaccine and the independent variable the number of

    persons that had a medical reason for not taking it, there is a negative relationship, meaning thatthe variables modify in opposite ways.

    y The second slope is positive that is why between the dependent variable thenumber of persons who didnt take the vaccine and the independent variable the number of

    persons that who knew about the side effects of the vaccine, there is a positive relationship,meaning that the variables modify in the same way.

    y The first P - value = 0. 87 , which compared with alpha 0.05, proves to be higher.That is why we fail to reject the initial hypothesis. We accept it, and reject the alternative one.

    y The second P - value = 0.23 , which compared with alpha 0.05, proves to be higher.That is why we accept the initial hypothesis and reject the alternative hypothesis.