statrep1

Upload: jose-stevens

Post on 04-Apr-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 statrep1

    1/11

    Introduction

    A short statistical analysis is done on the data about the passengers to uncover hidden

    dependencies among different variables. This paves the way to try and find reasons for the

    observed patterns.

    I. Is there a significant

    Fig. 1 Histogram for survivors and non survivors

    From the above histograms, we can comment that the age distribution seems different for the

    two samples. It seems that the infants (less than 5 years of age) survived more. The Box-plots

    of the same are presented:

    Fig. 2 Box plot for non survivors and survivors

  • 7/30/2019 statrep1

    2/11

    The Random Variable of interest is age. Let X1 denotes age of survivors and X2 denotes the

    age of non-survivors.

    To answer the first question, we start with the following assumptions:

    1. X1 and X2 are Continuous.

    2. The samples of X1 and X2 are independent and identically distributed.

    Let us check the assumption of normality of the populations of survivors and non-survivors.

    Fig. 2 Q-Q norm for survivors and non survivors

    The p-values for all the tests for the two samples are tabulated as follows:-

    Test p value (Survived) p-value (Non Survived)

    Shapiro-Wilk normality test 1.461442e-09

    Anderson-Darling normality test 5.014279e-16

    Cramer-von Mises normality test 4.824687e-10

    Lilliefors (Kolmogorov-Smirnov)normality test

    2.646866e-16

    Shapiro-Francia normality test 0.0020591830 1.719147e-08

    The p-values for all the tests for the two samples suggest that the populations of survivors and

    non-survivors are not normal (at assumed -value of 0.05).

    Further to the above assumptions, we assume that the CDFs of X 1 and X2 have same shape.This allows us to apply the wilcoxons rank sum test.

    From the calculated p-value for Wilcoxon rank sum test (0.19), there is not enough evidence

    against Ho (Ho: X1 is stochastically equal to X2).

    Fig 3: ECDF for non survivors and survivors

  • 7/30/2019 statrep1

    3/11

    But from the histograms and ECDF of survivors and non-survivors, it appears that there is a

    significant difference in survival probability for people in age group of 0-5 years. So, we do a

    Kolmogorov-Smirnov two sample test and get a p-value of 0.03428, which suggests that

    there is evidently a significant difference in age distributions (at -value of 0.05).

    Based on observed data and our intuition from histograms, we categorise survivors and non-

    survivors in different age categories viz. 0 to 5, 5+ to 15, 15+ to 30, 30+ to 45, 45+ to 60, and

    60+. Now, we do a chi-square test to see if survivors and non-survivors have a homogeneous

    distribution across these age categories. We get a p-value of 5.47710-6, which supports our

    belief that there is a difference in age distributions of survivors and non-survivors. Now,

    since the sample size of survivors is 313, and that of non-survivors is 443, we can do a z-test

    on problem of proportion for each age category separately, null hypotheses being

    0-5; Survivors= 0-5; Non-Survivors

    5-15; Survivors= 5-15; Non-Survivors

    15-30; Survivors= 15-30; Non-Survivors

    30-45; Survivors= 30-45; Non-Survivors

    45-60; Survivors= 45-60; Non-Survivors

    60+; Survivors= 60+; Non-Survivors

    We began with two tailed tests and single tailed tests were done wherever null was refuted.

    On performing Z tests, we get the following p values, and thus the adjoining conclusions:-

    Age Category P value Conclusion

    0 to 5 2.8e-6 0-5; Survivors >0-5; Non-Survivors

    5+ to 15

    15+ to 30

    30+ to 45

    45+ to 60 0.1578 45-60; Survivors = 45-60; Non-Survivors

    60+ 0.0355 60+; Survivors < 60+; Non-Survivors

  • 7/30/2019 statrep1

    4/11

    The above analysis suggests that there is a significant difference in age distribution between

    those who survived and those who did not.

    II. (a) Is there a significant difference in Age distribution between male survivors

    and male non survivors?

    Histograms and box-plots survived more and old males died more.

    Fig 4 Box plot for male non survivors and survivors

    Fig 5 Histogram for male non survivors and survivors

    Normality tests on data

    The p-values for all the tests for the two samples of survivors and non-survivors are tabulated

    as follows:-

    Test p value (Survived) p-value (Non Survived)

    6.366845e-10

    Anderson-Darling normality test 0.008390771 2.227363e-16

    Cramer-von Mises normality test 0.045051620 5.182246e-10

    the populations of survivors and non-survivors are not normal (at assumed -value of 0.05).

    We do a Kolmogorov Smirnov two sample test to find out that the two samples come from

    different distributions (p value = 0.002) implying there is a significant difference in age

    distributions of male survivors and dead.

  • 7/30/2019 statrep1

    5/11

    We use the same approach of dividing the population into age categories to find out if there is

    a dependence of survival probability on age category as done in part (1), the only difference

    being that here the two samples come from Male. Chi-square p value of 1.47e-11 implies

    population of male survivors and non-survivors is not homogeneous with respect to age

    categories. Thus, we go ahead with 6 separate Z tests, one for each age category. Null

    hypotheses being as follows:-

    0-5; Male_Survivors= 0-5; Male_Non-Survivors

    5-15; Male_Survivors = 5-15; Male_Non-Survivors

    15-30; Male_Survivors = 15-30; Male_Non-Survivors

    30-45; Male_Survivors = 30-45; Male_Non-Survivors

    45-60; Male_Survivors = 45-60; Male_Non-Survivors

    60+; Male_Survivors = 60+; Male_Non-Survivors

    We began with two tailed tests and single tailed tests were done wherever null was refuted.

    On performing Z tests, we get the following p values, and thus the adjoining conclusions:-

    Age Category

    0 to 5

    5+ to 15

    15+ to 30

    a significant difference in age distribution between male survivors and male non-survivors.

  • 7/30/2019 statrep1

    6/11

    II. (b) Is there a significant difference in Age distribution between females who survived

    and those who did not?

    Histograms and box-plots for male dead and survived are compared. The distributions do not

    seem to be normal, as supported by the normality tests.

    Fig 6 Boxplot for female non survivors and survivors

    Fig 7 Histograms for female survivors and non survivors

    Normality tests data

    The p-values for all the tests for the two samples of female survivors and female non

    survivors are tabulated below:-

    Test p value (Survived) p-value (Non Survived)

    0.11296655

    0.11530637

    0.08483381

    Lilliefors (Kolmogorov-Smirnov)

    normality test

    0.0001661670 0.12109238

    Shapiro-Francia normality test 0.0077707718 0.11744076

    The p-values for all the tests for the two samples suggest that the samples of survivors are not

    normal, whereas that of non survivors follow normal distribution (at assumed -value of

    0.05).This clearly suggests that the distributions are not same. However, to reinforce on this,

    we do a Kolmogorov Smirnov two sample test. This also suggests that the two samples come

    from different distributions (p value = 0.01326) implying there is a significant difference in

    age distributions of female survivors and dead.

  • 7/30/2019 statrep1

    7/11

    We use the same approach of dividing the population into age categories to find out if there is

    a dependence of

    30-45; Female_Survivors = 30-45; Female_Non-Survivors

    45-60; Female_Survivors = 45-60; Female_Non-Survivors

    60+; Female_Survivors = 60+; Female_Non-Survivors

    We began with two tailed tests and single tailed tests were done wherever null was refuted.

    On performing Z tests, we get the following p values, and thus the adjoining conclusions:-

    Age Category P value Conclusion

    0 to 5

    5+ to 15

    15+ to 30

    30+ to 45

    45+ to 60

    60+ 0.666 60+; Female_Survivors = 60+; Female_Non-Survivors

    The above analysis suggests that there is a significant difference in age distribution between

    female survivors and female non-survivors.

  • 7/30/2019 statrep1

    8/11

    III. Remark on how Age affected the Survival Probability of a passenger on board

    the Titanic, based on consolidations of your findings in 1 and 2 above.

    The findings in 1 and 2 above suggest that females had higher survival probability than their

    counterparts. Given that the boarders are males, infants and teenagers had higher survival

    probability; however, age group of 15 to 30 and above 60 years had less survival probability.

    Given that the boarders are females, age group of 45 to 60 had higher survival probability.

    Possible reasons could have been that females and kids were given preference in going on life

    boats, old could have thought of sacrificing their lives for the young.

    IV. Is there a significant di erence in Survival Probability between the two genders?ff

  • 7/30/2019 statrep1

    9/11

    Ho:No difference in the survival probability of the two genders viz. male and female

    Ha: Significant difference in the survival probability of the two genders viz. male and

    female (Two-sided)

    Data:

    The below table displays the problems data:-

    Survivor Non-Survivor Total

    Males 142 709 851

    Females 308 154 462

    Total 450 863 1313

    Test adopted for testing the hypothesis:

    Since its a problem of proportion and we would like to compare the survival probabilities of

    male and female, we can use the following tests:

    1. Fishers exact test

    2. Z-test

    Fishers exact test is more powerful test in this case but we can also do a Z-test as the sample

    size is large.

    Conclusion: On the basis of Z-test we conclude that there is a significant difference in the

    survival probability of the two genders.

  • 7/30/2019 statrep1

    10/11

    We have the following data:-

    Survivors Non-Survivors

    Passenger Class I 193 129

    Passenger Class II 119 161

    Passenger Class III 138 573

    The p-value of 2.210-16 suggests that there is enough evidence to reject the null hypothesis

    (at -value of 0.05). It can be said that there is a significant difference between population

    distributions across passenger classes.

    We further break the data to compare different classes. We did single-tailed Fishers test by

    taking sets of two classes at a time. This helped us find which passenger class had better

    chance of survival. It was observed that the survival probability is highest for Class I

    followed by Class II with Class III having the lowest probability for survival.

    The above conclusion agrees with the common knowledge that passengers in first class had

    the first option to mount the lifeboats. Passengers in third class were the last to mount the

    lifeboats.

    VI. Is there a significant difference in Survival Probability between the two genders

    even after taking the effect of Passenger Class into Account?

  • 7/30/2019 statrep1

    11/11

    We make three 22 contingency tables corresponding to each class, and do Fishers test as

    follows:-

    Class I Survivors Non Survivors

    Male 59 120

    Female 134 9

    We did a two sided Fishers test which yielded a p value of less than 2.2e-16, i.e., there is a

    significant difference in Survival Probability between the two genders for class1. So, we did a

    one-sided fishers test We did a two sided Fishers test which yielded a p-value of less than

    2.2e-16, i.e., there is a significant difference in Survival Probability between the two genders

    for Class II. So, we did a one-sided

    Class III Survivors Non Survivors

    441

    Female 80 132

    We did a two sided Fishers test which yielded a p value of less than 2.2e-16, i.e., there is a

    significant difference in Survival Probability between the two genders for class2. So, we did a

    one-sided fishers test with alternate hypothesis being that males survival probability is

    less than that of