statistics and research methods final exam - ibp union statistic (chi squared): ... concern among...

36
Student Name CPR Agnes Tingstrøm Klinken 100393-2164 Karoline Hauerbach 100795-0914 Luna Stæhr Andersen 201293-2440 Nina Meiniche 040594-2186 Nina Möger Bengtsson 090196-0534 Statistics and Research Methods Final Exam STU count: 34,017 Number of pages: 15 Written from 2 nd of May to 31 st of May Copenhagen Business School 2016 Group 19

Upload: phamtram

Post on 05-May-2018

214 views

Category:

Documents


2 download

TRANSCRIPT

Student Name CPR

Agnes Tingstrøm Klinken 100393-2164

Karoline Hauerbach 100795-0914

Luna Stæhr Andersen 201293-2440

Nina Meiniche 040594-2186

Nina Möger Bengtsson 090196-0534

StatisticsandResearchMethods

FinalExam

STUcount:34,017Numberofpages:15

Writtenfrom2ndofMayto31stofMay

CopenhagenBusinessSchool2016

Group19

ninameiniche
Highlight
ninameiniche
Highlight
ninameiniche
Highlight
ninameiniche
Highlight
ninameiniche
Highlight

1

TableofContents

Question1:95%confidenceinterval....................................................................................................................2

Question2:Comparingproportions.....................................................................................................................3

Question3:Chisquared........................................................................................................................................5

Question4:Comparingtwomeans.......................................................................................................................6

Question5:Analysisofvariance(ANOVOA).........................................................................................................7

Question6:Simplelinearregressionandpolynomialregression......................................................................10

Question7:Multiplelinearregression................................................................................................................13

Appendices..........................................................................................................................................................17

2

Question 1: Construct a 95% confidence interval for the probability that an individual in the

treatment group would purchase the EPP (treat = 1).

The variables investigated in this question are categorical; the observations do not take a numerical

value, instead they belong to one of two categories; those who would purchase the EPP (1) and

those who would not (0).

We start by finding the proportion of clients in the treatment group who would purchase the EPP.

The sample size of clients in the treatment group is 1481 clients, wherefrom 548 clients purchased

the EPP (app. 1A). The sample proportion is thus 𝑝 = 548/1481 = 0.37. The value of 𝑝 works as a

point estimate of the population proportion, stating that 37% of the clients in the treatment group

are willing to purchase the EPP.

The 95% confidence interval is constructed by use of the sample proportion and the margin of er-

ror. The margin of error is an estimate of the precision of the parameter one wishes to investigate.

In this question we investigate the probability parameter, which can be used to make inferences

about data from the sample to the relevant population - the clients. To find the margin of error we

must first find the standard error. The standard error portrays the variability that would exist be-

tween different samples taken from the population. This is unlike the standard deviation that

measures the variability of observations within a single sample.

Standard error is calculated as follows:

𝑠𝑒 = %('(%)*

→ 𝑠𝑒 = ,../('(,../)'01'

= 0.01

Seeing that the sample size (n = 1481) is quite large we assume that the distribution of the sample

proportion is approximately normal due to the Central Limit Theorem (CLT) and therefore we ex-

pect 95% of the observations to fall within 1.96 standard errors on both sides of the point estimate

(0.37). The margin of error is a multiple of the standard error and is calculated as 1.96 *

(0.01)=0.02.

The confidence interval is constructed by the formula:

𝑝 ± 1.96 𝑠𝑒 = 0.37 ± 1.96 0.01 = (0.3504, 0.3896)

We are thus 95% confident that the proportion of clients in the treatment group purchasing the

EPP is between 0.3504 and 0.3896, i.e. 35.04% and 38.96% (app. 1B).

3

Question 2: Is there a statistically significant difference between males and females in the treat-

ment group with respect to the probability of purchasing EPP? Include a z test and a confidence

interval for the difference.

To test the difference between males and females (two groups) we look at proportions, conduct a

significance test and make a confidence interval.

Assumptions: We assume that (1) the variables are categorical, (2) the data is randomised and

(3) that the sample is sufficiently large for the sample distribution to be approximately normal - i.e.

n1 and n2 are large enough for for there to be at least 5 successes and 5 failures.

Hypotheses: In order to compare the difference between males and females in the treatment

group, we formulate a null (H0) and alternative hypothesis (Ha). p1 represents the population pro-

portion for the group of males in the treatment group and p2 for the group of females in the treat-

ment group. Similarly, n1 and n2 represent the sample sizes for the two groups.

• Null hypothesis (H0): p1 – p2 = 0. There is no difference between the proportion of males

and females buying the EPP.

• Alternative hypothesis (Ha): p1 ≠ p2. There is a difference between the proportion of males

and females buying the EPP.

For this analysis the significance level is 0.05.

We start by finding the proportion of men and women in the treatment group who bought the EPP

by constructing a contingency table (app. 2a). The probability for females in the treatment group

purchasing the EPP is 0.3869 and for males it is 0.3605.

Test statistic (Z-test): The significance test will test the null hypothesis by showing the probabil-

ity of getting the sample data that we have if H0 is true. When comparing two independent groups

we find the difference between the two proportions for females and males respectively. We thus

calculate the difference between the sample proportions (𝑝1 – 𝑝2):

𝑝1 – 𝑝2 = 0.3605 – 0.3869 = – 0.0264

In order to make inferences about the difference of population proportions (p1 – p2) we need to

calculate how the difference of the sample proportions (𝑝1 – 𝑝2) could vary from sample to sample.

This is estimated by the standard error where we look at the variability around the mean of the

sampling distribution (𝑝1 – 𝑝2) of the estimate.

The standard error is calculated by the formula:

4

à 𝑠𝑒 = ,..>,?('(,..>,?)@0?

+ ,..1@>('(,..1@>)?.?

= 0.026122

A Z-score measures how many standard errors the sample estimate of a given parameter falls from

the null hypothesis value (p1 – p2 = 0). The Z-test is calculated by dividing the point estimate by

the standard error:

𝑍 =𝑝' − 𝑝E𝑠𝑒,

→ 𝑍 =0.3605 − 0.3896

0.026122= −1.0107

We thus find a Z-score of -1.0107

The conclusions made from a Z-score depends on the significance level required. We have set our

significance level to 0.05 and thus require our sample results to fall within 1.96 standard errors of

the mean to not reject the null hypothesis. This translates into a Z-score of ±1.96.

P-value: The Z-value can be translated into a P-value. The P-value gives the probability of a test

statistic value being equal to or more extreme than the observed value if another sample was taken.

It describes how unusual the data would be if H0 was true as it captures the sum of tail probabili-

ties. We find the P-value that corresponds to a Z-value of -1.0107 by looking in the table with

standard normal cumulative probabilities (app. 8a). The P-value thus equals 0.1562 but since it is a

two-sided test, the total P-value is 0.1562 ∗ 2 = 0.3124, which is above our threshold of 0.05. See

app. 2b for illustration.

Confidence interval: We construct a 95% confidence interval for the difference between the two

population proportions p1-p2 (app. 2c). The formula is similar to the one from Q1 but portrays the

fact that we are looking at the difference between two proportions.

(𝑝' − 𝑝E) ± 1.96 𝑠𝑒

– 0.0264 ± 1.96 ∗ 0.026122 = (−0.0776, 0.0248)

We are thus 95% confident that the true difference in proportions of males and females purchasing

the EPP is between -0.0776 and 0.0248, i.e. the probability that females will buy the EPP can be

between 2.48% lower and 7.76% higher than the males’ probability of purchasing the EPP. Since 0

occurs in the interval, there is a chance that the difference between the two proportions will be 0.

Conclusion: Based on our p-value that is above the threshold of 0.05 and our confidence interval

that includes 0, we cannot reject our null hypothesis. This is because we do not have evidence that

there is a statistically significant difference between men and women in the treatment group with

5

regards to the probability of purchasing the EPP. Note, however, that we also cannot fully accept

the null hypothesis either.

Question 3: Is there a significant association between being female and one’s concern about a

political crisis?

Assumptions: We assume that (1) our two variables are categorical, and thus the significance of a

potential association must be looked at through proportions, and (2) that the data has been collect-

ed using randomization and that the expected cell count ≥ 5 in all cells.

Hypotheses: We start by formulating the following hypotheses:

• Null hypothesis (H0): The two variables are independent

• Alternative hypothesis (Ha): The two variables are dependent (associated)

For this analysis the significance level is 0.05.

Test Statistic (Chi Squared): To test our null hypothesis we construct a contingency table and

illustrate the proportions with a pie chart (app. 3a and 3b). The contingency table informs us that

the proportion of men and women combined who are Not Concerned is 0.30, Somewhat Con-

cerned is 0.269 and Very Concerned is 0.431. From a first look, the proportions for men and wom-

en respectively do not seem to differ greatly (men: 29.29, 26.54, 44.17. Women: 31.25, 27.43,

41.32), and the pie chart illustrates this, but in order to determine whether there is an association

between the two variables - gender and level of concern - we test for independence.

Assuming independence, 𝑃 𝐴𝑎𝑛𝑑𝐵 = 𝑃 𝐴 ∗ 𝑃 𝐵 we test the expected cell count.

If the two variables are independent, the conditional distributions should be identical for men and

women and we would thus expect 30% of men and women respectively to be not concerned, 26.9%

to be somewhat concerned and 43.1% to be very concerned. We expect this distribution of level of

concern among both men and women as that is how it is distributed in the full sample.

As an example for an expected cell count, we look at the cell of female (A) and Not Concerned (B).

The calculations are as follows:

𝑛 ∗ 𝑃 𝐴 ∗ 𝑃 𝐵 = 𝑛 ∗ 𝑃 𝑓𝑒𝑚𝑎𝑙𝑒 ∗ 𝑃 𝑁𝑜𝑡𝐶𝑜𝑛𝑐𝑒𝑟𝑛𝑒𝑑 = 2960 ∗ 10722960

∗ (8882960

) = 321.6

The rest of the expected cell counts can be seen in app. 3a.

6

To test for independence, we conduct a Chi-squared test for each cell that shows how close the ac-

tual cell count falls to the expected cell count. For every cell we take the difference between the ob-

served count and expected count and square it and then divide it by the expected count. Following

this we sum all cells and find X2.

𝜒E =(𝑂 − 𝐸)E

𝐸

In this case 𝜒E = 0.317 + 0.0729 + 0.4659 + 0.5583 + 0.1284 + 0.8206 = 2.363

P-value: In order to interpret the magnitude of this Chi squared value, we will find the P-value

that tests the strength of evidence against the null hypothesis. We determine the degrees of free-

dom based on the number of rows (r) and columns (c) and look at the Chi-Squared Distribution

with respect to our required significance level of 0.05.

𝑑𝑓 = 𝑟 − 1 ∗ 𝑐 − 1 → 𝑑𝑓 = 2 − 1 ∗ 3 − 1 = 1 ∗ 2 = 2

JMP estimates a P-value of 0.31 with a Chi Square value of 2.33.

Conclusion: As the P-value of 0.31 is significantly higher than the significance level of 0.05, we

cannot reject the null hypothesis of independence. We can thus not reject the hypothesis that gen-

der and level of concern are independent, but we cannot fully accept it either.

Question 4: Is there a significant difference between the post-treatment profits of the treatment and the control group? Assumptions: In order to answer this question, we assume (1) that the observations are inde-

pendent, (2) that the data is normally distributed with mean 𝜇 and standard deviation 𝜎, and (3)

that the data is randomized.

We must check our assumption of equal standard deviations in the two groups before proceeding.

We thus construct distributions in JMP for the treatment and control group respectively (app. 4a

and 4b). We see that the standard deviations (SD) are approximately equal with a difference of

|0.49975–0.50393| = 0.00418. We test the assumption in JMP with a 2-sided F-test and receive a

P-value of 0.7595 (app. 4c). Consequently, we can confidently assume that the variances (and thus

the standard deviations) are approximately equal.

Hypotheses: To test whether there is a significant difference, we formulate two hypotheses:

• Null hypothesis: 𝐻,: 𝑥' = 𝑥E there is no difference between the two sample means.

• Alternative hypothesis 𝐻_:𝑥' ≠ 𝑥E there is a difference between the sample means.

7

For this analysis the significance level is 0.05.

Test statistic (T-test): We compare the sample means of the two groups and find the mean dif-

ference:

Treatment group: 𝑥' = 2.401

Control group: 𝑥E = 3.426

Sample mean difference (MDS): 𝑥' − 𝑥E = |3.401 − 3.426| = 0.025

We can use the MDS to estimate the difference between the two population means (µ1 – µ2) by cal-

culating the standard error.

𝑠𝑒 = (*b(')cbde(*d(')cdd

*be*d(E∗ '

*b∗ '*d

, where n1+n2 -2 is the degrees of freedom (df=2710).

𝑠𝑒 =(1354 − 1)0.4997534E + (1358E − 1)0.5039278E

1354 + 1358 − 2∗

11354

∗1

1358= 0.0193

The significance test t-test measures the number of standard errors that the parameter estimate

falls from the null hypothesis value of the parameter. It is given by:

𝑇 =𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑜𝑓𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 − 𝑛𝑢𝑙𝑙ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠𝑣𝑎𝑙𝑢𝑒𝑜𝑓𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟

𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑒𝑟𝑟𝑜𝑟𝑜𝑓𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟=(𝑥' − 𝑥E) − 0

𝑠𝑒

𝑇 =3.4007792 − 3.4256487 − 0

0.019288→ 𝑇 = −1.289 (𝑖𝑛𝐽𝑀𝑃:𝑇 = −1.290 )

P-value: Looking in a t distribution table (app. 8d) with 2710 degrees of freedom and a t-value of

1.290, we find an approximation of the P-value - JMP tells us the exact value: P=0.1970. (app. 4d

for illustration)

Conclusion: As P=0.1970 > 0.05, we cannot reject the null hypothesis that there is no significant

difference in post-treatment profits between the treatment group and the control group. We cannot

fully accept the null hypothesis - we can simply say that it cannot be ruled out that there is a chance

that the treatment had no effect.

Question 5: Are there any significant differences in pre-treatment profits depending on the level

of education (0-5)?

We use the analysis of variance method (ANOVA) to compare the pre-treatment profit means be-

tween the groups defined by the level of school attended (group 0-5). Our response variable is “pre-

8

treatment profits” and the explanatory variable is “attended_school”. The number of groups is 6

and the population means in each group is defined by µ0, µ1, µ2, µ3, µ4, µ5.

Assumptions: We assume (1) independent random sampling, (2) normal distribution within each

group and (3) equal standard deviations. As a rule of thumb, the test works well when the group’s

SDs are within a factor of two of each other, and since this is the case here, we assume that the test

works.

Hypotheses: We set up two hypotheses:

• Null hypothesis 𝐻,: 𝜇' = 𝜇E =. . . = 𝜇n, equal population means for all groups

• Alternative hypothesis 𝐻_: 𝑎𝑡𝑙𝑒𝑎𝑠𝑡𝑡𝑤𝑜𝜇𝑑𝑖𝑓𝑓𝑒𝑟𝑠

We plot the different group means in JMP for an overview of how the means tend to vary between

and within the 6 groups (app. 5a). However, we need to calculate the variances with an F-test.

Test statistic (F-test): F-tests measure the variances between the different groups and within

each group. The F test is the ratio between these two variabilities and indicates whether the vari-

ances are equal or unequal. The F score will be around 1 if H0 is true and larger under the Ha. The

F-value is only one-sided because there cannot be negative numbers when squaring the values.

𝐹 =𝐵𝑒𝑡𝑤𝑒𝑒𝑛𝑔𝑟𝑜𝑢𝑝𝑠𝑣𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑙𝑖𝑡𝑦𝑊𝑖𝑡ℎ𝑖𝑛𝑔𝑟𝑜𝑢𝑝𝑠𝑣𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑙𝑖𝑡𝑦

Mean square between the groups 𝜎E:

Betweengroupsvarianceestimate𝜎E =𝑛'(𝑦' − 𝑦)E + 𝑛E(𝑦E − 𝑦)E+. . . +𝑛n(𝑦n − 𝑦)E

𝑔 − 1

=

(379(3.2134433– 3.303015)E + 826 3.2610133 − 3.303015 E + 511 3.2691607 − 3.303015 E+775 3.3548929 − 3.303015 E + 101 3.4982772 − 3.303015 E + 165 3.4616585 − 3.303015 E

6 − 1

= 3.03015

Mean square within the groups 𝜎E:

𝑊𝑖𝑡ℎ𝑖𝑛𝑔𝑟𝑜𝑢𝑝𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝜎E =(𝑠'E + 𝑠EE+. . . +𝑠nE)

𝑔

= 0.4789731E + 0.5115549E + 0.5041083E + 0.4913135E + 0.4895688E + 0.49998E

6= 0.246

𝐹𝑟𝑎𝑡𝑖𝑜 =3.030150.246

= 12.317, (𝑖𝑛𝐽𝑀𝑃:𝐹𝑟𝑎𝑡𝑖𝑜 = 12.1847)

9

P-value: By use of the df1 = g-1 = 6 - 1 = 5 and df2 =N – g = 2756 - 6 = 2750, and the F-ratio, JMP

derives a P-value < 0.0001 (app. 5b). This means that if the H0 were true, there would only be a

less than 0.01 % chance of getting an F test statistic value larger than or equal to the observed F

value of 12.1487.

Since the p-value < 0.05, there is sufficient evidence to reject our null hypothesis (H0: µ1 = µ2 …

=µg). We conclude that there is a significant difference among the pre-treatment profit means be-

tween the 6 groups defined by the level of school attended. However, the result of the F test does

not tell us which groups are different and how different they are. We can address this issue by us-

ing multiple comparison.

Multiple comparison: We can estimate differences between population means with multiple

comparison confidence intervals (CIs). When the CI does not contain 0, we can infer that the popu-

lation means are different. The intervals show how different the means may be. In our case we have

6 groups and therefore >(>(')E

= 15 pairs of means to compare.

If using separate CIs, an error probability of 0.05 applies to each comparison, which would lead to

on average 15(0.05) = 0.75 = 75% probability of the CIs not containing the true difference of the

means. We need to construct the intervals so that they hold within an overall confidence level of

95%. Consequently, we use the Tukey method, which compare pairs of means with a confidence

level that applies to the entire set of comparisons rather than to each separate comparison. This

will give us wider intervals.

Looking at the Tukey report (app. 5c), we see that the P-value of the comparison of means between

group 2 and 0 is 0.56 (and the CI contains 0) which suggests that there is no significant difference

between the means and the two levels of education. Instead there is a

significant difference between group 3 and 2 where the p-value is

0.03, which implies that getting to the third level of education has an

impact on the mean. However, the following levels after level 3 do not

seem to have a significant impact on the means of the lprofit_pre.

Thus, we see two-sub groups within the means (A and B, as seen in

the table).

Note that the Tukey report produces a two-sided test and that we therefore must refer to the table

of means for Oneway Anova (app. 5d) to make inferences. Looking at the table, we see that pre-

treatment profit means are higher for groups having a higher level of education.

Our Tukey report suggests that within the first three levels of education (sub-group B) there is no

significant impact on the lprofit_pre means as their p-values are above 0.05, whereas when going

10

from sub-group B to the last three levels of education (sub-group A) there is a significant impact on

the lproft pre means.

Conclusion: We conclude that the pre-treatment profit means differ significantly between the 6

groups defined by levels of education. We reject the H0 (µ1 = µ2 … =µg). We compared all 15 dif-

ferent pairs of means to see how they differ. The test suggests that the education level influences

the pre-treatments profits to a certain extent, i.e. there is significant differences in pre-treatment

profit means depending on the levels of education, but only between the two sub-groups A and B.

Question 6: Fit a simple linear regression relating the post-treatment profits to the post-

treatment revenues. State a confidence interval for the slope of the regression line. Discuss

whether the assumptions of the model hold, and extend the model to a polynomial regression.

Simple linear regression: We wish to find out if there is a relationship between the quantitative

response variable (y), post-treatment profits, and the quantitative explanatory variable (x), post-

treatment revenues, by conducting a linear regression analysis. The linear regression model 𝜇� =

𝛼 + 𝛽𝑥, with y-intercept 𝛼 and slope ß, uses a straight line to approximate the relationship between

x and the mean 𝜇�. We want to find out whether the population mean of the post treatment profits

depends on the post-treatment revenues.

Assumptions: (1) the basic assumption for using a regression line for description is that the pop-

ulation means of y at different values of x have a straight-line relation with x (µy =𝛼 + ßx), i.e. the

mean of residuals at each x value should be 0. In order to make statistical inferences in regression

analysis, two additional assumptions must be made: (2) the data must be collected using randomi-

zation and (3) the population values of y at each value of x have a normal distribution, with the

same standard deviation at each x value (i.e. the residuals have constant variance). Whether or not

these assumptions hold will be discussed later on.

Our regression line uses the following sample prediction equation:ŷ = 𝑎 + 𝑏𝑥. From the given data

we have constructed a scatter plot (app. 6a) and the regression line derived from JMP is:

ŷ = 0,17672880 + 0,8253541x. In the linear regression model, the slope indicates whether the asso-

ciation is positive or negative and describes the trend. Here we see a linear trend with a positive

association.

However, the slope does not describe the strength of the association and in order to measure this,

we need to look at the correlation. We look at the 𝑟E measure, which describes the relative im-

provement from predicting y using the prediction equation ŷ rather than predicting y by using the

11

sample mean ȳ. The𝑟E describes the strength of an association by looking at how much more accu-

rately a regression line predicts the values of y rather than the sample mean ŷ. The closer 𝑟E is to 1,

the stronger the linear association. JMP gives us a 𝑟E value of 0,749 (app. 6b), which indicates a

moderate linear association.

To find out whether the population mean of the post treatment profits depends on the post-

treatment revenues, we test our null hypothesis, 𝐻,: 𝛽 = 0 - x and y are statistically independent,

against the alternative hypothesis,𝐻_: 𝛽 ≠ 0. We use a 5% significance level and the test statistic

𝑡 = (�(,)c�

. From the t-score we can get a p value: The smaller P-value, the greater is the evidence of

linear association. From JMP we derive a t test statistic of 89,86 and the P-value is 0.0001, which

is < 0.05. Consequently, we can reject H0; x and y are not statistically independent.

The small P-value (0.0001) suggests that the population regression line doesn’t have a slope of 0.

In order to see how far the population slope 𝛽 actually falls from 0, we construct a 95% confidence

interval for 𝛽 by use of the formula: b ± t.025(se). The t-score is 1.96, which we can derive by the

table (app. 8d) that shows the t distribution of critical values when knowing the degrees of freedom

(df = n-2 = 2710-2 = 2708). Furthermore, we find the following standard error in JMP:

se=0.009185. We now construct the interval: = 𝑏 ± 𝑡. 025 𝑠𝑒

0.8253541 ± 2 ∗ 0.009185 = (0.806984; 0.8437241)

We are 95% confident that the population slope 𝛽 yields a value between 0,806984 and 0.8437241.

Thus, on average, the profit will increase between 0,806984 and 0.8437241 for each additional

increase in x.

Discussion of assumptions: First of all, the assumption regarding randomization holds, since

the data provided in the exam case was collected randomly. However, we must evaluate further

when looking at the two last assumptions stating that the data fit a linear regression model, 𝜇� =

𝛼 + 𝛽𝑥, and the population values of y at each value of x follow a normal distribution with a con-

stant standard deviation at each x value.

We do this by plotting the residuals of the model against the regression line. The figure we receive

(app. 6c) shows the variation of the residuals for the linear regression and indicates that the resid-

uals are more or less constant when x increases, which means that they have a constant standard

deviation. It is only around the horizontal value 4.6 that we observe some outliers.

Using a Q-Q test we check whether the residuals are normally distributed (app. 6d). The graph il-

lustrates that the data is not exactly normally distributed indicating that the linear regression mod-

12

el, 𝜇� = 𝛼 + 𝛽𝑥, may not necessarily be the best fit. The histogram is slightly skewed to the right

and Normal Quantile Plot shows an inverse ‘hammock’ shape.

Therefore we will extend our analysis to check whether a polynomial model of 2 degrees will fit our

data better.

Polynomial regression: In order to analyse whether a polynomial model will be a better fit, we

conduct a polynomial fit of 2 degrees in JMP. It has the following equation: 𝜇� = 𝛼 + 𝛽𝑥 + 𝐶E,

where the highest order term is denoted by the letter C (app. 6e).

First of all, we can compare the 𝑟Efor the linear regression and for the polynomial model to see

whether the polynomial fit is a more appropriate. The 𝑟Efor the linear regression is 0.74887

whereas it is 0.776989 for the polynomial regression. This indicates that there is a slightly stronger

association between the variables for the polynomial regression.

In the following, we will check the assumptions about normality and constant standard deviation

applied to the polynomial regression. We perform a Q-Q test to look for normality (app. 6g). The

Normal Quantile Plot indicates that the distribution of the data is more normal than the linear re-

gression since the data follow the red line better.

Now, we look at the deviation of the residuals by plotting them around a straight line at 0 (app.

6h). The residuals fall reasonably constant from 2-4 of the horizontal axis but around 4.1 the resid-

uals deviate a bit. However, we still observe a more or less constant standard deviation.

Furthermore, we draw attention to third order estimate (C), which is -0.191702 (app. 6i). We con-

duct a significance test to investigate whether the x and y variable are statistically independent and

whether the polynomial fit is better than the linear regression. The null hypothesis is H0: C=0

(since that would give us the linear regression) and the alternative hypothesis is Ha: C≠0. From

JMP we get a t-score of -18.47. This gives a p-value of 0.0001, which means that we can reject the

null hypotheses and conclude that there is a correlation between the x and the y values for the C

estimate. Moreover, we can look at the CI for the third estimate that ranges between (-0.212049, -

0,171356) and since 0 does not incur, it supports our rejection of the null hypothesis (H0: C=0).

Conclusion: We can conclude that the polynomial regression line is better than the linear regres-

sion at describing the relation between the post-treatment revenues and the post-treatment profits

since the r2 was higher, the data was more normally distributed and we could reject the H0: C=0.

13

Question 7: Fit a multiple linear regression with response variable lprofit_post and predictors

lprofit_pre, treat, female, and attended_school. Discuss the model assumptions. Report the sig-

nificance of each term (P-value in effect summary) and reduce the model if possible.

A multiple linear regression describes the relation between a quantitative response variable y and

multiple categorical or quantitative explanatory variables xn. The model allows us to investigate the

combined effect of all explanatory variables at once as well as the individual effect of each xn while

controlling for all other variables. We generalise the bivariate regression equation to one of multi-

ple regression and find a sample prediction equation from there.

µ� = 𝛼 + 𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3𝑥3 + 𝛽4𝑥4 → ŷ = 𝑎 + 𝑏1𝑥1 + 𝑏2𝑥2 + 𝑏3𝑥3 + 𝑏4𝑥4,

where x1 = pre treatment profit, x2 = group status, x3 = gender and x4 = level of education.

We fit the model to see how post-treatment profits respond to the variables listed above. After we

seek to optimise this model. We use software to estimate the multiple regression equation with a

prediction equation generated by the method of least squares, which plots the predicted values of

the explanatory variable in relation to the actual value of the explanatory variable (app. 7a). JMP

reports an adjusted R2-value of 0.391. This means that using ŷ, i.e. the four explanatory variables to

predict post-treatment profits (y) reduces the prediction error by 39.3% compared to using the

sample mean ȳ alone to predict y.

Testing model assumptions: We use models to make inferences, but in order to do so we test

the model by scrutinising its assumptions: (a) The data set is characterised by randomization,

(b) y is normally distributed with a constant standard deviation at all combinations of predictors

and (c) the model approximates the true relationship between the ŷ and each explanatory variable.

Furthermore, it is crucial to note of the number of explanatory variables: As R2 only increase along

the number of explanatory variables and can thus skew the result. A rule of thumb is that the num-

ber of explanatory variables should not exceed 10*n. This rule is not violated in our case.

1. The data is characterised by randomisation and sample is large: As we have not ob-

tained the data ourselves, we refer to our exam set, where we are let to understand that the data

has been obtained randomly. Thus this assumption for the MLR model holds.

2. y is normally distributed with a constant standard deviation at all combinations of

predictors: For y to be normally distributed, regardless of the fixed values of the explanatory var-

iables, the residuals of the model must be normally distributed. We test this by plotting a distribu-

tion of the residuals and test for normality with a Normal Quantile Plot (app. 7b). Firstly, the mean

is approx. zero (0.00003), the median is 0.075 and the standard deviation is very close to 1. More-

14

over, nearly all standardised residuals fall between -2 and +2 with next to no observation beyond -

+3. Conducting a Normal Quantile Plot we can further see that the residuals approximately follow

the straightened line meaning that the distribution of residuals tends to a normal distribution with

a slight skew to the left.

We must also test that the standard deviation does not vary around ŷ such that the model exhibits

homoscedacity. From plotting the residuals by the regression line we can infer that the distribution

of residuals around the regression line exhibits homoscedacity (app. 7c). There is no clear pattern

signifying heteroscedastic. This assumption can thus be said to hold true.

3. The model approximates the true relationship between the ŷ and each explanatory

variable: This assumption tests if the model is actually additive such that the effect of one explan-

atory variable stays constant regardless of the value of other variables. We test this by plotting the

residuals of the model against each explanatory variable (app. 7d). We find that for lprofit_pre the

residuals are scattered randomly around 0 with no observable pattern. For the categorical variables

we create box plots and note that the interquartile range is similar for all 3 variables and their cate-

gories. Similarly their whiskers do not differ significantly.

However, box plots cannot account for the cross effects of the explanatory variables. We thus create

a multiple linear regression model in JMP that includes all cross-tabulations of the explanatory

variables (i.e. female*treat) to be able to see potential non-parallel relationships. Hence, we can

look at the prediction profiler, where we see that the explanatory variables are not independent as

their slopes change when one of the other variables change. In order to determine the significance

of these links, we turn to the effect summary (app. 7e) to determine which links would improve our

model: we see only one single significant cross-link between attended_school*lprofit_pre. It is sig-

nificant as its P-value is below 0.05. It is crucial that we take this into account when forming our

final model as significant internal collinearity can cause large errors in the prediction model.

From our tests, we can infer that the greater part of the assumptions hold true. Therefore the mod-

el is appropriate to use as a predictor of post-treatment profits keeping in mind the cross-link be-

tween school level and profit into account.

Test statistic (F test): In order to report the significance of each term we check to see whether or

not the four explanatory variables in our multiple linear regression model have a significant indi-

vidual effect on the the response variable, post-treatment profits. We start by setting up a null and

an alternative hypothesis:

o Null hypothesis (H0): β1 = β2 = β3 = β4= 0, x and y are statistically independent

o Alternative hypothesis (HA): βn ≠ 0, x and y are statistically dependent

15

We apply a significance level of o.o5.

Test statistic and p values: We conduct an F-test and find the P-values for the variables. A large

F-value and thus a small P-value provide evidence against the null hypothesis of independence.

To test H0 we take use of the test statistics for F: 𝐹 = ��_*c��_�������n��cc��*��_*c��_�������

Effect tests are more effective than T-tests, which test the significance of each individual category

within all variables, as we are dealing with categorical variables with more than two categories (i.e.

attended school has 6 categories).

We thus conduct an effect test in JMP and find the following results (app. 7f):

Pre-treatment profit: F value = 712.29 P value = < 0.0001

Female: F value = 244.49 P value = < 0.0001

Treat: F value = 1.22 P value = 0.2695

Attended school: F value = 0.9244 P value = 0.4639

When the F-ratio yields a value around or below 1, the null hypothesis is accepted. However, the

larger F is, the more evidence there is against the null hypothesis. Furthermore, if a P-value is be-

low 0.05 we can also reject the null hypothesis that x and y are statistically independent. For ex-

ample as F = 712.2882 for pre-treatment profit there is much evidence against the null hypothesis

stating that x and y are not statistically independent. Also, the F-value yields a P-value of less than

0.0001 also undermining the null hypothesis.

The P-values are the right-tail probability of receiving a result that is higher than or equal to the

observed F-values from the F-distribution in which df1 denotes the number of explanatory varia-

bles minus one (4 - 1 = 3) and df2 denotes the sample size minus the number of explanatory varia-

bles of the regression (= 2538 - 4 = 2534). According to these P-values, we can reject the null hy-

pothesis of all parameters = 0 as both ‘female’ and ‘lprofit_pre’ has a large F-value and a P-value

below 0.05 and are thus not independent from the response variable y. We must now determine

whether all variables are significant for our prediction model.

Midway conclusion: The F-values for pre-treatment profits and gender are far above 1 and the

P-values are below our threshold of 0.05, and we can thus conclude that both variables have a sig-

nificant impact on post-treatment profits. Contrastingly, group status and educational level have F

values of approximately 1, which yields P-values of respectively 0.2659 and 0.4625, far above our

threshold of 0.05. Therefore, we can reject that pre-treatment profit and gender are independent

from the post treatment profit, whereas we cannot reject that level of education and group status

are independent from the post treatment profit.

16

Asides from the F-test, we analyse the value of R2: Using all four variables we get a value for R2 of

0.3926, whereas using only the variables of gender and pre-treatment profit as explanatory varia-

bles, R2 = 0.3912 (app. 7g). This signifies that even though we remove the two variables we still

explain approximately 39% of the variability. Group status and educational level in itself have no

significant impact on predicting post treatment profit. Had we instead removed either gender or

pre-profit, the value of R2 would decrease and the model would only explain 33.38% (removing

gender) and 21.68% (removing pre-profit) of the variability.

Reducing the model and concluding: From the above calculations and model fitting it is clear

that neither the level of education nor the group status is significant in order to predict the post

treatment profit. Therefore we exclude these two variables from the multiple regression, such that:

ŷ= a+b1x1+b2x2 +b3x3+b4x4 → ŷ = a+b1x1+b2x2

We are thus left with two explanatory variables: lprofit_pre and female. However, we must re-

member that when testing the model assumptions, we found that the cross between attend-

ed_school and lprofit_pre had a significant impact on the prediction effect of our model, and we

should thus add this to our equation. We thus end up with a model with three explanatory varia-

bles; one quantitative (pre-treatment profits), one categorical (gender) and one cross-linked (edu-

cation crossed with pre-treatment profits).

17

APPENDICES Appendix 1a: On the right: Contingency table over members of the treatment group purchasing (1) and not pur-chasing (0) the EPP. On the left: the contingency table illustrated.

Appendix 1b: Confidence interval of proportion of clients in the treatment group purchasing the EPP

18

Appendix 2a: Contingency table over men (0) and women (1) in the treatment group purchasing (1) and not pur-chasing (0) the EPP.

Appendix 2b: Illustration of p-value: the tail probabilities in a Z distribution

19

Appendix 2c: 95% confidence interval for the difference between the two population proportions p1-p2:

Appendix 3a: Contingency table of males' (0) and females' (1) level of concern about the crisis.

20

Appendix 3b: Pie chart illustration of males (0) and females (1) level of concern about the crisis

Appendix 3c: Illustration of the probability table for Chi-squared values. For table of values see 8a.

http://statwiki.ucdavis.edu/@api/deki/files/147/5a0c7bbacb4242555e8a85c9767c03ee.jpg?revision=1

21

Appendix 4a: Distribution of post-treatment profits for the treatment group

Appendix 4b: Distribution of post-treatment profits for the control group

Appendix 4c: Two-sided F-test confirming that variances are equal.

22

Appendix 4d: T-distribution of the significance of the difference between the two population means

Appendix 5a: Distribution of various levels of attended school by pre-treatment profit.

23

Appendix 5b: Table representing a difference report using the Tukey method

Appendix 5c: Table of means for OneWay Anova for inferences about the influence of level of education on pre-treatment profits

24

Appendix 6a: Scatterplot and regression line for post-treatment profits and post-treatment revenues.

Appendix 6b: Summary of fit of the straight line regression model

Appendix 6c: Standard deviation of the residuals for the linear regression

25

Appendix 6d: Q-Q test of normality of residuals for linear regression

Appendix 6e: Illustration of polynomial model with a fit of 2df.

Appendix 6f: Summary of fit for the polynomial regression model

26

Appendix 6g: Q-Q test of normality of residuals for polynomial regression model

Appendix 6h: Deviation of the residuals for the polynomial regression

Appendix 6i: Parameter estimates for the polynomial regression model.

27

Appendix 7a: Multiple linear regression model with response variable lprofit_post and predictors lprofit_pre, treat, female, and attended_school

28

Appendix 7b: Plot of distribution of the residuals Q-Q test for normality

Appendix 7c: Deviation of the residuals around the regression line

29

Appendix 7d: Residuals plotted against the explanatory variables testing for potential issues with the the regres-sion model.

30

Appendix 7f: Effect of each explanatory variable and their cross-links on the multiple linear regression model.

31

Appendix 7g: F-tests to test for variable independence

Appendix 7f: Summary of fit for the multiple regression line for 4 and 2 variables respectively

32

Appendix 8a: Provided by CBS lecturer.

33

Appendix 8b. Chi squared distribution for values of right tail probabilities.

34

Appendix 8c F-Distribution for Values of Right-Tail Probability=0.05 Df1 is representing the horizontal values and df2 representing the horizontal numbers.

35

Appendix 8d. t-distribution critical values