hypothesis testing. testing your beliefs ultimately in most research, the aim is to investigate...

Hypothesis Testing

Testing your beliefs

Ultimately in most research, the aim is to investigate whether the data supports a particular hypothesis, or whether there is evidence to reject this hypothesis.

For example:How do you know that male students are more likely to weigh more than female students?

How do we know that smoking is associated with increased risk of lung cancer?

How do we know that possessing a particular genetic profile means someone is more likely to be obese than others with a different genetic profile?

1. Data checking, identifying problems and characteristics

2. Understanding chance and uncertainty

3. How will the data for one attribute behave, in a theoretical framework?

4. Theoretical framework assumes complete information, need to address uncertainties in real data

5. Testing your beliefs, do the data support what you think is true?

Data exploration and Statistical analysis

DataData exploration,

categorical / numerical outcomes

Estimation of parameters, quantifying uncertainty

Hypothesis testing

Parametric tests (t-tests, ANOVA,

test of proportions)

Model each outcome with a theoretical distribution

• Null hypothesis A statement of status quo, or of no changes

• Alternative hypothesis Hypothesis which the researcher wishes to investigate

• Commonly, the alternative hypothesis is first formulated, and the null hypothesis is the negation of the alternative hypothesis.

Hypothesis Testing

A woman buys a pregnancy test kit, and is interested to find out whether she is pregnant.

The null hypothesis in this case (status quo), is that she is not pregnant.

The alternative hypothesis (hypothesis of interest), is that she is pregnant.

Test kit may show:

+ve: indicating there is evidence to suggest pregnancy

–ve: indicating lack of evidence to suggest pregnancy

Pregnancy Test Kit

The test kit may either be accurate, or inaccurate.

Actually pregnant Actually not pregnant

Test kit shows +ve

Test kit shows –ve

Correct +ve diagnosis

Incorrect –ve diagnosis

Incorrect +ve diagnosis

Correct –ve diagnosis

Pregnancy Test Kit

The test kit may either be accurate, or inaccurate.

Pregnancy Test Kit

Type 1 Error(p-value)

False +ve conclusion(+ve when woman is in fact not pregnant)

Type 2 Error False –ve conclusion(–ve when woman is in fact pregnant)

Power(Sensitivity)

True +ve conclusion(+ve when woman is in fact pregnant)

Specificity True –ve conclusion(–ve when woman is in fact not pregnant)

Types of Errors

• Probability of observing a false positive result, also known as the significance of the test.

• If the p-value is small, we are more confident that the null hypothesis can be rejected.

• On average, expect 1 false positive result out of 20 results obtained.

• So if we perform a large study with 1 million variables, on average we expect about 50,000 variables to display p-values of < 0.05!

P-values

Statistical tests for comparing means

One of the most common statistical tests in biomedical sciences is the comparisons of averages.

Let’s revisit Example 1 from the previous lecture.

Example 1: The Science Faculty is interested to compare between the weights of male and female students in NUS.

Recall we:- Randomly sample 200 male students and 200 female students and measure their weight.- Calculate the mean weight of these 200 male students, and use this quantity to estimate the mean weight of all the male students in NUS.- Similarly calculate the mean weight of these 200 female students and use this to estimate the mean weight of all the female students in NUS.

So here we have the mean weights for male and female students. Can we compare these values, after accounting for the uncertainties in the estimation, and quantify the statistical evidence for observing a difference?

Test statistic

Test statisticA numerical quantification of the amount of evidence against the null hypothesis.

Usually takes the form of:

(Observed summary value – Hypothesized summary value)--- divided by ---

Standard error of observed summary value

Or

Degrees of freedom

Number of independent observations that are allowed to take any values.

For example: In an assessment of whether the mean weight of 120 male students in the Science faculty exceeds 70kg, we actually have 120 independent observations.

However, we need to estimate the mean weight of these 120 students, thus the degrees of freedom remaining = 120 – 1 = 119.

Another way to think about it: Upon collecting the weight of 120 students, IF I know the average weight, then only the weights of 119 students are allowed to ‘vary’ as the weight of the last person must be the specific value to yield the average weight that I know of.

Degrees of freedom (df)

Number of independent observations

(usually the sample size)

Number of estimated parameters

= –

t-tests

1-Sample t-testUseful if we are keen to compare a collection of values against a hypothesized mean value.

For example: Previous surveys from the 1980s found that the mean weight of male students in the Science Faculty was 63kg. The lecturer of ST1232 believes that this figure is likely to be an under-estimate for male students in 2010. Design an experiment to test this hypothesis.

- Randomly sample 200 male students from the Science Faculty, and measure their weights. - Calculate the mean weight of these 200 students and the corresponding standard error of the mean, and see whether this is significantly higher than 63kg.

Identifying the hypotheses

In hypothesis testing, it is extremely important to identify the hypotheses that are being tested, since this affects the calculation of the statistical evidence.

Null hypothesis, H0: Mean weight, = 63Alternative hypothesis, H1: Mean weight, > 63

But there are two other alternative hypotheses, each giving different outcome in the estimation of statistical evidence against the null hypothesis.

To see whether the weight of current students differ from 63kg: H1: Mean weight, 63

To see whether the weight of current students is lighter than 63kg: H1: Mean weight, 63

One-sided versus two-sided tests

Two-sided alternative hypothesis:Unbiased test, without assuming prior knowledge or expectation of the direction of the effect size. So the difference could be greater or smaller than the test value.

Remember the test statistic

Value under the null hypothesis

Value estimated from the data

Numerator measures how different is the data from the null hypothesis

Denominator measures the uncertainty of the estimation


Two-sided alternative hypothesis:Unbiased test, without assuming prior knowledge or expectation of the direction of the effect size. So the difference could be greater or smaller than the test value. Example of an alternative hypothesis: there is a difference between the height of men and women.

One-sided alternative hypothesis:Biased test, where the direction of the effect, if genuine, is known. Examples: (i) men are taller than women; (ii) the weight loss pill successfully reduces weight.


Interpreting statistical evidence:P-values are assessed by the probability found in the shaded areas: the shaded areas represent the probability of obtaining a test statistic at least as extreme as from the observed data, under the null hypothesis.

SPSS calculates the two-tailed p-value by default, need to work harder to get the one-tail p-value.

Two-sided tests

One-sided tests


Converting two-tailed p-value to one-tail p-value:- If observed effect (or test statistic) is in the same direction as the alternative hypothesis, half the p-value.

- If observed effect (or test statistic) is in the opposite direction as the alternative hypothesis, one-tail p-value = 1 – 0.5 (two-tailed p-value)

Two-sided tests

One-sided tests

Example 1: A pharmaceutical company is interested in testing a new weight-loss treatment, and recruited 120 volunteers to take part in a research trial. The weight of the participants are measured before and after taking the weight-loss treatment for the prescribed duration.

- Null hypothesis here is that the weight-loss treatment has no effect (or average difference in weight = 0)- Alternative hypothesis here is that the weight-loss treatment is effective (or average difference in weight < 0)

Suppose the evidence obtained yields a two-sided p-value of 0.03, and the average difference (after – before) is -3.5kg, what is the correct evidence for the trial?

One-tail p-value = 0.03 / 2 = 0.015 (since direction of observed effect is in the same direction as the alternative hypothesis)

If however, the average difference (after – before) is 3.5kg, the one-tail p-value will instead be 1 – 0.03 / 2 = 0.985. This intuitively makes sense!

Back to t-tests

1-Sample t-testUseful if we are keen to compare a collection of values against a hypothesized mean value.

Null hypothesis: Mean = hypothesized value

ASSUMPTIONS- Data is normally distributed (theoretical assumption)- Data is symmetrically distributed (practical application) - Observations made are all independent

Two independent samples t-test

For comparing the means between two groups.

Null hypothesis : Mean of group 1 = Mean of group 2Or effective : Difference in means = 0

Example: Comparing the weights between male and female students in Science faculty.

ASSUMPTIONS- Data is normally (symmetrically) distributed within each group- Observations made within each group are all independent - Observations are also independent across the groups

Paired-sample t-test

For comparing the difference within each pair of observations

Null hypothesis : No difference between the observations in each pairingOr effective : Difference within each pairing = 0

Example: Comparing the efficacy of a diet treatment, thus comparing the weight of an individual before and after the treatment.

ASSUMPTIONS- Difference within each pairing follows a Normal (symmetric) distribution- Independence between pairs of observations

Two independent samples vs paired-sample t-test

Is there any difference between these two:

Mathematically they seemed similar?

Main difference in the calculations of the denominators:

takes into account of the uncertainty of two sets of outcomes.

first calculates the difference within each pairing, then calculate

the standard error of the string of differences.

2 groups

Suppose we are interested in comparing the means between three groups, what can we do?

(A) Perform 3 sets of two-independent samples t-tests (between groups 1 and 2; groups 2 and 3; groups 1 and 3)

(B) Perform a ‘global’ test, checking whether the means are all the same.

- A statistical test for comparing the means of multiple independent groups- Test the null hypothesis that the means of ALL the groups are identicalAssumptions- Data within each group is normally (symmetrically) distributed - Independent observations within each group- Independent observations between the groups

Analysis of Variance(ANOVA)

ANOVA

Analysis of variance (ANOVA) – interesting name that we are effectively analysing the variance to decide whether there is any differences in the means!

Null hypothesis : Means of all the groups are identicalAlternative hypothesis : At least one of the groups has a different mean

So, observing a significant p-value in this instance means that at least one group is different, but we don’t actually know which group that is!

Practical Example

• Previous studies suggest restriction caloric intake can increase life expectancy.

• Perform an experiment with mice, each randomly assigned to one of six diet treatment.

• Measure the time of death for each mouse (in months).

Experimental Design

Visualising the Data

Practical Example

Research Questions

• Is there any difference in life expectancy across the different diet treatments?

• If there is, which diet treatment contribute to this difference?

• Which diet treatment significantly increases life expectancy?

Practical Example

ANOVA

Significant differences! But which treatment?

Q: Is there any difference in life expectancy across the different diet treatments?

Consider the following null hypothesis:All the mean life expectancy of the six groups are identical

Alternative hypothesis:At least one group has a different mean life expectancy

Multiple Comparisons

• Can compare every possible pair of treatments.

DANGER!

• More number of tests more chances of making a false judgement.

• Remember p-value threshold of 0.05 1 out of 20 judgement may be false.

• There are 15 possible pairings for the 6 treatment groups very likely to make a false judgement!

Bonferroni Correction

• Make it harder to define a result as significant.

• By lowering the p-value threshold. But to lower by how much?

Solution

Divide the threshold by the total number of tests performed.

Thus instead of a critical threshold of 0.05, we now use a critical threshold of

0033.015

05.0

Bonferroni Correction

• A much preferred approach is to calculate the “Bonferroni corrected p-value” instead

Bonferroni-corrected p-valueMultiple the obtained p-values by the number of tests performed.

- Different p-value thresholds may be used, and difficult to decide what the Bonferroni-corrected thresholds are.

- By calculating the Bonferroni-corrected p-value, it can be up to the researcher / author / editor / reviewer to decide what the global p-value threshold should be.

Post-Hoc Analysis

Multiple Comparisons

Dependent Variable: LIFETIME

Bonferroni

6.9945* 1.25652 .000 3.2803 10.7086

12.2837* 1.30637 .000 8.4222 16.1452

-5.4310* 1.24086 .000 -9.0988 -1.7631

-2.6115 1.19355 .440 -6.1395 .9166

-3.2000 1.26207 .175 -6.9306 .5306

-6.9945* 1.25652 .000 -10.7086 -3.2803

5.2892* 1.30101 .001 1.4435 9.1348

-12.4254* 1.23521 .000 -16.0766 -8.7743

-9.6060* 1.18768 .000 -13.1166 -6.0953

-10.1945* 1.25652 .000 -13.9086 -6.4803

-12.2837* 1.30637 .000 -16.1452 -8.4222

-5.2892* 1.30101 .001 -9.1348 -1.4435

-17.7146* 1.28588 .000 -21.5156 -13.9137

-14.8951* 1.24030 .000 -18.5613 -11.2289

-15.4837* 1.30637 .000 -19.3452 -11.6222

5.4310* 1.24086 .000 1.7631 9.0988

12.4254* 1.23521 .000 8.7743 16.0766

17.7146* 1.28588 .000 13.9137 21.5156

2.8195 1.17110 .249 -.6422 6.2811

2.2310 1.24086 1.000 -1.4369 5.8988

2.6115 1.19355 .440 -.9166 6.1395

9.6060* 1.18768 .000 6.0953 13.1166

14.8951* 1.24030 .000 11.2289 18.5613

-2.8195 1.17110 .249 -6.2811 .6422

-.5885 1.19355 1.000 -4.1166 2.9395

3.2000 1.26207 .175 -.5306 6.9306

10.1945* 1.25652 .000 6.4803 13.9086

15.4837* 1.30637 .000 11.6222 19.3452

-2.2310 1.24086 1.000 -5.8988 1.4369

.5885 1.19355 1.000 -2.9395 4.1166

(J) GROUPNN85

NP

NR40

NR50

RR50

lopro

NP

NR40

NR50

RR50

lopro

NN85

NR40

NR50

RR50

lopro

NN85

NP

NR50

RR50

lopro

NN85

NP

NR40

RR50

lopro

NN85

NP

NR40

NR50

(I) GROUPlopro

NN85

NP

NR40

NR50

RR50

MeanDifference

(I-J) Std. Error Sig. Lower Bound Upper Bound

95% Confidence Interval

The mean difference is significant at the .05 level.*.

Questions

• So why don’t we perform the post-hoc analyses all the time then?

• p-values, is there a difference between 0.049 and 0.051?

• p-values and effect sizes, which is better? P-values or confidence intervals?

• Power (sensitivity) and specificity, can we attempt to maximise both?

t-tests in SPSS

Consider the mathematics.xls dataset again.

1. The average marks of the mathematics exam before starting the omega 3 trial is 70. Is there any evidence that the marks after the omega 3 trial is higher than 70?

2. It is traditionally believed that male students tend to outperform female students in mathematics. Based on the marks before the start of the trial, is there any evidence in support of this hypothesis.

3. Is there any evidence that consuming omega 3 improves the performance in the mathematics exam?

4. Is there any difference in the marks before the trial between the three schools? If there is, which school exhibited the best performance?

5. Is there any difference in the omega 3 consumption between male and female students?

Average marks after = 70?

This should be relatively straightforward that we need to perform a one-sample test of the mean value.

However, before we can decide on the use of a one-sample t-test, we need to check whether the data is indeed symmetrically distributed (and to go through the usual data exploratory procedures).

Recall there are at least two ways of doing this in SPSS: 1. Qualitative assessment using a histogram. 2. Quantitative assessment with a Shapiro-Wilk’s Test.

Shapiro-Wilk’s test whether the data is normally distributed.

H0: Data is normally distributedH1: Data is not normally distributed

H0: Mean marks = 70H1: Mean marks > 70 (1-tailed test)

Actual 2-tailed p-value = 3.28 10-5

1-tailed p-value = 1.64 10-5

Conclusion: There exists significant evidence that the marks after consuming omega 3 is greater than 70 (p-value = 1.64 10-5).

Males better than females?

It should be immediately clear that there are 2 groups (or “populations”) here: one for male students, and one for female students.

Thus we can use the 2-independent samples t-test to compare the mean marks for the two groups. However, as before, we need to assess whether there is any violation of the normality or symmetrical distribution assumption.

Important: As there are two groups, we need to assess whether both groups satisfy this assumption!

Assessing assumptions for 2-independent samples t-test

H0: Mean marks for males = Mean marks for femalesH1: Mean marks for males > Mean marks for females (1-tailed test)

Which row to interpret?

Levene’s test for equality of variances:H0: Variance for group 1 = Variance for group 2H1: Variance for group 1 Variance for group 2

However, regardless of what you see here, always go ahead and interpret the second row not assuming equal variances..

2-tailed p-value, need to convert to 1-tailed p-value. As mean difference = -0.935, which is in opposite direction to H1, therefore:1-tailed p-value = 1 – 0.5 0.460 = 0.77

Conclusion: There is no evidence (p-value = 0.77) that males perform better than females in the mathematics exam before the start of the omega 3 trial.

Evidence that omega3 improves performance?

Most appropriate analysis is to compare the marks after against the marks before for the same individual.

Here the appropriate test is the paired-sample t-test.

H0: There is no difference between the marks before and after within each individual (or Difference = 0)

H1: The mark after taking omega3 is higher than the mark before (or Difference > 0)

Actual 2-tailed p-value = 4.06 10-14

1-tailed p-value = 2.03 10-14

Since mean difference is in the same direction as the alternative hypothesis.

Conclusion: There is overwhelming evidence (p-value = 2.03 10-14) that the exam marks after the omega 3 trial are higher than the marks before the trial.

Is there any difference in the performance across the 3 schools?

There are three groups that we want to compare the average marks. The use of an ANOVA should immediately come to mind for this purpose.

H0: There is no difference in the mean marks between the three schools. H1: At least one school has a different mean mark when compared to the rest of the schools.

Conclusion: There is at no significant evidence (p-value = 0.063) of a difference in marks between the three schools. (or at best, marginal evidence of a difference, between schools 1 and 2).

Is there any difference in omega 3 consumption between males and females?Feeling confident, we decided we can jump straight to performing the definitive analysis without exploring the data.

So the appropriate test here is a 2-independent sample t-test.

Is this correct?

With histograms like these, there really isn’t a need to perform the Shapiro-Wilk tests!

Parametric tests

It’s clear that any tests that assume the data to be symmetrically distributed will not be applicable in comparing omega 3 consumption, since this outcome is extremely right skewed.

There is thus a need for statistical tests that do not explicitly require such assumptions on the distribution of the variable – also known as non-parametric tests.

So far, the tests we have seen are considered parametric tests – which require assumptions on the distribution of the data to be satisfied before they can be correctly used.

REMEMBER! The computer does not recognise when a test is inappropriate, you have to know!

Relationship between p-values and confidence intervals

Let’s have a quick recap of all the valid analyses:

Look at the p-values and the confidence intervals. Notice a trend?

Significant at 0.05 threshold and 0 not in 95% CI

Not Significant at 0.05 threshold and 0 is in 95% CI

Significant at 0.05 threshold and 0 is not in 95% CI

Actually, even at the numerical EDA stage, the 95% confidence intervals about the means are already very informative for indicating whether there will be any differences between the two means in a formal comparison.

Relationship between p-values and confidence intervals

Thus, if the test value under the null hypothesis falls within the 95% CI of the estimated quantity p-value from the hypothesis test will be > 0.05.

If the test value under the null hypothesis does not fall within the 95% CI p-value from the hypothesis test will be < 0.05.

Conversely,

If the p-value from the hypothesis test is < 0.05 the 95% CI will not contain the test value under H0.

If the p-value from the hypothesis test is > 0.05 the 95% CI is certain to contain the test value under H0.

• understand the concept of the null and alternative hypotheses

• know what a test statistic is, and understand that it is always calculated assuming the null hypothesis is true

• understand what is meant by power, sensitivity, specificity and type I and type II errors

• understand and interpret a p-value from a hypothesis test

• know which statistical tests should be used

• know the assumptions for these statistical tests

• understand the relationship between a p-value and CI

• perform the appropriate analyses in SPSS and RExcel

Students should be able to

hypothesis testing. testing your beliefs ultimately in most research, the aim is to investigate...

Documents