understanding the presentation outline: fundamentals of ......fundamentals of statistics...

Understanding the Fundamentals of Statistics

Presentation Outline:

Hypothesis testing &

significance (p-value)

Sample size determination

Power analysis

Confidence intervals

Validity

Hypothesis Testing

NULL: Is the hypothesis that there is no difference

between the two groups to be compared, with respect to

the measured variable

ALTERNATIVE: Is the hypothesis that there is a

difference between the two groups to be compared, with

respect to the measured variable

The size of the difference is defined prior to data collection

The difference defined is usually the minimum clinically

significant difference

Hypothesis Testing: Alpha, α

The probability of making a Type I error is ALPHA

If p < α then the NULL hypothesis is rejected

Alpha levels are not calculated, they are chosen by the

researcher prior to data collection (usually set at ≤ 0.05

by convention or 0.01 or 0.001).

The alpha level is the p value that we as researchers

decide to accept before we will be confident enough to

release a finding. This is our predetermined acceptance

level.

7

Hypothesis Testing: Power (Beta, β)

The chance of obtaining a statistically significant

p value, if a true difference exists between

groups

Allows the scientist to set calculated odds that

s/he will detect a difference between groups if

one truly exists

Type I and II error: Definitions

Type I error (false positive): the chance of concluding

that there is a significant difference between the groups

when in fact there is not.

Incorrectly rejecting a true null hypothesis.

Type II error (false negative): the likelihood of not

detecting a true significant difference between the groups.

Failing to reject a false null hypothesis.

Truth

Study Result Truly Ineffective Truly Effective

Apparently Ineffective

Correct

(Probability = 1 – α)

Type II Error (Probability = β)

0.2

False Negative

Apparently Effective

Type I Error (Probability = α)

0.05

False Positive

Correct

(Probability = 1 – β)

Illustration of Type I and Type II Errors (α and β)

Hypothesis Testing

A statistical plan or method for deciding which of two

hypotheses is best supported by the data

Uses a “p” value as a measure of the strength of

evidence against one of the hypotheses

Hypothesis Testing: the p-value The NULL hypothesis is “tested” to determine which hypothesis (NULL

or ALTERNATIVE) is accepted as true by result of a p-value

P-values are calculated from statistical tests

P-values range from 0 - 1. The smaller the p-value the more confidence

we have (and the stronger the evidence) that the NULL is false and the

research is true

An example… For treating patients with leukemia, you want to

select, either:

Drug A

Drug B

So you create an experiment, asking different

physicians to choose their favorite.

An example: the first results

You start with a convenience sample of 10

physicians.

6 physicians choose Drug A

4 physicians choose Drug B

Does this mean that Drug A is the best drug for

treating patients with leukemia?

Is this convincing evidence?

An example: results from the experiment - making a decision

You decide to increase your sample size. Which of the following

convince you?

We have an intuitive sense of how much information it takes to be

convinced.

How can we move beyond intuition?

It would be great to create a measure of “evidence strength” that could help

us decide.

6-4, Drug A; p = 0.10

30-20, Drug A; p = 0.987

40-20, Drug A; p = 0.571

50-10, Drug A; p = 0.023

70-30, Drug A; p = 0.010

100-10, Drug A; p = 0.000 (usually presented as p < .001)

P-values You can use p-values to quantify how convincing the data

are.

It is measured by asking:

“How rare is it to get an outcome this extreme or more extreme if

the treatment groups (or Drug A & B) are the same?”

What does a p-value of 0.05 mean?

There is a 5% (5 in 100) chance that the study outcome is due to chance

alone.

There is 95% confidence that our treatment (independent) caused the

outcome (dependent)

P-values : One-tailed and Two-tailed tests

One-tailed: looking for change/ relationship in

one direction

Could require less patients

Two-tailed: looking for change in either

direction

Statistical significance: meaning By declaring that a p-value is smaller than 0.05, we are willing to allow

a chance of error

Namely, 5% of the time, we would be declaring that the groups were

different when, in truth, they were not. (NOTE: This is called Type I

error, or alpha)

If we make 100 comparisons of groups that are the same, we would

expect that roughly 5 of them would have a p value smaller than

0.05 by chance alone.

Statistical significance If a p-value is smaller than 0.05, then conventionally this is deemed as a

“statistically significant” outcome.

P-values larger than 0.05 are considered “not statistically significant”.

Interpreting P-values p > 0.10 – Little to no evidence of difference

0.05 ≤ p ≤ 0.10 – Moderate evidence

0.01 ≤ p < 0.05 – Strong evidence

p < 0.001 – Very strong evidence

Please note these are general guidelines and there are

exceptions….

P-value assessment Smallest p-values mean it is less likely that the groups are the same.

The smaller the p-value is, the more likely it is there is something other

than variability / chance causing the difference.

If the study is designed well, that “something else” is the difference

between the treatment groups (or drugs).

No significant findings If a p-value is not statistically significant (p > .05), does that mean that the

groups are the same?

NO – it only means that we did not have enough evidence to say that they

were different.

Non-statistically significant When the aim of the study is to show that the two groups are similar in

performance to one another, there are other study designs that can be

employed, called: Equivalence or Non Inferiority studies

Factors influencing p-values Small sample size

Multiple analyses (variables & subgroups)

Wide variability of the data

Sample Size Determination

Sample size is driven by the following factors:

α level

Power desired

Type of data (continuous, ordinal, categorical)

Magnitude of difference sought

Variance in the data

Mean = 145c

Mean = 135t

Standard Deviation = 5c

ES = 2.00

5 Ss per group

Mean = 145c

Mean = 135t


ES = 0.66

40 Ss per group

Mean = 145c

Mean = 142t


ES = 0.6

45 Ss per group

Mean = 145c

Mean = 142t


ES = 0.2

400 Ss per group

Sample Size Calculation Effect Size (ES) or magnitude difference =

(Meanc – Mean t) / (Standard Deviationc)

Alpha set at 0.05 and beta set at 0.2 (1 - 0.2 = 80% power)

Sample and Populations In studies, we are trying to generalize our results to a

wider population:

All women who need to get a mammogram

All asthma patients

All Type II diabetics, etc.

However…

Samples versus Populations We only collect information from a sample of the

population.

My female patients who need to get a mammogram

Study participants

Sample variability Samples do not always accurately reflect a population –

there is variation from sample to sample.

Similarly, if in truth 50% of your patients prefer to get a 2 D

mammogram, not all 5 samples will come back 50-50.

We cannot run a study and just see which mammogram, is

most preferred by our patients OR which therapy has the

best results.

A 6 versus 4 result in favor of the 2 D mammogram could be

due to variability or chance alone.

Overcoming variability

Confidence Intervals (CI)

Research is used to make statements about a population (a large group

of subjects) using information from a sample (a small, but carefully

selected subset of this population).

No matter how carefully this sample is selected to be a fair and

unbiased representation of the population, relying on information from

a sample will always lead to some level of uncertainty.

A confidence interval is a range of values that tries to quantify this

uncertainty. Consider it as a range of plausible values.

A narrow confidence interval implies high precision; we can specify

plausible values to within a tiny range.

A wide interval implies poor precision; we can only specify plausible

values to a broad and uninformative range.

Confidence intervals (CI) can help… A CI is a range of values that tries to quantify the level of uncertainty.

When you see a confidence interval in a published paper, you should look for two

things. First, does the interval contain a value that implies no change or no effect

(the NULL value)?

For example, with a confidence interval for a:

Difference - look to see whether that interval includes zero (0)

Ratio - look to see whether that interval contains one (1)

90, 95, or 99% CIs provide information about:

the range in which the true value lies with a certain degree of probability

the direction and strength of demonstrated effect

statistical plausibility AND…

clinical relevance

Confidence Intervals: Example Consider a recent study of PPI treatment of ulcers and their size.

When examining ulcers after the PPI, they showed that ulcers

decreased in size of 4 mm. The 95% confidence interval, however,

ranged from -5.5 to 7.5 mm. This interval implies that neither a large

improvement due to PPI nor a large decrement could be ruled out.

Generally, when a confidence interval is very wide like this one, it is

an indication of an inadequate sample size.

Confidence Intervals cont’d.

The interval shown below implies no statistically significant change.

The interval shown below would imply a statistically significant improvement.

The interval shown below implies a statistically significant decline.

Statistical Power: what is it?

The power of any test of statistical significance is defined as the

probability that it will reject a false null hypothesis.

In other words, statistical power is the likelihood that a study will

detect an effect when there is an effect to be detected.

If statistical power is high, the probability of concluding there is no effect

when, in fact, there is one, goes down.

Statistical power is affected chiefly by the size of the effect and the

size of the sample used to detect it.

Bigger effects are easier to detect than smaller effects, while large

samples offer greater test sensitivity than small samples.

Power analysis Power analysis can be used to calculate the minimum sample size required so

that one can be reasonably likely to detect an effect of a given size

Power analysis can be used to calculate the minimum effect size that is likely

to be detected in a study using a given sample size

In addition, the concept of power is used to make comparisons between

different statistical testing procedures: for example, between a parametric and

a nonparametric test of the same hypothesis

Statistical power is depended on:

Statistical significance criterion used in the test

Magnitude of the effect of interest in the population

Sample size used to detect the effect

Model (test)

Standardized effect size: (1) effect size and (2) variation (variability)

Test size (significance level alpha )

Power of the test (1-ß)

Four factors affecting power Alpha-level: the smaller the alpha level, the lower the

power

Type of Statistical technique:

One tailed tests are more "powerful"

Repeated measures more powerful

Sensitivity of the Data

Sample Size: The larger the sample size, the higher the power

Reliability of the measures

Statistical controls

Size of the treatment (or Effect Size): more extreme

treatments make for more powerful tests

Effect size (ES)

Treatment Effect

Psychotherapy ?

AZT for AIDS ?

Physician aspirin study ?

Drug treatment for arthritis ?

Cyclosporine-organ reject ?

Treatment Effect

Psychotherapy .70-.80

AZT for AIDS .47

Physician aspirin study .1

Drug treatment for arthritis .45-.77

Cyclosporine-organ reject .39

The problem parameter

ES= μE - μC/σ

Mu is the population

Mean (exp & control)

sigma is their shared

Standard Deviation

+ means better for exp.

- mean better for control

Statistical vs. Clinical Significance

You must have statistical significance to have clinical

significance

A p-value helps us to assess whether a result is likely due

to a difference between treatments; it does not say whether

that difference is clinically meaningful

Just because a p-value is statistically significant, it does

not mean that the result is clinically significant

Statistical significance “says nothing about the actual

magnitude or the importance of the difference”

The importance of the difference is called “clinical

significance”

“Clinical Significance” can only be decided by judgment,

not by mathematics

Statistical vs. Clinical Significance cont’d. P-values should be used in concert with

summary measures (e.g., means, percentages,

hazard ratios, 95% Confidence Intervals (CI’s))

to evaluate the magnitude of any given results.

P-values do not answer 2 critical questions: 1) Is the result correct?

• Solution: Seek other data and evaluate the possibility of

systematic error (bias)

2) Is the observed effect important?

• Solution: You must rely on your own clinical judgment

Clinical Significance

Study population vs. yours

Statistical significant outcomes may not be

clinically meaningful (e.g., 0.25 points on a

quality of life scale)

Treatment delivery/cost/resources may not be

reasonable or possible in your setting

Significant??

Statistical vs. Clinical Significance: Example

You receive a brochure offering to make your offspring smarter so

they can go to Ivy League colleges, become wealthy, and support you

in your retirement days.

The brochure contains legitimate research data to support its claims

that the product was demonstrated to raise IQ’s by an amount

significant “at the 0.05 significance level”.

RESULT: For an N of 100, a difference of only 3 points would

produce a statistically significant difference.

A Negative Result Type II errors are not rare, Freiman et al found that in 71

negative clinical trials, 50% had a Beta of > 0.74, meaning

that these studies only had 26% power to detect a true

difference.

In a negative trial the question to ask yourself is:

FOR A GIVEN BETA LEVEL AND A DIFFERENCE I CONSIDER

CLINICALLY RELEVANT, DID THE RESEARCHER USE A LARGE

ENOUGH SAMPLE SIZE?

YOU can calculate the sample size on your own!! IT’S

EASY! Calculated using software. (provide links)

Validity Typology

Internal

the extent to which causal conclusion can be made

Statistical conclusion

the level of confidence in the results given the design and

statistical methods employed

External

the extent to which the results can be applied to the

broader population

Construct

match between study operationalization and the

constructs underlying those.

Confounders:

Independent variable with strong

association with both outcome

as well as the exposure of

interest.

Is an independent risk factor for

the outcome/disease.

Distorts the association between

the exposure and the outcome.

Must be controlled if causal

inferences are to be made.

Examples:

Confounding by Concomitant medication

Confounding by Effect Modification

Drug-drug interaction

Impact of varying age groups (older patients

responding differently to an intervention)

Impact of varying dosing (different outcomes

as a result of difference in strength and

frequency of dose)

Controlling for Confounders Sample Restriction: specifying inclusion criteria; excluding

subjects exposed to confounding variable.

Matching: grouping subjects based on the presence of like

confounders. Age and sex are usual background

confounders.

For large sample sizes, techniques like propensity score

matching are used to control for confounders.

Stratification: individuals grouped into separate strata

defined by the values of the confounder.

Statistical Modeling: techniques such as regression are

used to control for multiple confounders.

References University of California, Berkley Department of Statistics

http://www.stat.berkeley.edu/users/hhuang/STAT141/Lecture-FDR.pdf

Yale University Department of Statistics

http://www.stat.yale.edu/Courses/1997-98/101/confint.htm

Stanford University

http://www.stanford.edu/~kcobb/hrp259/lecture11.ppt

MEERA: My Environmental Education Evaluation Resource

Assistant

http://meera.snre.umich.edu/plan-an-evaluation/related-topics/power-

analysis-statistical-significance-effect-size

University of Ottawa

http://www.med.uottawa.ca/sim/data/Statistical_significance_importance

_e.htm

Previous internal Advocate research department presentations







http://www.stanford.edu/~kcobb/hrp259/lecture11.ppt

http://meera.snre.umich.edu/plan-an-evaluation/related-topics/power-analysis-statistical-significance-effect-size

















http://www.med.uottawa.ca/sim/data/Statistical_significance_importance_e.htm

http://www.med.uottawa.ca/sim/data/Statistical_significance_importance_e.htm

Presentation Feedback

Thank you for your review.

If you would like to provide feedback on the content of the

presentation, please complete the short survey which can be

found at this link: Presentation Evaluation Survey

Please note the survey should not take more than 5 minutes to complete.

Thank you in advance for completing the survey!

https://advocatehealth.qualtrics.com/SE/?SID=SV_4TwnATnUsmv2jfD

Thank You!

understanding the presentation outline: fundamentals of ......fundamentals of statistics...

Documents