understanding the presentation outline: fundamentals of ......fundamentals of statistics...
TRANSCRIPT
Understanding the Fundamentals of Statistics
Presentation Outline:
Hypothesis testing &
significance (p-value)
Sample size determination
Power analysis
Confidence intervals
Validity
Hypothesis Testing
NULL: Is the hypothesis that there is no difference
between the two groups to be compared, with respect to
the measured variable
ALTERNATIVE: Is the hypothesis that there is a
difference between the two groups to be compared, with
respect to the measured variable
The size of the difference is defined prior to data collection
The difference defined is usually the minimum clinically
significant difference
Hypothesis Testing: Alpha, α
The probability of making a Type I error is ALPHA
If p < α then the NULL hypothesis is rejected
Alpha levels are not calculated, they are chosen by the
researcher prior to data collection (usually set at ≤ 0.05
by convention or 0.01 or 0.001).
The alpha level is the p value that we as researchers
decide to accept before we will be confident enough to
release a finding. This is our predetermined acceptance
level.
7
Hypothesis Testing: Power (Beta, β)
The chance of obtaining a statistically significant
p value, if a true difference exists between
groups
Allows the scientist to set calculated odds that
s/he will detect a difference between groups if
one truly exists
Type I and II error: Definitions
Type I error (false positive): the chance of concluding
that there is a significant difference between the groups
when in fact there is not.
Incorrectly rejecting a true null hypothesis.
Type II error (false negative): the likelihood of not
detecting a true significant difference between the groups.
Failing to reject a false null hypothesis.
Truth
Study Result Truly Ineffective Truly Effective
Apparently Ineffective
Correct
(Probability = 1 – α)
Type II Error (Probability = β)
0.2
False Negative
Apparently Effective
Type I Error (Probability = α)
0.05
False Positive
Correct
(Probability = 1 – β)
Illustration of Type I and Type II Errors (α and β)
Hypothesis Testing
A statistical plan or method for deciding which of two
hypotheses is best supported by the data
Uses a “p” value as a measure of the strength of
evidence against one of the hypotheses
Hypothesis Testing: the p-value The NULL hypothesis is “tested” to determine which hypothesis (NULL
or ALTERNATIVE) is accepted as true by result of a p-value
P-values are calculated from statistical tests
P-values range from 0 - 1. The smaller the p-value the more confidence
we have (and the stronger the evidence) that the NULL is false and the
research is true
An example… For treating patients with leukemia, you want to
select, either:
Drug A
Drug B
So you create an experiment, asking different
physicians to choose their favorite.
An example: the first results
You start with a convenience sample of 10
physicians.
6 physicians choose Drug A
4 physicians choose Drug B
Does this mean that Drug A is the best drug for
treating patients with leukemia?
Is this convincing evidence?
An example: results from the experiment - making a decision
You decide to increase your sample size. Which of the following
convince you?
We have an intuitive sense of how much information it takes to be
convinced.
How can we move beyond intuition?
It would be great to create a measure of “evidence strength” that could help
us decide.
6-4, Drug A; p = 0.10
30-20, Drug A; p = 0.987
40-20, Drug A; p = 0.571
50-10, Drug A; p = 0.023
70-30, Drug A; p = 0.010
100-10, Drug A; p = 0.000 (usually presented as p < .001)
P-values You can use p-values to quantify how convincing the data
are.
It is measured by asking:
“How rare is it to get an outcome this extreme or more extreme if
the treatment groups (or Drug A & B) are the same?”
What does a p-value of 0.05 mean?
There is a 5% (5 in 100) chance that the study outcome is due to chance
alone.
There is 95% confidence that our treatment (independent) caused the
outcome (dependent)
P-values : One-tailed and Two-tailed tests
One-tailed: looking for change/ relationship in
one direction
Could require less patients
Two-tailed: looking for change in either
direction
Statistical significance: meaning By declaring that a p-value is smaller than 0.05, we are willing to allow
a chance of error
Namely, 5% of the time, we would be declaring that the groups were
different when, in truth, they were not. (NOTE: This is called Type I
error, or alpha)
If we make 100 comparisons of groups that are the same, we would
expect that roughly 5 of them would have a p value smaller than
0.05 by chance alone.
Statistical significance If a p-value is smaller than 0.05, then conventionally this is deemed as a
“statistically significant” outcome.
P-values larger than 0.05 are considered “not statistically significant”.
Interpreting P-values p > 0.10 – Little to no evidence of difference
0.05 ≤ p ≤ 0.10 – Moderate evidence
0.01 ≤ p < 0.05 – Strong evidence
p < 0.001 – Very strong evidence
Please note these are general guidelines and there are
exceptions….
P-value assessment Smallest p-values mean it is less likely that the groups are the same.
The smaller the p-value is, the more likely it is there is something other
than variability / chance causing the difference.
If the study is designed well, that “something else” is the difference
between the treatment groups (or drugs).
No significant findings If a p-value is not statistically significant (p > .05), does that mean that the
groups are the same?
NO – it only means that we did not have enough evidence to say that they
were different.
Non-statistically significant When the aim of the study is to show that the two groups are similar in
performance to one another, there are other study designs that can be
employed, called: Equivalence or Non Inferiority studies
Factors influencing p-values Small sample size
Multiple analyses (variables & subgroups)
Wide variability of the data
Sample Size Determination
Sample size is driven by the following factors:
α level
Power desired
Type of data (continuous, ordinal, categorical)
Magnitude of difference sought
Variance in the data
Mean = 145c
Mean = 135t
Standard Deviation = 5c
ES = 2.00
5 Ss per group
Mean = 145c
Mean = 135t
Standard Deviation = 15c
ES = 0.66
40 Ss per group
Mean = 145c
Mean = 142t
Standard Deviation = 5c
ES = 0.6
45 Ss per group
Mean = 145c
Mean = 142t
Standard Deviation = 15c
ES = 0.2
400 Ss per group
Sample Size Calculation Effect Size (ES) or magnitude difference =
(Meanc – Mean t) / (Standard Deviationc)
Alpha set at 0.05 and beta set at 0.2 (1 - 0.2 = 80% power)
Sample and Populations In studies, we are trying to generalize our results to a
wider population:
All women who need to get a mammogram
All asthma patients
All Type II diabetics, etc.
However…
Samples versus Populations We only collect information from a sample of the
population.
My female patients who need to get a mammogram
Study participants
Sample variability Samples do not always accurately reflect a population –
there is variation from sample to sample.
Similarly, if in truth 50% of your patients prefer to get a 2 D
mammogram, not all 5 samples will come back 50-50.
We cannot run a study and just see which mammogram, is
most preferred by our patients OR which therapy has the
best results.
A 6 versus 4 result in favor of the 2 D mammogram could be
due to variability or chance alone.
Overcoming variability
Confidence Intervals (CI)
Research is used to make statements about a population (a large group
of subjects) using information from a sample (a small, but carefully
selected subset of this population).
No matter how carefully this sample is selected to be a fair and
unbiased representation of the population, relying on information from
a sample will always lead to some level of uncertainty.
A confidence interval is a range of values that tries to quantify this
uncertainty. Consider it as a range of plausible values.
A narrow confidence interval implies high precision; we can specify
plausible values to within a tiny range.
A wide interval implies poor precision; we can only specify plausible
values to a broad and uninformative range.
Confidence intervals (CI) can help… A CI is a range of values that tries to quantify the level of uncertainty.
When you see a confidence interval in a published paper, you should look for two
things. First, does the interval contain a value that implies no change or no effect
(the NULL value)?
For example, with a confidence interval for a:
Difference - look to see whether that interval includes zero (0)
Ratio - look to see whether that interval contains one (1)
90, 95, or 99% CIs provide information about:
the range in which the true value lies with a certain degree of probability
the direction and strength of demonstrated effect
statistical plausibility AND…
clinical relevance
Confidence Intervals: Example Consider a recent study of PPI treatment of ulcers and their size.
When examining ulcers after the PPI, they showed that ulcers
decreased in size of 4 mm. The 95% confidence interval, however,
ranged from -5.5 to 7.5 mm. This interval implies that neither a large
improvement due to PPI nor a large decrement could be ruled out.
Generally, when a confidence interval is very wide like this one, it is
an indication of an inadequate sample size.
Confidence Intervals cont’d.
The interval shown below implies no statistically significant change.
The interval shown below would imply a statistically significant improvement.
The interval shown below implies a statistically significant decline.
Statistical Power: what is it?
The power of any test of statistical significance is defined as the
probability that it will reject a false null hypothesis.
In other words, statistical power is the likelihood that a study will
detect an effect when there is an effect to be detected.
If statistical power is high, the probability of concluding there is no effect
when, in fact, there is one, goes down.
Statistical power is affected chiefly by the size of the effect and the
size of the sample used to detect it.
Bigger effects are easier to detect than smaller effects, while large
samples offer greater test sensitivity than small samples.
Power analysis Power analysis can be used to calculate the minimum sample size required so
that one can be reasonably likely to detect an effect of a given size
Power analysis can be used to calculate the minimum effect size that is likely
to be detected in a study using a given sample size
In addition, the concept of power is used to make comparisons between
different statistical testing procedures: for example, between a parametric and
a nonparametric test of the same hypothesis
Statistical power is depended on:
Statistical significance criterion used in the test
Magnitude of the effect of interest in the population
Sample size used to detect the effect
Model (test)
Standardized effect size: (1) effect size and (2) variation (variability)
Test size (significance level alpha )
Power of the test (1-ß)
Four factors affecting power Alpha-level: the smaller the alpha level, the lower the
power
Type of Statistical technique:
One tailed tests are more "powerful"
Repeated measures more powerful
Sensitivity of the Data
Sample Size: The larger the sample size, the higher the power
Reliability of the measures
Statistical controls
Size of the treatment (or Effect Size): more extreme
treatments make for more powerful tests
Effect size (ES)
Treatment Effect
Psychotherapy ?
AZT for AIDS ?
Physician aspirin study ?
Drug treatment for arthritis ?
Cyclosporine-organ reject ?
Treatment Effect
Psychotherapy .70-.80
AZT for AIDS .47
Physician aspirin study .1
Drug treatment for arthritis .45-.77
Cyclosporine-organ reject .39
The problem parameter
ES= μE - μC/σ
Mu is the population
Mean (exp & control)
sigma is their shared
Standard Deviation
+ means better for exp.
- mean better for control
Statistical vs. Clinical Significance
You must have statistical significance to have clinical
significance
A p-value helps us to assess whether a result is likely due
to a difference between treatments; it does not say whether
that difference is clinically meaningful
Just because a p-value is statistically significant, it does
not mean that the result is clinically significant
Statistical significance “says nothing about the actual
magnitude or the importance of the difference”
The importance of the difference is called “clinical
significance”
“Clinical Significance” can only be decided by judgment,
not by mathematics
Statistical vs. Clinical Significance cont’d. P-values should be used in concert with
summary measures (e.g., means, percentages,
hazard ratios, 95% Confidence Intervals (CI’s))
to evaluate the magnitude of any given results.
P-values do not answer 2 critical questions: 1) Is the result correct?
• Solution: Seek other data and evaluate the possibility of
systematic error (bias)
2) Is the observed effect important?
• Solution: You must rely on your own clinical judgment
Clinical Significance
Study population vs. yours
Statistical significant outcomes may not be
clinically meaningful (e.g., 0.25 points on a
quality of life scale)
Treatment delivery/cost/resources may not be
reasonable or possible in your setting
Significant??
Statistical vs. Clinical Significance: Example
You receive a brochure offering to make your offspring smarter so
they can go to Ivy League colleges, become wealthy, and support you
in your retirement days.
The brochure contains legitimate research data to support its claims
that the product was demonstrated to raise IQ’s by an amount
significant “at the 0.05 significance level”.
RESULT: For an N of 100, a difference of only 3 points would
produce a statistically significant difference.
A Negative Result Type II errors are not rare, Freiman et al found that in 71
negative clinical trials, 50% had a Beta of > 0.74, meaning
that these studies only had 26% power to detect a true
difference.
In a negative trial the question to ask yourself is:
FOR A GIVEN BETA LEVEL AND A DIFFERENCE I CONSIDER
CLINICALLY RELEVANT, DID THE RESEARCHER USE A LARGE
ENOUGH SAMPLE SIZE?
YOU can calculate the sample size on your own!! IT’S
EASY! Calculated using software. (provide links)
Validity Typology
Internal
the extent to which causal conclusion can be made
Statistical conclusion
the level of confidence in the results given the design and
statistical methods employed
External
the extent to which the results can be applied to the
broader population
Construct
match between study operationalization and the
constructs underlying those.
Confounders:
Independent variable with strong
association with both outcome
as well as the exposure of
interest.
Is an independent risk factor for
the outcome/disease.
Distorts the association between
the exposure and the outcome.
Must be controlled if causal
inferences are to be made.
Examples:
Confounding by Concomitant medication
Confounding by Effect Modification
Drug-drug interaction
Impact of varying age groups (older patients
responding differently to an intervention)
Impact of varying dosing (different outcomes
as a result of difference in strength and
frequency of dose)
Controlling for Confounders Sample Restriction: specifying inclusion criteria; excluding
subjects exposed to confounding variable.
Matching: grouping subjects based on the presence of like
confounders. Age and sex are usual background
confounders.
For large sample sizes, techniques like propensity score
matching are used to control for confounders.
Stratification: individuals grouped into separate strata
defined by the values of the confounder.
Statistical Modeling: techniques such as regression are
used to control for multiple confounders.
References University of California, Berkley Department of Statistics
http://www.stat.berkeley.edu/users/hhuang/STAT141/Lecture-FDR.pdf
Yale University Department of Statistics
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm
Stanford University
http://www.stanford.edu/~kcobb/hrp259/lecture11.ppt
MEERA: My Environmental Education Evaluation Resource
Assistant
http://meera.snre.umich.edu/plan-an-evaluation/related-topics/power-
analysis-statistical-significance-effect-size
University of Ottawa
http://www.med.uottawa.ca/sim/data/Statistical_significance_importance
_e.htm
Previous internal Advocate research department presentations
Presentation Feedback
Thank you for your review.
If you would like to provide feedback on the content of the
presentation, please complete the short survey which can be
found at this link: Presentation Evaluation Survey
Please note the survey should not take more than 5 minutes to complete.
Thank you in advance for completing the survey!
Thank You!