brief outline of basic statistics - university of cape town · 2017. 8. 30. · brief outline of...
TRANSCRIPT
Brief outline of basic statistics
S/Lecturer Maia Lesosky Division of Epidemiology &
Biostatistics, University of Cape Town [email protected]
Thanks to Landon Myer for slides
“These are the topics that need to be covered”
1. definitions of mean, median and standard deviation 2. interpretation of confidence intervals and p values 3. sensitivity and specificity 4. positive and negative predictive values 5. interpretation of parametric and non parametric data tests commonly
used- students T test, Chi square analysis, Fishers exact test, ANOVA
6. correlation 7. risk reduction and numbers needed to treat 8. relative risk and hazard ratio 9. regression analysis 10. interpretation of kappa values 11. survival analysis
Principles of this talk More detail here than you need
à Will go quickly, you must stop to ask questions à These are the basics, you will be asked to apply them
Many things here are gross simplifications You want this
Terminology varies à have kept as general as possible, noted synonyms
Outline
I. Measurements & distributions (20%) – describing distributions
II. Making comparisons (70%) – Statistical tests, p-values & CI – Regression, survival analysis
III. Evaluating measurements (10%) – Validity, reliability
I. Measurements & distributions
Measurements • Broadly 3 kinds of measurements in health
sciences (“variables”)
– Numeric measures • Continuous • Discrete
– Categorical measures • Polytomous
– Ordinal vs nominal • Binary
– Time-based measures • Time-to-event, survival
Examples • Systolic blood pressure • Mortality • ALT • TMN staging • Gender/sex • Time to remission
Note: can make categories of continuous measures
Con
tinuo
us m
easu
re (g
/dL)
Haemoglobin
Pol
ytom
ous
mea
sure
(low
, med
ium
, hig
h)
Bin
ary
mea
sure
(low
, hig
h)
NB: Categorical measures NB: Numeric measure
Distributions
• When we take measurements on many patients, we can describe measures as distributions
• How we describe a distribution will depend on the kind of measure
Categorical distributions
• Describe in frequency distributions
Can describe distribution of categorical variable in terms of counts, percentages Different units, same conclusions
Continuous distributions
• We also describe in terms of frequency distributions
• But we have many “categories” (“bins”)
We draw summaries to describe the shape of the distribution
There are other ways to show shapes of distributions
‘Box-and-whisker’ plots
Many different possible shapes for continuous distributions
Skewed distributions Negative (left) skew
– Long “tail” of distribution is to the left
– Bulk of observations shifted right
Positive (right) skew – Long “tail” of distribution is to
the right – Bulk of observations shifted left
There are some “classic” distributions
Ways to describe a distribution
• Measure of central tendency – Where the distribution clusters
• Measure of dispersion – How spread out the distribution is
Measures of central tendency
• Mean – Arithmetic mean, average value
• Median – 50th percentile, middle value
• Mode – Most commonly occurring value
Quantiles (regular intervals of a distribution)
• Percentiles 1, 2, 3, 4, 5, 6, 7, 8 …… 95, 95, 96, 97, 98, 99, 100
• Deciles 1-10, 11-20, 21-30 ….. 81-90, 91-100
• Quintiles 1-20, 21-40, 41-60, 61-80, 81-100
• Quartiles 1-25, 26-50, 51-75, 76-100
• Tertlies 1-33, 34-67, 67-100
Describing distributions
Note: if the data are normally distributed, the mean is a good measure of central tendency If the data are non-Normal, the median is better measure of central tendency
Measures of dispersion (‘spread’)
• Range – Minimum value to Maximum value
• Variance – Average distance between each point and the mean
• Standard deviation – Square root of variance
• Interquartile range – 25th percentile to 75th percentile
Variance
3 distributions: same mean value, but different variances
We have a favourite distribution
Remember: standard deviation is just the square root of variance
Normal distribution ~ “Gaussian distribution”, “standard normal distribution”
We like the Normal distribution because it has some well-defined features
In a Normal distribution, 95% of the data falls within 1.96 standard deviations of
the mean value
(here, 95.46% within 2 standard deviations)
• This is where a 95% confidence interval comes from – 95% confidence interval around a mean value is
± 1.96 standard deviations around the mean value
– Sometimes we are lazy and call it ± 2 SD around the mean
• 95% CI is a generic statistic – It will come up elsewhere (same concept,
different application)
• We often manipulate (“transform”) variables so that we can make them “Normal” – Common manipulations include logarithms, square
roots, or squares
• Eg, log HIV viral loads
There are many different standard distributions with well-defined
features
• Gaussian (Normal) is most common
• Others – Z-distribution, T-distribution, F-distribution – Chi-squared (χ2) distribution – Binomial distribution (for categorical data) – Poisson distribution (for counts of things)
Parametric statistics
• If the distribution of our measures follows a known distribution, we can make assumptions about our data based on rules of the known distribution – Eg, if our data are normally distributed, we
know that 95% of data fall within 2 standard deviations of the mean value
• These kinds of statistics are parametric statistics
Non-parametric statistics
• If our measures really don’t look like any known distribution, we can’t make assumptions about it based on any standard distribution – We have to work with the actual values of our
measurements • These are non-parametric statistics
Example There are parametric and non-parametric
approaches to describing distributions
• If data are normally distributed – Mean and standard deviation (or variances) used to
describe distributions
• If data are not normally distributed – Median and interquartile ranges (or just ranges) used
to describe distribution
II. Making comparisons
• Sometimes we want to compare 2 distributions to each other – Are the distributions different from each
other? – Is there an association between the two
measures? • We can ask this question about different
combinations of • Continuous measures
– Normal or non-normal distributions
• Categorical measures – Polytomous or binary
Example: comparison of 2 distributions
Serum cholesterol among women
Serum cholesterol among men
Question: Is cholesterol associated with gender?
Example: comparison of 2 binary measures
Patients without TB Patients with TB
Question: Is TB disease associated with death?
Statistical hypothesis testing
• There are different statistical tests that are applied in different situations to answer the question – Are the distributions of one variable different
according to another variable Which is the same thing as
– Is there an association between one measure and another measure?
• Different tests all give rise to p-values
Statistical test for every situation • Comparing 2 continuous variables to each other
– Correlation coefficient • Comparing 2 categorical variables to each other
– Chi-square test, Fisher’s exact test • Comparing a binary categorical variable to a continuous
variable – Student’s T-test (parametric ~ if continuous variable is normally
distributed) – Wilcoxon rank-sum test (=Mann-Whitney U-test) (nonparametric - if
continuous variable not normally distributed) • Comparing a polytomous categorical variable to a
continuous variable – ANOVA (parametric ~ if continuous variable is normally distributed) – Kruskall-Wallis test (=Mann-Whitney U-test) (nonparametric - if
continuous variable not normally distributed)
Correlation coefficient
• Correlation coefficients (usually “r”) used to examine association between 2 continuous variables
This graph is sometimes called a “scatterplot”
Chi-squared tests
• Used to examine the association between 2 categorical variables
Dead Alive no TB 26 128
TB+ 67 91
Note: Chi-square tests are parametric and used for larger sample sizes
Fisher’s exact tests
Dead Alive no TB 2 6
TB+ 7 5
For smaller sample sizes we replace chi-squared with Fisher’s exact tests (non-parametric)
They do the same thing but different formulae, much more calculations
Small sample size ~ table contains <60 total, or any cell <5
• Chi-squared tests and Fisher’s exact tests can be used to compare – 2 binary variables to each other (2x2) – Binary versus polytomous (eg, 2x3) – Polytomous versus polytomous (eg, 4x5)
(Student’s) T-test
• Used to compare 2 normal distributions (parametric test)
• Whether 2 distributions are different depends on the size of the difference in means AND how much variability is present
Wilcoxon rank-sum test
(= Mann-Whitney U-test) • Non-parametric test • Compares 2 non-normal distributions
– The non-parametric version of t-test
• “Comparing means”: t-test • “Comparing medians”: rank-sum test
ANOVA • ANOVA = analysis of variance • Used to test for any difference in mean values
for >2 distributions
Parametric – requires Normally distributed data
Is there any difference between these 3 distributions?
Kruskall-Wallis test • Extension of Wilcoxon
rank-sum test for comparing >2 groups at once
• Also = Mann-Whitney U test
• Non-parametric version of the ANOVA
• Comparing 2 continuous variables to each other – Correlation coefficient
• Comparing 2 categorical variables to each other – Chi-square test, Fisher’s exact test
• Comparing a binary categorical variable to a continuous variable – Student’s T-test (parametric) – Wilcoxon rank-sum test (nonparametric)
• Comparing a polytomous categorical variable to a continuous variable – ANOVA (parametric) – Kruskall-Wallis test (nonparametric)
Relative risks
• Often data from clinical research seeks to understand whether patients with some pre-existing status (‘exposure’) may be more/less like to develop some subsequent health outcome – Cohort studies – Randomised controlled trials
Dead Alive
Drug A 12 88 Drug B 37 63
• Imagine a trial randomising 100 patients to receive drug A and 100 patients to receive drug B, then following them over time to observe survival
• We could calculate a chi-square test here, but not very useful clinically (but only tells us “statistical significance”)
• Often we prefer to calculate the relative risk (risk ratio or rate ratio)
Relative risk Proportion of all the exposed (here, drug A)
patients developing the outcome divided by
Proportion of all unexposed (here, drug B) patients developing the outcome
Dead Alive
Drug A 12 88 Drug B 37 63
12 / (12 + 88) = 0.12
37/ (37+ 63) = 0.37
0.12 / 0.37 = 0.33
Interpreting the relative risk
• Relative risk is how much more (or less) likely the health outcome is in one group relative to the other – Here, death is 0.32 times as likely (ie, less
likely) in patients receiving drug A relative to patients receiving drug B
• Note: if the risk of the outcome is the same in both arms, the relative risk is 1
• If ‘exposure’ is protective, RR < 1 • If ‘exposure’ is detrimental, RR > 1
Confidence intervals (again)
• We can calculate confidence intervals (CI) around this relative risk – Here, the interval is (0.19 – 0.61)
• The CI gives a range of estimates for the RR that the observed data (from the table) are consistent with – Narrow CI ~ precise estimate of RR (good)
– Wide CI ~ imprecise estimate or RR (bad)
Absolute risk reduction (risk difference)
• Like relative risk, but subtract instead of divide
Proportion of all the exposed (here, drug A) patients developing the outcome minus
Proportion of all unexposed (here, drug B) patients developing the outcome
12 / (12 + 88) = 0.12
37/ (37+ 63) = 0.37
0.12 - 0.37 = - 0.25
Dead Alive
Drug A 12 88 Drug B 37 63
• Absolute risk reduction (risk difference) tells us how the risk of the health outcome changes when the exposure is taken away – Here, risk of death drops by 0.25 (25%) when
patients receive drug A compared to drug B
• Note: if the risk of the outcome is the same in both arms, the risk reduction (risk difference) is 0
• If ‘exposure’ is protective, RR < 0 • If ‘exposure’ is detrimental, RR > 0
Numbers needed to treat
• The average number of patients who need to receive an intervention (here, Drug A) to prevent 1 outcome from happening
• Calculated as 1 / (risk reduction)
• Here, 1 / 0.25 = 4 – On average, 4 patients need to receive drug A
instead of drug B to prevent 1 death
P-values
• P-values provide a measure of “statistical significance” from any statistical test that compares 2 things
– “universal currency” of statistical comparison
– Helps us understand the role of chance in explaining an association
Interpreting p-values
• P-values ~ probabilities ~ range from 0 - 1
• P-value’s formal definition based on hypothesis testing – Evaluates the probability of null hypothesis
• Null hypothesis ~ usually that there is no association between variables
– P-value: the probability of observing the data in your study if the null hypothesis is true
Practical interpretation of p-values
• Large p-value:
The association observed between the 2 variables in your data is consistent with the hypothesis of no association between variables
– “Association is not statistically significant”
– “No statistically significant difference”
• Small p-value: the association observed between the 2 variables in your data is NOT consistent with the hypothesis of no association – “Association is statistically significant” – “Statistically significant difference”
• Smaller p-value à association less consistent with chance finding – “Statistically significant” = not consistent with
chance
Small vs Large
• We traditionally use 0.05 as a cut off for a “statistically significant” p-value
• This is arbitrary rule-of-thumb – 0.048 it not very different from 0.053
• Another guide – >0.1 = not statistically significant – 0.05-0.1 = approaching statistical significance – 0.001-0.05 = statistically significant – <0.001 = highly statistically significant
Sample sizes • “Statistical significance” (the size of a p-
value) is determined by a few things, most importantly
– The size of the difference in the measure you are looking at AND
– The number of patients (sample sizes) involved
For example
Dead Alive
no TB 2 8
TB+ 6 4
Dead Alive no TB 20 80
TB+ 60 40
The proportions in the 2 tables are the same (calculate the risk ratios to see this)
But the p-value for the table on the left is 0.17 The p-value for the table on the right is <0.001
P-values and CI
• P-values and CI are closely related – Calculated from the same place – A small p-value suggests narrow CI
• more precise, good – A large p-value suggest wide CI
• less precise, bad
• CI for an RR that do not overlap 1 mean the corresponding p-value is <0.05
‘statistically significant’
Example: interpret the following
RR = 1.9; 95% CI= 1.4 – 2.8; p=0.008 – Outcome about 2x more common in exposed vs
unexposed; narrow CI, statistically significant RR = 0.8; 95% CI = 0.2 – 4.8; p=0.37
• Outcome slightly less common in exposed vs unexposed; wide CI, not statistically significant
RR = 1.02; 95% CI= 0.5 - 2.0; p=0.98 – Not much difference in frequency of outcome
between exposed and unexposed, not statistically significant
Regression models
• All the statistical tests we have looked at so far only look at the association between two variables at a time
• But sometimes we want to look at the associations involved between >2 variables at once – Regression models commonly used for this
Concept of regression
• Equation used to predict an outcome variable (y) according to the one or more predictor variables (x’s)
• Basic equation for a line Y = intercept + slope * X Here we’re interested in the slope à relationship between X and Y
Linear regression
Note: There are many different kinds of regression
Application of regression models in medical research
• Regression models used to look at how multiple factors combine predict a health outcome – Especially adjustment for confounding variables
• Equations like Y = intercept + (slope*X) + (slope*R) + (slope*Z) Would you be used to understand how a
certain health outcome (Y) is predicted by 3 different factors (X, R, Z)
Survival analysis
• Survival analysis uses time-to-event measures
• “Survival” can mean time until death – Or any other specific outcome
• Remission, Cure, Relapse, Need for admission – Any binary outcome studies over time
• Note: survival analysis from cohort studies or RCTs – Need to follow patients over time
Kaplan-Meier plots
Kaplan-Meier survival analyses to compare survival in 2 groups over time
Hazard ratios
• There is a particular kind of regression model for survival analysis: Cox’s proportional hazards model
• Model gives us hazard ratios – Like the distance between 2 survival curves – Interpreted exactly like relative risks
• So how would you interpret: – HR > 1 – HR < 1 – HR = 1
III. Evaluating measurements
Evaluating a new test
• We often want to know how well a certain test performs in detecting a condition of interest – We may be interested in screening for a
condition or diagnosing it • Test of interest may be
– Laboratory assay, radiological investigation • We want to know how test performs
– Identifying those with disease, those without
• We study this by comparing the new test to an established gold-standard (representing the truth)
True pos
True neg
Test pos A B A+B Test neg C D C+D
A+C B+D
False Positives
False Negatives
Ways of evaluating a new test
• Sensitivity – The proportion of people who truly have
disease who are detected correctly by the test • Specificity
– The proportion of people who do truly do not have disease who are detected correctly by the test
These are features of tests, but not actually
what is needed by a clinician at bedside
• Positive predictive value – What proportion of people who have a
positive test truly have disease • Negative predictive value
– What proportion of people who have a negative test truly do not have disease
These are of greater clinical interest, but are
problematic in reality
True pos
True neg
Test pos A B A+B Test neg C D C+D
A+C B+D
• Sensitivity = A / (A+C) – High sensitivity means few false negatives
• Specificity = D / (B+D) – High specificity means few false positives
True pos
True neg
Test pos A B A+B Test neg C D C+D
A+C B+D
• Positive Predictive Value= A / (A+B) – High PPV means few false positives
• Negative Predictive Value = D / (C+D) – High NPV means few false negatives
Example: raised IFN-γ in detecting culture-confirmed TB
in smear-negative patients TB cult neg
TB cult neg
Raised IFN-γ 75 125 200 Normal IFN-γ 25 175 200
100 300
Calculate – Sens – Spec – PPV – NPV
Interpretations: Sens, Spec
• Sensitivity and specificity of raised IFN-γ are 75% and 58%, respectively in detecting culture-confirmed TB
– This means that 75% of patients with culture-confirmed TB will have raised IFN-γ (test pos)
– And that 58% of patients without disease will not have raised IFN-γ (test neg)
Interpretations: PPV, NPV
• PPV and NPV of raised IFN-γ are 38% and 88%, respectively in detecting culture-confirmed TB
– This means that 38% of patients with a raised IFN-γ will truly have culture-confirmed TB
– And that 88% of patients with a low IFN-γ will truly not have TB
General rules
• Higher sensitivities usually mean lower specificities (good tests are high on both)
• High sens à few false negatives à high NPV – If a test has a high sens, negative result helps
rule out disease (SnOUT) • High spec à few false positives à high PPV
– If a test has a high spec, positive result helps rule in disease (SpIN)
Reliability
• There are some situations in which there is no “gold-standard” to compare with
• We then compare the reliability (repeatability) of measures
• Example: – radiologists identifying lesions on scan – pathologists identifying malignancy on biopsy – psychiatrists making any diagnosis
• In these situations we compare the repeatability of measures
• Here it’s not clear who the gold-standard is à can’t really calculate sens, spec, etc.
Radiologist A: positive
Radiologist A: negative
Radiologist B: positive 31 19 50
Radiologist B: negative 21 79 100
52 98 150
Kappa • Instead we want to see the amount of
agreement between the raters
• Kappa = the degree of agreement of 2 raters above chance – Measure of test-retest reliability – Range, -1 (perfect disagreement) to 1 (perfect
agreement)