brief outline of basic statistics - university of cape town · 2017. 8. 30. · brief outline of...

Brief outline of basic statistics

S/Lecturer Maia Lesosky Division of Epidemiology &

Biostatistics, University of Cape Town [email protected]

Thanks to Landon Myer for slides

“These are the topics that need to be covered”

1. definitions of mean, median and standard deviation 2. interpretation of confidence intervals and p values 3. sensitivity and specificity 4. positive and negative predictive values 5. interpretation of parametric and non parametric data tests commonly

used- students T test, Chi square analysis, Fishers exact test, ANOVA

6. correlation 7. risk reduction and numbers needed to treat 8. relative risk and hazard ratio 9. regression analysis 10. interpretation of kappa values 11. survival analysis

Principles of this talk More detail here than you need

à Will go quickly, you must stop to ask questions à These are the basics, you will be asked to apply them

Many things here are gross simplifications You want this

Terminology varies à have kept as general as possible, noted synonyms

Outline

I. Measurements & distributions (20%) – describing distributions

II. Making comparisons (70%) – Statistical tests, p-values & CI – Regression, survival analysis

III. Evaluating measurements (10%) – Validity, reliability

I. Measurements & distributions

Measurements •  Broadly 3 kinds of measurements in health

sciences (“variables”)

–  Numeric measures •  Continuous •  Discrete

–  Categorical measures •  Polytomous

–  Ordinal vs nominal •  Binary

–  Time-based measures •  Time-to-event, survival

Examples •  Systolic blood pressure •  Mortality •  ALT •  TMN staging •  Gender/sex •  Time to remission

Note: can make categories of continuous measures

Con

tinuo

us m

easu

re (g

/dL)

Haemoglobin

Pol

ytom

ous

mea

sure

(low

, med

ium

, hig

h)

Bin

ary

mea

sure

(low

, hig

h)

NB: Categorical measures NB: Numeric measure

Distributions

•  When we take measurements on many patients, we can describe measures as distributions

•  How we describe a distribution will depend on the kind of measure

Categorical distributions

•  Describe in frequency distributions

Can describe distribution of categorical variable in terms of counts, percentages Different units, same conclusions

Continuous distributions

•  We also describe in terms of frequency distributions

•  But we have many “categories” (“bins”)

We draw summaries to describe the shape of the distribution

There are other ways to show shapes of distributions

‘Box-and-whisker’ plots

Many different possible shapes for continuous distributions

Skewed distributions Negative (left) skew

–  Long “tail” of distribution is to the left

–  Bulk of observations shifted right

Positive (right) skew –  Long “tail” of distribution is to

the right –  Bulk of observations shifted left

There are some “classic” distributions

Ways to describe a distribution

•  Measure of central tendency – Where the distribution clusters

•  Measure of dispersion – How spread out the distribution is

Measures of central tendency

•  Mean – Arithmetic mean, average value

•  Median – 50th percentile, middle value

•  Mode – Most commonly occurring value

Quantiles (regular intervals of a distribution)

•  Percentiles 1, 2, 3, 4, 5, 6, 7, 8 …… 95, 95, 96, 97, 98, 99, 100

•  Deciles 1-10, 11-20, 21-30 ….. 81-90, 91-100

•  Quintiles 1-20, 21-40, 41-60, 61-80, 81-100

•  Quartiles 1-25, 26-50, 51-75, 76-100

•  Tertlies 1-33, 34-67, 67-100

Describing distributions

Note: if the data are normally distributed, the mean is a good measure of central tendency If the data are non-Normal, the median is better measure of central tendency

Measures of dispersion (‘spread’)

•  Range –  Minimum value to Maximum value

•  Variance –  Average distance between each point and the mean

•  Standard deviation –  Square root of variance

•  Interquartile range –  25th percentile to 75th percentile

Variance

3 distributions: same mean value, but different variances

We have a favourite distribution

Remember: standard deviation is just the square root of variance

Normal distribution ~ “Gaussian distribution”, “standard normal distribution”

We like the Normal distribution because it has some well-defined features

In a Normal distribution, 95% of the data falls within 1.96 standard deviations of

the mean value

(here, 95.46% within 2 standard deviations)

•  This is where a 95% confidence interval comes from – 95% confidence interval around a mean value is

± 1.96 standard deviations around the mean value

– Sometimes we are lazy and call it ± 2 SD around the mean

•  95% CI is a generic statistic –  It will come up elsewhere (same concept,

different application)

•  We often manipulate (“transform”) variables so that we can make them “Normal” –  Common manipulations include logarithms, square

roots, or squares

•  Eg, log HIV viral loads

There are many different standard distributions with well-defined

features

•  Gaussian (Normal) is most common

•  Others – Z-distribution, T-distribution, F-distribution – Chi-squared (χ2) distribution – Binomial distribution (for categorical data) – Poisson distribution (for counts of things)

Parametric statistics

•  If the distribution of our measures follows a known distribution, we can make assumptions about our data based on rules of the known distribution – Eg, if our data are normally distributed, we

know that 95% of data fall within 2 standard deviations of the mean value

•  These kinds of statistics are parametric statistics

Non-parametric statistics

•  If our measures really don’t look like any known distribution, we can’t make assumptions about it based on any standard distribution – We have to work with the actual values of our

measurements •  These are non-parametric statistics

Example There are parametric and non-parametric

approaches to describing distributions

•  If data are normally distributed –  Mean and standard deviation (or variances) used to

describe distributions

•  If data are not normally distributed –  Median and interquartile ranges (or just ranges) used

to describe distribution

II. Making comparisons

•  Sometimes we want to compare 2 distributions to each other – Are the distributions different from each

other? –  Is there an association between the two

measures? •  We can ask this question about different

combinations of •  Continuous measures

– Normal or non-normal distributions

•  Categorical measures –  Polytomous or binary

Example: comparison of 2 distributions

Serum cholesterol among women

Serum cholesterol among men

Question: Is cholesterol associated with gender?

Example: comparison of 2 binary measures

Patients without TB Patients with TB

Question: Is TB disease associated with death?

Statistical hypothesis testing

•  There are different statistical tests that are applied in different situations to answer the question – Are the distributions of one variable different

according to another variable Which is the same thing as

–  Is there an association between one measure and another measure?

•  Different tests all give rise to p-values

Statistical test for every situation •  Comparing 2 continuous variables to each other

–  Correlation coefficient •  Comparing 2 categorical variables to each other

–  Chi-square test, Fisher’s exact test •  Comparing a binary categorical variable to a continuous

variable –  Student’s T-test (parametric ~ if continuous variable is normally

distributed) –  Wilcoxon rank-sum test (=Mann-Whitney U-test) (nonparametric - if

continuous variable not normally distributed) •  Comparing a polytomous categorical variable to a

continuous variable –  ANOVA (parametric ~ if continuous variable is normally distributed) –  Kruskall-Wallis test (=Mann-Whitney U-test) (nonparametric - if

continuous variable not normally distributed)

Correlation coefficient

•  Correlation coefficients (usually “r”) used to examine association between 2 continuous variables

This graph is sometimes called a “scatterplot”

Chi-squared tests

•  Used to examine the association between 2 categorical variables

Dead Alive no TB 26 128

TB+ 67 91

Note: Chi-square tests are parametric and used for larger sample sizes

Fisher’s exact tests


TB+ 7 5

For smaller sample sizes we replace chi-squared with Fisher’s exact tests (non-parametric)

They do the same thing but different formulae, much more calculations

Small sample size ~ table contains <60 total, or any cell <5

•  Chi-squared tests and Fisher’s exact tests can be used to compare – 2 binary variables to each other (2x2) – Binary versus polytomous (eg, 2x3) – Polytomous versus polytomous (eg, 4x5)

(Student’s) T-test

•  Used to compare 2 normal distributions (parametric test)

•  Whether 2 distributions are different depends on the size of the difference in means AND how much variability is present

Wilcoxon rank-sum test

(= Mann-Whitney U-test) •  Non-parametric test •  Compares 2 non-normal distributions

– The non-parametric version of t-test

•  “Comparing means”: t-test •  “Comparing medians”: rank-sum test

ANOVA •  ANOVA = analysis of variance •  Used to test for any difference in mean values

for >2 distributions

Parametric – requires Normally distributed data

Is there any difference between these 3 distributions?

Kruskall-Wallis test •  Extension of Wilcoxon

rank-sum test for comparing >2 groups at once

•  Also = Mann-Whitney U test

•  Non-parametric version of the ANOVA

•  Comparing 2 continuous variables to each other –  Correlation coefficient

•  Comparing 2 categorical variables to each other –  Chi-square test, Fisher’s exact test

•  Comparing a binary categorical variable to a continuous variable –  Student’s T-test (parametric) –  Wilcoxon rank-sum test (nonparametric)

•  Comparing a polytomous categorical variable to a continuous variable –  ANOVA (parametric) –  Kruskall-Wallis test (nonparametric)

Relative risks

•  Often data from clinical research seeks to understand whether patients with some pre-existing status (‘exposure’) may be more/less like to develop some subsequent health outcome – Cohort studies – Randomised controlled trials

Dead Alive

Drug A 12 88 Drug B 37 63

•  Imagine a trial randomising 100 patients to receive drug A and 100 patients to receive drug B, then following them over time to observe survival

•  We could calculate a chi-square test here, but not very useful clinically (but only tells us “statistical significance”)

•  Often we prefer to calculate the relative risk (risk ratio or rate ratio)

Relative risk Proportion of all the exposed (here, drug A)

patients developing the outcome divided by

Proportion of all unexposed (here, drug B) patients developing the outcome

Dead Alive


12 / (12 + 88) = 0.12

37/ (37+ 63) = 0.37

0.12 / 0.37 = 0.33

Interpreting the relative risk

•  Relative risk is how much more (or less) likely the health outcome is in one group relative to the other – Here, death is 0.32 times as likely (ie, less

likely) in patients receiving drug A relative to patients receiving drug B

•  Note: if the risk of the outcome is the same in both arms, the relative risk is 1

•  If ‘exposure’ is protective, RR < 1 •  If ‘exposure’ is detrimental, RR > 1

Confidence intervals (again)

•  We can calculate confidence intervals (CI) around this relative risk – Here, the interval is (0.19 – 0.61)

•  The CI gives a range of estimates for the RR that the observed data (from the table) are consistent with – Narrow CI ~ precise estimate of RR (good)

– Wide CI ~ imprecise estimate or RR (bad)

Absolute risk reduction (risk difference)

•  Like relative risk, but subtract instead of divide

Proportion of all the exposed (here, drug A) patients developing the outcome minus

Proportion of all unexposed (here, drug B) patients developing the outcome

12 / (12 + 88) = 0.12

37/ (37+ 63) = 0.37

0.12 - 0.37 = - 0.25

Dead Alive


•  Absolute risk reduction (risk difference) tells us how the risk of the health outcome changes when the exposure is taken away – Here, risk of death drops by 0.25 (25%) when

patients receive drug A compared to drug B

•  Note: if the risk of the outcome is the same in both arms, the risk reduction (risk difference) is 0

•  If ‘exposure’ is protective, RR < 0 •  If ‘exposure’ is detrimental, RR > 0

Numbers needed to treat

•  The average number of patients who need to receive an intervention (here, Drug A) to prevent 1 outcome from happening

•  Calculated as 1 / (risk reduction)

•  Here, 1 / 0.25 = 4 – On average, 4 patients need to receive drug A

instead of drug B to prevent 1 death

P-values

•  P-values provide a measure of “statistical significance” from any statistical test that compares 2 things

– “universal currency” of statistical comparison

– Helps us understand the role of chance in explaining an association

Interpreting p-values

•  P-values ~ probabilities ~ range from 0 - 1

•  P-value’s formal definition based on hypothesis testing – Evaluates the probability of null hypothesis

•  Null hypothesis ~ usually that there is no association between variables

– P-value: the probability of observing the data in your study if the null hypothesis is true

Practical interpretation of p-values

•  Large p-value:

The association observed between the 2 variables in your data is consistent with the hypothesis of no association between variables

– “Association is not statistically significant”

– “No statistically significant difference”

•  Small p-value: the association observed between the 2 variables in your data is NOT consistent with the hypothesis of no association – “Association is statistically significant” – “Statistically significant difference”

•  Smaller p-value à association less consistent with chance finding – “Statistically significant” = not consistent with

chance

Small vs Large

•  We traditionally use 0.05 as a cut off for a “statistically significant” p-value

•  This is arbitrary rule-of-thumb – 0.048 it not very different from 0.053

•  Another guide – >0.1 = not statistically significant – 0.05-0.1 = approaching statistical significance – 0.001-0.05 = statistically significant – <0.001 = highly statistically significant

Sample sizes •  “Statistical significance” (the size of a p-

value) is determined by a few things, most importantly

– The size of the difference in the measure you are looking at AND

– The number of patients (sample sizes) involved

For example

Dead Alive

no TB 2 8

TB+ 6 4


TB+ 60 40

The proportions in the 2 tables are the same (calculate the risk ratios to see this)

But the p-value for the table on the left is 0.17 The p-value for the table on the right is <0.001

P-values and CI

•  P-values and CI are closely related – Calculated from the same place – A small p-value suggests narrow CI

•  more precise, good – A large p-value suggest wide CI

•  less precise, bad

•  CI for an RR that do not overlap 1 mean the corresponding p-value is <0.05

‘statistically significant’

Example: interpret the following

RR = 1.9; 95% CI= 1.4 – 2.8; p=0.008 – Outcome about 2x more common in exposed vs

unexposed; narrow CI, statistically significant RR = 0.8; 95% CI = 0.2 – 4.8; p=0.37

•  Outcome slightly less common in exposed vs unexposed; wide CI, not statistically significant

RR = 1.02; 95% CI= 0.5 - 2.0; p=0.98 – Not much difference in frequency of outcome

between exposed and unexposed, not statistically significant

Regression models

•  All the statistical tests we have looked at so far only look at the association between two variables at a time

•  But sometimes we want to look at the associations involved between >2 variables at once – Regression models commonly used for this

Concept of regression

•  Equation used to predict an outcome variable (y) according to the one or more predictor variables (x’s)

•  Basic equation for a line Y = intercept + slope * X Here we’re interested in the slope à relationship between X and Y

Linear regression

Note: There are many different kinds of regression

Application of regression models in medical research

•  Regression models used to look at how multiple factors combine predict a health outcome –  Especially adjustment for confounding variables

•  Equations like Y = intercept + (slope*X) + (slope*R) + (slope*Z) Would you be used to understand how a

certain health outcome (Y) is predicted by 3 different factors (X, R, Z)

Survival analysis

•  Survival analysis uses time-to-event measures

•  “Survival” can mean time until death – Or any other specific outcome

•  Remission, Cure, Relapse, Need for admission – Any binary outcome studies over time

•  Note: survival analysis from cohort studies or RCTs – Need to follow patients over time

Kaplan-Meier plots

Kaplan-Meier survival analyses to compare survival in 2 groups over time

Hazard ratios

•  There is a particular kind of regression model for survival analysis: Cox’s proportional hazards model

•  Model gives us hazard ratios – Like the distance between 2 survival curves –  Interpreted exactly like relative risks

•  So how would you interpret: – HR > 1 – HR < 1 – HR = 1

III. Evaluating measurements

Evaluating a new test

•  We often want to know how well a certain test performs in detecting a condition of interest – We may be interested in screening for a

condition or diagnosing it •  Test of interest may be

– Laboratory assay, radiological investigation •  We want to know how test performs

–  Identifying those with disease, those without

•  We study this by comparing the new test to an established gold-standard (representing the truth)

True pos

True neg

Test pos A B A+B Test neg C D C+D

A+C B+D

False Positives

False Negatives

Ways of evaluating a new test

•  Sensitivity – The proportion of people who truly have

disease who are detected correctly by the test •  Specificity

– The proportion of people who do truly do not have disease who are detected correctly by the test

These are features of tests, but not actually

what is needed by a clinician at bedside

•  Positive predictive value – What proportion of people who have a

positive test truly have disease •  Negative predictive value

– What proportion of people who have a negative test truly do not have disease

These are of greater clinical interest, but are

problematic in reality

True pos

True neg


A+C B+D

•  Sensitivity = A / (A+C) – High sensitivity means few false negatives

•  Specificity = D / (B+D) – High specificity means few false positives

True pos

True neg


A+C B+D

•  Positive Predictive Value= A / (A+B) – High PPV means few false positives

•  Negative Predictive Value = D / (C+D) – High NPV means few false negatives

Example: raised IFN-γ in detecting culture-confirmed TB

in smear-negative patients TB cult neg

TB cult neg

Raised IFN-γ 75 125 200 Normal IFN-γ 25 175 200

100 300

Calculate – Sens – Spec – PPV – NPV

Interpretations: Sens, Spec

•  Sensitivity and specificity of raised IFN-γ are 75% and 58%, respectively in detecting culture-confirmed TB

– This means that 75% of patients with culture-confirmed TB will have raised IFN-γ (test pos)

– And that 58% of patients without disease will not have raised IFN-γ (test neg)

Interpretations: PPV, NPV

•  PPV and NPV of raised IFN-γ are 38% and 88%, respectively in detecting culture-confirmed TB

– This means that 38% of patients with a raised IFN-γ will truly have culture-confirmed TB

– And that 88% of patients with a low IFN-γ will truly not have TB

General rules

•  Higher sensitivities usually mean lower specificities (good tests are high on both)

•  High sens à few false negatives à high NPV –  If a test has a high sens, negative result helps

rule out disease (SnOUT) •  High spec à few false positives à high PPV

–  If a test has a high spec, positive result helps rule in disease (SpIN)

Reliability

•  There are some situations in which there is no “gold-standard” to compare with

•  We then compare the reliability (repeatability) of measures

•  Example: –  radiologists identifying lesions on scan – pathologists identifying malignancy on biopsy – psychiatrists making any diagnosis

•  In these situations we compare the repeatability of measures

•  Here it’s not clear who the gold-standard is à can’t really calculate sens, spec, etc.

Radiologist A: positive

Radiologist A: negative

Radiologist B: positive 31 19 50

Radiologist B: negative 21 79 100

52 98 150

Kappa •  Instead we want to see the amount of

agreement between the raters

•  Kappa = the degree of agreement of 2 raters above chance – Measure of test-retest reliability – Range, -1 (perfect disagreement) to 1 (perfect

agreement)

brief outline of basic statistics - university of cape town · 2017. 8. 30. · brief outline of...

Documents