application of medical statistics in clinical research chp... · li jibin, md, phd department of...
TRANSCRIPT
Application of Medical statistics in clinical research
LI Jibin, MD, PhDDepartment of Clinical Research, Sun Yat-sen
University Cancer CenterEmail: [email protected]
A figure related to statistics
2
Outline
3
Introduction
How to select proper statistical method
Statistical description
Compare group differences
Quantify associations between variables
Analysis of observational study
Introduction
• Medical statistics is an important tool for clinical research.
• Medical statistics is the science that study the collection, sorting and
analysis of medical data.
• Proper statistical methods is precondition to ensure validity and
reliability of the results.
• It is necessary to describe statistical plan/method in protocol.
4
5
Medical statistics
Clinical Study design
Data analysis
Observational study
Experimental study
Descriptive analysis
Statistical inference
Hypothesis testing
Parameter estimates
Observational study
• Exposure is not assigned by investigator
6
Observational study
Descriptive study
Analytical study
Case report
Cross‐sectional study
Case‐control study
Cohort study
Exposure and outcome at the same time/period
Outcome → Exposure
Exposure → Outcome
……
Applications of different observational study design
7
Objective Cross-sectional
Case-control Cohort
Investigation of rare disease - +++++ -Investigation of rare cause - - +++++Testing multiple effects of cause ++ - +++++Study of multiple exposures and determinants ++ ++++ +++Measurements of time relationship - + a +++++Direct measurement of incidence - + b +++++Investigation of long latent periods - +++ -
+…+++++ indicates the general degree of suitability (there are exceptions); - not suitablea if prospective.b if population-based
Ref: R Bonita R Beaglehole & T Kjellstrom. Basic epidemiology, 2nd edition. World health organization
Experimental study
8
• Exposure/Intervention method is assigned by investigator
Experimental study
Randomized controlled Study
Non‐randomized controlled study
……
• Intervention• Participants• Effects
• Control group• Repeat• Randomization and
blindness
Randomized controlled trial (RCT)
9
Study population
Treatment group Control group
Randomly allocate exposure/intervention
Cohort study vs. RCT
10
Assess for eligibility
Enroll
Experimental group
Randomly?
Control group
Follow-up for ascertainment of efficacy/safety outcomes
excludedineligible
Cohort study
RCT
No
Yes
Strength of study design: the evidence pyramid
11
Low
High
Apply high quality study design availably to answer your clinical question
Outline
12
Introduction
How to select proper statistical method
Statistical description
Compare group differences
Quantify associations between variables
Analysis of observational study
How to select proper statistical method?
13
Study design
Study objectives
Data types Data characteristics
1 2
3 4
• The four key elements
(1) Study design
• Completely random design
• Paired design
• Randomized block design
• Factorial design
• Repeated measurement design
• Group sequential design
• ……
14
(2) Study objectives
15
• Descriptive analysis
Mean, SD, Median, Prevalence, etc.
• Parameter estimation
95% Confidence interval
• Compare difference between groups
Hypothesis test
i.e. 2 test、t test、ANOVA、Wilcoxon test、log-rank test
(2) Study objectives
16
• Explore associations between exposures and outcome
Multivariate analysis
i.e. Correlation、regression analysis.
• Select risk factors
Regression analysis
• Prediction
Regression analysis
• Causal inference
Path analysis/structural equation model (SEM)
(3) Data types
• Quantitative vs. qualitative?
• Ordinal vs. nominal?
• Binomial vs. multiple?
• Time-to-event?
• ……
17
The choice of statistical methods depend on the data types
Data types
• also called quantitative or
continuous variableo Number of drinks.
o Height
o Weight
o BMI
o Fasting blood glucose
o ……
18
Numerical Categorical
• also called qualitative variable
• Nominal, results of classifyingo Gender: Male, Female
o Blood group: A, B, AB, O.
• Ordinal, express ranks (type of categorical data)
o Treatment effect: cured, better, worsen, dead.
o Personal health status: Excellent, very good, good, fair, poor
Data transformation
Numerical data
Ordinal data
Nominal data
19
It is not recommended
20
Example:
• WBC (L/m3) count for five persons:
o Multiple categorical data: Lower, normal, Higher
o Binominal data: Normal (3 persons); abnormal (2 persons)
3000 6000 5000 8000 12000 → Numerical variableLower Normal Normal Normal Higher → Qualitative variable
(4) Data characteristics
• Assumption of statistical methods
Normal or skewed distribution?
Homogeneity of variance?
Sample size?
Missing values?
……
21
Outline
22
Introduction
How to select proper statistical method
Descriptive statistics
Compare group differences
Quantify associations between variables
Analysis of observational study
Descriptive statistics
• Describe the basic features of data of the study
• Provide summaries about the sample and measures
23
Descriptive statistics
ID Age (X1) Gender (X2) Group (X3) DBP (X4) ECG (X5) Treatment effect (X6)1 37 Male A 11.27 Normal Effective
2 48 Female B 12.53 Normal Effective
3 45 Male A 10.93 Abnormal Ineffective
… … … … … … …
108 58 Male B 16.80 Abnormal Effective
24
The data of treatment effect of 108 hypertensive patients
ECG: electrocardiogram; DBP: Diastolic blood pressure
Row: cases Colum: variables
1=Male 1=A 1=Normal 1=Effective2=Female 2=B 0=Abnormal 0=Ineffective
Descriptive statistics
ID Age (X1) Gender (X2) Group (X3) DBP (X4) ECG (X5) Treatment effect (X6)1 37 1 1 11.27 1 1
2 48 2 2 12.53 1 1
3 45 1 1 10.93 0 0
… … … … … … …
108 58 1 2 16.80 0 1
25
The data of treatment effect of 108 hypertensive patients
ECG: electrocardiogram; DBP: Diastolic blood pressure
Row: cases Colum: variables
Descriptive statistics
• Use histogram to describe the distribution of data
Variation
Average level
Average level
Skewed distributionapproximately
normal distribution
26
Descriptive statistics
• Quantitative variable
27
Indicators Application
Central tendency
Mean (X) Symmetric distribution: Normal or approximately normal distribution
Median (M) Asymmetric distribution: skewed distribution
Geometric mean (G) Lognormal distribution
Dispersion tendency
Standard deviation (SD) Symmetric distribution: Normal or approximately normal distribution
Inter-quartile range (IQR) Asymmetric distribution: skewed distribution
Coefficient of variance (CV) To describe the variation of several variables
Variation
Average levelAverage level
Skewed distribution approximatelynormal distribution
Median Mean
Standard deviationP25,P75
28
• Normal or approximately normal distributionMean and standard deviation
• Skewed distributionMedian and P25, P75 (IQR)
Descriptive statistics
• Qualitative variable
Frequency
Distribution of frequency (%)
……
29
Treatment effect of 108 patients with hypertensionTreatment effect Frequency Percentage (%)
Cured 46 42.6 Better 29 26.9 Good 18 16.7
No effect 15 13.9
Outline
30
Introduction
How to select proper statistical method
Descriptive statistics
Compare group differences
Quantify associations between variables
Analysis of observational study
Comparison between groups—Hypothesis test
• Based on data types Numerical data
• t test, ANOVA/Kruskal-Wallis H test, Z test, Wilcoxon test, etc.
Nominal data
• Chi-square test, Z test
Ordinal data
• Wilcoxon test, etc.
Time-to-event data
• Kaplan-Meier cures
• Log-rank test
31
Example 1:
32
Groups n Mean (SD)A 10 24.6 (5.61)B 10 36.2 (4.39)C 10 29.4(6.11)
BMI among three groups of 30 patients
• Three independent groups
• Numerical data
• Proper method: one-way ANOVA
Incorrect method: t test
Example 2:
33
ID Treatment group
ID Control group
Before After Before After 1 130 114 11 118 124 2 124 110 12 132 122 3 136 126 13 134 132 4 128 116 14 114 96 5 122 102 15 118 124 6 118 100 16 128 118 7 116 98 17 118 116 8 138 122 18 132 122 9 126 108 19 120 124
10 124 106 20 134 128
Diastolic blood pressure (mmHg) before and after treatment of hypertensive patients
• Paired design• Numerical data
• Proper method: Two independent sample t test by using
differences between before and after of DBP
Example 3:
• Two group patients received two drug treatment respectively. Whether the treatment effects is better than control group?
34
Group nTreatment effects
Effective Markedly effective ineffectiveTreatment 42 28 10 4Control 40 19 9 12
χ2 test: χ2=5.73, p=0.057
• Two independent groups• Ordinal data• The proper method: Wilcoxon test
×
Example 4:
ID Treatment time
before 12 weeks 24 weeks 36 weeks 1 160 105 147 135 2 415 371 258 182 3 327 94 36 51 4 174 113 63 50 5 201 26 55 20 6 289 20 17 21 7 85 44 56 62 8 176 165 136 83 9 76 215 34 81
10 75 94 51 59
ALT (U/L) at different time points after drug treatment among HCV patients
• Repeated measurement design;• Numerical data;• The proper method: Repeated measurement ANOVA.
Outline
36
Introduction
How to select proper statistical method
Statistical description
Compare group differences
Quantify associations between variables
Analysis of observational study
Quantify associations between variables
• In clinical research, it commonly need to estimate the relationship between exposures and target disease, to explore the risk factors of target disease, involving multivariate analysis.
37
Objectives Multivariate methodsAssociation between two quantitative variables Simple linear regressionCorrelation between two quantitative variables Pearson or Spearman correlationCorrelation between two ordinal variables Spearman correlation
Association between multiple variables and one quantitative variable
Multiple linear regression
Association between multiple variables and one categorical variable
Logistic regression
Association between multiple variables and one time-to-event outcome
Survival analysis• Cox proportional hazard regression• K-M curves
Linear regression
• Modeling the relationship between one numerical outcome Y and one or more explanatory variables denoted as x variables
• Y is numerical and normal distribution
38
⋯L I N E
Assumptions:
Linear Independence of errors
Normal distribution
Equal of variance
Logistic regression
• The important model for categorical response (Y) data
Binary: 0 and 1,i.e. death vs. survival, normal vs. abnormal
Nominal or ordinal with ≥ 3 levels, i.e. cured, better, bad.
• Predictor variables (xi) can take on any form: binary, categorical or
continuous
39
Logit(P)=ln ⋯ Odds Ratio (OR), the main export.
Cox proportional hazard regression
• The important model for survival data analysis.• Y is time-to-event variable with censoring data.
40
ln , ⋯ PH Assumption.
Main output: Relative risk (RR).
Commonly used regression models based on different types of outcomes
41
Types of outcome Applicable regression model
Continuous outcome Linear regression model
Categorical outcome Logistic regression model
Time-to-event outcomeCox proportional hazard model
Kaplan-Meier cures
Time series Time series analysis
42
ID Age Grade Size Relapse Start End Time Status
1 62 1 0 0 02/10/1996 12/30/2000 59 0
2 64 1 0 0 03/05/1996 08/12/2000 54 1
3 52 2 0 1 04/09/1996 12/03/1999 44 0
4 60 1 0 0 06/06/1996 10/27/2000 53 0
… … … … … … … … …
30 54 3 1 1 03/10/2000 09/20/2000 6 1
Survival data of 30 bladder carcinoma
Outline
43
Introduction
How to select proper statistical method
Descriptive statistics
Compare group differences
Quantify associations between variables
Analysis of observational study
Analysis of observational studies
44
Cross-sectional study
Case-control study
Cohort study
Cross-sectional study
• Sample: a sample of everyone in a population, regardless of
exposure or outcome status
• Design: In each individual, determine the exposure and disease
status at the same time (or period)
• “Snap shot”; No follow-up data
• Examples:
Prevalence surveys (How common is kidney disease in a population?)
Etiology (Is hypertension associated with prevalent kidney disease?)
45
Cross-sectional study
• Study design:
46
Defined Population
Collect data on Exposure and Disease
Exposed, with Disease
Exposed, No Disease
Not Exposed, with Disease
Not Exposed, No Disease
Begin with:
Then:
Statistical analysis
• Descriptive analysis: describe characteristics of the population
Numerical: Mean (SD), Median (range, IQR)
Categorical: Prevalence
• Comparison between groups
Numerical data: t test, ANOVA, Wilcoxon test
Categorical data: chi-square test, Fisher’s exact test., Wilcoxon test
47
Statistical analysis
• Associations/risk factors
Binary outcome: Logistic regression
Numerical outcome: Linear regression?
48
Strengths and limitations
• Strengths: Useful for public health surveys Useful for public policy (allocation of resources) Good initial step in evaluating associations Cost-effective use of resources
• Limitations: Temporal relationship not defined
• Causation can not be determined• Survivor bias
Can not evaluate prognosis Can not evaluate treatment effects
49
For example
50
Statistical method
51
Descriptive analysis
52
Prevalence of diabetes
53
Risk factors of diabetes and prediabetes
54
Case-control study
55
Case-control study
• Compared with a control group essentially• Design: Clearly define cases (patients with disease)• Clearly define controls (patients without the disease) • Data regarding exposure ( risk factors or predictors) • Example:
Lung cancer and smoking
heart attack and mercury exposure
56
Case-control study
57
a
b
c
d
Cases (people with disease)
Direction of inquiry
Exposed
Exposed
Unexposed
Unexposed
Control (people without disease)
Population
TIME
Odds Ratio (OR)
• OR is the odds of exposure given disease divided by the odds
of exposure given no disease.
• Remember that the odds of exposure among cases compared
with controls is the same as the odds of disease among
exposed and unexposed.
58
Exposed Unexposed Total
CasesControls
ac
bd
a+bc+d
Total a+c b+d a+b+c+d
//
a cORb d
Odds ratio
59
Analytic Strategy
• Descriptive analysis
Numerical: Mean (SD), Median (range, IQR)
Categorical: Frequency/prevalence
• Stratified analysis Calculate stratum-specific ORs for exposure-outcome relationship
Determine presence of confounding and interaction
60
Analytic Strategy
• Logistic regression analysis Adjusted OR, by adjusting for confounding and interaction.
Special logistic model applied in matched studies.
61
Strengths and limitations
• Strengths:
Lower cost than cohort studies.
Useful for studying uncommon diseases.
• Limitations:
Very susceptible to bias.
Can not evaluate prevalence, incidence or prognosis.
Can only provide odds ratios, not relative risk (although OR is a good
measure of association).
62
63 64
Statistical methods
65 66
67
Several important features• The study provides an efficient means to study rare diseases. Case-
control studies tend to be more feasible than other studies.
• Case-control studies allow researchers to investigate several risk
factors.
• A single case-control investigation does not “prove” causality, but it
can provide suggestive evidence of a causal relationship that
warrants intervention by public health officials to reduce exposure to
the implicated risk factor.
68
Cohort study
69
Cohort study
• Cohort=Prospective=Longitudinal
• Clearly defined cohort (group, sample) of persons at risk followed through time
• Data regarding exposures (risk factors, predictors) collected prior to data on outcomes (endpoints)
• Protocol developed prior to data collection of research-grade data used for purpose of testing hypothesis
70
Cohort study
71
PopulationPeople
without the disease
Exposed
Not Exposed
Disease
No disease
Disease
No diseaseTIME
Direction of inquiry
Not randomized
Relative risk (RR)
72
RR = incidence in exposed/incidence in non-exposed⁄⁄=
Statistical analysis
• Descriptive analysis
Numerical: Mean (SD), Median (range, IQR)
Categorical: Frequency/prevalence
• Comparison between groups
• Regression analysis
Adjust confounders
Select risk factors
73
Statistical analysis
• Incidence rate/cumulative incidence
• Incidence rate ratio
Poisson regression
• Survival analysis (Time-to-event data)
KM
Log-rank test
Cox regression
74
Strengths and limitations
• Strengths To estimate temporal relationships between exposures and outcomes To estimate incidence of outcome after exposure Stronger external validity than RCT’s (i.e. more representative of
general population) • Limitations
Long and costly Confounding
• Residual confounding• Confounding by indication (i.e. very limited in studying treatment
effects) Bias: Loss to follow-up
75
For example
76
77 78
79
Important points to remember
• Association ≠ Causation
• Statistical significance ≠ Clinical/practical significance
• Multiple factors contribute to whether your results are
significant
80