application of medical statistics in clinical research chp... · li jibin, md, phd department of...

Application of Medical statistics in clinical research

LI Jibin, MD, PhDDepartment of Clinical Research, Sun Yat-sen

University Cancer CenterEmail: [email protected]

A figure related to statistics

2

Outline

3

Introduction

How to select proper statistical method

Statistical description

Compare group differences

Quantify associations between variables

Analysis of observational study

Introduction

• Medical statistics is an important tool for clinical research.

• Medical statistics is the science that study the collection, sorting and

analysis of medical data.

• Proper statistical methods is precondition to ensure validity and

reliability of the results.

• It is necessary to describe statistical plan/method in protocol.

4

5

Medical statistics

Clinical Study design

Data analysis

Observational study

Experimental study

Descriptive analysis

Statistical inference

Hypothesis testing

Parameter estimates

Observational study

• Exposure is not assigned by investigator

6

Observational study

Descriptive study

Analytical study

Case report

Cross‐sectional study

Case‐control study

Cohort study

Exposure and outcome at the same time/period

Outcome → Exposure

Exposure → Outcome

……

Applications of different observational study design

7

Objective Cross-sectional

Case-control Cohort

Investigation of rare disease - +++++ -Investigation of rare cause - - +++++Testing multiple effects of cause ++ - +++++Study of multiple exposures and determinants ++ ++++ +++Measurements of time relationship - + a +++++Direct measurement of incidence - + b +++++Investigation of long latent periods - +++ -

+…+++++ indicates the general degree of suitability (there are exceptions); - not suitablea if prospective.b if population-based

Ref: R Bonita R Beaglehole & T Kjellstrom. Basic epidemiology, 2nd edition. World health organization

Experimental study

8

• Exposure/Intervention method is assigned by investigator

Experimental study

Randomized controlled Study

Non‐randomized controlled study

……

• Intervention• Participants• Effects

• Control group• Repeat• Randomization and

blindness

Randomized controlled trial (RCT)

9

Study population

Treatment group Control group

Randomly allocate exposure/intervention

Cohort study vs. RCT

10

Assess for eligibility

Enroll

Experimental group

Randomly?

Control group

Follow-up for ascertainment of efficacy/safety outcomes

excludedineligible

Cohort study

RCT

No

Yes

Strength of study design: the evidence pyramid

11

Low

High

Apply high quality study design availably to answer your clinical question

Outline

12

Introduction






How to select proper statistical method?

13

Study design

Study objectives

Data types Data characteristics

1 2

3 4

• The four key elements

(1) Study design

• Completely random design

• Paired design

• Randomized block design

• Factorial design

• Repeated measurement design

• Group sequential design

• ……

14

(2) Study objectives

15

• Descriptive analysis

Mean, SD, Median, Prevalence, etc.

• Parameter estimation

95% Confidence interval

• Compare difference between groups

Hypothesis test

i.e. 2 test、t test、ANOVA、Wilcoxon test、log-rank test

(2) Study objectives

16

• Explore associations between exposures and outcome

Multivariate analysis

i.e. Correlation、regression analysis.

• Select risk factors

Regression analysis

• Prediction

Regression analysis

• Causal inference

Path analysis/structural equation model (SEM)

(3) Data types

• Quantitative vs. qualitative?

• Ordinal vs. nominal?

• Binomial vs. multiple?

• Time-to-event?

• ……

17

The choice of statistical methods depend on the data types

Data types

• also called quantitative or

continuous variableo Number of drinks.

o Height

o Weight

o BMI

o Fasting blood glucose

o ……

18

Numerical Categorical

• also called qualitative variable

• Nominal, results of classifyingo Gender: Male, Female

o Blood group: A, B, AB, O.

• Ordinal, express ranks (type of categorical data)

o Treatment effect: cured, better, worsen, dead.

o Personal health status: Excellent, very good, good, fair, poor

Data transformation

Numerical data

Ordinal data

Nominal data

19

It is not recommended

20

Example:

• WBC (L/m3) count for five persons:

o Multiple categorical data: Lower, normal, Higher

o Binominal data: Normal (3 persons); abnormal (2 persons)

3000 6000 5000 8000 12000 → Numerical variableLower Normal Normal Normal Higher → Qualitative variable

(4) Data characteristics

• Assumption of statistical methods

Normal or skewed distribution?

Homogeneity of variance?

Sample size?

Missing values?

……

21

Outline

22

Introduction


Descriptive statistics





• Describe the basic features of data of the study

• Provide summaries about the sample and measures

23


ID Age (X1) Gender (X2) Group (X3) DBP (X4) ECG (X5) Treatment effect (X6)1 37 Male A 11.27 Normal Effective

2 48 Female B 12.53 Normal Effective

3 45 Male A 10.93 Abnormal Ineffective

… … … … … … …

108 58 Male B 16.80 Abnormal Effective

24

The data of treatment effect of 108 hypertensive patients

ECG: electrocardiogram; DBP: Diastolic blood pressure

Row: cases Colum: variables

1=Male 1=A 1=Normal 1=Effective2=Female 2=B 0=Abnormal 0=Ineffective


ID Age (X1) Gender (X2) Group (X3) DBP (X4) ECG (X5) Treatment effect (X6)1 37 1 1 11.27 1 1

2 48 2 2 12.53 1 1

3 45 1 1 10.93 0 0

… … … … … … …

108 58 1 2 16.80 0 1

25

The data of treatment effect of 108 hypertensive patients

ECG: electrocardiogram; DBP: Diastolic blood pressure

Row: cases Colum: variables


• Use histogram to describe the distribution of data

Variation

Average level

Average level

Skewed distributionapproximately

normal distribution

26


• Quantitative variable

27

Indicators Application

Central tendency

Mean (X) Symmetric distribution: Normal or approximately normal distribution

Median (M) Asymmetric distribution: skewed distribution

Geometric mean (G) Lognormal distribution

Dispersion tendency

Standard deviation (SD) Symmetric distribution: Normal or approximately normal distribution

Inter-quartile range (IQR) Asymmetric distribution: skewed distribution

Coefficient of variance (CV) To describe the variation of several variables

Variation

Average levelAverage level

Skewed distribution approximatelynormal distribution

Median Mean

Standard deviationP25，P75

28

• Normal or approximately normal distributionMean and standard deviation

• Skewed distributionMedian and P25, P75 (IQR)


• Qualitative variable

Frequency

Distribution of frequency (%)

……

29

Treatment effect of 108 patients with hypertensionTreatment effect Frequency Percentage (%)

Cured 46 42.6 Better 29 26.9 Good 18 16.7

No effect 15 13.9

Outline

30

Introduction






Comparison between groups—Hypothesis test

• Based on data types Numerical data

• t test, ANOVA/Kruskal-Wallis H test, Z test, Wilcoxon test, etc.

Nominal data

• Chi-square test, Z test

Ordinal data

• Wilcoxon test, etc.

Time-to-event data

• Kaplan-Meier cures

• Log-rank test

31

Example 1:

32

Groups n Mean (SD)A 10 24.6 (5.61)B 10 36.2 (4.39)C 10 29.4(6.11)

BMI among three groups of 30 patients

• Three independent groups

• Numerical data

• Proper method: one-way ANOVA

Incorrect method: t test

Example 2:

33

ID Treatment group

ID Control group

Before After Before After 1 130 114 11 118 124 2 124 110 12 132 122 3 136 126 13 134 132 4 128 116 14 114 96 5 122 102 15 118 124 6 118 100 16 128 118 7 116 98 17 118 116 8 138 122 18 132 122 9 126 108 19 120 124

10 124 106 20 134 128

Diastolic blood pressure (mmHg) before and after treatment of hypertensive patients

• Paired design• Numerical data

• Proper method: Two independent sample t test by using

differences between before and after of DBP

Example 3:

• Two group patients received two drug treatment respectively. Whether the treatment effects is better than control group?

34

Group nTreatment effects

Effective Markedly effective ineffectiveTreatment 42 28 10 4Control 40 19 9 12

χ2 test: χ2=5.73, p=0.057

• Two independent groups• Ordinal data• The proper method: Wilcoxon test

×

Example 4:

ID Treatment time

before 12 weeks 24 weeks 36 weeks 1 160 105 147 135 2 415 371 258 182 3 327 94 36 51 4 174 113 63 50 5 201 26 55 20 6 289 20 17 21 7 85 44 56 62 8 176 165 136 83 9 76 215 34 81

10 75 94 51 59

ALT (U/L) at different time points after drug treatment among HCV patients

• Repeated measurement design;• Numerical data;• The proper method: Repeated measurement ANOVA.

Outline

36

Introduction







• In clinical research, it commonly need to estimate the relationship between exposures and target disease, to explore the risk factors of target disease, involving multivariate analysis.

37

Objectives Multivariate methodsAssociation between two quantitative variables Simple linear regressionCorrelation between two quantitative variables Pearson or Spearman correlationCorrelation between two ordinal variables Spearman correlation

Association between multiple variables and one quantitative variable

Multiple linear regression

Association between multiple variables and one categorical variable

Logistic regression

Association between multiple variables and one time-to-event outcome

Survival analysis• Cox proportional hazard regression• K-M curves

Linear regression

• Modeling the relationship between one numerical outcome Y and one or more explanatory variables denoted as x variables

• Y is numerical and normal distribution

38

⋯L I N E

Assumptions:

Linear Independence of errors

Normal distribution

Equal of variance

Logistic regression

• The important model for categorical response (Y) data

Binary: 0 and 1，i.e. death vs. survival, normal vs. abnormal

Nominal or ordinal with ≥ 3 levels, i.e. cured, better, bad.

• Predictor variables (xi) can take on any form: binary, categorical or

continuous

39

Logit(P)=ln ⋯ Odds Ratio (OR), the main export.

Cox proportional hazard regression

• The important model for survival data analysis.• Y is time-to-event variable with censoring data.

40

ln , ⋯ PH Assumption.

Main output: Relative risk (RR).

Commonly used regression models based on different types of outcomes

41

Types of outcome Applicable regression model

Continuous outcome Linear regression model

Categorical outcome Logistic regression model

Time-to-event outcomeCox proportional hazard model

Kaplan-Meier cures

Time series Time series analysis

42

ID Age Grade Size Relapse Start End Time Status

1 62 1 0 0 02/10/1996 12/30/2000 59 0

2 64 1 0 0 03/05/1996 08/12/2000 54 1

3 52 2 0 1 04/09/1996 12/03/1999 44 0

4 60 1 0 0 06/06/1996 10/27/2000 53 0

… … … … … … … … …

30 54 3 1 1 03/10/2000 09/20/2000 6 1

Survival data of 30 bladder carcinoma

Outline

43

Introduction






Analysis of observational studies

44

Cross-sectional study

Case-control study

Cohort study


• Sample: a sample of everyone in a population, regardless of

exposure or outcome status

• Design: In each individual, determine the exposure and disease

status at the same time (or period)

• “Snap shot”; No follow-up data

• Examples:

Prevalence surveys (How common is kidney disease in a population?)

Etiology (Is hypertension associated with prevalent kidney disease?)

45


• Study design:

46

Defined Population

Collect data on Exposure and Disease

Exposed, with Disease

Exposed, No Disease

Not Exposed, with Disease

Not Exposed, No Disease

Begin with:

Then:

Statistical analysis

• Descriptive analysis: describe characteristics of the population

Numerical: Mean (SD), Median (range, IQR)

Categorical: Prevalence

• Comparison between groups

Numerical data: t test, ANOVA, Wilcoxon test

Categorical data: chi-square test, Fisher’s exact test., Wilcoxon test

47


• Associations/risk factors

Binary outcome: Logistic regression

Numerical outcome: Linear regression?

48

Strengths and limitations

• Strengths: Useful for public health surveys Useful for public policy (allocation of resources) Good initial step in evaluating associations Cost-effective use of resources

• Limitations: Temporal relationship not defined

• Causation can not be determined• Survivor bias

Can not evaluate prognosis Can not evaluate treatment effects

49

For example

50

Statistical method

51

Descriptive analysis

52

Prevalence of diabetes

53

Risk factors of diabetes and prediabetes

54

Case-control study

55

Case-control study

• Compared with a control group essentially• Design: Clearly define cases (patients with disease)• Clearly define controls (patients without the disease) • Data regarding exposure ( risk factors or predictors) • Example:

Lung cancer and smoking

heart attack and mercury exposure

56

Case-control study

57

a

b

c

d

Cases (people with disease)

Direction of inquiry

Exposed

Exposed

Unexposed

Unexposed

Control (people without disease)

Population

TIME

Odds Ratio (OR)

• OR is the odds of exposure given disease divided by the odds

of exposure given no disease.

• Remember that the odds of exposure among cases compared

with controls is the same as the odds of disease among

exposed and unexposed.

58

Exposed Unexposed Total

CasesControls

ac

bd

a+bc+d

Total a+c b+d a+b+c+d

//

a cORb d

Odds ratio

59

Analytic Strategy



Categorical: Frequency/prevalence

• Stratified analysis Calculate stratum-specific ORs for exposure-outcome relationship

Determine presence of confounding and interaction

60

Analytic Strategy

• Logistic regression analysis Adjusted OR, by adjusting for confounding and interaction.

Special logistic model applied in matched studies.

61


• Strengths:

Lower cost than cohort studies.

Useful for studying uncommon diseases.

• Limitations:

Very susceptible to bias.

Can not evaluate prevalence, incidence or prognosis.

Can only provide odds ratios, not relative risk (although OR is a good

measure of association).

62

63 64

Statistical methods

65 66

67

Several important features• The study provides an efficient means to study rare diseases. Case-

control studies tend to be more feasible than other studies.

• Case-control studies allow researchers to investigate several risk

factors.

• A single case-control investigation does not “prove” causality, but it

can provide suggestive evidence of a causal relationship that

warrants intervention by public health officials to reduce exposure to

the implicated risk factor.

68

Cohort study

69

Cohort study

• Cohort=Prospective=Longitudinal

• Clearly defined cohort (group, sample) of persons at risk followed through time

• Data regarding exposures (risk factors, predictors) collected prior to data on outcomes (endpoints)

• Protocol developed prior to data collection of research-grade data used for purpose of testing hypothesis

70

Cohort study

71

PopulationPeople

without the disease

Exposed

Not Exposed

Disease

No disease

Disease

No diseaseTIME

Direction of inquiry

Not randomized

Relative risk (RR)

72

RR = incidence in exposed/incidence in non-exposed⁄⁄=




Categorical: Frequency/prevalence

• Comparison between groups

• Regression analysis

Adjust confounders

Select risk factors

73


• Incidence rate/cumulative incidence

• Incidence rate ratio

Poisson regression

• Survival analysis (Time-to-event data)

KM

Log-rank test

Cox regression

74


• Strengths To estimate temporal relationships between exposures and outcomes To estimate incidence of outcome after exposure Stronger external validity than RCT’s (i.e. more representative of

general population) • Limitations

Long and costly Confounding

• Residual confounding• Confounding by indication (i.e. very limited in studying treatment

effects) Bias: Loss to follow-up

75

For example

76

77 78

79

Important points to remember

• Association ≠ Causation

• Statistical significance ≠ Clinical/practical significance

• Multiple factors contribute to whether your results are

significant

80

application of medical statistics in clinical research chp... · li jibin, md, phd department of...

Documents