cph exam review biostatistics

CPH Exam ReviewBiostatistics

Lisa Sullivan, PhDAssociate Dean for EducationProfessor and Chair, Department of BiostatisticsBoston University School of Public Health

Outline and Goals Overview of Biostatistics (Core Area) Terminology and Definitions Practice Questions

An archived version of this review, along with the PPT file, will be available on the NBPHE website (www.nbphe.org) under

Study Resources

http://www.nbphe.org/

Biostatistics

Two Areas of Applied Biostatistics:

Descriptive Statistics Summarize a sample selected from a

population

Inferential Statistics Make inferences about population

parameters based on sample statistics.

Variable Types Dichotomous variables have 2 possible

responses (e.g., Yes/No) Ordinal and categorical variables have

more than two responses and responses are ordered and unordered, respectively

Continuous (or measurement) variables assume in theory any values between a theoretical minimum and maximum

We want to study whether individuals over 45 years are at greater risk of diabetes than those younger than 45. What kind of variable is age?

1. Dichotomous2. Ordinal3. Categorical4. Continuous

We are interested in assessing disparities in infant morbidity by race/ethnicity. What kind of variable is race/ethnicity?

1. Dichotomous2. Ordinal3. Categorical4. Continuous

Numerical Summaries of Dichotomous, Categorical and Ordinal VariablesFrequency Distribution Table Heath Status Freq. Rel. Freq. Cumulative

FreqCumulative Rel. Freq.

Excellent 19 38% 19 38%Very Good 12 24% 31 62%Good 9 18% 40 80%Fair 6 12% 46 92%Poor 4 8% 50 100%

n=50 100%

Ordinal variables only

Frequency Bar Chart

05

1015202530

Marital Status

Freq

uenc

y

Relative Frequency Histogram

05

10152025303540

Poor Fair Good Very Good ExcellentHealth Status

%

Continuous Variables Assume, in theory, any value between

a theoretical minimum and maximum Quantitative, measurement variables Example – systolic blood pressure

Standard Summary: n = 75, = 123.6, s = 19.4

Second sample n = 75, = 128.1, s = 6.4

Summarizing Location and Variability When there are no outliers, the sample

mean and standard deviation summarize location and variability

When there are outliers, the median and interquartile range (IQR) summarize location and variability, where IQR = Q3-Q1

Outliers <Q1–1.5 IQR or >Q3+1.5 IQR

Mean Vs. Median

Box and Whisker PlotMin Q1 Median Q3 Max

Comparing Samples withBox and Whisker Plots

100 110 120 130 140 150 160Systolic Blood Pressure

1

2

What type of display is shown below?

1. Frequency bar chart2. Relative frequency bar chart3. Frequency histogram4. Relative frequency histogram

I II III IV05

101520253035

%

Percent Patients by Disease Stage

The distribution of SBP in men, 20-29 years is shown below. What is the best summary of a typical value

1. Mean2. Median3. Interquartile range4. Standard Deviation

When data are skewed, the mean is higher than the median.

1. True2. False

The best summary of variability for the following continuous variable is

1. Mean2. Median3. Interquartile range4. Standard Deviation

Numerical and Graphical Summaries Dichotomous and categorical

Frequencies and relative frequencies Bar charts (freq. or relative freq.)

Ordinal Frequencies, relative frequencies,

cumulative frequencies and cumulative relative frequencies

Histograms (freq. or relative freq. Continuous

n, and s or median and IQR (if outliers) Box whisker plot

What is the probability of selecting a male with optimal blood pressure?

1. 20/252. 20/803. 20/150

Blood Pressure Category Optimal Normal Pre-Htn Htn TotalMale 20 15 15 30 80Female 5 15 25 25 70Total 25 30 40 55 150

What is the probability of selecting a patient with Pre-Htn or Htn?

1. 95/1502. 45/803. 55/150

Blood Pressure Category Optimal Normal Pre-Htn Htn TotalMale 20 15 15 30 80Female 5 15 25 25 70Total 25 30 40 55 150

What proportion of men have prevalent CVD? CVD Free of CVDMen 35 265Women 45 355

1. 35/802. 35/2653. 35/300

What proportion of patients with CVD are men ? CVD Free of CVDMen 35 265Women 45 355

1. 35/7002. 35/803. 80/300

Are Family History and Current Status Independent?

Example. Consider the following table which cross classifies subjects by their family history of CVD and current (prevalent) CVD status.

Current CVDFamily History No Yes

No 215 25Yes 90 15

P(Current CVD| Family Hx) = 15/105 = 0.143P(Current CVD| No Family Hx) = 25/240 = 0.104

Are symptoms independent of disease? Disease No Disease TotalSymptoms 25 225 250No Symptoms 50 450 500

1. No2. Yes

Probability Models – Binomial Distribution

Two possible outcomes: success and failure

Replications of process are independent P(success) is constant for each

replication

Mean=np, variance=np(1-p)

xnx p)(1px)!(nx!

n!P(x)

Probability Models – Poisson Distribution

Two possible outcomes: success and failure

Replications of process are independent Often used to model counts (often used

to model rare events)

Mean=m, variance=m / x!)μ (e P(x) x-μ

Probability Models – Normal Distribution Model for continuous outcome Mean=median=mode

Normal DistributionProperties of Normal DistributionI) The normal distribution is symmetric about the

mean (i.e., P(X > m) = P(X < m) = 0.5). ii) The mean and variance (m and s2) completely

characterize the normal distribution.iii) The mean = the median = the mode iv) Approximately 68% of obs between mean + 1 sd

95% between mean + 2 sd, and >99% between mean + 3 sd

Normal DistributionBody mass index (BMI) for men age 60 is normally distributed with a mean of 29 and standard deviation of 6.

What is the probability that a male has BMI < 29?

11 17 23 29 35 41 47

P(X<29)= 0.5

Normal Distribution

11 17 23 29 35 41 47

P(X<30)=?

What is the probability that a male has BMI less than 30?

Standard Normal Distribution ZNormal distribution with m=0 and s=1

-3 -2 -1 0 1 2 3

Normal Distribution

P(X<30)= P(Z<0.17) = 0.5675

From a table of standard normal probabilities or statistical computing package.

0.176

2930σ

μxZ

Comparing Systolic Blood Pressure (SBP)Comparing systolic blood pressure (SBP) Suppose for Males Age 50, SBP is

approximately normally distributed with a mean of 108 and a standard deviation of 14

Suppose for Females Age 50, SBP is approximately normally distributed with a mean of 100 and a standard deviation of 8

If a Male Age 50 has a SBP = 140 and a Female Age 50 has a SBP = 120, who has the “relatively” higher SBP ?

Normal Distribution

ZM = (140 - 108) / 14 = 2.29

ZF = (120 - 100) / 8 = 2.50

Which is more extreme?

Percentiles of the Normal Distribution

The kth percentile is defined as the score that holds k percent of the scores below it.

Eg., 90th percentile is the score that holds 90% of the scores below it.

Q1 = 25th percentile, median = 50th percentile, Q3 = 75th percentile

PercentilesFor the normal distribution, the following is used to

compute percentiles:X = m + Z s

where m = mean of the random variable X,s = standard deviation, andZ = value from the standard normal distribution

for the desired percentile (e.g., 95th, Z=1.645). 95th percentile of BMI for Men: 29+1.645(6) = 38.9

Central Limit Theorem (Non-normal) population with m, s Take samples of size n – as long as n is

sufficiently large (usually n > 30 suffices) The distribution of the sample mean is

approximately normal, therefore can use Z to compute probabilities

nσμxZ

Standard error

Statistical Inference There are two broad areas of statistical

inference, estimation and hypothesis testing.

Estimation. Population parameter is unknown, sample statistics are used to generate estimates.

Hypothesis Testing. A statement is made about parameter, sample statistics support or refute statement.

What Analysis To Do When Nature of primary outcome variable

Continuous, dichotomous, categorical, time to event

Number of comparison groups One, 2 independent, 2 matched or

paired, > 2 Associations between variables

Regression analysis

Estimation Process of determining likely values for

unknown population parameter Point estimate is best single-valued

estimate for parameter Confidence interval is range of values for

parameter: point estimate + margin of errorpoint estimate + t SE (point estimate)

Hypothesis Testing Procedures1. Set up null and research

hypotheses, select a2. Select test statistic3. Set up decision rule4. Compute test statistic5. Draw conclusion & summarize

significance (p-value)

P-values P-values represent the exact

significance of the data Estimate p-values when rejecting H0

to summarize significance of the data (approximate with statistical tables, exact value with computing package)

If p < a then reject H0

Errors in Hypothesis Tests

Conclusion of Statistical TestDo Not Reject H0 Reject

H0

H0 true Correct Type I errorH0 false Type II error Correct

Continuous OutcomeConfidence Interval for m Continuous outcome - 1 Sample

n > 30

n < 30

nsZX

nstX

Example.95% CI for mean waiting time at EDData: n=100, =37.85 and s=9.5 mins

37.85 + 1.86 (35.99 to 39.71)

1009.5 1.96 37.85

Statistical computing packages use t throughout.

New Scenario Outcome is dichotomous

Result of surgery (success, failure) Cancer remission (yes/no)

One study sample Data

On each participant, measure outcome (yes/no)

n, x=# positive responses, nxp

Dichotomous Outcome Confidence Interval for p Dichotomous outcome - 1 Sample

n)p-(1pZp

proceduresexact otherwise,5)]pn(1,pmin[n

Example.In the Framingham Offspring Study (n=3532), 1219 patients were on antihypertensive medications. Generate 95% CI.

0.345 + 0.016 (0.329, 0.361)

35320.345)-0.345(196.10.345

One Sample Procedures – Comparisons with Historical/External Control Continuous Dichotomous

H0: mm0 H0: pp0

H1: m>m0, <m0, ≠m0 H1: p>p0, <p0, ≠p0

n>30

n<30

ns/μ-X

Z 0

ns/μ-X

t 0

n)p-(1p

p-p Z

00

0

proceduresexact otherwise,5)]pn(1,min[np 00

One Sample Procedures – Comparisons with Historical/External Control

Categorical or Ordinal outcomec2 Goodness of fit test

H0: p1p10, p2p20, . . . , pkpk0

H1: H0 is false

E)E - (O Σ = χ

22

New Scenario Outcome is continuous

SBP, Weight, cholesterol Two independent study samples Data

On each participant, identify group and measure outcome

)s(ors,X,n),s(ors,X,n 22

22212

111

Two Independent Samples Cohort Study - Set of Subjects Who

Meet Study Inclusion Criteria

Group 1 Group 2Mean Group 1 Mean Group 2

Two Independent Samples RCT: Set of Subjects Who Meet

Study Eligibility Criteria Randomize

Treatment 1 Treatment 2Mean Trt 1 Mean Trt 2

Continuous OutcomeConfidence Interval for (m1m2) Continuous outcome - 2 Independent Samples

n1>30 and n2>30

n1<30 or n2<30

2121 n

1n1 ZSp)X - X(

2121 n

1n1 tSp)X - X(

2nn1)s(n1)s(n

Sp21

222

211

Hypothesis Testing for (m1m2)

Continuous outcome 2 Independent Sample

H0: m1m2 (m1m2 = 0)

H1: m1>m2, m1<m2, m1≠m2

Hypothesis Testing for (m1m2)

Test Statistic

n1>30 and n2> 30

n1<30 or n2<3021

21

n1

n1Sp

X - XZ

21

21

n1

n1Sp

X - Xt

An RCT is planned to show the efficacy of a new drug vs. placebo to lower total cholesterol.

What are the hypotheses?

1. H0: mP=mN H1: mP>mN 2. H0: mP=mN H1: mP<mN

3. H0: mP=mN H1: mP≠mN

New Scenario Outcome is dichotomous

Result of surgery (success, failure) Cancer remission (yes/no)

Two independent study samples Data

On each participant, identify group and measure outcome (yes/no)

2211 p,n,p,n

Dichotomous OutcomeConfidence Interval for (p1p2)

Dichotomous outcome - 2 Independent Samples

2

22

1

1121 n

)p(1pn

)p-(1pZ)p-p(

5)]p(1n,pn),p(1n,pmin[n 22221111

Measures of Effect for Dichotomous Outcomes

Outcome = dichotomous (Y/N or 0/1)

Risk=proportion of successes = x/n

Odds=ratio of successes to failures=x/(n-x)

Measures of Effect for Dichotomous Outcomes Risk Difference =

Relative Risk =

Odds Ratio =

21 p-p

21 p/p

)p1/(p)p1/(p

22

11

Confidence Intervals for Relative Risk (RR) Dichotomous outcome 2 Independent Samples

exp(lower limit), exp(upper limit)2

222

1

111

n)/xx-(n

n)/xx-(nZR)Rln(

Confidence Intervals for Odds Ratio (OR) Dichotomous outcome 2 Independent Samples

exp(lower limit), exp(upper limit))x(n

1x1

)x(n1

x1ZR)Oln(

222111

Hypothesis Testing for (p1-p2) Dichotomous outcome 2 Independent Sample

H0: p1=p2

H1: p1>p2, p1<p2, p1≠p2

Test Statistic

21

21

n1

n1)p-(1p

p-p Z

5)]p(1n,pn),p(1n,pmin[n 22221111

Two (Independent) Group Comparisons

Difference in birth weight is -106 g,

95% CI for difference in mean Birth weight: (-175.3 to -36.7)

New Scenario Outcome is continuous

SBP, Weight, cholesterol Two matched study samples Data

On each participant, measure outcome under each experimental condition

Compute differences (D=X1-X2) dd s,Xn,

Two Dependent/Matched SamplesSubject ID Measure 1 Measure 2

1 55 702 42 60..

Measures taken serially in time or under different experimental conditions

Crossover TrialTreatment Treatment

Eligible RParticipants

Placebo Placebo

Each participant measured on Treatment and placebo

Confidence Intervals for md

Continuous outcome 2 Matched/Paired Samples

n > 30

n < 30

nsZX d

d

nstX d

d

Hypothesis Testing for md

Continuous outcome 2 Matched/Paired Samples

H0: md0

H1: md>0, md<0, md≠0Test Statisticn>30

n<30

nsμ - X

Zd

dd

nsμ - X

td

dd

Independent Vs Matched Design

Statistical Significance versus Effect Size P-value summarizes significance Confidence intervals give magnitude

of effect (If null value is included in CI, then no statistical significance)

The null value of a difference in means is…

1. 02. 0.53. 14. 2

The null value of a mean difference is…

1. 02. 0.53. 14. 2

The null value of a relative risk is…

1. 02. 0.53. 14. 2

The null value of a difference in proportions is…

1. 02. 0.53. 14. 2

The null value of an odds ratio is…

1. 02. 0.53. 14. 2

A two sided test for the equality of means produces p=0.20. Reject H0?

1. Yes2. No3. Maybe

Hypothesis Testing for More than 2 Means - Analysis of Variance Continuous outcome k Independent Samples, k > 2

H0: m1m2m3 … mk

H1: Means are not all equalTest Statistic

k)/(N)XΣΣ(X1)/(k)XX(Σn

F 2j

2jj

F is ratio of between group variation to within group variation (error)

ANOVA TableSource of Sums of MeanVariation Squares df Squares F

BetweenTreatments k-1 SSB/k-1

MSB/MSE

Error N-k SSE/N-k

Total N-1

)X - X( n Σ = SSB j2

j

)X - X( Σ Σ = SSE j2

)X -X( Σ Σ = SST 2

ANOVA When the sample sizes are equal, the

design is said to be balanced Balanced designs give greatest power

and are more robust to violations of the normality assumption

Extensions Multiple Comparison Procedures –

Used to test for specific differences in means after rejecting equality of all means (e.g., Tukey, Scheffe)

Higher-Order ANOVA - Tests for differences in means as a function of several factors

Extensions Repeated Measures ANOVA - Tests for

differences in means when there are multiple measurements in the same participants (e.g., measures taken serially in time)

c2 Test of Independence Dichotomous, ordinal or categorical outcome 2 or More Samples

H0: The distribution of the outcome is independent of the groups

H1: H0 is false

Test Statistic EE)-(O χ

22

c2 Test of Independence Data organization (r by c table)

Is there distribution of the outcome different (associated with) groups

OutcomeGroup 1 2 3

A 20% 40% 40%B 50% 25% 25%C 90% 5% 5%

What Tests Were Used?

In Framingham Heart Study, we want to assess risk factors for Impaired Glucose Outcome = Glucose Category

Diabetes (glucose > 126), Impaired Fasting Glucose (glucose 100-125), Normal Glucose

Risk Factors Sex Age BMI (normal weight, overweight, obese) Genetics

What test would be used to assess whether sex is associated with Glucose Category?

1. ANOVA2. Chi-Square GOF3. Chi-Square test of independence4. Test for equality of means5. Other

What test would be used to assess whether age is associated with Glucose Category?


What test would be used to assess whether BMI is associated with Glucose Category?


In Framingham Heart Study, we want to assess risk factors for Glucose Level Consider a Secondary Outcome =

Fasting Glucose Level Risk Factors

Sex Age BMI (normal weight, overweight, obese) Genetics

What test would be used to assess whether sex is associated with Glucose Level?


What test would be used to assess whether BMI is associated with Glucose Level?


What test would be used to assess whether age is associated with Glucose Level?


In Framingham Heart Study, we want to assess risk factors for Diabetes Consider a Tertiary Outcome =

Diabetes Vs No Diabetes Risk Factors

Sex Age BMI (normal weight, overweight, obese) Genetics

What test would be used to assess whether sex is associated with Diabetes?


What test would be used to assess whether BMI is associated with Diabetes?


What test would be used to assess whether age is associated with Diabetes?


Correlation Correlation (r)– measures the nature

and strength of linear association between two variables at a time

Regression – equation that best describes relationship between variables

What is the most likely value of r for the data shown below?Y

X

*

*

*

** * *

*

*

*

*

*

** *

*

**

*

*

*

*

*

*

1. r=-0.52. r=03. r=0.54. r=1

What is the most likely value of r for the data shown below?

1. r=-0.52. r=03. r=0.54. r=1

Y

X

* * *

**

*

*

****

**

*

* * *

Simple Linear Regression Y = Dependent, Outcome variable X = Independent, Predictor variable = b0 + b1 x b0 is the Y-intercept, b1 is the slope

y

Simple Linear RegressionAssumptions Linear relationship between X and Y Independence of errors Homoscedasticity (constant variance) of

the errors Normality of errors

Multiple Linear Regression Useful when we want to jointly

examine the effect of several X variables on the outcome Y variable.

Y = continuous outcome variable X1, X2, …, Xp = set of independent or

predictor variables . x b + . . .+ x b + x b + b = y pp22110

Multiple Regression Analysis Model is conditional, parameter

estimates are conditioned on other variables in model

Perform overall test of regression If significant, examine individual

predictors Relative importance of predictors by p-

values (or standardized coefficients)

Multiple Regression Analysis Predictors can be continuous,

indicator variables (0/1) or a set of dummy variables

Dummy variables (for categorical predictors) Race: white, black, Hispanic

Black (1 if black, 0 otherwise) Hispanic (1 if Hispanic, 0 otherwise)

Definitions Confounding – the distortion of the

effect of a risk factor on an outcome Effect Modification – a different

relationship between the risk factor and an outcome depending on the level of another variable

Multiple Regression for SBP: Comparison of Parameter Estimates

Simple Models Multiple Regression

b p b pAge 1.03 <.0001 0.86 <.0001Male -2.26 .0009 -2.22 .0002BMI 1.80 <.0001 1.48 <.0001BP Meds 33.38 <.0001 24.12 <.0001

Focus on the association between BP meds and SBP…

RCT of New Drug to Raise HDLExample of Effect Modification

Women N Mean Std Dev

New drug 40 38.88 3.97Placebo 41 39.24 4.21

Men N Mean Std Dev

New drug 10 45.25 1.89Placebo 9 39.06 2.22

Simple Logistic Regression Outcome is dichotomous (binary) We model the probability p of having

the disease.

Xbb

Xbb

10

10

e1ep

xbbp1

pln)plogit( 10

Multiple Logistic Regression Outcome is dichotomous (1=event,

0=non-event) and p=P(event) Outcome is modeled as log odds

pp22110 xb ... xb xbbp-1

pln

Multiple Logistic Regression for Birth Defect (Y/N)Predictor b p OR (95% CI for OR)Intercept -1.099 0.0994Smoke 1.062 0.2973 2.89 (0.34, 22.51)Age 0.298 0.0420 1.35 (1.02, 1.78)

Interpretation of OR for age:The odds of having a birth defect for the older of two mothers differing in age by one year is estimated to be 1.35 times higher after adjusting for smoking.

Survival Analysis Outcome is the time to an event. An event could be time to heart attack,

cancer remission or death. Measure whether person has event or not

(Yes/No) and if so, their time to event. Determine factors associated with longer

survival.

Survival Analysis Incomplete follow-up information Censoring

Measure follow-up time and not time to event

We know survival time > follow-up time Log rank test to compare survival in

two or more independent groups

Survival Curve – Survival Function

Comparing Survival Curves

H0: Two survival curves are equal

c2 Test with df=1. Reject H0 if c2 > 3.84

c2 = 6.151. Reject H0.

Cox Proportional Hazards Model Model:

ln(h(t)/h0(t)) = b1X1 + b2X2 + … + bpXp

Exp(bi) = hazard ratio Model used to jointly assess effects of

independent variables on outcome (time to an event).

Outcome= all-cause mortality Age and Sex as predictors

bi p HRAge 0.11149 0.0001 1.118Male Sex 0.67958 0.0001 1.973

Sample Size Determination Need sample to ensure precision in

analysis Sample size determined based on

type of planned analysis CI Test of hypothesis

Determining Sample Size for Confidence Interval Estimates Goal is to estimate an unknown

parameter using a confidence interval estimate

Plan a study to sample individuals, collect appropriate data and generate CI estimate

How many individuals should we sample?

Determining Sample Size for Confidence Interval Estimates Confidence intervals:

point estimate + margin of error Determine n to ensure small margin

of error (precision) – accounting for attrition!

Must specify desired margin of error, confidence level and variability of parameter

Determining Sample Size for Hypothesis Testing How many participants are needed to

ensure that there is a high probability of rejecting H0 when it is really false?

Determine n to ensure high power (usually 80% or 90%) – accounting for attrition!

Must specify desired power, a and effect size (difference in parameter under H0 versus H1)

Determining Sample Size for Hypothesis Testing b and Power are related to the sample

size, level of significance (a) and the effect size (difference in parameter of interest under H0 versus H1) Power is higher with larger a Power is higher with larger effect size Power is higher with larger sample size

Sample Size Determination Critical Ethical Sometimes difficult

cph exam review biostatistics

Documents