statistical methods for evaluating biomarkerscourses.washington.edu/epi573/janes session...

55
Statistical Methods for Evaluating Biomarkers Holly Janes Fred Hutchinson Cancer Research Center November 10, 2008

Upload: others

Post on 02-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Statistical Methods for Evaluating Biomarkers

Holly JanesFred Hutchinson Cancer Research Center

November 10, 2008

Page 2: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Biomarkers for...

Diagnosis: disease versus non-diseaseScreening: early diagnosisPrognosis: predicting outcome

ExamplesI clinical signs / symptomsI laboratory testsI gene expression technologyI proteomicsI combinations of any of the above

How to evaluate their accuracy?

Page 3: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Outline

1. Measures of biomarker accuracy2. Evaluating incremental value3. Phases of biomarker development4. Study design issues5. Advanced topics6. Software

Page 4: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Measures of Accuracy for Binary Markers

Page 5: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Classification Probabilities

D = outcome (disease)Y = binary marker

D = 0 D = 1Y = 0 True negative False negativeY = 1 False positive True positive

false positive fraction = FPF = P[Y = 1|D = 0] = 1 - specificitytrue positive fraction = TPF = P[Y = 1|D = 1] = sensitivity

Ideal test: TPF = 1 and FPF = 0

Page 6: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Classification Probabilities, cont’d

I condition on disease statusI describe test accuracyI helpful to public health researchers: should the test be

used?I helpful to individual: should I have the test?

Page 7: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Predictive Values

positive predictive value = PPV = P[D = 1|Y = 1]negative predictive value = NPV = P[D = 0|Y = 0]

Ideal test: PPV = 1 and NPV = 1

I condition on test resultI require cohort study to estimateI depend on TPF, FPF, and prevalence (ρ)

PPV = ρTPF/(ρTPF + (1-ρ)FPF)NPV = (1-ρ)(1-FPF)/((1-ρ)(1-FPF) + ρ(1-TPF))

I describe predictive capacity of testI given my test result, how likely is it that I’m diseased?

Page 8: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Example: Diagnosis of CADY : exercise stress testD : coronary artery disease

D = 0 D = 1Y = 0 22.3% 14.2% 36.5%Y = 1 7.8% 55.6% 63.4%

30.1% 69.8% 100%

TPF = 0.797, FPF = 0.259, ρ = 0.698PPV = 0.877, NPV = 0.611, τ = 0.634

I CAD detects 80% of diseased subjects and incorrectlyidentifies 26% of non-diseased as suspicious

I 88% of test positives and 39% of test negatives havedisease

From The Statistical Evaluation of Medical Tests for Classification and Predictionby Margaret S. Pepe, Ph.D., Oxford University Press, 2003

Page 9: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Inappropriate Commonly Used Measures

I misclassification rate (MCR)I odds ratio

Page 10: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

MCR

I = P[Y 6= D]= P[Y = 1|D = 0]P[D = 0] + P[Y = 0|D = 1]P[D = 1]= FPF ∗ (1− ρ) + (1− TPF) ∗ ρ

I ignores differential importance of false negative and falsepositive errors

I depends on the prevalence (ρ)I eg, if P[Y = 1|D = 1] = P[Y = 1|D = 0] = 0 with low ρ,

MCR lowI used a lot in statistics, not in medical settings

Page 11: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Odds Ratio

I = TPF∗(1−FPF)FPF∗(1−TPF)

I measure of association, not classificationI good classification ⇒ huge odds ratiosI e.g., TPF = 0.80, FPF = 0.10 (a ‘good’ test)

I Odds Ratio = 0.80∗(1−0.10)0.10∗(1−0.80) = 36

Page 12: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

FPF

0.0 0.2 0.4 0.6 0.8 1.0

TPF

0.0

0.2

0.4

0.6

0.8

1.0

1.5

2.0

3.0

9.0

16.0

1.0

36.0

(FPF, TPF) corresponding to different odds ratios

I large odds ratio does not imply good classifierI need to report FPF and TPF separately

From Pepe et al. AJE 2004; 159:882–90.

Page 13: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Measures of Accuracy for Continuous Markers

Page 14: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Classification Accuracy for a Continuous Test

Continuous marker, YI most markers

The ROC curve generalizes (FPF, TPF) to continuous markers

I thresholding rule: ‘positive’ if Y ≥ cI TPF(c) = P[Y ≥ c|D = 1]

FPF(c) = P[Y ≥ c|D = 0]

I ROC(·) = {(FPF(c), TPF(c)), c ε (−∞,∞)}

Page 15: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

−5 0 5

Y

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPF(c)

TP

F(c

)

Page 16: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

−5 0 5

Y

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPF(c)

TP

F(c

)

Page 17: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

−5 0 5

Y

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPF(c)

TP

F(c

)

Page 18: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

−5 0 5

Y

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPF(c)

TP

F(c

)

Page 19: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Attributes of the ROCI shows entire range of possible performanceI puts different tests on a common relevant scale

ABR

-10 0

DPOAE 65 at 2kHz

-20 0 20

normal

hearing impaired

Figure 4.3 Probability distributions of test results for the DPOAE and ABR tests among hearing impaired ears and normally hearing ears.

From The Statistical Evaluation of Medical Tests for Classification and Predictionby Margaret S. Pepe, Ph.D., Oxford University Press, 2003

Page 20: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

t

0 1R

OC

(t)

0

1

Figure 4.4 ROC curves for the DPOAE and

ABR tests.

DPOAE

ABR

I two tests have similar ability to distinguish betweenhearing-impaired and normal ears

From The Statistical Evaluation of Medical Tests for Classification and Predictionby Margaret S. Pepe, Ph.D., Oxford University Press, 2003

Page 21: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Choosing a Threshold

Formal decision theory:

Expected cost(c) = ρ(1− TPF (c))CD + (1− ρ)FPF (c)CN

CD is the cost of negatively classifying a diseased subjectCN is the cost of positively classifying a non-diseased subject

=⇒ cost minimized at the threshold c where the slope of theROC curve equals

1− ρ

ρ

CN

CD

I requires specifying costs CD and CN (tricky!)

Page 22: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Choosing a Threshold, cont’d

Common informal practice:I fix maximum tolerated FPFI eg must be very low (< 5%) for cancer screening testI f0 = FPF → threshold = 1− f0 quantile among controlsI or fix minimum tolerated TPFI eg must be very high in most diagnostic settingsI t0 = TPF → threshold = 1− t0 quantile among cases

Page 23: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Summary Measures of Classification Accuracy

I TPF = ROC(f0) at chosen FPF = f0I percent cases detected for fixed FPF

I FPF = ROC−1(t0) at chosen TPF = t0I FPF for fixed percent cases detected

I AUC =∫ 1

0 ROC(f )dfI probability of correctly ordering a randomly chosen case

and control observationI little clinical relevanceI summarizes TPF over entire FPF range

I partial AUC =∫ f0

0 ROC(f0)dfI restricted ROC region, but little clinical relevance

Page 24: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

t = FPF(c)

0 1

RO

C(t

) =

TP

F(c

)

0

1

ROC(t)

Page 25: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Example: Pancreatic Cancer Data

I marker sought for screening for pancreatic cancerI data on two markers: CA 19-9 and CA 125

FPF

0 1

TPF

0

1

0.2

CA 19-9

CA 125

From The Statistical Evaluation of Medical Tests for Classification and Predictionby Margaret S. Pepe, Ph.D., Oxford University Press, 2003

Page 26: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

AUC for CA 125 = 0.71AUC for CA 19-9 = 0.89p-value = 0.007=⇒ the probability of correct ordering is 18% higher with CA19-9

ROC(0.2) for CA 125 = 0.49ROC(0.2) for CA 19-9 = 0.78p-value = 0.04=⇒ CA 19-9 detects 29% more cancers with the same FPR =0.2

I conclusions about ROC(0.2) are more clinically importantthan those about AUC

Page 27: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Generalizing Predictive Values to ContinuousBiomarkers

I a relatively new area of research; not well developed

Page 28: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Evaluating Incremental Value

Page 29: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Incremental Value

I how much classification accuracy does the new markeradd to existing predictors?

I eg how much does CRP add to existing lipidmeasurements and risk factor information in discriminatingbetween those who will and will not develop CVD?

Page 30: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

How Best to Combine Markers?

I Y = (Y1, . . . , YP)

I the “best" combination is the risk score,R(Y ) = P(D = 1|Y1, . . . , YP) McIntosh and Pepe(Biometrics, 2000)

I “best" =⇒ No other combination of (Y1, . . . , YP) has a(FPF, TPF) point above its ROC curve

Page 31: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

To Combine Markers

I EstimateR(Y ) = P(D = 1|Y1, . . . , YP)

I using logistic regression, neural networks, classificationtrees, support vector machines, Bayesian modelling, . . . .

I logistic regression can be used with case-control data

I Calculate the ROC curve for R(Y ) (it’s just anothermarker!)

I avoid overoptimism due to fitting and evaluating model onsame data

I split into training and validation dataI or use cross-validation

Page 32: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Evaluating Incremental Value

I new marker Y ∗, baseline markers Y1, . . . , YP

I compare the ROC curves for

P(D = 1|Y1, . . . , YP)

andP(D = 1|Y1, . . . , YP , Y ∗)

I NOT quantified by β∗ in

g(P(D = 1|Y1, . . . , YP , Y ∗)) = β0+β1Y1+. . .+βPYP +β∗Y ∗

Page 33: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Pancreatic Cancer Example

I Y1 = log CA-19-9 Y2 = log CA-125I combination β1Y1 + β2Y2 from fitting

logitP(D = 1|Y1, Y2) = α + β1Y1 + β2Y2

exp(β2) = 2.54 (p = 0.002)

I Y2 strongly associated with D

Page 34: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

ROC(0.05) = 0.68 for CA 19-9ROC(0.05) = 0.71 for combination of CA 19-9 and CA 125

I extremely common phenomenon

Page 35: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Phases of Biomarker Development

Page 36: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

# Phase Objective Design1 Preclinical

Exploratorypromising directions identified,assess test reproducibility

diverse andconvenient casesand controls

2 ClinicalAssay andValidation

clinical assay detectsestablished disease, comparetest with standard of practice,assess covariate effects

population based,cases withdisease, controlswithout disease

3 RetrospectiveLongitudinal

biomarker detects disease earlybefore it becomes clinical (forscreening markers)

case-control studynested in alongitudinal cohort

4 ProspectiveScreening

extent and characteristics ofdisease detected by the testand the false referral rate areidentified

Cross-sectionalcohort of people

5 DiseaseControl

impact of screening on reducingthe burden of disease on thepopulation is quantified

randomized trial(ideally)

From: Pepe et al. Phases of biomarker development for early detection of cancer. JNCI 93(14):1054–61, 2001.

Page 37: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Study Design Issues

Page 38: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Matching in Case-Control Studies

I randomly sample casesI select controls matched to cases with respect to

confoundersI attempts to eliminate confoundingI eg Physicians’ Health Study

I evaluate PSA as a screening tool for prostate cancerI for each case select 3 controls within 1 years of age of the

caseI cases tend to be older, older subjects tend to have higher

PSA =⇒ age is confounderI matching on age attempts to correct for this

Page 39: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Implications of Matching

I must adjust for matching covariates in analysisI unadjusted analysis is biasedI more complicated analysis

I can’t assess incremental value of marker over matchingcovariates

I tends to increase efficiency

Page 40: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Selected Verification

I in prospective studies, may not be possible to obtain theoutcome (disease status) for all individuals

I too expensive (cost or resources)I not ethical (eg biopsy)

I often biomarker value determines whether disease statusis verified

I eg, in study of PSA and DRE for prostate cancer screening,biopsy recommended if PSA > 2.5 or DRE+

I selective sampling can lead to biased estimates ofaccuracy – “verification bias" or “work-up bias"

Page 41: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Implications of Selected Verification

When comparing two binary biomarkers in paired study:I those who test negative on both tests are not needed to

estimate relative TPF, FPF

When evaluating one binary biomarker:I naive TPF,FPF are biasedI there are methods for correcting for verification biasI all make untestable assumptions about the verification

mechanismI verification may depend on unmeasured factors!

I lead to decreased precision of estimated TPFI difficult to find settings with cost savings: reduction in

number verified offset by increased total sample sizeI avoid selected verification whenever possible

Page 42: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Advanced Topics

Page 43: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Covariate adjustment

I adjust for covariates that impact the marker distribution incontrols

I eg center effects in multicenter studiesI analogous to covariate adjustment in studies of associationI the accuracy of the marker in a population with fixed

covariate value

Page 44: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

ROC regression

I model covariate effects on biomarker accuracyI eg disease severityI fit regression model for ROC curve, as function of

covariates

Page 45: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Time-dependent ROC curves

I model biomarker accuracy as a function of time betweenmarker measurement and disease

I eg the accuracy of PSA may decline with increasing timelag between sample collection and disease

I define time-dependent versions of TPF,FPFI model accuracy as a function of time

Page 46: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Imperfect reference test

I account for lack of gold standard for DI eg questionnaire to diagnose depressionI various statistical approaches ... but is this a statistical

problem?

Page 47: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Software

On DABS Center website: http://www.fhcrc.org/labs/pepe/dabs

I Stata packages for ROC analysis and sample sizecalculations by Pepe et al.

I R programs for time-dependent ROC curves by PatrickHeagerty

Page 48: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

Websites

http://www.fhcrc.org/labs/pepe/dabs

DABS Center website. Contains datasets, software,references...

http://faculty.washington.edu/∼azhou/books/software.docLists some free and commercial computer programs. Alsoavailable through the Wiley website for Statistical Methods inDiagnostic Medicine by Zhou, Obuchowski and McClish, 2002.

http://xray.bsd.uchicago.edu/krl/roc_soft.htm

Charles Metz and colleagues at University of Chicago arepioneers in ROC analysis software. Developed with a focus onapplications in radiologic imaging.

Page 49: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

ReferencesStudy design

I Baker SG, Kramer BS, McIntosh M, Patterson BH, Shyr Y,Skates S. Evaluating markers for the early detection of cancer:overview of study designs and methods. Clinical Trials 3:43-56,2006.

I Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP,Irwig LM, Lijmer JG, Moher D, Rennie D, de Vet, HCW. Towardscomplete and accurate reporting of studies of diagnosticaccuracy: the STARD initiative. Annals of Internal Medicine138:40-44, 2003.

I Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP,Irwig LM, Moher D, Rennie D, de Vet, HCW, Lijmer JG. TheSTARD statement for reporting studies of diagnostic accuracy:explanation and elaboration. Annals of Internal Medicine138(1):W1-12, 2003.

I Janes H, Pepe MS. The optimal ratio of cases to controls forestimating the classification accuracy of a biomarker.Biostatistics 7(3):456-68, 2006.

Page 50: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

I Pepe MS, Etzioni R, Feng Z Potter JD, Thompson M, ThornquistM, Winget M and Yasui Y. Phases of biomarker development forearly detection of cancer. Journal of the National CancerInstitute 93(14):1054–61, 2001.

I Zhou, SH, McClish, DK, and Obuchowski, NA. StatisticalMethods in Diagnostic Medicine. Wiley Press, 2002.

Combining markers

I McIntosh MS and Pepe MS. Combining several screening tests:Optimality of the risk score. Biometrics 58:657–64, 2002.

I Pepe MS, Cai T, Longton G. Combining predictors forclassification using the area under the ROC curve. Biometrics62:221–229, 2006.

Covariate adjustment

I Janes H, Pepe MS. Matching in studies of classificationaccuracy: Implications for analysis, efficiency, and assessmentof incremental value. Biometrics 2008; 64: 1-9.

Page 51: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

I Janes H, Pepe MS. Adjusting for Covariate Effects onClassification Accuracy Using the Covariate-Adjusted ROCCurve. Biometrika (in press)

ROC regression

I Alonzo TA and Pepe MS. Distribution-free ROC analysis usingbinary regression techniques. Biostatistics 3:421–32, 2002.

I Cai T and Pepe MS. Semi-parametric ROC analysis to evaluatebiomarkers for disease. Journal of the American StatisticalAssociation 97: 1099–1107, 2002.

I Cai T. Semiparametric ROC regression analysis. Biostatistics5(1):45–60, 2004.

I Dodd L, Pepe MS. Partial AUC estimation and regression.Biometrics 59:614–623, 2003.

I Dodd L, Pepe MS. Semi-parametric regression for the areaunder the Receiver Operating Characteristic Curve. Journal ofthe American Statistical Association 98:409–417, 2003.

Page 52: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

I Heagerty PJ and Pepe MS. Semiparametric estimation ofregression quantiles with application to standardizing weight forheight and age in US children. Applied Statistics 48:533–51,1999.

Time-dependent ROCs

I Cai T, Pepe MS, Zheng Y, Lumley T, Jenny NS. The sensitivityand specificity of markers for event times Biostatistics7:182–197, 2006.

I Heagerty PJ, Zheng Y. Survival model predictive accuracy andROC curves. Biometrics 61:92–105, 2005.

I Zheng Y, Heagerty PJ. Semi-parametric estimation oftime-dependent ROC curves for longitudinal marker data.Biostatistics 5:615–632, 2004.

Imperfect reference test

I Albert PS, Dodd LE. A cautionary note on robustness of latentclass models for estimating diagnostic error without a goldstandard. Biometrics 60:427–35, 2004.

Page 53: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

I Albert PS, McShane LM, Shih JH, et al. Latent class modelingapproaches for assessing diagnostic error without a goldstandard. Biometrics 57:610–19, 2001.

I Alonzo TA and Pepe MS. Using a combination of reference teststo assess the accuracy of a new diagnostic test. Statistics inMedicine 18:2897-3003, 1999.

I Pepe MS, Janes H. Insights into latent class analysis.Biostatistics 8:474-84, 2007.

I Vacek PM. The effect of conditional dependence on theevaluation of diagnostic tests. Biometrics 41:959-68, 1985.

Verification bias

I Alonzo TA. Verification bias-corrected estimators of the relativetrue and false positive rates of two binary screening tests.Statistics in Medicine 24:403–417, 2005.

I Alonzo TA and Pepe MS. Assessing accuracy of a continuousscreening test in the presence of verification bias. Journal of theRoyal Statistical Society: Applied Statistics 54:173–190, 2005.

Page 54: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

I Alonzo TA, Braun TB, Moskowitz CS. Small sample estimation ofrelative accuracy for binary screening tests. Statistics inMedicine 23:21–34, 2004.

I Alonzo TA, Kittelson JC. A novel design for estimating relativeaccuracy of screening tests when complete disease verificationis not feasible. Biometrics DOI:10.1111/j.1541-0420.2005.00445.x (early online access).

I Begg CB and Greenes RA. Assessment of diagnostic testswhen disease verification is subject to selection bias. Biometrics39:207-15, 1983.

I Kosinski AS and Barnhart HX. Accounting for nonignorableverification bias in assessment of diagnostic tests. Biometrics59:163-71, 2003.

I Obuchowski NA, Zhou X. Prospective studies of diagnostic testaccuracy when disease prevalence is low. Biostatistics 3:477-92,2002.

I Pepe MS and Alonzo TA. Comparing disease screening testswhen true disease status is ascertained only for screenpositives. Biostatistics 2:1–12, 2001.

Page 55: Statistical Methods for Evaluating Biomarkerscourses.washington.edu/epi573/janes session 11-10-08.pdf · 10/8/2011  · From The Statistical Evaluation of Medical Tests for Classification

I Pepe MS and Alonzo TA. Reply to Letter to Editor regardingAlonzo TA and Pepe MS, Assessing the accuracy of a newdiagnostic test when a gold standard does not exist. Statistics inMedicine 20:656–660, 2001.

I Punglia et al. Effect of verification bias on screening for prostatecancer by measurement of PSA. NEJM 349:335-42, 2003.