receiver operating characteristic (roc) curve

Upload: nss

Post on 10-Oct-2015

4 views

Category:

Documents


1 download

TRANSCRIPT

  • INDIAN PEDIATRICS 277 VOLUME 48__APRIL 17, 2011

    Diagnostic tests play a vital role inmodern medicine not only forconfirming the presence of disease butalso to rule out the disease in individualpatient. Diagnostic tests with two outcomecategories such as a positive test (+) and negativetest () are known as dichotomous, whereas thosewith more than two categories such as positive,indeterminate and negative are called polytomoustests. The validity of a dichotomous test comparedwith the gold standard is determined by sensitivityand specificity. These two are components thatmeasure the inherent validity of a test.

    A test is called continuous when it yieldsnumeric values such as bilirubin level and nominalwhen it yields categories such as Mantoux test.Sensitivity and specificity can be calculated in bothcases but ROC curve is applicable only forcontinuous or ordinal test.

    When the response of a diagnostic test iscontinuous or on ordinal scale (minimum 5

    categories), sensitivity and specificity can becomputed across all possible threshold values.Sensitivity is inversely related with specificity in thesense that sensitivity increases as specificitydecreases across various threshold. The receiveroperating characteristic (ROC) curve is the plot thatdisplays the full picture of trade-off between thesensitivity and (1- specificity) across a series of cut-off points. Area under the ROC curve is considered asan effective measure of inherent validity of adiagnostic test. This curve is useful in (i) findingoptimal cut-off point to least misclassify diseased ornon-diseased subjects, (ii) evaluating the discri-minatory ability of a test to correctly pick diseasedand non-diseased subjects; (iii) comparing theefficacy of two or more tests for assessing the samedisease; and (iv) comparing two or more observersmeasuring the same test (inter-observer variability).

    INTRODUCTION

    This article provides simple introduction to measuresof validity: sensitivity, specificity, and area under

    Receiver Operating Characteristic (ROC) Curvefor Medical ResearchersRAJEEV KUMAR AND ABHAYA INDRAYAN

    From the Department of Biostatistics and Medical Informatics, University College of Medical Sciences, Delhi, India.Correspondence to:Mr Rajeev Kumar, Department of Biostatistics and Medical Informatics, University College of MedicalSciences, Delhi 110 095. [email protected]

    P E R S P E C T I V EP E R S P E C T I V EP E R S P E C T I V EP E R S P E C T I V EP E R S P E C T I V E

    Sensitivity and specificity are two components that measure the inherent validity of a diagnostic test for dichotomousoutcomes against a gold standard. Receiver operating characteristic (ROC) curve is the plot that depicts the trade-offbetween the sensitivity and (1-specificity) across a series of cut-off points when the diagnostic test is continuous or onordinal scale (minimum 5 categories). This is an effective method for assessing the performance of a diagnostic test. Theaim of this article is to provide basic conceptual framework and interpretation of ROC analysis to help medical researchersto use it effectively. ROC curve and its important components like area under the curve, sensitivity at specified specificityand vice versa, and partial area under the curve are discussed. Various other issues such as choice between parametricand non-parametric methods, biases that affect the performance of a diagnostic test, sample size for estimating thesensitivity, specificity, and area under ROC curve, and details of commonly used softwares in ROC analysis are alsopresented.

    Key words: Sensitivity, Specificity, Receiver operating characteristic curve, Sample size, Optimal cut-off point,Partial area under the curve.

  • INDIAN PEDIATRICS 278 VOLUME 48__APRIL 17, 2011

    KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE

    ROC curve, with their meaning and interpretation.Some popular procedures to find optimal thresholdpoint, possible bias that can affect the ROC analysis,sample size required for estimating sensitivity,specificity and area under ROC curve, and finallycommonly used statistical softwares for ROCanalysis and their specifications are also discussed.

    PubMed search of pediatric journals reveals thatROC curve is extensively used for clinical decisions.For example, it was used for determining the validityof biomarkers such as serum creatine kinase muscle-brain fraction and lactate dehydrogenase (LDH) fordiagnosis of the perinatal asphyxia in symptomaticneonates delivered non-institutionally where areaunder the ROC curve for serum creatine kinasemuscle-brain fraction recorded at 8 hours was 0.82(95% CI 0.69-0.94) and cut-off point above 92.6 U/Lwas found best to classify the subjects. The areaunder ROC curve for LDH at 72 hours was 0.99 (95%CI 0.99-1.00) and cut-off point above 580 U/L wasfound optimal for classifying the perinatal asphyxiain symptomatic neonates [1]. It has also beensimilarly used for parameters such as mid-armcircumference at birth for detection of low birthweight [2], and first day total serum bilirubin value topredict the subsequent hyperbilirubinemia [3]. It isalso used for evaluating model accuracy andvalidation such as death and survival in children orneonates admitted in the PICU based on the childcharacteristics [4], and for comparing predictabilityof mortality in extreme preterm neonates by birth-weight with predictability by gestational age and withclinical risk index of babies score [5].

    SENSITIVITY AND SPECIFICITY

    Two popular indicators of inherent statistical validityof a medical test are the probabilities of detecting

    correct diagnosis by test among the true diseasedsubjects (D+) and true non-diseased subjects (D-).For dichotomous response, the results in terms of testpositive (T+) or test negative (T-) can be summarizedin a 22 contingency table (Table I). The columnsrepresent the dichotomous categories of truediseased status and rows represent the test results.True status is assessed by gold standard. Thisstandard may be another but more expensivediagnostic method or a combination of tests or maybe available from the clinical follow-up, surgicalverification, biopsy, autopsy, or by panel of experts.Sensitivity or true positive rate (TPR) is conditionalprobability of correctly identifying the diseased

    TPsubjects by test: SN = P(T+/D+) = ; andTP + FNspecificity or true negative rate (TNR) is conditionalprobability of correctly identifying the non-disease

    TNsubjects by test: SP= P(T-/D-) = . FalseTN +FPpositive rate (FPR) and false negative rate (FNR) arethe two other common terms, which are conditionalprobability of positive test in non-diseased subjects:

    FPP(T+/D-)= ; and conditional probability ofFP + TNFNnegative test in diseased subjects: P(T-/D+)= ,TP + FN

    respectively.

    Calculation of sensitivity and specificity ofvarious values of mid-arm circumference (cm) fordetecting low birth weight on the basis of ahypothetical data are given in Table II as anillustration. The same data have been used later todraw a ROC curve.

    ROC CURVE

    ROC curve is graphical display of sensitivity (TPR)on y-axis and (1 specificity) (FPR) on x-axis forvarying cut-off points of test values. This is

    TABLE I DIAGNOSTIC TEST RESULTS IN RELATION TO TRUE DISEASE STATUS IN A 22 TABLE

    Diagnostic test result Disease status Total

    Present Absent

    Present True positive (TP) False positive (FP) All test positive (T+)Absent False negative (FN) True negative (TN) All test negative (T-)Total Total with disease (D+) Total without disease (D-) Total sample size

  • INDIAN PEDIATRICS 279 VOLUME 48__APRIL 17, 2011

    KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE

    generally depicted in a square box for convenienceand its both axes are from 0 to 1. Figure 1 depicts aROC curve and its important components asexplained later. The area under the curve (AUC) isan effective and combined measure of sensitivityand specificity for assessing inherent validity of adiagnostic test. Maximum AUC = 1 and it meansdiagnostic test is perfect in differentiating diseasedwith non-diseased subjects. This implies bothsensitivity and specificity are one and both errorsfalse positive and false negativeare zero. This canhappen when the distribution of diseased and non-diseased test values do not overlap. This isextremely unlikely to happen in practice. The AUCcloser to 1 indicates better performance of the test.

    The diagonal joining the point (0, 0) to (1,1)divides the square in two equal parts and each hasan area equal to 0.5. When ROC is this line, overallthere is 50-50 chances that test will correctly discri-minate the diseased and non-diseased subjects. Theminimum value of AUC should be considered 0.5instead of 0 because AUC = 0 means test incorrectlyclassified all subjects with disease as negative andall non-disease subjects as positive. If the test re-sults are reversed then area = 0 is transformed to area= 1; thus a perfectly inaccurate test can betransformed into a perfectly accurate test!

    ADVANTAGES OF THE ROC CURVE

    ROC curve has following advantages comparedwith single value of sensitivity and specificity at aparticular cut-off.

    1. The ROC curve displays all possible cut-offpoints, and one can read the optimal cut-off forcorrectly identifying diseased or non-diseasedsubjects as per the procedure given later.

    TABLE II HYPOTHETICAL DATA SHOWING THE SENSITIVITY AND SPECIFICITY AT VARIOUS CUT-OFF POINTS OF MID-ARMCIRCUMFERENCE TO DETECT LOW BIRTH WEIGHT

    Mid-arm cir- Low birthweight Normal birth weight Sensitivity Specificitycumference (cm) (

  • INDIAN PEDIATRICS 280 VOLUME 48__APRIL 17, 2011

    KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE

    2. The ROC curve is independent of prevalence ofdisease since it is based on sensitivity andspecificity which are known to be independentof prevalence of disease [6-7].

    3. Two or more diagnostic tests can be visuallycompared simultaneously in one figure.

    4. Sometimes sensitivity is more important thanspecificity or vice versa, ROC curve helps infinding the required value of sensitivity at fixedvalue of specificity.

    5. Empirical area under the ROC curve (explainedlater) is invariant with respect to the addition orsubtraction of a constant or transformation likelog or square root [8]. Log or square roottransformation condition is not applicable forbinormal ROC curve. Binormal ROC is alsoshortly explained.

    6. Useful summary of measures can be obtainedfor determining the validity of diagnostic testsuch as AUC and partial area under the curve.

    NON-PARAMETRIC AND PARAMETRIC METHODS TOOBTAIN AREA UNDER THE ROC CURVE

    Statistical softwares provide non-parametric andparametric methods for obtaining the area underROC curve. The user has to make a choice. Thefollowing details may help.

    Non-parametric Approach

    This does not require any distribution pattern of testvalues and the resulting area under the ROC curve iscalled empirical. First such method uses trapezoidalrule. It calculates the area by just joining the points(1-SP,SM) at each interval of the observed values ofcontinuous test and draws a straight line joining thex-axis. This forms several trapezoids (Fig 2) andtheir area can be easily calculated and summed.Figure 2 is drawn for the mid-arm circumferenceand low birth weight data in Table II. Another non-parametric method uses Mann-Whitney statistics,also known as Wilcoxon rank-sum statistic and thec-index for calculating area. Both these methods ofestimating AUC estimate have been foundequivalent [7].

    Standard errors (SE) are needed to construct aconfidence interval. Three methods have beensuggested for estimating the SE of empirical areaunder ROC curve [7,9-10]. These have been foundsimilar when sample size is greater than 30 in eachgroup provided test value is on continuous scale[11]. For small sample size it is difficult torecommend any one method. For discrete ordinaloutcome, Bamber method [9] and Delong method[10] give equally good results and better thanHanley and McNeil method [7].

    Parametric Methods

    These are used when the statistical distribution ofdiagnostic test values in diseased and non-diseasedis known. Binormal distribution is commonly usedfor this purpose. This is applicable when test valuesin both diseased and non-diseased subjects follownormal distribution. If data are actually binormal ora transformation such as log, square or Box-Cox[12] makes the data binormal then the relevantparameters can be easily estimated by means andvariances of test values in diseased and non-diseased subjects. Details are available elsewhere[13].

    Another parametric approach is to transform thetest results into an unknown monotone form whenboth the diseased and non-diseased populations

    Se

    ns

    it

    iv

    it

    y

    FIG. 2 Comparison of empirical and binormal ROC curvesfor hypothetical neonatal data in Table II.

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.91 - Specificity

  • INDIAN PEDIATRICS 281 VOLUME 48__APRIL 17, 2011

    KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE

    follow binormal distribution [14]. This firstdiscretizes the continuous data into a maximum 20categories, then uses maximum likelihood methodto estimate the parameters of the binormaldistribution and calculates the AUC and standarderror of AUC. ROCKIT package containingROCFIT method uses this approach to draw theROC curve, to estimate the AUC, for comparisonbetween two tests, and to calculate partial area [15].

    The choice of method to calculate AUC forcontinuous test values essentially depends uponavailability of statistical software. Binormal methodand ROCFIT method produce results similar to non-parametric method when distribution is binormal[16]. In unimodal skewed distribution situation,Box-Cox transformation that makes test valuebinormal and ROCFIT method perform give resultssimilar to non-parametric method but former twoapproaches have additional useful property forproviding smooth curve [16,17]. When software forboth parametric and non-parametric methods isavailable, conclusion should be based on the methodwhich yields greater precision to estimate the AUC.However, for bimodal distribution (having twopeaks), which is rarely found in medical practice,Mann-Whitney gives more accurate estimatescompare to parametric methods [16]. Parametricmethod gives small bias for discrete test valuecompared to non-parametric method [13].

    The area under the curve by trapezoidal rule andMann-Whitney U are 0.9142 and 0.9144,respectively, of mid-arm circumference forindicating low birth weight in our data. The SE alsois nearly equal by three methods in these data:Delong SE = 0.0128, Bamber SE = 0.0128, andHanley and McNeil SE = 0.0130. For parametricmethod, smooth ROC curve was obtained assumingbinormal assumption (Fig 2) and the area under thecurve is calculated by using means and standarddeviations of mid-arm circumference in normal andlow birth weight neonates which is 0.9427 and itsSE is 0.0148 in this example. Binormal methodshowed higher area compared to area by non-parametric method which might be due to violationof binormal assumption in this case. Binormal ROCcurve is initially above the empirical curve (Fig 2)suggesting higher sensitivity compared to empirical

    values in this range. When (1specificity) liesbetween 0.1 to 0.2, the binormal curve is below theempirical curve, suggesting comparatively lowsensitivity compared to empirical values. Whenvalues of (1specificity) are greater than 0.2, thecurves are almost overlapping suggesting bothmethods giving the similar sensitivity. The AUC byusing ROCFIT methods is 0.9161 and it standarderror is 0.0100. This AUC is similar to the non-parametric method; however standard error is littleless compared to standard error by non-parametricmethod. The data in our example has unimodalskewed distribution and results agree with previoussimulation study [16, 17] on such data. Allcalculations were done using MS Excel and STATAstatistical software for this example.

    Interpretation of ROC Curve

    Total area under ROC curve is a single index formeasuring the performance a test. The larger theAUC, the better is overall performance ofdiagnostic test to correctly pick up diseased andnon-diseased subjects. Equal AUCs of two testsrepresents similar overall performance of medicaltests but this does not necessarily mean that both thecurves are identical. They may cross each other.Three common interpretations of area under theROC curve are: (i) the average value of sensitivityfor all possible values of specificity, (ii) the averagevalue of specificity for all possible values ofsensitivity [13]; and (iii) the probability that arandomly selected patient with disease has positivetest result that indicates greater suspicion than arandomly selected patient without disease [10]when higher values of the test are associated withdisease and lower values are associated with non-disease. This interpretation is based on non-parametric Mann-Whitney U statistic forcalculating the AUC.

    Figure 3 depicts three different ROC curves.Considering the area under the curve, test A is betterthan both B and C, and the curve is closer to theperfect discrimination. Test B has good validity andtest C has moderate.

    Hypothetical ROC curves of three diagnostictests A, B, and C applied on the same subjects to

  • INDIAN PEDIATRICS 282 VOLUME 48__APRIL 17, 2011

    KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE

    classify the same disease are shown in Fig 4. Test B(AUC=0.686) and C (AUC=0.679) have nearlyequal area but cross each other whereas test A(AUC=0.805) has higher AUC value than curves Band C. The overall performance of test A is betterthan test B as well as test C at all the thresholdpoints. Test C performed better than test B wherehigh sensitivity is required, and test B performedbetter than C when high specificity is needed.

    Sensitivity at Fixed Point and Partial Area Underthe ROC Curve

    The choice of fixed specificity or range ofspecificity depends upon clinical setting. Forexample, to diagnose serious disease such as cancerin a high risk group, test that has higher sensitivity ispreferred even if the false positive rate is highbecause test giving more false negative subjects ismore dangerous. On the other hand, in screening alow risk group, high specificity is required for thediagnostic test for which subsequent confirmatorytest is invasive and costly so that false positive rateshould be low and patient does not unnecessarilysuffers pain and pays price. The cut-off point shouldbe decided accordingly. Sensitivity at fixed point ofspecificity or vice versa and partial area under the

    curve are more suitable in determining the validityof a diagnostic test in above mentioned clinicalsetting and also for the comparison of twodiagnostic tests when applied to same orindependent patients when ROC curves cross eachother.

    Partial area is defined as area between range offalse positive rate (FPR) or between the twosensitivities. Both parametric (binormalassumption) and non-parametric methods areavailable in the literature [13,18] but most statisticalsoftwares do not have option to calculate the partialarea under ROC curve (Table III). STATAsoftware has all the features for ROC analysis.

    Standardization of the partial area by dividing itwith maximum possible area (equal to the width ofthe interval of selected ranged of FPRs orsensitivities) has been recommended [19]. It can beinterpreted as average sensitivity for the range ofselected specificities. This standardization makespartial area more interpretable and its maximumvalue will be 1. Figure 5 shows that the partial areaunder the curve for FPR from 0.3 to 0.5 for test A is0.132, whereas for test B is 0.140. Afterstandardization, they it would be 0.660 and 0.700,

    FIG. 3 Comparison of three smooth ROC curves withdifferent areas.

    FIG. 4 Three empirical ROC curves. Curves for B and Ccross each other but have nearly equal areas, curveA has bigger area.

  • INDIAN PEDIATRICS 283 VOLUME 48__APRIL 17, 2011

    KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE

    TA

    BLE II

    I SO

    ME P

    OPU

    LAR

    RO

    C A

    NA

    LYSI

    S SO

    FTW

    AR

    ES A

    ND

    TH

    EIR IM

    PORT

    AN

    T FEA

    TUR

    ES

    Nam

    e of R

    OC

    anal

    ysis

    softw

    are

    Met

    hods

    use

    d to

    Com

    paris

    on o

    f tw

    o or

    mor

    ePa

    rtial

    area

    Impo

    rtant

    by-

    prod

    ucts

    estim

    ate A

    UC

    of

    RO

    C cu

    rves

    RO

    C cu

    rve a

    nd it

    sva

    rianc

    eC

    alcu

    latio

    nC

    ompa

    rison

    Med

    calc

    sof

    twar

    e ver

    sion

    11.3

    ;N

    on-p

    aram

    etric

    Avai

    labl

    eN

    ot av

    aila

    ble

    Not

    avai

    labl

    eSe

    nsiti

    vity

    , spe

    cific

    ity,

    Com

    mer

    cial

    softw

    are t

    rial v

    ersi

    onLR

    +, L

    R- w

    ith 9

    5% C

    Iav

    aila

    ble a

    t: w

    ww.

    med

    calc

    .be

    at ea

    ch p

    ossi

    ble c

    ut-p

    oint

    SPSS

    ver

    sion

    17.

    0;N

    on-p

    aram

    etric

    Not

    avai

    labl

    eN

    ot av

    aila

    ble

    Not

    avai

    labl

    eSe

    nsiti

    vity

    and

    Com

    mer

    cial

    softw

    are

    spec

    ifici

    ty at

    each

    cut-o

    ff po

    int -

    no

    95%

    Cl

    STAT

    A v

    ersi

    on 11

    ;N

    on-p

    aram

    etric

    Avai

    labl

    e for

    pai

    red

    and

    unpa

    i-Av

    aila

    ble

    spec

    ifici

    tyO

    nly

    two

    parti

    alSe

    nsiti

    vity

    , spe

    cific

    ity,

    Com

    mer

    cial

    softw

    are

    Para

    met

    ricre

    d su

    bjec

    ts w

    ith B

    onfe

    rron

    iat

    spec

    ified

    rang

    e and

    AU

    Cs c

    an b

    eLR

    +, L

    R-

    at ea

    ch cu

    t-off

    (Met

    z et a

    l.)ad

    just

    men

    t whe

    n 3

    or m

    ore

    sens

    itivi

    ty ra

    nge

    com

    pare

    dpo

    int b

    ut n

    o 95

    % C

    Icu

    rves

    to b

    e com

    pare

    dR

    OC

    KIT

    (Bet

    a ver

    sion

    )Pa

    ram

    etric

    Avai

    labl

    e for

    bot

    h pa

    ired

    Avai

    labl

    eAv

    aila

    ble

    Sens

    itivi

    ty an

    dFr

    ee so

    ftwar

    e; A

    vaila

    ble a

    t(M

    etz e

    t al.)

    and

    unpa

    ired

    subj

    ects

    spec

    ifici

    ty at

    each

    cut-

    ww

    w.ra

    diol

    ogy.u

    chic

    ago.

    edu

    poin

    tA

    naly

    se -

    itCom

    mer

    cial

    softw

    are;

    Non

    -par

    amet

    ricAv

    aila

    ble

    no

    optio

    n fo

    rN

    ot av

    aila

    ble

    Not

    avai

    labl

    eSe

    nsiti

    vity

    , spe

    cific

    ity,

    add

    on in

    the M

    S Ex

    cel;

    trial

    ver

    sion

    paire

    d an

    d un

    paire

    dLR

    +, L

    R- w

    ith 9

    5% C

    I of

    avai

    labl

    e at w

    ww.

    anal

    yse-

    it.co

    msu

    bjec

    tsea

    ch p

    ossi

    ble c

    ut-p

    oint

    ;Va

    rious

    dec

    isio

    n pl

    ots

    XLs

    tat2

    010

    Com

    mer

    cial

    softw

    are;

    Non

    -par

    amet

    ricAv

    aila

    ble f

    or b

    oth

    paire

    dN

    ot av

    aila

    ble

    Not

    avai

    labl

    eSe

    nsiti

    vity

    , spe

    cific

    ity,

    add

    on in

    the M

    S Ex

    cel;

    trial

    ver

    -an

    d un

    paire

    d su

    bjec

    tsLR

    +, L

    R- w

    ith 9

    5% C

    I of

    sion

    avai

    labl

    e at:

    ww

    w.xl

    stat

    .com

    each

    pos

    sibl

    e cut

    -Var

    ious

    - de

    cisi

    on p

    lots

    Sigm

    a plo

    t Com

    mer

    cial

    softw

    are;

    Non

    -par

    amet

    ricAv

    aila

    ble f

    or b

    oth

    paire

    dad

    d on

    in th

    e MS

    Exce

    l; tri

    al v

    er-

    and

    unpa

    ired

    subj

    ects

    Not

    avai

    labl

    eN

    ot av

    aila

    ble

    Sens

    itivi

    ty, s

    peci

    ficity

    ,av

    aila

    ble a

    t:LR

    +, L

    R- w

    ith 9

    5% C

    I of

    ww

    w.si

    gmap

    lot.c

    omea

    ch p

    ossi

    ble c

    ut-p

    oint

    :Va

    rious

    - dec

    isio

    n pl

    ots

    *LR+

    =Po

    sitiv

    e lik

    elih

    ood

    ratio

    , LR-

    = N

    egat

    ive l

    ikel

    ihoo

    d ra

    tio, P

    aire

    d su

    bjec

    ts m

    eans

    bot

    h di

    agno

    stic t

    ests

    (test

    and

    gold

    ) ap

    plie

    d to

    sam

    e sub

    ject

    s and

    unp

    aire

    d su

    bjec

    ts m

    eans

    dia

    gnos

    ticte

    sts a

    pplie

    d to

    diff

    eren

    t sub

    ject

    s, C

    I = C

    onfid

    ence

    inte

    rval

    , AU

    C=

    Area

    und

    er cu

    rve,

    RO

    C=

    Rece

    iver

    ope

    ratin

    g ch

    arac

    teris

    tic.

  • INDIAN PEDIATRICS 284 VOLUME 48__APRIL 17, 2011

    KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE

    respectively. The portion of partial area will dependon the range of interest of FPRs selected byresearcher. It may lie on one side of intersectingpoint or may be on both sides of intersecting pointof ROC curves. In Figure 5, SN(A) and SN(B) aresensitivities at specific value of FPR. For example,sensitivity at FPR=0.3 is 0.545 for test A and 0.659for test B. Similarly sensitivity at fixed FPR=0.5 fortest A is 0.76 and 0.72 for test B. All thesecalculations were done by STATA (version 11)statistical software using comproc command withoption pcvmeth(empirical).

    METHOD TO FIND THE OPTIMAL THRESHOLDPOINT

    Optimal threshold is the point that gives maximumcorrect classification. Three criteria are used to findoptimal threshold point from ROC curve. First twomethods give equal weight to sensitivity andspecificity and impose no ethical, cost, and noprevalence constraints. The third criterionconsiders cost which mainly includes financial costfor correct and false diagnosis, cost of discomfort toperson caused by treatment, and cost of furtherinvestigation when needed. This method is rarelyused in medical literature because it is difficult toimplement. These three criteria are known as pointson curve closest to the (0, 1), Youden index, andminimize cost criterion, respectively.

    The distance between the point (0, 1) and anypoint on the ROC curve is d2 =[(1SN)

    2 + (1 Sp)2].

    To obtain the optimal cut-off point to discriminatethe disease with non-disease subject, calculate thisdistance for each observed cut-off point, and locatethe point where the distance is minimum. Most ofthe ROC analysis softwares (Table III) calculatethe sensitivity and specificity at all the observedcut-off points allowing you to do this exercise.

    The second is Youden index [20] that maximizesthe vertical distance from line of equality to thepoint [x, y] as shown in Fig 1. The x-axis represents(1- specificity) and y-axis represents sensitivity. Inother words, the Youden index J is the point on theROC curve which is farthest from line of equality(diagonal line). The main aim of Youden index is tomaximize the difference between TPR (SN) and

    FPR (1 SP) and little algebra yields J =max[SN+SP]. The value of J for continuous test canbe located by doing a search of plausible valueswhere sum of sensitivity and specificity can bemaximum. Youden index is more commonly usedcriterion because this index reflects the intension tomaximize the correct classification rate and is easyto calculate. Many authors advocate this criterion[21]. Third method that considers cost is rarely usedin medical literature and is described in [13].

    BIASES THAT CAN AFFECT ROC CURVE RESULTS

    We describe more prevalent biases in this sectionthat affect the sensitivity, specificity and conse-quently may affect the area under the ROC curve.Interested researcher can find detailed descriptionof these and other biases such as withdrawal bias,lost to follow-up bias, spectrum bias, andpopulation bias, elsewhere [22,23].

    1. Gold standard: Validity of gold standard isimportantideally it should be error free and thediagnostic test under review should beindependent of the gold standard as this canincrease the area under the curve spuriously. Thegold standard can be clinical follow-up, surgicalverification, biopsy or autopsy or in some cases

    FIG. 5 The partial area under the curve and sensitivity atfixed point of specificity (see text).

  • INDIAN PEDIATRICS 285 VOLUME 48__APRIL 17, 2011

    KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE

    opinion of panel of experts. When gold standard isimperfect, such as peripheral smear for malariaparasites [24], sensitivity and specificity of thetest are under estimated [22].

    2. Verification bias: This occurs when all diseasesubjects do not receive the same gold standard forsome reason such as economic constraints andclinical considerations. For example, inevaluating the breast bone density as screeningtest for diagnosis of breast cancer and only thosewomen who have higher value of breast bonedensity are referred for biopsy, and those withlower value but suspected are followed clinically.In this case, verification bias would overestimatethe sensitivity of breast bone density test.

    3. Selection bias: Selection of right patients with andwithout diseased is important because some testsproduce prefect results in severely diseased groupbut fail to detect mild disease.

    4. Test review bias: The clinician should be blind tothe actual diagnosis while evaluating a test. Aknown positive disease subject or known non-disease subject may influence the test result.

    5. Inter-observer bias: In the studies where observerabilities are important in diagnosis, such as forbone density assessment through MRI,experienced radiologist and junior radiologistmay differ. If both are used in the same study, theobserver bias is apparent.

    6. Co-morbidity bias: Sometimes patients haveother types of known or unknown diseases whichmay affect the positivity or negativity of test. Forexample, NESTROFT (Naked eye single tube redcell osmotic fragility test), used for screening ofthalassaemia in children, shows good sensitivityin patients without any other hemoglobindisorders but also produces positive results whenother hemoglobin disorders are present [25].

    7. Uninterpretable test results: This bias occurswhen test provides results which can not beinterpreted and clinician excludes these subjectsfrom the analysis. This results in over estimationof validity of the test.

    It is difficult to rule out all the biases butresearcher should be aware and try to minimizethem.

    TABLE IV SAMPLE SIZE FORMULA FOR ESTIMATING SENSITIVITY AND SPECIFICITY AND AREA UNDER THE ROC CURVE

    Problem Formula Description of symbol used

    Estimating the sensitivity of test SN = Anticipated sensitivityPrev = Prevalence of disease in population canbe obtained from previous literature or pilotstudy = required absolute precision on either side ofthe sensitivity

    Estimating the specificity of test SN = Anticipated specificityPrev = Prevalence of disease in population canbe obtained from previous literature or pilotstudy = required absolute precision on either side ofthe specificity.

    Estimating the area under the V(AUC) = Anticipated variance anticipatedROC curve area under ROC curve

    nD = number of diseased subjects = required absolute precision on either side ofn = nD(1+k), k is ratio of prevalence the area under the curve.of non-disease to disease subjects

    Z1/2 is a standard normal value and is the confidence level. Z1/2= 1.645 for =0.10 and Z1/2= 1.96 for =0.05.

    Z21/2SN (1SN)

    2 Prev

    Z21/2Sp (1Sp)

    2 (1Prev)

    Z2/2V (AUC)nD=

    2

  • INDIAN PEDIATRICS 286 VOLUME 48__APRIL 17, 2011

    KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE

    SAMPLE SIZE

    Adequate power of the study depends upon thesample size. Power is probability that a statisticaltest will indicate significant difference wherecertain pre-specified difference is actually present.In a survey of eight leading journals, only two out of43 studies reported a prior calculation of samplesize in diagnostic studies [26]. In estimation set-up,adequate sample size ensures the study will yieldthe estimate with desired precision. Small samplesize produces imprecise or inaccurate estimate,while large sample size is wastage of resourcesespecially when a test is expensive. The sample sizeformula depends upon whether interest is inestimation or in testing of the hypothesis. Table IVprovides the required formula for estimation ofsensitivity, specificity and AUC. These are based onthe normal distribution or asymptotic assumption(large sample theory) which is generally used forsample size calculation.

    Variance of AUC, required in formula 3 (TableIV), can be obtained by using either parametric ornon-parametric method. This may also be availablein literature on previous studies. If no previousstudy is available, a pilot study is done to get someworkable estimates to calculate sample size. Forpilot study data, appropriate statistical software canprovide estimate of this variance.

    Formulas of sample size to test hypothesis onsensitivity-specificity or the AUC with a pre-specified value and for comparison on the samesubjects or different subjects are complex. Refer[13] for details. A Nomogram was devised to readthe sample size for anticipated sensitivity and

    specificity at 90%, 95%, 99% confidence level [27].

    There are many more topics for interested readerto explore such as combining the multiple ROCcurve for meta-analysis, ROC analysis to predictmore than one alternative, ROC analysis in theclustered environment, and for tests for repeatedover the time. For these see [13,28]. For predictivitybased ROC, see [6].

    Contributors: Both authors contributed to concept, review ofliterature, drafting the paper and approved the final version.Funding: None.Competing interests: None stated.

    REFERENCES

    1. Reddy S, Dutta S, Narang A. Evaluation of lactatedehydrogenase, creatine kinase and hepatic enzymesfor the retrospective diagnosis of perinatal asphyxiaamong sick neonates. Indian Pediatr. 2008;45:144-7.

    2. Sood SL, Saiprasad GS, Wilson CG. Mid-arm circum-ference at birth: A screening method for detection oflow birth weight. Indian Pediatr. 2002;39:838-42.

    3. Randev S, Grover N. Predicting neonatalhyperbilirubinemia using first day serum bilirubinlevels. Indian J Pediatr. 2010; 77:147-50.

    4. Vasudevan A, Malhotra A, Lodha R, Kabra SK. Profileof neonates admitted in pediatric ICU and validation ofscore for neonatal acute physiology (SNAP). IndianPediatr. 2006;43:344-8.

    5. Khanna R, Taneja V, Singh SK, Kumar N, Sreenivas V,Puliyel JM. The clinical risk index of babies (CRIB)score in India. Indian J Pediatr. 2002;69:957-60.

    6. Indrayan A. Medical Biostatistics (Second Edition).Boca Raton: Chapman & Hall/ CRC Press; 2008. p.263-7.

    7. Hanley JA, McNeil BJ. The meaning and use of the areaunder a receiver operating characteristic (ROC) curve.Radiology. 1982;143:29-36.

    8. Campbell G. Advances in statistical methodology forevaluation of diagnostic and laboratory tests. Stat Med.1994;13:499-508.

    KEY MESSAGES For sensitivity and specificity to be valid indicators, the gold standard should be nearly perfect.

    For a continuous test, optimal cut-off point to correctly pick-up a disease and non-diseased cases is the pointwhere sum of specificity and sensitivity is maximum, when equal weight is given to both.

    ROC curve is obtained for a test on continuous scale. Area under the ROC curves is an index of overall inherentvalidity of test as well as used for comparing sensitivity and specificity at particular cut-offs of interest.

    Area under the curve is not an accurate index of comparing two tests when their ROC curves cross each other.

    Adequate sample size and proper designing is required to yield valid and unbiased results.

  • INDIAN PEDIATRICS 287 VOLUME 48__APRIL 17, 2011

    KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE

    9. Bamber D. The area above the ordinal dominance graphand area below the receiver operating characteristicgraph. J Math Psychol. 1975;12:387-415.

    10. DeLong ER, DeLong DM, Clarke-Pearson DL.Comparing the area under two or more correlatedreceiver operating characteristic curves: anonparametric approach. Biometrics. 1988;44:837-45.

    11. Cleves MA. Comparative assessment of three commonalgorithms for estimating the variance of the area underthe nonparametric receiver operating characteristiccurve. Stata J. 2002;3:280-9.

    12. Box GEP, Cox DR. An analysis of transformation. JRoyal Statistical Society, Series B. 1964;26:211-52.

    13. Zhou Xh, Obuchowski NA, McClish DK. StatisticalMethods in Diagnostic Medicine. New York: JohnWiley and Sons, Inc; 2002.

    14. Metz CE, Herman BA, Shen JH. Maximum likelihoodestimation of receiver operating characteristic (ROC)curve from continuously distributed data. Stat Med.1998;17:1033-53.

    15. ROCKIT [Computer program]. Chicago: University ofChicago. Available from: www-radiology.uchicago.edu/krl/KRL_ROC/software_index6.htm. Accessed onFebruary 27, 2007.

    16. Faraggi D, Reiser B. Estimating of area under the ROCcurve. Stat Med. 2002; 21:3093-3106.

    17. Hajian Tilaki KO, Hanley JA, Joseph L, Collet JP. Acomparison of parametric and nonparametricapproaches to ROC analysis of quantitative diagnosistests. Med Decis Making. 1997;17:94-102.

    18. Pepe M, Longton G, Janes H. Estimation andcomparison of receiver operating characteristic curves.Stata J. 2009;9:1.

    19. Jiang Y, Metz CE, Nishikawa RM. A receiver operating

    characteristic partial area index for highly sensitivediagnostic test. Radiology. 1996;201:745-50.

    20. Youden WJ. An index for rating diagnostic test.Cancer. 1950;3:32-5.

    21. Perkins NJ, Schisterman EF. The inconsistency ofoptimal cut points obtained using two criteria basedon the receiver operating characteristic curve. Am JEpidemiol. 2006;163:670-5.

    22. Whiting P, Ruljes AW, Reitsma JB, Glas AS, BossuytPM, Kleijnen J. Sources of variation and bias in studiesof diagnostic accuracy a systematic review. AnnIntern Med. 2004;140:189-202.

    23. Kelly S, Berry E, Proderick P, Harris KM,Cullingworth J, Gathercale L, et al. The identificationof bias in studies of the diagnostic performance ofimaging modalities. Br J Radiol. 1997;70:1028-35.

    24. Malaria Site. Peripheral smear study for malariaparasites Available from: www.malariasite.com/malaria/DiagnosisOfMalaria.htm. Accessed on July05, 2010.

    25. Thomas S, Srivastava A, Jeyaseelan L, Dennison D,Chandy M. NESTROFT as screening test for thedetection of thalassaemia & common haemoglobino-pathies an evaluation against a high performanceliquid chromatographic method. Indian J Med Res.1996;104:194-7.

    26. Bachmann LM, Puhan MA, ter Riet G, Bossuyt PM.Sample sizes of studies on diagnostic accuracy:literature survey. BMJ. 2006;332:1127-9.

    27. Malhotra RK, Indrayan A. A simple nomogram forsample size for estimating the sensitivity and specificityof medical tests. Indion J Ophthalmol.2010;58:519-22.

    28. Kester AD, Buntinx F. Meta analysis of curves. MedDecis Making. 2000;20:430-9.

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /Warning /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 215 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.04651 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 600 /MonoImageMinResolutionPolicy /Warning /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /CreateJDFFile false /Description >>> setdistillerparams> setpagedevice