receiver operating characteristic (roc) curve
TRANSCRIPT
-
INDIAN PEDIATRICS 277 VOLUME 48__APRIL 17, 2011
Diagnostic tests play a vital role inmodern medicine not only forconfirming the presence of disease butalso to rule out the disease in individualpatient. Diagnostic tests with two outcomecategories such as a positive test (+) and negativetest () are known as dichotomous, whereas thosewith more than two categories such as positive,indeterminate and negative are called polytomoustests. The validity of a dichotomous test comparedwith the gold standard is determined by sensitivityand specificity. These two are components thatmeasure the inherent validity of a test.
A test is called continuous when it yieldsnumeric values such as bilirubin level and nominalwhen it yields categories such as Mantoux test.Sensitivity and specificity can be calculated in bothcases but ROC curve is applicable only forcontinuous or ordinal test.
When the response of a diagnostic test iscontinuous or on ordinal scale (minimum 5
categories), sensitivity and specificity can becomputed across all possible threshold values.Sensitivity is inversely related with specificity in thesense that sensitivity increases as specificitydecreases across various threshold. The receiveroperating characteristic (ROC) curve is the plot thatdisplays the full picture of trade-off between thesensitivity and (1- specificity) across a series of cut-off points. Area under the ROC curve is considered asan effective measure of inherent validity of adiagnostic test. This curve is useful in (i) findingoptimal cut-off point to least misclassify diseased ornon-diseased subjects, (ii) evaluating the discri-minatory ability of a test to correctly pick diseasedand non-diseased subjects; (iii) comparing theefficacy of two or more tests for assessing the samedisease; and (iv) comparing two or more observersmeasuring the same test (inter-observer variability).
INTRODUCTION
This article provides simple introduction to measuresof validity: sensitivity, specificity, and area under
Receiver Operating Characteristic (ROC) Curvefor Medical ResearchersRAJEEV KUMAR AND ABHAYA INDRAYAN
From the Department of Biostatistics and Medical Informatics, University College of Medical Sciences, Delhi, India.Correspondence to:Mr Rajeev Kumar, Department of Biostatistics and Medical Informatics, University College of MedicalSciences, Delhi 110 095. [email protected]
P E R S P E C T I V EP E R S P E C T I V EP E R S P E C T I V EP E R S P E C T I V EP E R S P E C T I V E
Sensitivity and specificity are two components that measure the inherent validity of a diagnostic test for dichotomousoutcomes against a gold standard. Receiver operating characteristic (ROC) curve is the plot that depicts the trade-offbetween the sensitivity and (1-specificity) across a series of cut-off points when the diagnostic test is continuous or onordinal scale (minimum 5 categories). This is an effective method for assessing the performance of a diagnostic test. Theaim of this article is to provide basic conceptual framework and interpretation of ROC analysis to help medical researchersto use it effectively. ROC curve and its important components like area under the curve, sensitivity at specified specificityand vice versa, and partial area under the curve are discussed. Various other issues such as choice between parametricand non-parametric methods, biases that affect the performance of a diagnostic test, sample size for estimating thesensitivity, specificity, and area under ROC curve, and details of commonly used softwares in ROC analysis are alsopresented.
Key words: Sensitivity, Specificity, Receiver operating characteristic curve, Sample size, Optimal cut-off point,Partial area under the curve.
-
INDIAN PEDIATRICS 278 VOLUME 48__APRIL 17, 2011
KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE
ROC curve, with their meaning and interpretation.Some popular procedures to find optimal thresholdpoint, possible bias that can affect the ROC analysis,sample size required for estimating sensitivity,specificity and area under ROC curve, and finallycommonly used statistical softwares for ROCanalysis and their specifications are also discussed.
PubMed search of pediatric journals reveals thatROC curve is extensively used for clinical decisions.For example, it was used for determining the validityof biomarkers such as serum creatine kinase muscle-brain fraction and lactate dehydrogenase (LDH) fordiagnosis of the perinatal asphyxia in symptomaticneonates delivered non-institutionally where areaunder the ROC curve for serum creatine kinasemuscle-brain fraction recorded at 8 hours was 0.82(95% CI 0.69-0.94) and cut-off point above 92.6 U/Lwas found best to classify the subjects. The areaunder ROC curve for LDH at 72 hours was 0.99 (95%CI 0.99-1.00) and cut-off point above 580 U/L wasfound optimal for classifying the perinatal asphyxiain symptomatic neonates [1]. It has also beensimilarly used for parameters such as mid-armcircumference at birth for detection of low birthweight [2], and first day total serum bilirubin value topredict the subsequent hyperbilirubinemia [3]. It isalso used for evaluating model accuracy andvalidation such as death and survival in children orneonates admitted in the PICU based on the childcharacteristics [4], and for comparing predictabilityof mortality in extreme preterm neonates by birth-weight with predictability by gestational age and withclinical risk index of babies score [5].
SENSITIVITY AND SPECIFICITY
Two popular indicators of inherent statistical validityof a medical test are the probabilities of detecting
correct diagnosis by test among the true diseasedsubjects (D+) and true non-diseased subjects (D-).For dichotomous response, the results in terms of testpositive (T+) or test negative (T-) can be summarizedin a 22 contingency table (Table I). The columnsrepresent the dichotomous categories of truediseased status and rows represent the test results.True status is assessed by gold standard. Thisstandard may be another but more expensivediagnostic method or a combination of tests or maybe available from the clinical follow-up, surgicalverification, biopsy, autopsy, or by panel of experts.Sensitivity or true positive rate (TPR) is conditionalprobability of correctly identifying the diseased
TPsubjects by test: SN = P(T+/D+) = ; andTP + FNspecificity or true negative rate (TNR) is conditionalprobability of correctly identifying the non-disease
TNsubjects by test: SP= P(T-/D-) = . FalseTN +FPpositive rate (FPR) and false negative rate (FNR) arethe two other common terms, which are conditionalprobability of positive test in non-diseased subjects:
FPP(T+/D-)= ; and conditional probability ofFP + TNFNnegative test in diseased subjects: P(T-/D+)= ,TP + FN
respectively.
Calculation of sensitivity and specificity ofvarious values of mid-arm circumference (cm) fordetecting low birth weight on the basis of ahypothetical data are given in Table II as anillustration. The same data have been used later todraw a ROC curve.
ROC CURVE
ROC curve is graphical display of sensitivity (TPR)on y-axis and (1 specificity) (FPR) on x-axis forvarying cut-off points of test values. This is
TABLE I DIAGNOSTIC TEST RESULTS IN RELATION TO TRUE DISEASE STATUS IN A 22 TABLE
Diagnostic test result Disease status Total
Present Absent
Present True positive (TP) False positive (FP) All test positive (T+)Absent False negative (FN) True negative (TN) All test negative (T-)Total Total with disease (D+) Total without disease (D-) Total sample size
-
INDIAN PEDIATRICS 279 VOLUME 48__APRIL 17, 2011
KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE
generally depicted in a square box for convenienceand its both axes are from 0 to 1. Figure 1 depicts aROC curve and its important components asexplained later. The area under the curve (AUC) isan effective and combined measure of sensitivityand specificity for assessing inherent validity of adiagnostic test. Maximum AUC = 1 and it meansdiagnostic test is perfect in differentiating diseasedwith non-diseased subjects. This implies bothsensitivity and specificity are one and both errorsfalse positive and false negativeare zero. This canhappen when the distribution of diseased and non-diseased test values do not overlap. This isextremely unlikely to happen in practice. The AUCcloser to 1 indicates better performance of the test.
The diagonal joining the point (0, 0) to (1,1)divides the square in two equal parts and each hasan area equal to 0.5. When ROC is this line, overallthere is 50-50 chances that test will correctly discri-minate the diseased and non-diseased subjects. Theminimum value of AUC should be considered 0.5instead of 0 because AUC = 0 means test incorrectlyclassified all subjects with disease as negative andall non-disease subjects as positive. If the test re-sults are reversed then area = 0 is transformed to area= 1; thus a perfectly inaccurate test can betransformed into a perfectly accurate test!
ADVANTAGES OF THE ROC CURVE
ROC curve has following advantages comparedwith single value of sensitivity and specificity at aparticular cut-off.
1. The ROC curve displays all possible cut-offpoints, and one can read the optimal cut-off forcorrectly identifying diseased or non-diseasedsubjects as per the procedure given later.
TABLE II HYPOTHETICAL DATA SHOWING THE SENSITIVITY AND SPECIFICITY AT VARIOUS CUT-OFF POINTS OF MID-ARMCIRCUMFERENCE TO DETECT LOW BIRTH WEIGHT
Mid-arm cir- Low birthweight Normal birth weight Sensitivity Specificitycumference (cm) (
-
INDIAN PEDIATRICS 280 VOLUME 48__APRIL 17, 2011
KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE
2. The ROC curve is independent of prevalence ofdisease since it is based on sensitivity andspecificity which are known to be independentof prevalence of disease [6-7].
3. Two or more diagnostic tests can be visuallycompared simultaneously in one figure.
4. Sometimes sensitivity is more important thanspecificity or vice versa, ROC curve helps infinding the required value of sensitivity at fixedvalue of specificity.
5. Empirical area under the ROC curve (explainedlater) is invariant with respect to the addition orsubtraction of a constant or transformation likelog or square root [8]. Log or square roottransformation condition is not applicable forbinormal ROC curve. Binormal ROC is alsoshortly explained.
6. Useful summary of measures can be obtainedfor determining the validity of diagnostic testsuch as AUC and partial area under the curve.
NON-PARAMETRIC AND PARAMETRIC METHODS TOOBTAIN AREA UNDER THE ROC CURVE
Statistical softwares provide non-parametric andparametric methods for obtaining the area underROC curve. The user has to make a choice. Thefollowing details may help.
Non-parametric Approach
This does not require any distribution pattern of testvalues and the resulting area under the ROC curve iscalled empirical. First such method uses trapezoidalrule. It calculates the area by just joining the points(1-SP,SM) at each interval of the observed values ofcontinuous test and draws a straight line joining thex-axis. This forms several trapezoids (Fig 2) andtheir area can be easily calculated and summed.Figure 2 is drawn for the mid-arm circumferenceand low birth weight data in Table II. Another non-parametric method uses Mann-Whitney statistics,also known as Wilcoxon rank-sum statistic and thec-index for calculating area. Both these methods ofestimating AUC estimate have been foundequivalent [7].
Standard errors (SE) are needed to construct aconfidence interval. Three methods have beensuggested for estimating the SE of empirical areaunder ROC curve [7,9-10]. These have been foundsimilar when sample size is greater than 30 in eachgroup provided test value is on continuous scale[11]. For small sample size it is difficult torecommend any one method. For discrete ordinaloutcome, Bamber method [9] and Delong method[10] give equally good results and better thanHanley and McNeil method [7].
Parametric Methods
These are used when the statistical distribution ofdiagnostic test values in diseased and non-diseasedis known. Binormal distribution is commonly usedfor this purpose. This is applicable when test valuesin both diseased and non-diseased subjects follownormal distribution. If data are actually binormal ora transformation such as log, square or Box-Cox[12] makes the data binormal then the relevantparameters can be easily estimated by means andvariances of test values in diseased and non-diseased subjects. Details are available elsewhere[13].
Another parametric approach is to transform thetest results into an unknown monotone form whenboth the diseased and non-diseased populations
Se
ns
it
iv
it
y
FIG. 2 Comparison of empirical and binormal ROC curvesfor hypothetical neonatal data in Table II.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.91 - Specificity
-
INDIAN PEDIATRICS 281 VOLUME 48__APRIL 17, 2011
KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE
follow binormal distribution [14]. This firstdiscretizes the continuous data into a maximum 20categories, then uses maximum likelihood methodto estimate the parameters of the binormaldistribution and calculates the AUC and standarderror of AUC. ROCKIT package containingROCFIT method uses this approach to draw theROC curve, to estimate the AUC, for comparisonbetween two tests, and to calculate partial area [15].
The choice of method to calculate AUC forcontinuous test values essentially depends uponavailability of statistical software. Binormal methodand ROCFIT method produce results similar to non-parametric method when distribution is binormal[16]. In unimodal skewed distribution situation,Box-Cox transformation that makes test valuebinormal and ROCFIT method perform give resultssimilar to non-parametric method but former twoapproaches have additional useful property forproviding smooth curve [16,17]. When software forboth parametric and non-parametric methods isavailable, conclusion should be based on the methodwhich yields greater precision to estimate the AUC.However, for bimodal distribution (having twopeaks), which is rarely found in medical practice,Mann-Whitney gives more accurate estimatescompare to parametric methods [16]. Parametricmethod gives small bias for discrete test valuecompared to non-parametric method [13].
The area under the curve by trapezoidal rule andMann-Whitney U are 0.9142 and 0.9144,respectively, of mid-arm circumference forindicating low birth weight in our data. The SE alsois nearly equal by three methods in these data:Delong SE = 0.0128, Bamber SE = 0.0128, andHanley and McNeil SE = 0.0130. For parametricmethod, smooth ROC curve was obtained assumingbinormal assumption (Fig 2) and the area under thecurve is calculated by using means and standarddeviations of mid-arm circumference in normal andlow birth weight neonates which is 0.9427 and itsSE is 0.0148 in this example. Binormal methodshowed higher area compared to area by non-parametric method which might be due to violationof binormal assumption in this case. Binormal ROCcurve is initially above the empirical curve (Fig 2)suggesting higher sensitivity compared to empirical
values in this range. When (1specificity) liesbetween 0.1 to 0.2, the binormal curve is below theempirical curve, suggesting comparatively lowsensitivity compared to empirical values. Whenvalues of (1specificity) are greater than 0.2, thecurves are almost overlapping suggesting bothmethods giving the similar sensitivity. The AUC byusing ROCFIT methods is 0.9161 and it standarderror is 0.0100. This AUC is similar to the non-parametric method; however standard error is littleless compared to standard error by non-parametricmethod. The data in our example has unimodalskewed distribution and results agree with previoussimulation study [16, 17] on such data. Allcalculations were done using MS Excel and STATAstatistical software for this example.
Interpretation of ROC Curve
Total area under ROC curve is a single index formeasuring the performance a test. The larger theAUC, the better is overall performance ofdiagnostic test to correctly pick up diseased andnon-diseased subjects. Equal AUCs of two testsrepresents similar overall performance of medicaltests but this does not necessarily mean that both thecurves are identical. They may cross each other.Three common interpretations of area under theROC curve are: (i) the average value of sensitivityfor all possible values of specificity, (ii) the averagevalue of specificity for all possible values ofsensitivity [13]; and (iii) the probability that arandomly selected patient with disease has positivetest result that indicates greater suspicion than arandomly selected patient without disease [10]when higher values of the test are associated withdisease and lower values are associated with non-disease. This interpretation is based on non-parametric Mann-Whitney U statistic forcalculating the AUC.
Figure 3 depicts three different ROC curves.Considering the area under the curve, test A is betterthan both B and C, and the curve is closer to theperfect discrimination. Test B has good validity andtest C has moderate.
Hypothetical ROC curves of three diagnostictests A, B, and C applied on the same subjects to
-
INDIAN PEDIATRICS 282 VOLUME 48__APRIL 17, 2011
KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE
classify the same disease are shown in Fig 4. Test B(AUC=0.686) and C (AUC=0.679) have nearlyequal area but cross each other whereas test A(AUC=0.805) has higher AUC value than curves Band C. The overall performance of test A is betterthan test B as well as test C at all the thresholdpoints. Test C performed better than test B wherehigh sensitivity is required, and test B performedbetter than C when high specificity is needed.
Sensitivity at Fixed Point and Partial Area Underthe ROC Curve
The choice of fixed specificity or range ofspecificity depends upon clinical setting. Forexample, to diagnose serious disease such as cancerin a high risk group, test that has higher sensitivity ispreferred even if the false positive rate is highbecause test giving more false negative subjects ismore dangerous. On the other hand, in screening alow risk group, high specificity is required for thediagnostic test for which subsequent confirmatorytest is invasive and costly so that false positive rateshould be low and patient does not unnecessarilysuffers pain and pays price. The cut-off point shouldbe decided accordingly. Sensitivity at fixed point ofspecificity or vice versa and partial area under the
curve are more suitable in determining the validityof a diagnostic test in above mentioned clinicalsetting and also for the comparison of twodiagnostic tests when applied to same orindependent patients when ROC curves cross eachother.
Partial area is defined as area between range offalse positive rate (FPR) or between the twosensitivities. Both parametric (binormalassumption) and non-parametric methods areavailable in the literature [13,18] but most statisticalsoftwares do not have option to calculate the partialarea under ROC curve (Table III). STATAsoftware has all the features for ROC analysis.
Standardization of the partial area by dividing itwith maximum possible area (equal to the width ofthe interval of selected ranged of FPRs orsensitivities) has been recommended [19]. It can beinterpreted as average sensitivity for the range ofselected specificities. This standardization makespartial area more interpretable and its maximumvalue will be 1. Figure 5 shows that the partial areaunder the curve for FPR from 0.3 to 0.5 for test A is0.132, whereas for test B is 0.140. Afterstandardization, they it would be 0.660 and 0.700,
FIG. 3 Comparison of three smooth ROC curves withdifferent areas.
FIG. 4 Three empirical ROC curves. Curves for B and Ccross each other but have nearly equal areas, curveA has bigger area.
-
INDIAN PEDIATRICS 283 VOLUME 48__APRIL 17, 2011
KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE
TA
BLE II
I SO
ME P
OPU
LAR
RO
C A
NA
LYSI
S SO
FTW
AR
ES A
ND
TH
EIR IM
PORT
AN
T FEA
TUR
ES
Nam
e of R
OC
anal
ysis
softw
are
Met
hods
use
d to
Com
paris
on o
f tw
o or
mor
ePa
rtial
area
Impo
rtant
by-
prod
ucts
estim
ate A
UC
of
RO
C cu
rves
RO
C cu
rve a
nd it
sva
rianc
eC
alcu
latio
nC
ompa
rison
Med
calc
sof
twar
e ver
sion
11.3
;N
on-p
aram
etric
Avai
labl
eN
ot av
aila
ble
Not
avai
labl
eSe
nsiti
vity
, spe
cific
ity,
Com
mer
cial
softw
are t
rial v
ersi
onLR
+, L
R- w
ith 9
5% C
Iav
aila
ble a
t: w
ww.
med
calc
.be
at ea
ch p
ossi
ble c
ut-p
oint
SPSS
ver
sion
17.
0;N
on-p
aram
etric
Not
avai
labl
eN
ot av
aila
ble
Not
avai
labl
eSe
nsiti
vity
and
Com
mer
cial
softw
are
spec
ifici
ty at
each
cut-o
ff po
int -
no
95%
Cl
STAT
A v
ersi
on 11
;N
on-p
aram
etric
Avai
labl
e for
pai
red
and
unpa
i-Av
aila
ble
spec
ifici
tyO
nly
two
parti
alSe
nsiti
vity
, spe
cific
ity,
Com
mer
cial
softw
are
Para
met
ricre
d su
bjec
ts w
ith B
onfe
rron
iat
spec
ified
rang
e and
AU
Cs c
an b
eLR
+, L
R-
at ea
ch cu
t-off
(Met
z et a
l.)ad
just
men
t whe
n 3
or m
ore
sens
itivi
ty ra
nge
com
pare
dpo
int b
ut n
o 95
% C
Icu
rves
to b
e com
pare
dR
OC
KIT
(Bet
a ver
sion
)Pa
ram
etric
Avai
labl
e for
bot
h pa
ired
Avai
labl
eAv
aila
ble
Sens
itivi
ty an
dFr
ee so
ftwar
e; A
vaila
ble a
t(M
etz e
t al.)
and
unpa
ired
subj
ects
spec
ifici
ty at
each
cut-
ww
w.ra
diol
ogy.u
chic
ago.
edu
poin
tA
naly
se -
itCom
mer
cial
softw
are;
Non
-par
amet
ricAv
aila
ble
no
optio
n fo
rN
ot av
aila
ble
Not
avai
labl
eSe
nsiti
vity
, spe
cific
ity,
add
on in
the M
S Ex
cel;
trial
ver
sion
paire
d an
d un
paire
dLR
+, L
R- w
ith 9
5% C
I of
avai
labl
e at w
ww.
anal
yse-
it.co
msu
bjec
tsea
ch p
ossi
ble c
ut-p
oint
;Va
rious
dec
isio
n pl
ots
XLs
tat2
010
Com
mer
cial
softw
are;
Non
-par
amet
ricAv
aila
ble f
or b
oth
paire
dN
ot av
aila
ble
Not
avai
labl
eSe
nsiti
vity
, spe
cific
ity,
add
on in
the M
S Ex
cel;
trial
ver
-an
d un
paire
d su
bjec
tsLR
+, L
R- w
ith 9
5% C
I of
sion
avai
labl
e at:
ww
w.xl
stat
.com
each
pos
sibl
e cut
-Var
ious
- de
cisi
on p
lots
Sigm
a plo
t Com
mer
cial
softw
are;
Non
-par
amet
ricAv
aila
ble f
or b
oth
paire
dad
d on
in th
e MS
Exce
l; tri
al v
er-
and
unpa
ired
subj
ects
Not
avai
labl
eN
ot av
aila
ble
Sens
itivi
ty, s
peci
ficity
,av
aila
ble a
t:LR
+, L
R- w
ith 9
5% C
I of
ww
w.si
gmap
lot.c
omea
ch p
ossi
ble c
ut-p
oint
:Va
rious
- dec
isio
n pl
ots
*LR+
=Po
sitiv
e lik
elih
ood
ratio
, LR-
= N
egat
ive l
ikel
ihoo
d ra
tio, P
aire
d su
bjec
ts m
eans
bot
h di
agno
stic t
ests
(test
and
gold
) ap
plie
d to
sam
e sub
ject
s and
unp
aire
d su
bjec
ts m
eans
dia
gnos
ticte
sts a
pplie
d to
diff
eren
t sub
ject
s, C
I = C
onfid
ence
inte
rval
, AU
C=
Area
und
er cu
rve,
RO
C=
Rece
iver
ope
ratin
g ch
arac
teris
tic.
-
INDIAN PEDIATRICS 284 VOLUME 48__APRIL 17, 2011
KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE
respectively. The portion of partial area will dependon the range of interest of FPRs selected byresearcher. It may lie on one side of intersectingpoint or may be on both sides of intersecting pointof ROC curves. In Figure 5, SN(A) and SN(B) aresensitivities at specific value of FPR. For example,sensitivity at FPR=0.3 is 0.545 for test A and 0.659for test B. Similarly sensitivity at fixed FPR=0.5 fortest A is 0.76 and 0.72 for test B. All thesecalculations were done by STATA (version 11)statistical software using comproc command withoption pcvmeth(empirical).
METHOD TO FIND THE OPTIMAL THRESHOLDPOINT
Optimal threshold is the point that gives maximumcorrect classification. Three criteria are used to findoptimal threshold point from ROC curve. First twomethods give equal weight to sensitivity andspecificity and impose no ethical, cost, and noprevalence constraints. The third criterionconsiders cost which mainly includes financial costfor correct and false diagnosis, cost of discomfort toperson caused by treatment, and cost of furtherinvestigation when needed. This method is rarelyused in medical literature because it is difficult toimplement. These three criteria are known as pointson curve closest to the (0, 1), Youden index, andminimize cost criterion, respectively.
The distance between the point (0, 1) and anypoint on the ROC curve is d2 =[(1SN)
2 + (1 Sp)2].
To obtain the optimal cut-off point to discriminatethe disease with non-disease subject, calculate thisdistance for each observed cut-off point, and locatethe point where the distance is minimum. Most ofthe ROC analysis softwares (Table III) calculatethe sensitivity and specificity at all the observedcut-off points allowing you to do this exercise.
The second is Youden index [20] that maximizesthe vertical distance from line of equality to thepoint [x, y] as shown in Fig 1. The x-axis represents(1- specificity) and y-axis represents sensitivity. Inother words, the Youden index J is the point on theROC curve which is farthest from line of equality(diagonal line). The main aim of Youden index is tomaximize the difference between TPR (SN) and
FPR (1 SP) and little algebra yields J =max[SN+SP]. The value of J for continuous test canbe located by doing a search of plausible valueswhere sum of sensitivity and specificity can bemaximum. Youden index is more commonly usedcriterion because this index reflects the intension tomaximize the correct classification rate and is easyto calculate. Many authors advocate this criterion[21]. Third method that considers cost is rarely usedin medical literature and is described in [13].
BIASES THAT CAN AFFECT ROC CURVE RESULTS
We describe more prevalent biases in this sectionthat affect the sensitivity, specificity and conse-quently may affect the area under the ROC curve.Interested researcher can find detailed descriptionof these and other biases such as withdrawal bias,lost to follow-up bias, spectrum bias, andpopulation bias, elsewhere [22,23].
1. Gold standard: Validity of gold standard isimportantideally it should be error free and thediagnostic test under review should beindependent of the gold standard as this canincrease the area under the curve spuriously. Thegold standard can be clinical follow-up, surgicalverification, biopsy or autopsy or in some cases
FIG. 5 The partial area under the curve and sensitivity atfixed point of specificity (see text).
-
INDIAN PEDIATRICS 285 VOLUME 48__APRIL 17, 2011
KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE
opinion of panel of experts. When gold standard isimperfect, such as peripheral smear for malariaparasites [24], sensitivity and specificity of thetest are under estimated [22].
2. Verification bias: This occurs when all diseasesubjects do not receive the same gold standard forsome reason such as economic constraints andclinical considerations. For example, inevaluating the breast bone density as screeningtest for diagnosis of breast cancer and only thosewomen who have higher value of breast bonedensity are referred for biopsy, and those withlower value but suspected are followed clinically.In this case, verification bias would overestimatethe sensitivity of breast bone density test.
3. Selection bias: Selection of right patients with andwithout diseased is important because some testsproduce prefect results in severely diseased groupbut fail to detect mild disease.
4. Test review bias: The clinician should be blind tothe actual diagnosis while evaluating a test. Aknown positive disease subject or known non-disease subject may influence the test result.
5. Inter-observer bias: In the studies where observerabilities are important in diagnosis, such as forbone density assessment through MRI,experienced radiologist and junior radiologistmay differ. If both are used in the same study, theobserver bias is apparent.
6. Co-morbidity bias: Sometimes patients haveother types of known or unknown diseases whichmay affect the positivity or negativity of test. Forexample, NESTROFT (Naked eye single tube redcell osmotic fragility test), used for screening ofthalassaemia in children, shows good sensitivityin patients without any other hemoglobindisorders but also produces positive results whenother hemoglobin disorders are present [25].
7. Uninterpretable test results: This bias occurswhen test provides results which can not beinterpreted and clinician excludes these subjectsfrom the analysis. This results in over estimationof validity of the test.
It is difficult to rule out all the biases butresearcher should be aware and try to minimizethem.
TABLE IV SAMPLE SIZE FORMULA FOR ESTIMATING SENSITIVITY AND SPECIFICITY AND AREA UNDER THE ROC CURVE
Problem Formula Description of symbol used
Estimating the sensitivity of test SN = Anticipated sensitivityPrev = Prevalence of disease in population canbe obtained from previous literature or pilotstudy = required absolute precision on either side ofthe sensitivity
Estimating the specificity of test SN = Anticipated specificityPrev = Prevalence of disease in population canbe obtained from previous literature or pilotstudy = required absolute precision on either side ofthe specificity.
Estimating the area under the V(AUC) = Anticipated variance anticipatedROC curve area under ROC curve
nD = number of diseased subjects = required absolute precision on either side ofn = nD(1+k), k is ratio of prevalence the area under the curve.of non-disease to disease subjects
Z1/2 is a standard normal value and is the confidence level. Z1/2= 1.645 for =0.10 and Z1/2= 1.96 for =0.05.
Z21/2SN (1SN)
2 Prev
Z21/2Sp (1Sp)
2 (1Prev)
Z2/2V (AUC)nD=
2
-
INDIAN PEDIATRICS 286 VOLUME 48__APRIL 17, 2011
KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE
SAMPLE SIZE
Adequate power of the study depends upon thesample size. Power is probability that a statisticaltest will indicate significant difference wherecertain pre-specified difference is actually present.In a survey of eight leading journals, only two out of43 studies reported a prior calculation of samplesize in diagnostic studies [26]. In estimation set-up,adequate sample size ensures the study will yieldthe estimate with desired precision. Small samplesize produces imprecise or inaccurate estimate,while large sample size is wastage of resourcesespecially when a test is expensive. The sample sizeformula depends upon whether interest is inestimation or in testing of the hypothesis. Table IVprovides the required formula for estimation ofsensitivity, specificity and AUC. These are based onthe normal distribution or asymptotic assumption(large sample theory) which is generally used forsample size calculation.
Variance of AUC, required in formula 3 (TableIV), can be obtained by using either parametric ornon-parametric method. This may also be availablein literature on previous studies. If no previousstudy is available, a pilot study is done to get someworkable estimates to calculate sample size. Forpilot study data, appropriate statistical software canprovide estimate of this variance.
Formulas of sample size to test hypothesis onsensitivity-specificity or the AUC with a pre-specified value and for comparison on the samesubjects or different subjects are complex. Refer[13] for details. A Nomogram was devised to readthe sample size for anticipated sensitivity and
specificity at 90%, 95%, 99% confidence level [27].
There are many more topics for interested readerto explore such as combining the multiple ROCcurve for meta-analysis, ROC analysis to predictmore than one alternative, ROC analysis in theclustered environment, and for tests for repeatedover the time. For these see [13,28]. For predictivitybased ROC, see [6].
Contributors: Both authors contributed to concept, review ofliterature, drafting the paper and approved the final version.Funding: None.Competing interests: None stated.
REFERENCES
1. Reddy S, Dutta S, Narang A. Evaluation of lactatedehydrogenase, creatine kinase and hepatic enzymesfor the retrospective diagnosis of perinatal asphyxiaamong sick neonates. Indian Pediatr. 2008;45:144-7.
2. Sood SL, Saiprasad GS, Wilson CG. Mid-arm circum-ference at birth: A screening method for detection oflow birth weight. Indian Pediatr. 2002;39:838-42.
3. Randev S, Grover N. Predicting neonatalhyperbilirubinemia using first day serum bilirubinlevels. Indian J Pediatr. 2010; 77:147-50.
4. Vasudevan A, Malhotra A, Lodha R, Kabra SK. Profileof neonates admitted in pediatric ICU and validation ofscore for neonatal acute physiology (SNAP). IndianPediatr. 2006;43:344-8.
5. Khanna R, Taneja V, Singh SK, Kumar N, Sreenivas V,Puliyel JM. The clinical risk index of babies (CRIB)score in India. Indian J Pediatr. 2002;69:957-60.
6. Indrayan A. Medical Biostatistics (Second Edition).Boca Raton: Chapman & Hall/ CRC Press; 2008. p.263-7.
7. Hanley JA, McNeil BJ. The meaning and use of the areaunder a receiver operating characteristic (ROC) curve.Radiology. 1982;143:29-36.
8. Campbell G. Advances in statistical methodology forevaluation of diagnostic and laboratory tests. Stat Med.1994;13:499-508.
KEY MESSAGES For sensitivity and specificity to be valid indicators, the gold standard should be nearly perfect.
For a continuous test, optimal cut-off point to correctly pick-up a disease and non-diseased cases is the pointwhere sum of specificity and sensitivity is maximum, when equal weight is given to both.
ROC curve is obtained for a test on continuous scale. Area under the ROC curves is an index of overall inherentvalidity of test as well as used for comparing sensitivity and specificity at particular cut-offs of interest.
Area under the curve is not an accurate index of comparing two tests when their ROC curves cross each other.
Adequate sample size and proper designing is required to yield valid and unbiased results.
-
INDIAN PEDIATRICS 287 VOLUME 48__APRIL 17, 2011
KUMAR AND INDRAYAN RECEIVER OPERATING CHARACTERISTIC CURVE
9. Bamber D. The area above the ordinal dominance graphand area below the receiver operating characteristicgraph. J Math Psychol. 1975;12:387-415.
10. DeLong ER, DeLong DM, Clarke-Pearson DL.Comparing the area under two or more correlatedreceiver operating characteristic curves: anonparametric approach. Biometrics. 1988;44:837-45.
11. Cleves MA. Comparative assessment of three commonalgorithms for estimating the variance of the area underthe nonparametric receiver operating characteristiccurve. Stata J. 2002;3:280-9.
12. Box GEP, Cox DR. An analysis of transformation. JRoyal Statistical Society, Series B. 1964;26:211-52.
13. Zhou Xh, Obuchowski NA, McClish DK. StatisticalMethods in Diagnostic Medicine. New York: JohnWiley and Sons, Inc; 2002.
14. Metz CE, Herman BA, Shen JH. Maximum likelihoodestimation of receiver operating characteristic (ROC)curve from continuously distributed data. Stat Med.1998;17:1033-53.
15. ROCKIT [Computer program]. Chicago: University ofChicago. Available from: www-radiology.uchicago.edu/krl/KRL_ROC/software_index6.htm. Accessed onFebruary 27, 2007.
16. Faraggi D, Reiser B. Estimating of area under the ROCcurve. Stat Med. 2002; 21:3093-3106.
17. Hajian Tilaki KO, Hanley JA, Joseph L, Collet JP. Acomparison of parametric and nonparametricapproaches to ROC analysis of quantitative diagnosistests. Med Decis Making. 1997;17:94-102.
18. Pepe M, Longton G, Janes H. Estimation andcomparison of receiver operating characteristic curves.Stata J. 2009;9:1.
19. Jiang Y, Metz CE, Nishikawa RM. A receiver operating
characteristic partial area index for highly sensitivediagnostic test. Radiology. 1996;201:745-50.
20. Youden WJ. An index for rating diagnostic test.Cancer. 1950;3:32-5.
21. Perkins NJ, Schisterman EF. The inconsistency ofoptimal cut points obtained using two criteria basedon the receiver operating characteristic curve. Am JEpidemiol. 2006;163:670-5.
22. Whiting P, Ruljes AW, Reitsma JB, Glas AS, BossuytPM, Kleijnen J. Sources of variation and bias in studiesof diagnostic accuracy a systematic review. AnnIntern Med. 2004;140:189-202.
23. Kelly S, Berry E, Proderick P, Harris KM,Cullingworth J, Gathercale L, et al. The identificationof bias in studies of the diagnostic performance ofimaging modalities. Br J Radiol. 1997;70:1028-35.
24. Malaria Site. Peripheral smear study for malariaparasites Available from: www.malariasite.com/malaria/DiagnosisOfMalaria.htm. Accessed on July05, 2010.
25. Thomas S, Srivastava A, Jeyaseelan L, Dennison D,Chandy M. NESTROFT as screening test for thedetection of thalassaemia & common haemoglobino-pathies an evaluation against a high performanceliquid chromatographic method. Indian J Med Res.1996;104:194-7.
26. Bachmann LM, Puhan MA, ter Riet G, Bossuyt PM.Sample sizes of studies on diagnostic accuracy:literature survey. BMJ. 2006;332:1127-9.
27. Malhotra RK, Indrayan A. A simple nomogram forsample size for estimating the sensitivity and specificityof medical tests. Indion J Ophthalmol.2010;58:519-22.
28. Kester AD, Buntinx F. Meta analysis of curves. MedDecis Making. 2000;20:430-9.
/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /Warning /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 215 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.04651 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 600 /MonoImageMinResolutionPolicy /Warning /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False
/CreateJDFFile false /Description >>> setdistillerparams> setpagedevice