screening with tumor markers

10

Click here to load reader

Upload: j-e-roulston

Post on 06-Aug-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Screening with tumor markers

MOLECULAR BIOTECHNOLOGY Volume 20, 2002

Screening with Tumor Markers 153

153

Molecular Biotechnology 2002 Humana Press Inc. All rights of any nature whatsoever reserved. 1073–6085/2002/20:2/153–162/$12.50

*Author to whom all correspondence should be addressed: Dr. J. E. Roulston, Reproductive and Developmental Sciences; Clinical Biochem-istry, Royal Infirmary, Edinburgh EH3 9YW Scotland UK. E-mail: [email protected]

REVIEW

AbstractReviewing the literature it would appear that tumor markers have often flattered to deceive. Early promise

does not often seem to be borne out in extended trials. Despite apparently high specificity, very few markersare capable of assisting in a screening process.

This brief review attempts to put the roles of tumor markers in perspective and explain how their misap-plication has led to misunderstanding of their potential value in a clinical context. It also considers thetheoretical basis for their use and highlights how misunderstanding of these can lead to flawed studies andapplication.

Index Entries: Tumor markers; screening.

1. IntroductionCancer has been known to mankind since ancient

times. There is an early Egyptian papyrus describ-ing how one should differentiate between breastcancer and mastitis. The ancient Greeks andRomans also have left us with writings in whichvarious treatment options are discussed (1). Dis-ease processes and causes were not well under-stood however; the humoral pathology establishedby the ancient Greeks of the school of Galen inthe second century A.D. was to survive virtuallyintact until the mid-nineteenth century. It is per-haps all the more remarkable, then, that the firsttumor marker—Bence Jones protein in multiplemyeloma—should come to light in what was stillby and large the pre-scientific medical culture pre-vailing in 1845.

Multiple myeloma was fully described andnamed by von Rustizky (2) in 1873, but it wasKahler (3) who related the disease to Bence Jonesproteinuria and thereby brought a specific tumormarker to medical attention, a marker that is stillused to this day to assist in diagnosis.

Despite the lesson of Bence Jones protein, inwhich a marker specific for a particular cancerwas discovered, many researchers still sought ageneral test for early diagnosis of all cancer.Homberger (4) reviewed more than 60 tests whichhad been suggested in the previous 20 (1930–1950) yr. Many of these tests were based uponthe physicochemical properties of serum proteinsand sought to show a difference between precipi-tation of serum proteins from normal subjects andcancer patients.

With the benefit of hindsight it is easy to writeoff such efforts as misplaced; the biochemicaltechniques available were crude and not alwaysapplied with logic. Bodansky (5) points out theproblems with many early studies. Technically thetests were deficient because they were based upona gross and nonspecific measurement—the changein a large fraction of the serum protein pool. Sec-ondly, these investigations were usually carriedout in samples from patients with advanced dis-ease whereas control groups of similarly agedpatients with serious nonmalignant diseases were

Screening with Tumor MarkersCritical Issues

J. E. Roulston

Page 2: Screening with tumor markers

MOLECULAR BIOTECHNOLOGY Volume 20, 2002

154 Roulston

not studied. When these controls were looked atlater, the false positive rate was as high as the truepositive rate in the neoplastic group.

Apart from technical shortcomings, there is alsoa major assumption in the presupposition that can-cers will produce some unique feature that non-neoplastic diseases will not, and for this there isnot a shred of evidence (6).

As the biochemical tools and techniques avail-able have grown ever more sophisticated, it hasenabled more precisely focused studies to be con-ducted. The advent of immunoassay techniques inthe 1960s and their refinement during the follow-ing decades with nonisotopic labels and, especially,the development of monoclonal (“hybridoma”)technology has brought levels of analytical sen-sitivity and specificity which are orders of mag-nitude better than those available to previousgenerations of researchers.

2. Theoretical ConsiderationsIn order to assess and apply tests in an appro-

priate and discerning manner it is necessary toconsider what the aims and objectives are and howone monitors and assesses one’s efforts.

At first thought it appears very simple. Firstlyit is intended to apply a test to discriminate betweenthe normal and the diseased subject, that is toassist in diagnosis and possibly to screen popula-tions for occult disease. Secondly one may wishto apply a test to monitor the course of the diseasein a noninvasive way in order to assess the effi-ciency of therapy, to watch for drug-resistance andto predict outcome. Thirdly one may wish tomonitor patients in remission to ensure that theyremain disease-free and to get a valuable lead-timeto relapse.

In order to achieve these aims several pointsmust be made clear. Firstly, one must have confi-dence in the analytical accuracy and precision ofthe test(s). However, in order to translate the ana-lytical data into clinically meaningful information,it is essential to be aware of what the objectivesare. “Is this result normal?” is a question oftenasked by a requesting clinician and it is worth con-sidering, at the outset, what the word “normal”may or may not mean.

2.1. What Is “Normal?”The first problem presenting to workers in clini-

cal medicine is the statistical definition of normalbecause it is widely misunderstood and even morewidely misapplied. Gauss’ law of errors appliesto repeated measurements on the same subject orobject, not a series of measurements of the sameanalyte in different subjects. Gauss’ law proposesthat if the same measurement were repeated overand over again in the same subject the results’spread would fit a bell-shaped distribution sym-metrical about the mean. Abnormal results maythen be defined as those outside the 95% confi-dence limit; in other words the 2.5% of values atthe top and bottom end of the range. There is how-ever no a priori reason why this law of distribu-tion should apply to measurements in more thanone subject; it was never derived to describe thedistribution of a variable (disease-related or oth-erwise) in a population of subjects. Although it iscommon practice in laboratories to define a refer-ence range for an analyte as being the limits withinwhich 95% of the healthy population’s results fall,these limits per se give no indication of morbidityor mortality. Indeed, by definition, 5% of thispopulation will be “abnormal” although disease-free if we assume a 95% (i.e., mean value plus orminus two standard deviations) reference range.It also follows that the more tests performed oneach specimen then the greater the likelihood of atleast one of its results being “abnormal”; from 5%for 1 test, to 40% for 10 tests (the chance of tentests on a sample all being “normal” is 0.9510,which is 0.6 or 60%; leaving a 40% chance for an“abnormal”). The percentage error figure rises to99% ([1–0.9590] × 100%) for 90 tests.

It is for this reason that most laboratories todayeschew the phrase “normal range” and prefer thealternative “reference range,” or “referent value”;in order to make clear that the range or cut-offcited is not of necessity a range or cut-off thatencompasses or defines the limit of the values ofthe analyte in all disease-free and excludes alldiseased subjects. Often 95% reference ranges,based upon the mean value plus or minus twostandard deviations, are employed as the refer-ence limits because they have been found empi-

Page 3: Screening with tumor markers

MOLECULAR BIOTECHNOLOGY Volume 20, 2002

Screening with Tumor Markers 155

rically to provide cut-offs at clinically useful anddiscriminant values.

For tumor markers, however, there is less con-cern whether a reference range based upon a sym-metric distribution is ideal; in practice the optimalcut-off value is sought; a point which discrimi-nates “normal” from “elevated”—there is nolower limit to the “reference range.”

In order to establish this cut-off value empiri-cally it is necessary to discover the value whichdiscriminates the best between disease and non-disease; in other words produces the fewest mis-classifications. In order to do this large number ofmeasurements in the disease group under studyand a suitably matched control population mustbe made. Inevitably, there is an overlap betweenthe range of results produced by the control groupand the range produced by the diseased group.There will, therefore, be some false positiveresults (elevated analyte in control subject) andsome false negative (“normal” result in diseasedsubject), and just how many depend upon the testand population in question. In order to proceedfurther, therefore, it is vital to determine how goodthe test is in an objective manner.

2.2. How Good Is the Test?2.2.1. Sensitivity and Specificity

The two most usually applied criteria in orderto assess a test are those of sensitivity and speci-ficity. Sensitivity is a measure of how good a testis at picking up the disease in question by giving apositive result. It is expressed in a population withthe disease as the number giving a true positiveresult divided by the sum of true positive and falsenegative; in other words what percentage of the dis-eased cohort were identified correctly by the test.

It is obvious that a test which is 100% sensitivewill score perfectly. A test that is 90% sensitive,however, will generate 10 false negatives in each100 positives.

As well as correctly identifying the presence ofdisease, a good test must also correctly classifythe disease free subject by giving a negative result.The measure of a test to so discriminate is calledthe specificity. This is established by measure-ments in a disease-free population and specificity

is defined as the number of true negatives dividedby the sum of true negatives plus false positives.

It can be seen that sensitivity and specificity areentirely “test-based” parameters; they take noaccount of the prevalence of the disease in thepopulation, the sensitivity is calculated by studyof a group who are all disease positive, and speci-ficity from a group who are all disease free. This,as will become apparent, is a serious limitation tothe application of these parameters because dis-ease prevalence has serious effects upon the clini-cal usefulness of tests in certain circumstances.

Furthermore, the choice of cut-off, which effec-tively determines the sensitivity and specificity,cannot improve both sensitivity and specificitysimultaneously; moving the chosen cut-off pointto a higher referent value will increase specificity,but correspondingly reduce sensitivity. The optimalchoice of cut-off, therefore, depends upon whetherit is deemed more desirable to optimize sensitivityat the expense of specificity or vice versa, and thisconsideration in turn is influenced by the diseaseprevalence in the population under study.

2.2.2. Incidence and Prevalence

By definition incidence relates to the frequencyof occurrence of an event and is therefore a rateper unit time. For a disease the incidence rate isthe number of new cases per 100,000 of the popu-lation per year. The prevalence, by contrast, is thenumber of patients, per 100,000 of the populationwho have the given disease at the time of thestudy; prevalence therefore is a snapshot of thestatus quo.

The incidence of epithelial carcinoma of theovary is of the order of 15 per 100,000 per year. Ifthe average duration of the disease is five years itfollows that the prevalence, assuming a steady-state situation in the population, must be 75 per100,000. As a general rule, therefore:

Prevalence = Incidence × Duration

The clinical usefulness of a test in a given situ-ation will depend upon the prevalence of the dis-ease in the cohort under study; high sensitivity andspecificity, although vital, are not enough of them-selves to guarantee “usefulness.” For example, a

Page 4: Screening with tumor markers

MOLECULAR BIOTECHNOLOGY Volume 20, 2002

156 Roulston

test that was 100% specific and 99% sensitiveseems to have impressive credentials but it wouldfail dismally as a screening test for ovarian can-cer. Screening 100,000 women would yield 99%out of the positives, i.e., 74 or 75 women which isan acceptable “pick-up rate” but it would also gen-erate 1% false positives, i.e., 1,000 nondiseasedwomen. Therefore a positive test result would cor-rectly identify disease presence in less than 7%(75/[1000+75] = 6.98%) of the subjects studied.

It is necessary therefore to use assessment pro-cedures which take into account the prevalence ofthe disease in the population under study.

2.2.3. Bayes’ Theorem andthe Predictive Value Model

In 1975 Galen and Gambino (7) introduced thePredictive Value Model to clinical laboratories.The theoretical basis was hardly new; coming asit did form a posthumous publication in 1763 (8).What Bayes’ theorem allows is the calculation ofthe a posteriori probability of disease beingpresent in an individual given that the patient hasa positive test result. By definition the a prioriprobability that a patient will have the disease(i.e., before the test), is equal to the prevalence ofthe disease. From prevalence, sensitivity andspecificity Bayes calculated the a posteriori prob-ability—the so-called predictive value of a posi-tive result or positive predictive value.

Let sensitivity = a, and specificity = b, andprevalence = p, then one can describe the positivepredictive value (PPV) as follows:

PPV = pa/[pa + (1–b)(1–p)] (2.1)

This simplifies to

PPV = true positives/[true positives +false positives]

Since pa = the prevalence of the disease multipliedby the sensitivity of the test for the disease i.e., thetrue positive.

Similarly, (1-b)(1-p) = the prevalence of non-disease multiplied by the probability of a positiveresult in such a disease free person. 1-b which is1-specificity is sometimes, albeit incorrectly,referred to as the false positive rate.

The benefit of the predictive value model isapparent immediately; if a test has a 95% PPV ina given area of use then the clinician may assumethat in a patient with a positive result there is a95% chance that the patient has the disease. Thesame conclusion cannot be made from sensitivityand specificity values as they take no account ofdisease prevalence. Table 1 shows how the pre-dictive value of a positive test varies from virtu-ally zero to 100% as a result of changing diseaseprevalence even when sensitivity and specificityare high. This gives an important insight toscreening procedures; where disease prevalenceis low, it is necessary to have tests with greaterthan 99% sensitivity and specificity to achieve anacceptable positive predictive value.

3. Screening for DiseaseScreening has been defined as “the presump-

tive identification of unrecognised disease ordefect by the application of tests, examinations,or other procedures that can be applied rapidly”(9). By definition, therefore, a screening test isapplied to asymptomatic subjects and is not diag-nostic per se; confirmatory tests being required.The idea that early warning leads to a better out-come is not easily translated into a practical pro-gram. The economic difficulties of testing largenumbers of apparently healthy individuals in orderto pick up a small number with the disease areenormous. Secondly, there are difficult ethicalconsiderations when one is investigating healthysubjects without symptoms or any substantiveprobability of finding disease.

3.1. Population ScreeningThe oncology literature contains many reports

of apparently promising markers which fail sub-sequently to claim a routine clinical role. Thereare many reasons that contribute to this, but com-monest is overextrapolating or illogically apply-ing the results. Consider a study in which aninvestigator tests a novel tumor marker for a par-ticular cancer which has a prevalence of 100/100,000 in the general population and finds thatin 100 patients with the tumor under investigation99 have a positive test result, that is the test has a

Page 5: Screening with tumor markers

MOLECULAR BIOTECHNOLOGY Volume 20, 2002

Screening with Tumor Markers 157

sensitivity of 99%. Equally, when tested on 100disease free subjects, only one is test-positive—aspecificity of 99%. Owing to this excellent dis-crimination it is decided to introduce the test as ascreen in the general population in order to detectthis tumor at an earlier stage to improve therapeu-tic efficacy and patient outcome. The results aredisastrous; the test appears to have lost its earlierdiscrimination and is generating lots of false posi-tives—why?

In the pilot study, the disease prevalence was50% by design; there were 100 patients and 100controls and the positive predictive value was99%. In the screening exercise the prevalencewould be 100/100,000 which is 0.1%. Therefore,as well as correctly identifying 99 out of the 100true positives the test will also under these circum-stances misclassify 1000 as false positive givingus a positive predictive value of 99/[99 + 1,000],that is 9.0%. In other words, a test that in the pilotinvestigation yielded 99% correct results, gives,in a screening situation a 91% a posteriori prob-ability that elevated results are not associated withthe disease. The marker sensitivity and specificityremain unchanged, the fall in positive predictivevalue from 99% to 9% was entirely caused by thechange in prevalence in the cohort under studyfrom 50% to 0.1%.

If a test is genuinely and completely useless,that is to say it yields positive and negative resultsin a truly random manner, then the positive pre-dictive value will be the same as the prevalence:The a priori probability of disease in the patient

equals the a posteriori probability of disease. Fur-thermore, for a test to be random it is not neces-sary for sensitivity and specificity each to equal50%; a test may have 90% sensitivity and still giverandom results if the specificity is only 10%.

Randomness requires only that:

∑[sensitivity + specificity] = 100%

These findings can be derived simply fromEq.2.1: PPV = pa/[pa + (1-p)(1-b)]

In a random test the percentage of true posi-tives in the diseased group will equal the percent-age of false positives in the well group—bydefinition. That is: pa/p = [(1–p)(1–b)]/(1–p).

Also by definition pa/p is the sensitivity of thetest and [(1–p)(1–b)]/(1–p) = (1–b) = (1–specific-ity). Therefore in a random test sensitivity equals(1–specificity) which is to say the sum of sensi-tivity and specificity equals unity (or 100%).

This relationship is of value in the graphicalrepresentation of marker performance. When Sen-sitivity is plotted as a function of (1–Specificity)an immediate visual impression of the marker’sdiscrimination is obtained. This graph is termedthe Receiver Operating Characteristic or “ROC”plot. A random test will give a straight line graphat 45° to the axes, whereas a good, highly discrim-inatory test will give a curve of steep slope fromthe origin, showing a high sensitivity even at highspecificity. Therefore the greater the area underthe curve the better the test. ROC plots are par-ticularly useful in that they remove the influenceof the cut-off point from the marker evaluation.

Table 1The Effects of Prevalence on Predictive Value of a Positive Test

Positive Predictive Values as a Function of Prevalence

Test Sensitivity Test Sensitivityand Specificity = 95% and Specificity = 99%

Disease Prevalence (%) Positive Predictive Value Positive Predictive Value

0.02 0 2.00.1 1.9 9.01.0 16.1 50.02.0 27.9 66.95.0 50.0 83.9

50.0 95.0 99.0

Page 6: Screening with tumor markers

MOLECULAR BIOTECHNOLOGY Volume 20, 2002

158 Roulston

3.2.OptimizationIf screening is to be considered it is necessary

to know the disease prevalence therefore, and tohave tests with high sensitivity and specificity inorder to calculate whether an acceptable positivepredictive value can be achieved. But, it is impos-sible to optimize simultaneously both sensitivityand specificity; increasing the one automaticallydecreases the other. Considerations regardingoptimization strategies will vary with the naturalhistory of the disease under study (vide infra).

The simplest case will be considered; the situa-tion where there is a screening procedure to beoptimized and a false negative result carries anequivalent penalty to a false positive result. Underthese circumstances we may define our “Index ofMisclassification,” f, as the sum of the false nega-tive and false positive results.

f = FN + FP (2.2)

False negatives, FN, can be calculated as thelack of sensitivity (1–a) multiplied by diseaseprevalence, p. Similarly false positives, FP, canbe calculated by multiplying lack of disease spe-cificity (1–b) by the prevalence of nondisease inthe population under study. Therefore

f = p(1–a) + (1–b)(1–p) (2.3)

For most cancers prevalence of disease in ageneral population screen will be tend to zero.Therefore:

f = (1–b) (2.4)

It follows, therefore, that under the conditionsand assumptions outlined—very low prevalenceand equality of penalty for false negatives andfalse positives—one should increase specificity atthe expense of sensitivity to minimize mis-classi-fications.

3.3. Targeted ScreeningThe most frequently cited example of success-

ful screening using a tumor marker is the use ofhuman chorionic gonadotropin (hCG) in chorio-carcinoma, and it is instructive to consider brieflywhy hCG has worked so wonderfully well whenno other tumor markers are as competent.

Choriocarcinoma is rare; it accounts for 0.02%of all cancer deaths and is almost exclusively con-fined to women who have had a hydatidiformmole of whom about 8% go on to develop chorio-carcinoma. The single key fact which makes thescreening program workable is the application ofthe test to a predetermined group in which the dis-ease is present at a high prevalence.

If we assume that hCG has a sensitivity (a) of99% and a specificity (b) of 99% and choriocarci-noma has a prevalence (p) of 8% in our screeninggroup then we can calculate the positive predic-tive value of hCG in this context:

PPV = pa/[pa + (1–b)(1–p)]

PPV = (0.08 × 0.99)/[(0.08 × 0.99)+(1 – 0.99)(1 – 0.08)] = 89.6%

By contrast, if one attempted to screen for cho-riocarcinoma all women whose pregnancies hadachieved full term (prevalence 0.01%), the posi-tive predictive value would be vanishingly small:

PPV = (0.0001 × 0.99)/[(0.0001 × 0.99)+(1 – 0.99)(1 – 0.0001)] = 0.98%

It is therefore apparent that for screening to beeffective a high prevalence group must be identi-fied in order to keep the number of false positivesto an acceptable level.

4. Clinical UtilityClinical effectiveness demands that the early

intervention afforded by a successful screen istranslated into an increased rate of cure orimproved survival time. Objective quantificationof improvement in survival time is not quite assimple as it first might appear, as studies are sub-ject to various forms of methodological bias.

4.1. Lead Time BiasSurvival is measured from the date of diagnosis

to death, rather than from the date of inception todeath. The date of diagnosis may therefore varyconsiderably, depending on the methods of detec-tion used, without altering the true length of sur-vival from the date of inception. Lead timegenerated by screening, or the period from detec-tion while the woman is still asymptomatic until

Page 7: Screening with tumor markers

MOLECULAR BIOTECHNOLOGY Volume 20, 2002

Screening with Tumor Markers 159

the appearance of clinical symptoms which wouldpermit conventional diagnosis, may increase theapparent survival without in fact the individualhaving benefited from screening. In such circum-stances the patient has to live with the knowledgeof their disease for longer.

4.2. Length BiasA series of cases diagnosed at screening will

be atypical of those arising clinically, since itwill contain a disproportionate number ofpatients with slowly developing tumors withprobably a better prognosis. Patients with rapidlyprogressing tumors are more likely to presentwith symptoms before the initiation of, or in theinterval between, screening tests. This bias ismore likely to be manifest at the initiation ofscreening and is therefore especially importantin studies of short duration.

4.3. Selection BiasSelection bias results from entry of a cohort

into a screening trial who have a different prob-ability of developing and dying from the diseasethan the population at large. In self-selectedpopulations, it is common to find a higher thannormal proportion of individuals presenting forscreening because of a positive family history.These individuals are more motivated to presentfor screening because they are more educated inthis respect and are more likely to benefit fromit. This has been well demonstrated in breast andcervical screening programs.

5. Optimization StrategiesIt was demonstrated earlier that, when preva-

lence was very low (tending to zero), if false nega-tives and false positives carried equal penalty thento minimise mis-classifications one should maxi-mize specificity. In addition one should maximizespecificity in situations where the disease is seri-ous but cannot be treated or cured and where,therefore, any false positive result would lead topsychological trauma. Some occult cancers wouldclearly fall into this group, as well as diseases suchas Multiple Sclerosis. Such incurable diseasesshould not be subject to population screening asthere is usually no benefit to patient or society at

large in early diagnosis. In this section the otheroptions which are available will be considered andunder which circumstances it would be appropri-ate to use them.

5.1. SensitivitySensitivity should be maximized in situations

where, although the disease is serious and shouldnot be missed, it is treatable and therefore falsepositives are less psychologically damaging. Mosttreatable infectious diseases would fall into thiscategory, as do pheochromocytoma and phenylke-tonuria. Cervical cancer, where the screening pro-gram is effective and confirmatory tests areavailable prior to an effective therapeutic inter-vention program, is an example of a malignancywhich may fall into this category. Furthermore theconcern caused by the presence of abnormal cellsupon a cervical smear can in large measure be off-set by the patient being aware of the success ofearly treatment.

5.2. Positive Predictive ValuePositive predictive value (PPV) should be

maximized in any situation where treatment of afalse positive could be seriously damaging. Wherethe treatment indicated involves major surgeryand radiotherapy, such as certain occult carcino-mas, instigating treatment in someone who did nothave the disease would be a major catastrophe.

5.3. Accuracy (or “Efficiency”)Accuracy of a very high order is required when

a disease is both serious and treatable and falsepositive and false negative results carry equal pen-alty. Myocardial Infarction has usually been citedas the classical example of where the tests shouldbe optimized for accuracy [(TP+TN)/(TP+TN+FP+FN)], however a case for optimizing accuracycould be made in testing for certain leukæmias andlymphomas.

6. The Use of Multiple MarkersThe idea of using a group of markers in order to

complement the sensitivity and specificity of eachother seems logical enough and can be extremelybeneficial. There are certain rules that can bedefined and applied, and certain pitfalls to avoid.

Page 8: Screening with tumor markers

MOLECULAR BIOTECHNOLOGY Volume 20, 2002

160 Roulston

There are two distinct approaches to multipletesting. The first, as described in the exampleabove, is so-called series testing; the various testsare performed one after the other depending uponthe result of the previous test. In series testing,therefore, a “test-positive” patient is one who hasscored positive in all the tests. A secondary con-sideration here is defining the order in which thetests are to be performed to maximize efficacy; al-though considerations of cost and patient compli-ance also need to be included in any trial design.

In parallel testing all tests are performed uponall patients, a “test-positive” patient in these cir-cumstances is one who is positive on any one (ormore) of the tests.

It is usual in a screening exercise that seriestesting is to be preferred as it maximizes specific-ity at the expense of sensitivity which, as dis-cussed earlier, is a rational approach wheredisease prevalence is low. Calculation of the posi-tive predictive value for parallel and seriesregimes bear this out (10).

For series testing, as not all tests are performedon all samples, there is the option of the order inwhich the tests are to be performed. There aremany considerations; the relative cost of the testsinvolved, the degree of invasiveness and the rela-tive sensitivities and specificities of the testsinvolved. If variables such as cost are set aside itcan be shown that the sensible option is to test inseries rather than parallel as the positive predic-tive value is far higher and the total number oftests performed is a lot less. Also, although posi-tive predictive value is independent of the orderof testing, the number of analyses that have to beperformed varies considerably, being minimizedby application first of the test with the higher (orhighest) specificity of those in the panel.

6.1. Series TestingIn an abstract (11) a research group reported the

results of screening 1010 postmenopausal womenfor epithelial ovarian cancer using the serummarker CA125 followed up by ultrasonography.They found a level of greater than 30 U per milli-liter (their cut-off level) in 31 women. These 31were then given ultrasonography; three were

deemed abnormal and sent for surgery. One hadan early stage ovarian cancer. The authors con-cluded that CA125 had a high specificity for ova-rian cancer, that they could increase the sensitivityby lowering the cut-off from 30 to 23 U per mil-lilitre (the widely-accepted cut-off value is, infact, 35 U/mL) and that CA125 warranted furtherinvestigation for early diagnosis.

Their data are shown in Table 2. It is apparentthat, from these data, there is no good reason tolower the cut-off from 30 to 23 as the sensitivityis already 100%. How reliable that figure is, how-ever, is open to question as there is only one truepositive in the study. Furthermore, false nega-tives—here reported as zero—invariably takelonger to emerge from any study and tend to bethe most difficult to follow up; for these reasonsthen, the reported sensitivity may be an over-esti-mate. The one true positive patient had a CA125level of 32 U/mL. If, therefore these workers hadfollowed the axiom of optimising specificity at theexpense of sensitivity, they would, in all probabil-ity, have missed the one patient who was to ben-efit directly from the trial. Their reason for optingfor a higher sensitivity in this case was that theyhad a highly efficient second test (ultrasonogra-phy) to filter out the majority of the false positivesgenerated by the CA125 alone and did not wishto miss any cases. It can be seen from Table 2that, despite a sensitivity of 100% a specificityof 97% and an overall accuracy of 97%, the posi-tive predictive value was only 3.1% for CA125,hopelessly inadequate as a single selector forexploratory surgery. It is also true to say that,knowing the sensitivity and specificity of the testand the disease prevalence one could have calcu-lated this positive predictive value without havingto do the trial saving considerable expense.(“Since Isaac Newton, we no longer have to chartthe fall of each apple”—Sir Peter Medawar).

When, however, ultrasonography is added in asa second-line test the positive predictive valueimproves by an order of magnitude to 33% (1/3)which is perhaps an acceptable pick-up rate con-sidering the high mortality rate of the disease ifnot diagnosed early. In effect, the use of CA125in this and other studies generates a sub-group of

Page 9: Screening with tumor markers

MOLECULAR BIOTECHNOLOGY Volume 20, 2002

Screening with Tumor Markers 161

the population under study who are at higher riskthan the population at large; it defines a high-prevalence group thereby enabling a second linetest of similar sensitivity and specificity to pro-duce a positive predictive value that is far higher.

6.2. Panel TestingEvaluation of a panel of tests is, of course, sub-

ject to all the same provisions as for the assess-ment of a single test; particularly the prevalenceof the disease in the study group must be typicalof the prevalence in the population to which it isintended to apply the test(s).

In a study of ovarian cancer by Ward (12) et al.in 1987 it was reported that, by using three mark-ers, the sensitivity in samples from pretreatmentpatients with stage I and II disease had increasedfrom 18% using CA125 alone to 64% using humanmilk-fat globulin II (HMFG2) as the second assayand placental alkaline phosphatase (PLAP) as athird marker. That is to say CA125 had picked up2/11 of the diseased group, HMFG2 and PLAPhad picked up a further 5 of the CA125 negativegroup taking the total to 7/11. However, as all thesubjects under study were disease positive, it canbe seen that neither CA125 nor HMFG2 nor PLAPperformed significantly differently from randomchance. They also studied the marker panel inpatients with advanced disease. In the 26 patientswith advanced (stage III and stage IV) disease 25had elevated CA125 (96%) and the twenty-sixthhad an elevated PLAP. Therefore all patients withadvanced carcinoma of the ovary were positive for

at least one of these three markers. These resultsare not quite as promising as one might at firstbelieve: Using such a group of patients whereprevalence is 100% (whether early stage oradvanced disease) one could achieve apparentlyexcellent sensitivity by four consecutive coin flipsat considerably less cost! (Each flip will have a50% sensitivity; therefore, in series, the cumula-tive sensitivity will become 50%, 75%, 87.5%,and 93.75%.)

7. ConclusionsDisease prevalence is of fundamental impor-

tance in the rational application of tumor markerassays. By and large, cancer prevalence is too lowin the population to permit effective screeningeven if the financial and ethical constraints couldbe overcome. In ovarian cancer there is thereforea large amount of current research directed at theidentification of possible high risk groups—theso-called cancer families—in which prevalence issignificantly higher than in the population at largedue to genetic predisposition.

The use of tumor markers to monitor diseaseprogress or remission, to track therapeutic effi-cacy or to give a lead time to relapse are muchmore successful. Here the markers either are beingapplied to a group in order to quantify a diseasewhich is known to be present or to pick up arelapse in a group where relapse and, therefore,disease prevalence will be high. The routine appli-cation of tumor markers in a clinical context hasbeen reviewed elsewhere (13,14).

Table 2Data from 1010 Postmenopausal Women Screened for Epithelial Ovarian Cancer

(EOC) using CA125TP = True Positive; FP = False Positive; TN = True Negative; FN = False Negative

EOC Positive EOC Negative Totals

CA125 Positive 1 (TP) 31 (FP) 32 (TP+FP)CA125 Negative 0 (FN) 978 (TN) 978 (TN+FN)Totals 1 (TP+FN) 1009(TN+FP) 1010 (ALL)

Sensitivity = TP/(TP+FN) = 1/1 = 100%Specificity = TN/(TN+FP) = 978/1009 = 97%Prevalence = (TP+FN)/(TP+TN+FP+FN) = 1/1010 = 0.1%Accuracy = (TP+TN)/(TP+TN+FP+FN) = 979/1010 = 97%Positive Predictive Value = TP/(TP+FP) = 1/32 = 3.1%

Page 10: Screening with tumor markers

MOLECULAR BIOTECHNOLOGY Volume 20, 2002

162 Roulston

AcknowledgmentsThe author is indebted to Dr Cathie Sturgeon

for her stimulating critique of this review. He isalso grateful to Churchill Livingstone for permis-sion to use excerpts from his textbook: Serologi-cal Tumor Markers: An Introduction (RoulstonJE & Leonard RCF 1993).

References1. Baum, M. (1988) Breast Cancer; the facts. OUP

pp 1–6.2. von Rustizky, J. (1873) Multiple Myeloma. Zentral-

blatt fur Chirugie (Leipzig) 3, 102–111.3. Kahler, O. (1889) Zur symptomatologie des multiplen

Myelomas. Wiener Medizinische Presse 30, 209–253.4. Homburger, F. (1950) Evaluation of diagnostic tests for

cancer. 1: Methodology of evaluation and review ofsuggested diagnostic procedures. Cancer 3, 143–172.

5. Bodansky, O. (1974) Reflections on biochemicalaspects of human cancer. Cancer 33, 364–370.

6. Woodruff, M. (1990) Cellular variation and adapta-tion in cancer. Biological basis and therapeutic con-sequences. OUP pp 1–7.

7. Galen, R. S. and Gambino S. R. (1975) “Beyond Nor-mality: The Predictive Value and Efficiency of Medi-cal Diagnoses” New York: John Wiley Medical.

8. Bayes, Rev. T. (1763) An essay toward solving aproblem in the doctrine of chance. Phil. Trans. Roy.Soc. 53, 370–418.

9. Miller, A. B. (1985) Principles of screening and ofthe evaluation of screening programs. In: “Screeningfor Cancer” ed. Miller A.B. Academic Press Inc, pp3–24.

10. Roulston, J. E. and Leonard, R. C. F. (1993) Serologi-cal Tumor Markers: An Introduction pp 15–34 pub.Churchill Livingstone, Edinburgh, UK.

11. Jacobs, I. J., Bridges, J., Stabile, I., Kemsley, P.,Reynolds, C., and Oram, D. H. (1987). CA-125 andscreening for ovarian cancer: serum levels in 1010apparently healthy postmenopausal women. Br. J.Cancer 55, 515.

12. Ward, B. G., Cruickshank, D. J., Tucker, D. F., andLove, S. (1987). Independent expression in serum ofthree tumor-associated antigens: CA125, placentalalkaline phosphatase and HMFG2 in human ovariancarcinoma. Br. J. Obstet. Gynæcol. 94, 696–698.

13. Bormer, O. P., Paus, E., and Nustad, K. (1998). Sen-sible use of tumor markers in routine practice. Pro-ceedings UK NEQAS Meeting; 3, 140–145.

14. Hayes, D. F., Bast, R. C., Desch, C. E., Fritsche, H.Jr., Kemeny, N. E., and Jessup, J. M. (1996). Tumormarker utility grading system: Aframework to evalu-ate clinical utility of tumor markers. J.N.C.I. 88,1456–1466.