radiology:statistical concepts series pdf - cornell...

102
Radiology Statistical Concepts Series (November 2002 – March 2004) Contents (Page numbers in PDF file) Introduction ................................................................................................................. 2 An Introduction to Biostatistics .................................................................................... 3 Describing Data: Statistical and Graphical Methods .................................................. 8 Probability in Radiology ............................................................................................ 15 Measurement Variability and Confidence Intervals in Medicne: Why Should Radiologists Care?..................................................................................................... 19 Hypothesis Testing I: Proportions ............................................................................. 24 Hypothesis Testing II: Means .................................................................................... 29 Sample Size Estimation: How Many Individuals Should Be Studied? ......................... 33 Correlation and Simple Linear Regression ................................................................. 38 Fundamental Measures of Diagnostic Examination Performance: Usefulness for Clinical Decision Making and Research ..................................................................... 44 Measurement of Observer Agreement ......................................................................... 51 Hypothesis Testing III: Counts and Medians ............................................................. 57 Receiver Operating Characteristic Curves and Their Use in Radiology...................... 63 Primer on Multiple Regression Models for Diagnostic Imaging Research .................. 69 Special Topics III: Bias ............................................................................................. 75 Proportions, Odds, and Risk....................................................................................... 80 Technology Assessment for Radiologists .................................................................... 88 Sample Size Estimation: A glimpse beyond Simple Formulas .................................... 94 Statistical Literacy ................................................................................................... 101

Upload: trannga

Post on 30-Jan-2018

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Radiology Statistical Concepts Series(November 2002 – March 2004)

Contents (Page numbers in PDF file)

Introduction ................................................................................................................. 2

An Introduction to Biostatistics .................................................................................... 3

Describing Data: Statistical and Graphical Methods .................................................. 8

Probability in Radiology ............................................................................................ 15

Measurement Variability and Confidence Intervals in Medicne: Why Should

Radiologists Care?..................................................................................................... 19

Hypothesis Testing I: Proportions ............................................................................. 24

Hypothesis Testing II: Means .................................................................................... 29

Sample Size Estimation: How Many Individuals Should Be Studied?......................... 33

Correlation and Simple Linear Regression................................................................. 38

Fundamental Measures of Diagnostic Examination Performance: Usefulness for

Clinical Decision Making and Research..................................................................... 44

Measurement of Observer Agreement......................................................................... 51

Hypothesis Testing III: Counts and Medians ............................................................. 57

Receiver Operating Characteristic Curves and Their Use in Radiology...................... 63

Primer on Multiple Regression Models for Diagnostic Imaging Research .................. 69

Special Topics III: Bias ............................................................................................. 75

Proportions, Odds, and Risk....................................................................................... 80

Technology Assessment for Radiologists .................................................................... 88

Sample Size Estimation: A glimpse beyond Simple Formulas .................................... 94

Statistical Literacy ................................................................................................... 101

Page 2: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Anthony V. Proto, MD, Editor

Radiology 2002—Statistical Concepts Series1

As of January 2001, we began review bystatisticians of all manuscripts that havestatistical content and that are to be pub-lished in Radiology (1). Although I be-lieved, from its inception, this form ofmanuscript evaluation to be an essentialcomponent of the peer-review process,my belief has been further confirmedover the past many months. One of themost common comments by our statisti-cal reviewers is that authors have selectedinappropriate statistical tests for the anal-ysis of their data. We urge authors toconsult with statisticians regarding theanalysis of their data. It is particularlyimportant that a study be designed anddata be collected in a manner that willallow the study hypothesis to be ade-quately evaluated. Statistical consulta-tion in the study-planning stages canhelp ensure success in this regard.

With the November 2002 issue of Ra-diology, we begin a special series of arti-cles that will appear in the section enti-tled Statistical Concepts Series. As Iannounced earlier this year (2), we areindebted to Kimberly E. Applegate, MD,MS, and Philip E. Crewson, PhD, for co-ordinating this series. Dr Applegate, anRSNA Editorial Fellow in the year 2000, iscurrently associate professor of Radiologyand Health Services Research at IndianaUniversity, Indianapolis. Dr Crewson,formerly director of Clinical Studies, Re-search Development, at the AmericanCollege of Radiology, is currently assis-tant director of Scientific Development,Health Services Research, and Develop-ment Service at the Office of Researchand Development, Department of Veter-ans Affairs, Washington, DC. Both DrApplegate and Dr Crewson have ex-pended a substantial amount of time andeffort in selecting topics for this series,identifying the authors for the varioustopics, and working with the authors toensure an appropriate level of depth ofcoverage for each topic without undueoverlap with other topics in the series.After review of their manuscripts by DrsApplegate and Crewson, authors submit-ted the manuscripts to the Radiology Ed-itorial Office for peer review.

We hope that authors, reviewers, andreaders will find this series of articles

helpful—authors with regard to the de-sign of their studies and the analysis oftheir data, reviewers with regard to theirevaluation and critique of manuscriptsduring the peer-review process, and readerswith regard to improved understandingand interpretation of articles publishedin Radiology. Since we have establishedthe section title Statistical Concept Seriesfor these articles, they will be accessiblethrough Radiology Online (radiology.rsnajnls.org) by clicking first on Browse by Sub-specialty and Category (Radiology Col-lections) and second on this section title.As noted by Drs Applegate and Crewsonin their article “An Introduction to Bio-statistics,” the first in this series and pub-lished in the current issue of the Journal(3), “These articles are meant to increaseunderstanding of how statistics can andshould be applied in radiology researchso that radiologists can appropriately in-terpret the results of a study.”

References1. Proto AV. Radiology 2001—the upcoming

year. Radiology 2001; 218:1–2.2. Proto AV. Radiology 2002—continued

progress. Radiology 2002; 222:1–2.3. Applegate KE, Crewson PE. An introduc-

tion to biostatistics. Radiology 2002; 225:318–322.

Index terms:EditorialsRadiology (journal)

Published online before print10.1148/radiol.2252020996

Radiology 2002; 225:317See also the article by Applegate and

Crewson in this issue.© RSNA, 2002

From the Editor

317

Ra

dio

logy

Page 3: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Kimberly E. Applegate, MD,MS

Philip E. Crewson, PhD

Index terms:Radiology and radiologists, researchStatistical analysis

Published online before print10.1148/radiol.2252010933

Radiology 2002; 225:318–322

1 From the Department of Radiology,Riley Hospital for Children, 702 Barn-hill Rd, Indianapolis, IN 46202 (K.E.A.);and the Department of Research, Amer-ican College of Radiology, Reston, Va(P.E.C.). Received May 16, 2001; revi-sion requested July 6; revision receivedSeptember 5; accepted September 24.Address correspondence to K.E.A.(e-mail: [email protected]).

See also the From the Editor articlein this issue.© RSNA, 2002

An Introduction toBiostatistics1

This introduction to biostatistics and measurement is the first in a series of articlesdesigned to provide Radiology readers with a basic understanding of statisticalconcepts. Although most readers of the radiology literature know that application ofstudy results to their practice requires an understanding of statistical issues, manymay not be fully conversant with how to interpret statistics. The goal of this seriesis to enhance the ability of radiologists to evaluate the literature competently andcritically, not make them into statisticians.© RSNA, 2002

There are three kinds of lies: lies, damned lies and statistics.Benjamin Disraeli (1)

The use of statistics in both radiology journals and the broader medical literature hasbecome a common feature of published clinical research (2). Although not often recog-nized by the casual consumer of research, errors in statistical analysis are common, andmany believe that as many as 50% of the articles in the medical literature have statisticalflaws (2). Most radiologists, however, are poorly equipped to properly interpret many ofthe statistics reported in the radiology literature. There are a number of reasons for thisproblem, but the reality for the radiology profession is that research methods have longbeen a low priority in the training of radiologists (3–5). Contributing to this deficit is ageneral indifference toward statistical teaching in medical school and physician training,insufficient numbers of statisticians, and limited collaboration and understanding be-tween radiologists and statisticians (2,6,7).

If it has traditionally been such a low priority in the profession, why then do we needto improve our understanding of statistics? We are consumers of information. Statisticsallow us to organize and summarize information and to make decisions by using only asample of all available data. Nearly all readers of the radiology literature know thatunderstanding a study’s results and determining the applicability of the results to theirpractice requires an understanding of statistical issues (5). Even when learned, however,research skills can be quickly forgotten if not applied on a regular basis—something mostradiologists are unlikely to do, given their increasing clinical demands.

This introduction to biostatistics and measurement is the first in a series of articlesdesigned to provide Radiology readers with a basic understanding of statistical concepts.These articles are meant to increase understanding of how statistics can and should beapplied in radiology research so that radiologists can appropriately interpret the results ofa study. Each article will provide a short summary of a statistical topic. The series beginswith basic measurement issues and progress from descriptive statistics to hypothesistesting, multivariate models, and selected technology-assessment topics. Key conceptspresented in this series will be directly related to the practice of radiology and radiologicresearch. In some cases, formulas will be provided for those who wish to develop a deeperunderstanding; however, the goal of this series is to enhance the ability of radiologists toevaluate the literature competently and critically, not to make them statisticians.

The concepts presented in this introductory article are important for putting intoperspective the substantive value of published research. Appendices A and B include twouseful resources. One is a list of the common terms and definitions related to measure-ment. The other is a list of potentially useful Web resources. This list contains Web sites

Statistical Concepts Series

318

Ra

dio

logy

Page 4: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

that are either primarily educational orhave links to other resources such as sta-tistical software. In addition, some sug-gested additional readings are listed inAppendix C.

ORIGINS OF STATISTICS

The term statistic simply means “numericdata.” In contrast, the field of statistics isa human enterprise encompassing a widerange of methods that allow us to learnfrom experience (8,9). Tied to the emer-gence of the scientific method in the16th and 17th centuries, statistical think-ing involves deduction of explanations ofreality and framing of these explanationsinto testable hypotheses. The results ofthese tests are used to reach conclusions(inferences) about future outcomes on thebasis of past experiences.

Statistical thinking is probabilistic. Forexample, most radiologists are familiarwith the notion that a P value of less than.05 represents a statistically significant re-sult. Few understand that .05 is an arbi-trary threshold. While performing agron-omy research, Sir Ronald Fisher helped toestablish the P value cutoff level of .05(commonly referred to as the ! level) inthe early part of the 20th century. Fisher(10) was testing hypotheses about appro-priate levels of fertilizer for potato plantsand needed a basis for decision making.Today we often use this same basis fortesting hypotheses about appropriate pa-tient care.

The health profession gradually recog-nized that statistics were as applicable topeople as they were to potato plants. Bythe 1960s, clinicians and health policyleaders were asking for statistical evi-dence that an intervention was effective(11). Over the past several decades, theuse of statistics in medical journals hasincreased both in quantity and in sophis-tication (12,13). Advances in computersand statistical software have paralleledthis increase. However, the benefits ofeasy access to the tools of statistical anal-ysis can be overshadowed by the costsassociated with misapplication of statis-tical methods. Statistical software makesit far too easy to conduct multiple tests ofdata without prior hypotheses (the so-called data-mining phenomenon) or toreport overly precise results that portenda false sense of accuracy. There is also thepotential for errors in statistical softwareand the ever-present risk that researcherswill fail to take the time to carefully lookat the raw data (14). These issues canresult in poor science, erroneous or mis-

leading results, and inappropriate patientcare. The benefits of statistical softwaregenerally far outweigh the costs, butproper measurement, study design, andgood judgment should prevail over theease with which many analyses can beconducted. What follows is an introduc-tion to the basics of measurement.

MEASUREMENTS: BUILDINGBLOCKS OF STATISTICS

The interpretation and use of statisticsrequire a basic understanding of the fun-damentals of measurement. Althoughmost readers of the radiology literaturewill recognize common terms such asvariables, association, and causation, feware likely to understand how these termsinterrelate with one another to frame thestructure of a statistical analysis. Whatfollows is a brief introduction to the prin-ciples and vocabulary of measurement.

Operationalization

Emmet (15) wrote, “We must bewarealways of thinking that because a wordexists the ’thing’ for which that word issupposed to stand necessarily exists too.”

Measurement begins with the assign-ment of numbers to events or things tohelp us describe reality. Measurementsrange from the obvious (eg, diameter,length, time) to the more difficult (eg,patient satisfaction, quality of life, pain),but all are something we can quantify orcount. This process is called operational-ization. Operationalized concepts rangefrom well-established measures such aslesion diameter in millimeters to less well-defined measures such as image quality,contrast agent toxicity, patient comfort,and imaging cost.

If this appears somewhat abstract, con-sider the following three points: First, re-searchers can operationalize anythingthat exists (16), but some measures willbe more imprecise (quality of life) thanothers (diameter). Second, since there islikely to be more than one way to opera-tionalize a concept, the choice of the bestway may not be obvious. Third, the radi-ology profession and the research it gen-erates are saturated with conceptualiza-tions that have been operationalized,some more successfully than others.

Variables

Variables represent measurable indica-tors of a characteristic that can take onmore than one value from one observa-tion to the next. A characteristic may

have a different value in different people,in different places, or at different times.Such variables are often referred to asrandom variables when the value of aparticular outcome is determined bychance (ie, by means of random sam-pling) (17). Since many characteristicsare measured imperfectly, we should notexpect complete congruence between ameasure and truth. Put simply, any mea-surement has an error component.

If a measure does not take on morethan one value, it is referred to as a con-stant. As an example, patient sex is a vari-able: It can vary between male and fe-male from one patient to the next.However, a study of breast imaging islikely to be limited to female patients. Inthis context, sex is no longer a variable ina statistical sense (we cannot analyze itbecause it does not vary). In contrast,holding the value of one variable con-stant in order to clarify variations inother variables is sometimes referred to asa statistical control. With mammographyas an example, it may be useful to esti-mate the accuracy of an imaging tech-nique separately for women with and forwomen without dense breast tissue. Asnoted previously, however, operational-izing what is and is not dense breast tis-sue may not be as simple as it first ap-pears.

Measurement Scales

There are four levels of data, com-monly referred to as nominal, ordinal,interval, and ratio data. Nominal dataclassifies objects according to type orcharacteristic but has no logical order.With imaging technology as an example,ultrasonography, magnetic resonance(MR) imaging, computed tomography,and conventional radiography are eachexclusive technologic categories, gener-ally without logical order. Other com-mon examples would be sex, race, and aradiologist’s primary subspecialty. Ordi-nal data also classify objects according tocharacteristic, but the categories can takeon some meaningful order. The Ameri-can College of Radiology Breast ImagingReporting and Data System, or BI-RADS,classification system for final assessmentis a good example of an ordinal scale. Thecategories are mutually exclusive (eg, afinding cannot be both “benign” and a“suspicious abnormality”), have somelogical order (ranked from “negative” to“highly suggestive of malignancy”), andare scaled according to the amount of aparticular characteristic they possess(suspicion of malignancy). Nominal

Volume 225 ! Number 2 Introduction to Biostatistics ! 319

Ra

dio

logy

Page 5: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

and ordinal data are also referred to asqualitative variables, since their under-lying meaning is nonnumeric.

Interval data classify objects accordingto type and logical order, but the differ-ences between levels of a measure areequal (eg, temperature in degrees Celsius,T scores reported for bone mineral den-sity). Ratio data are the same as intervaldata but have a true zero starting point.As noted in the examples above, the val-ues of degrees Celsius and T score cantake on both positive and negative num-bers. Examples of ratio data would beheart rate, percentage vessel stenosis, andrespirations per minute. Interval and ra-tio data are also referred to as quantita-tive variables, since they have a directnumeric interpretation. In most analyses,it does not matter whether the data areinterval or ratio data.

Continuous and Discrete Variables

Variables such as weight and diameterare measured on a continuous scale,meaning they can take on any valuewithin a given interval or set of intervals.As a general rule of thumb, if a subdivi-sion between intervals makes sense, thedata are continuous. As an example, atime interval of minutes can be furtherdivided into seconds, milliseconds, andan infinite number of additional frac-tions. In contrast, discrete variables suchas sex, the five-point BI-RADS final assess-ment scale, race, and number of childrenin a household have basic units of mea-surement that cannot be divided (onecannot have 1.5 children).

Reliability and Validity

Measurement accuracy is directly re-lated to reliability and validity. Reliabil-ity is the extent to which the repeated useof a measure yields the same values whenno change has occurred. Therefore, reli-ability can be evaluated empirically. Poorreliability negatively affects all studies. Asan example, reliability can depend onwho performs the measurement andwhen, where, how, and from whom thedata are collected.

Validity is the extent to which a mea-sure is an accurate representation of theconcept it is intended to operationalize.Validity cannot be confirmed empiri-cally—it will always be in question. Al-though there are several different con-ceptualizations of validity, the followingprovides a brief overview. Predictive va-lidity refers to the ability of an indicatorto correctly predict (or correlate with) anoutcome (eg, imaged abnormal lesion

and subsequent malignancy). Contentvalidity is the extent to which the indi-cator reflects the full domain of interest(eg, tumor shrinkage may be indicated bytumor width, height, or both). Constructvalidity is the degree to which one mea-sure correlates with other measures of thesame concept (eg, does a positive MRstudy for multiple sclerosis correlate withphysical examination findings, patientsymptoms, or laboratory results?). Facevalidity evaluates whether the indicatorappears to measure the concept. As anexample, it is unlikely that an MR studyof the lumbar spine will facilitate a diag-nosis for lost memory and disorientation.

Association

The connection between variables isoften referred to as association. Associa-tion, also known as covariation, is exhib-ited by measurable changes in one vari-able that occur concurrently withchanges in another variable. A positiveassociation is represented by changes inthe same direction (eg, heart rate in-creases as physical activity increases).Negative association is represented byconcurrent changes in opposite direc-tions (hours per week spent exercisingand percentage body fat). Spurious asso-ciations are associations between twovariables that can be better explained bya third variable. As an example, if aftertaking medication for a common cold for10 days the symptoms disappear, onecould assume that the medication curedthe illness. Most of us, however, wouldprobably agree that the change is betterexplained in terms of the normal timecourse of a common cold rather than apharmacologic effect.

Causation

There is a difference between the deter-mination of association and that of cau-sation. Causation cannot be proved withstatistics. With this caveat in mind, sta-tistical techniques are best used to ex-

plore (not prove) connections betweenindependent and dependent variables. Adependent variable (sometimes calledthe response variable) is a variable thatcontains variations for which we seek anexplanation. An independent variable isa variable that is thought to affect (cause)changes in the dependent variable. Cau-sation is implied when statistically signif-icant associations are found between anindependent and a dependent variable,but causation can never be truly proved.Proof is always an exercise in logical de-duction tempered with a degree of uncer-tainty (18,19), even in experimental de-signs (such as randomized controlledtrials).

Statistical techniques provide evidencethat a relationship exists between inde-pendent and dependent variables throughthe use of significance testing and mea-sures of the strength of association. Thisevidence must be supported by the theo-retical basis and logic of the research. TheTable presents a condensed list of ele-ments necessary for a claim of causation.The first attempt to provide an epidemi-ologic method for evaluating causationwas performed by A. G. Hill and adaptedfor the well-known U.S. Surgeon Gener-al’s report, Smoking and Health (1964)(18,19). The elements described in theTable serve to remind us that causation isneither a simple exercise nor a directproduct of statistical significance. This iswhy many believe the optimal researchtechnique to establish causation is to usea randomized controlled experiment.

MAINTAINING PERSPECTIVE

Rothman and Greenland (19) wrote,“The tentativeness of our knowledgedoes not prevent practical applications,but it should keep us skeptical and criti-cal, not only of everyone else’s work butof our own as well.”

A basic understanding of measurementwill enable radiologists to better under-

Required Elements for Causation

Element Explanation

Association Do the variables covary empirically? Strong associations are more likelyto be causal than are weak associations.

Precedence Does the independent variable vary before the effect exhibited in thedependent variable?

Nonspuriousness Can the empirical correlation between two variables be explainedaway by the influence of a third variable?

Plausibility Is the expected outcome biologically plausible and consistent withtheory, prior knowledge, and results of other studies?

320 ! Radiology ! November 2002 Applegate and Crewson

Ra

dio

logy

Page 6: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

stand and put into perspective the sub-stantive importance of published re-search. Maintaining perspective not onlyrequires an understanding that all ana-lytic studies operate under a cloud of im-perfect knowledge, but it also requiressufficient insight to recognize that statis-tical sophistication and significance test-ing are tools, not ends in themselves. Sta-tistical techniques, however, are useful inproviding summary measures of con-cepts and helping researchers decide,given certain assumptions, what is mean-ingful in a statistical sense (more aboutthis in future articles). As new techniquesare presented in this series, readersshould remind themselves that statisticalsignificance is meaningless without clin-ical significance.

WHAT COMES NEXT

This introduction to measurement willbe followed by a series of articles on basicbiostatistics. The series will cover topicson descriptive statistics, probability, sta-tistical estimation and hypothesis test-ing, sample size, and power. There willalso be more advanced topics introduced,such as correlation, regression modeling,statistical agreement, measures of riskand accuracy, technology assessment, re-ceiver operating characteristic curves,and bias. Each article will be written byexperienced researchers using radiologicexamples to present a nontechnical ex-planation of a statistical topic.

APPENDIX A: KEY TERMS

Below is a list of the common terms anddefinitions related to measurement.

Abstract concept.—The starting point formeasurement, an abstract concept is bestunderstood as a general idea in linguisticform that helps us describe reality.

Association.—An association is a measur-able change in one variable that occurs con-currently with changes in another variable.Positive association is represented by changein the same direction. Negative association isrepresented by concurrent changes in oppo-site directions.

Constant.—A constant is an attribute of aconcept that does not vary.

Construct validity.—Construct validity isthe degree to which one measure correlateswith other measures of the same abstractconcept.

Content validity.—Content validity is theextent to which the indicator reflects thefull domain of interest.

Continuous variable.—This type of variableis a measure that can take on any value

within a given interval or set of intervals: aninfinite number of possible values.

Dependent variable.—The value of the de-pendent variable depends on variations inanother variable.

Discrete variable.—This type of variable isa measure that is represented by a limitednumber of values.

Face validity.—Face validity evaluateswhether the indicator appears to measurethe abstract concept.

Independent variable.—The independentvariable can be manipulated to affect varia-tions or responses in another variable.

Interval data.—These variables classify ob-jects according to type and logical order butalso require that differences between levelsof a category are equal.

Nominal data.—These are variables thatclassify objects according to type or charac-teristic.

Operationalize.—This is the process of cre-ating a measure of an abstract concept.

Ordinal data.—These are variables thatclassify objects according to type or kindbut also have some logical order.

Predictive validity.—This is the ability ofan indicator to correctly predict (or corre-late with) an outcome.

Random variable.—This type of variable is ameasure where any particular value is basedon chance by means of random sampling.

Ratio data.—These variables have a zerostarting point and classify objects accordingto type and logical order but also requirethat differences between levels of a categorybe equal.

Reliability.—Reliability is the extent towhich the repeated use of a measure yieldsthe same value when no change has occurred.

Spurious association.—This is an associationbetween two variables that can be better ex-plained by or depends greatly on a third vari-able.

Statistical control.—This refers to holdingthe value of one variable constant in orderto clarify associations among other vari-ables.

Statistical inference.—This is the processwhereby one reaches a conclusion about apopulation on the basis of information ob-tained from a sample drawn from that pop-ulation. There are two such methods, statis-tical estimation and hypothesis testing.

Validity.—Validity is the extent to whicha measure accurately represents the abstractconcept it is intended to operationalize.

Variable.—A variable is a measure of aconcept that can take on more than onevalue from one observation to the next.

APPENDIX B: WEBRESOURCES

The following is a list of links to generalstatistics resources available on the Web (ac-cessed May 14, 2001).

www.StatPages.netwww.stats.gla.ac.ukThe following Web links are sources of

statistics help (accessed May 14, 2001).BMJ Statistics at Square One: www.bmj

.com/statsbk/The Little Handbook of Statistical Prac-

tice: www.tufts.edu/"gdallal/LHSP.HTMRice Virtual Lab in Statistics: www.ruf.rice

.edu/"lane/rvls.htmlConcepts and Applications of Inferential

Statistics: faculty.vassar.edu/"lowry/webtext.html

StatSoft Electronic Textbook: www.statsoft.com/textbook/stathome.html

Hypertext Intro Stat Textbook: www.stat.ucla.edu/textbook/

Introductory Statistics: Concepts, Models,and Applications: www.psychstat.smsu.edu/sbk00.htm

Statnotes: An Online Textbook: www2.chass.ncsu.edu/garson/pa765/statnote.htm

Research Methods Knowledge Base: trochim.human.cornell.edu/kb/

The following is a list Web links to statis-tical software (accessed May 14, 2001).

Stata Software (links to software provid-ers): www.stata.com/links/stat_software.html

EpiInfo, free software downloads avail-able from the Centers for Disease Controland Prevention: www.cdc.gov/epiinfo/

APPENDIX C: SUGGESTEDGENERAL READINGS

The following is a list of suggested read-ings.

Pagano M, Gauvreau K. Principles of bio-statistics. Belmont, Calif: Duxbury, 1993.

Motulsky H. Intuitive biostatistics. NewYork, NY: Oxford University Press, 1995.

Rothman KJ, Greenland S, eds. Modernepidemiology. Philadelphia, Pa: Lippincott-Raven, 1998.

Gordis L, ed. Epidemiology. Philadelphia,Pa: Saunders, 1996.

Oxman AD, Sackett DL, Guyatt GH. Us-ers’ guides to the medical literature. I. Howto get started. The Evidence-Based MedicineWorking Group. JAMA 1993; 270:2093–2095.[This is from an ongoing series through year2000.]

References1. Disraeli B. Quoted by: Twain M. An auto-

biography of Mark Twain. Neider C, ed.New York, NY: Columbia UniversityPress, 1995; chapter 29.

2. Altman DG, Bland JM. Improving doc-tor’s understanding of statistics. J R StatSoc 1991; 154:223–267.

3. Hillman BJ, Putnam CE. Fostering re-search by radiologists: recommendationsof the 1991 summit meeting. Radiology1992; 182:315–318.

4. Doubilet PM. Statistical techniques formedical decision making: application todiagnostic radiology. AJR Am J Roentge-nol 1988; 150:745–750.

Volume 225 ! Number 2 Introduction to Biostatistics ! 321

Ra

dio

logy

Page 7: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

5. Black WC. How to evaluate the radiologyliterature. AJR Am J Roentgenol 1990;154:17–22.

6. Moses LE, Lois TA. Statistical consultationin clinical research: a two-way street. In:Bailar JC III, Mosteller F, eds. Medical usesof statistics. 2nd ed. Boston, Mass: NEJMBooks, 1992; 349–356.

7. Altman DG. Statistics and ethics in med-ical research. VIII. Improving the qualityof statistics in medical journals. Br Med J1981; 282:44–47.

8. Moses L. Statistical concepts fundamentalto investigations. In: Bailar JC III, MostellerF, eds. Medical uses of statistics. 2nd ed.Boston, Mass: NEJM Books, 1992; 5–44.

9. Bailar JC III, Mosteller F, eds. Medical uses

of statistics. 2nd ed. Boston, Mass: NEJMBooks, 1992.

10. Fisher RA. The arrangement of field exper-iments. Journal of the Ministry of Agricul-ture of Great Britain 1926; 33:503–513.

11. Oxman AD, Guyatt GH. The science ofreviewing research. Ann N Y Acad Sci1993; 703:125–133; discussion 133–134.

12. Altman DG. Statistics: necessary and im-portant. Br J Obstet Gynaecol 1986; 93:1–5.

13. Altman DG. Statistics in medical jour-nals: some recent trends. Stat Med 2000;19:3275–3289.

14. Altman DG. Practical statistics for medi-cal research. London, England: Chapman& Hall, 1991; 108–111.

15. Emmet ER. Handbook of logic. Lanham,

Md: Littlefield Adams Quality Paper-backs, 1993.

16. Babbie E. The practice of social research.6th ed. New York, NY: Wadsworth, 1995.

17. Fisher LD, Belle GV. Biostatistics: a meth-odology for the health sciences. NewYork, NY: Wiley, 1993.

18. Rothman KJ, Greenland S, eds. The emer-gence of modern epidemiology. In: Mod-ern epidemiology. 2nd ed. Philadelphia,Pa: Lippincott-Raven, 1998; 3–6.

19. Rothman KJ, Greenland S, eds. Causationand causal inference. In: Modern epide-miology. 2nd ed. Philadelphia, Pa: Lip-pincott-Raven, 1998; 7–28.

322 ! Radiology ! November 2002 Applegate and Crewson

Ra

dio

logy

Page 8: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Seema S. Sonnad, PhD

Index terms:Data analysisStatistical analysis

Published online before print10.1148/radiol.2253012154

Radiology 2002; 225:622–628

1 From the Department of Surgery, Uni-versity of Michigan Medical Center, AnnArbor. Received January 14, 2002; revi-sion requested March 2; revision re-ceived May 20; accepted June 14. Ad-dress correspondence to the author,Department of Surgery, University ofPennsylvania Health System, 4 Silver-stein, 3400 Spruce St, Philadelphia, PA19104-4283 (e-mail: [email protected]).© RSNA, 2002

Describing Data: Statisticaland Graphical Methods1

An important step in any analysis is to describe the data by using descriptive andgraphic methods. The author provides an approach to the most commonly usednumeric and graphic methods for describing data. Methods are presented forsummarizing data numerically, including presentation of data in tables and calcu-lation of statistics for central tendency, variability, and distribution. Methods are alsopresented for displaying data graphically, including line graphs, bar graphs, histo-grams, and frequency polygons. The description and graphing of study data resultin better analysis and presentation of data.© RSNA, 2002

A primary goal of statistics is to collapse data into easily understandable summaries. Thesesummaries may then be used to compare sets of numbers from different sources or toevaluate relationships among sets of numbers. Later articles in this series will discussmethods for comparing data and evaluating relationships. The focus of this article is onmethods for summarizing and describing data both numerically and graphically. Optionsfor constructing measures that describe the data are presented first, followed by methodsfor graphically examining your data. While these techniques are not methodologicallydifficult, descriptive statistics are central to the process of organizing and summarizinganything that can be presented as numbers. Without an understanding of the key con-cepts surrounding calculation of descriptive statistics, it is difficult to understand how touse data to make comparisons or draw inferences, topics that will be discussed extensivelyin future articles in this series.

In this article, five properties of a set of numbers will be discussed. (a) Location or centraltendency: What is the central or most typical value seen in the data? (b) Variability: Towhat degree are the observations spread or dispersed? (c) Distribution: Given the centerand the amount of spread, are there specific gaps or concentrations in how the datacluster? Are the data distributed symmetrically or are they skewed? (d) Range: How extremeare the largest and smallest values of the observations? (e) Outliers: Are there any obser-vations that do not fit into the overall pattern of the data or that change the interpretationof the location or variability of the overall data set?

The following tools are used to assess these properties: (a) summary statistics, includingmeans, medians, modes, variances, ranges, quartiles, and tables; and (b) plotting of thedata with histograms, box plots, and others. Use of these tools is an essential first step tounderstand the data and make decisions about succeeding analytic steps. More specificdefinitions of these terms can be found in the Appendix.

DESCRIPTIVE STATISTICS

Frequency Tables

One of the steps in organizing a set of numbers is counting how often each value occurs.An example would be to look at diagnosed prostate cancers and count how often in a2-year period cancer is diagnosed as stage A, B, C, or D. For example, of 236 diagnosedcancers, 186 might be stage A, 42 stage B, six stage C, and two stage D. Because it is easierto understand these numbers if they are presented as percentages, we say 78.8% (186 of236) are stage A, 17.8% (42 of 236) are stage B, 2.5% (six of 236) are stage C, and 0.9% (twoof 236) are stage D. This type of calculation is performed often, and two definitions areimportant. The frequency of a value is the number of times that value occurs in a givendata set. The relative frequency of a value is the proportion of all observations in the dataset with that value. Cumulative frequency is obtained by adding relative frequency for one

Statistical Concepts Series

622

Ra

dio

logy

Page 9: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

value at a time. The cumulative fre-quency of the first value would be thesame as the relative frequency and that ofthe first two values would be the sum oftheir relative frequencies, and so on. Fre-quency tables appear regularly in Radiol-ogy articles and other scientific articles.An example of a frequency table, includ-ing both frequencies and percentages, isshown in Table 1. In the section of thisarticle about graphic methods, histo-grams are presented. They are a graphicmethod that is analogous to frequencytables.

Measurement of the Center of theData

The three most commonly used mea-sures of the center of the data are themean, median, and mode. The mean (of-ten referred to as the average) is the mostcommonly used measure of center. It ismost often represented in the literatureas x! . The mean is the sum of the values ofthe observations divided by the numberof observations. The median is the mid-point of the observations, when arrangedin order. Half of the observations in adata set lie below the median and half lieabove the median. The mode is the mostfrequent value. It is the value that occurs

most commonly in the data set. In a dataset like the one in Figure 1, the mean,median, and mode will be different num-bers. In a perfectly normal distribution,they will all be the same number. A nor-mal distribution is a commonly occur-ring symmetric distribution, which is de-fined by the familiar bell-shaped curve,that includes a set percentage of data be-tween the center and each standard devi-ation (SD) unit.

When a median is used, the data arearranged in order and the middle obser-vation is selected. If the number of obser-vations is even, there will be a middlepair of values, and the median will be thepoint halfway between those two values.The numbers must be ordered when themedian is selected, and each observationmust be included. With a data set of 4, 4,5, 5, 6, 6, 8, and 9, the median is 5.5.However, if values represented by thedata were listed rather than listing thevalue of each observation, the medianwould be erroneously reported as 6. Fi-nally, if there are 571 observations, thereis a method to avoid counting in fromthe ends. Instead, the following formulais used: If there are n observations, cal-culate (n ! 1)/2. Arrange the observa-tions from smallest to largest, and count

(n ! 1)/2 observations up from the bot-tom. This gives the median. In real life,many statistical programs (including Ex-cel; Microsoft, Redmond, Wash), will givethe mean, median, and mode, as well asmany other descriptive statistics for datasets.

To look for the mode, it is helpful tocreate a histogram or bar chart. The mostcommon value is represented by thehighest bar and is the mode. Some distri-butions may be bimodal and have twovalues that occur with equal frequency.When no value occurs more than once,they could all be considered modes.However, that does not give us any extrainformation. Therefore, we say that thesedata do not have a mode. The mode isnot used often because it may be far fromthe center of the data, as in Figure 1, orthere may be several modes, or none. Themain advantage of the mode is that it isthe only measure that makes sense forvariables in nominal scales. It does notmake sense to speak about the median ormean race or sex of radiologists, but itdoes make sense to speak about the mostfrequent (modal) race (white) or sex(male) of radiologists. The median is de-termined on the basis of order informa-tion but not on the basis of the actualvalues of observations. It does not matterhow far above or below the middle a valueis, but only that it is above or below.

The mean comprises actual numericvalues, which may be why it is used socommonly. A few exceptionally large orsmall values can significantly affect themean. For example, if one patient whoreceived contrast material before com-puted tomography (CT) developed a se-vere life-threatening reaction and had tobe admitted to the intensive care unit,the cost for care of that patient might beseveral hundred thousand dollars. Thiswould make the mean cost of care asso-ciated with contrast material–enhancedCT much higher than the median costbecause without such an episode costsmight only be several hundred dollars.For this reason, it may be most appropri-ate to use the median rather than themean for describing the center of a dataset if the data contain some very large orvery small outlier values or if the data arenot centered (Fig 1).

“Skewness” is another important term.While there is a formula for calculatingskew, which refers to the degree to whicha distribution is asymmetric, it is notcommonly used in data analysis. Datathat are skewed right (as seen in Fig 1) arecommon in biologic studies because manymeasurements involve variables that have

TABLE 1Frequencies, Relative Frequencies, and Cumulative Frequencies of CancerStaging Distribution

CancerStage Frequency

CumulativeFrequency Percentage

Relative Frequency(proportion)

A 186 .788 78.8 .79B 42 .967 17.9 .18C 6 .992 2.5 .025D 2 1.0 0.9 .009

Figure 1. Line graph shows the mean, median, and mode of a skewed distribution. In adistribution that is not symmetric, such as this one, the mean (arithmetic average), the median(point at which half of the data lie above and half lie below), and the mode (most common valuein the data) are not the same.

Volume 225 ! Number 3 Describing Data: Statistical and Graphical Methods ! 623

Ra

dio

logy

Page 10: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

a natural lower boundary but no definitiveupper boundary. For example, hospitallength of stay can be no shorter than 1 dayor 1 hour (depending on the units used bya given hospital), but it could be as long asseveral hundred days. The latter would re-sult in a distribution with more values be-low some cutoff and then a few outliersthat create a long “tail” on the right.

Measuring Variability in the Data

Measures of center are an excellentstarting point in summarizing data, butthey usually do not “tell the full story”and can be misleading if there is no in-formation about the variability or spreadof the data. An adequate summary of aset of data requires both a measure ofcenter and a measure of variability. Justas with the center, there are several op-tions for measuring variability. Eachmeasure of variability is most often asso-ciated with one of the measures of center.When the median is used to describe thecenter, the variability and general shapeof the data distribution are described byusing percentiles. The xth percentile isthe value at which x percent of the datalie below that percentile and the rest lieabove it; therefore, the median is also the50th percentile. As seen in the box plot(Fig 2), the 25th and 75th percentiles,also known as the lower and upper quar-tiles, respectively, are often used to de-scribe data (1).

Although the percentiles and quartilesused for creating box plots are useful andsimple, they are not the most commonmeasures of spread. The most commonmeasure is the SD. The SD (and the re-

lated variance) is used to describe spreadaround the center when the center is ex-pressed as a mean. The formula for vari-ance would be written as

¥ "obs ! mean#2

no. of obs ,

where $ represents a summation of allthe values, and obs means observations.Squaring of the differences results in allpositive values. Then the SD is

!¥ "obs ! mean#2

no. of obs ,

the square root of the variance. It is help-ful to think of the SD as the average dis-tance of observations from the mean. Be-cause it is a distance, it is always positive.Large outliers affect the SD drastically,just as they do the mean. Occasionally,the coefficient of variation—the SD ormean multiplied by 100 to get a percent-age value—is used. This can be useful if

the interest is in the percentage variationrather than the absolute value in numericterms. Kaus and colleagues (2) presentedan interesting table in their study. Table

TABLE 2Intra- and Interobserver Variability in MR Reading: Automated versusManual Method

Segmented Volume andTumor Histologic Type

Manual Method Automated Method

Intraobserver Interobserver Intraobserver Interobserver

BrainMeningioma 0.42 % 0.03 4.93 % 1.75 0.36 % 0.45 1.84 % 0.65Low-grade glioma 1.79 % 1.53 6.31 % 2.85 1.44 % 1.33 2.71 % 1.68

TumorMeningioma 1.58 % 0.98 7.08 % 2.18 0.66 % 0.72 2.66 % 0.38Low-grade glioma 2.08 % 0.78 13.61 % 2.21 2.06 % 1.73 2.97 % 1.58

Note.—Data are the mean coefficient of variation percentage plus or minus the SD.(Adapted and reprinted, with permission, from reference 2.)

TABLE 3Standard Scores and CorrespondingPercentiles

Standard Score Percentile

&3.0 0.13&2.5 0.62&2.0 2.27&1.5 6.68&1.0 15.87&0.5 30.85

0.0 50.000.5 69.151.0 84.131.5 93.322.0 97.732.5 99.383.0 99.87

Figure 2. Box plot demonstrates low varia-tion. A high-variation box plot would be muchtaller. The horizontal line is the median, theends of the box are the upper and lower quar-tiles, and the vertical lines are the full range ofvalues in the data. (Reprinted, with permis-sion, from reference 1.)

Figure 3. Line graph shows change in MR signal intensity in thecerebrospinal fluid (CSF) collected during oxygen inhalation overtime. Signal intensity in the quadrigeminal plate cistern increasesmore gradually, and equilibration is reached at 15–20 minutes afterthe start of oxygen inhalation. (Reprinted, with permission, fromreference 3.)

624 ! Radiology ! December 2002 Sonnad

Ra

dio

logy

Page 11: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

2 shows their data on the intra- and in-terobserver variability in the reading ofmagnetic resonance (MR) images withautomated and manual methods.

Normal Distribution and StandardScores

Once a mean and SD are calculated,they can be used to further simplify de-scription of data or to create statisticsthat can be compared across data sets.One common measure is the standardscore. With the standard score, data areassumed to come from a normal distribu-tion, a symmetric bell-shaped distribu-tion. In a normal distribution, 68% of allobservations are within 1 SD of the mean(34% above the mean and 34% below).Another 27% are between 1 and 2 SDs;therefore, 95% of all observations arewithin 2 SDs of the mean. A total of

99.7% are within 3 SDs of the mean;therefore, any normal curve (or histo-gram that represents a large set of datadrawn from a normal distribution) isabout 6 SDs wide. Two SDs from themean is often used as a cutoff for theassignment of values as outliers. Thisconvention is related to the common useof 95% confidence intervals and the se-lection of confidence for testing of a hy-pothesis as 95% (these concepts will bedefined in a future article). The use of 2SDs from the mean in normally distrib-uted data ensures that 95% of the dataare included. As can be seen, the SD is thecommon measure of variability for datafrom normal distributions. These datacan be expressed as standard scores,which are a measure of SD units from themean. The standard score is calculated as

(OV & M)/SD, where OV is the observa-tion value and M is the mean. A standardscore of 1 corresponds to the 84th per-centile of data from a normal distribu-tion. Standard scores are useful because ifthe data are drawn from a normal distri-bution, regardless of the original meanand SD, each standard score correspondsto a specific percentile. Table 3 showspercentiles that correspond to some stan-dard scores.

GRAPHICAL METHODS FORDATA SUMMARY

Line and Bar Graphs

It is often easier to understand and in-terpret data when they are presentedgraphically rather than descriptively or as

Figure 4. Bar graphs. (a) Use of a scale with a maximum that is only slightly higher than the highest value in the data shows the differencesbetween the groups more clearly than does (b) use of a scale with a maximum that is much higher than the highest value.

Figure 5. Bar graph shows the underestimation rate for lesion size withvacuum-assisted and large-core needle biopsy. Underestimation rateswere lower with the vacuum-assisted device. It is helpful to label the barswith the value. (Reprinted, with permission, from reference 4.) Figure 6. Bar graph shows the diagnostic adequacy of high-spatial-

resolution MR angiography with a small field of view and that with alarge field of view for depiction of the segmental renal arteries. (Re-printed, with permission, from reference 5.)

Volume 225 ! Number 3 Describing Data: Statistical and Graphical Methods ! 625

Ra

dio

logy

Page 12: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

a table. Line graphs are most often usedto show the behavior of one variable overtime. Time appears on the horizontal axisand the variable of interest on the verti-cal axis. Figure 3 is an example of a linegraph. The variable is MR signal intensityin the cerebrospinal fluid collected dur-ing oxygen inhalation. It is apparent fromthis graph that signal intensity within thequadrigeminal plate cistern increases moregradually than that within the suprasellarcistern, which was the conclusion drawnby the authors (3).

When you look at graphs, it is impor-tant to examine both the horizontal andvertical axis scales. Selection of the scaleto be used may influence how the readerinterprets the graph. This can be seen inFigure 4, which depicts bar graphs.

Bar graphs are another common wayto display data. They are used for com-paring the value of multiple variables. Inmany cases, multiple bar graphs will ap-pear in the same figure, such as in Figures5 and 6. Figure 5 shows the diagnosticunderestimation rate for three categoriesof lesion size with vacuum-assisted andlarge-core needle biopsies. This bar chartis presented with a percentage rate on they axis and the categories on the x axis (4).Figure 6 is similar and shows the diagnos-tic adequacy of high-spatial-resolutionMR angiography with a small field ofview compared with that with a largefield of view for depiction of the segmen-tal renal arteries. In this case, the y axisrepresents the number of cases and thepercentages appeared in the figure cap-tion (5). Bars may be drawn either verti-cally, as in Figures 5 and 6, or horizon-tally. They may be drawn to be touchingeach other or to be separate. It is impor-tant that the width of the bars remainsconsistent so that one bar does not ap-pear to represent more occurrences be-cause it has a larger area.

Histograms and FrequencyPolygons

Histograms look somewhat like barcharts, but they serve a different purpose.Rather than displaying occurrence insome number of categories, a histogramis intended to represent a sampling dis-tribution. A sampling distribution is arepresentation of how often a variablewould have each of a given set of values ifmany samples were to be drawn from thetotal population of that variable. Thebars in a histogram appear in a graphwhere the y axis is frequency and the xaxis is marked in equal units. Rememberthat a bar chart does not have x-axis

units. When a program automaticallycreates a histogram, it creates x-axis unitsof equal size. The histogram is analogousto the frequency table discussed earlier.Figure 7 shows a histogram, although itwas called a bar chart in the original fig-ure caption (6). This figure represents thedistribution of calcific area among partic-ipants and has equal units on the x axis,which makes it a histogram.

Frequency polygons are a less used al-ternative to histograms. Figure 8a showsa histogram that might represent the dis-tribution of the mean partition coeffi-cients in a number of healthy individu-als. The line in the figure is created byconnecting the middle of each of the his-togram bars. The figure represented bythat line is called a frequency polygon.The frequency polygon is shown in Fig-ure 8b. In Figure 8a, both the histogramand the frequency polygon are shown on

the same graph. Most often, one or theother appears but not both.

Stem and Leaf Plots

Another graphic method for represent-ing distribution is known as the stem andleaf plot, which is useful for relativelysmall data sets, as seen in many radio-logic investigations. With this approach,more of the information from the origi-nal data is preserved than is preservedwith the histogram or frequency poly-gon, but a graphic summary is still pro-vided. To make a stem plot, the first digitsof the data values represent the stems.Stems appear vertically with a verticalline to their right, and the digits aresorted into ascending order. Then thesecond digit of each value occurs as a leafto the right of the proper stem. Theseleaves should also be sorted into ascend-

TABLE 4Data for Body Weight of 10 Patients Used to Construct Stem and Leaf Plot

Volunteer No./Age (y)/Sex

Body Weight(kg)

Height(cm)

'QF (gray tonesper pixel)*

c (gray tones per pixelper milliliter)

1/27/F 76 178 4.7 0.0492/32/F 61 173 3.3 0.0353/31/M 83 181 3.8 0.0404/28/M 85 180 4.0 0.0425/44/M 103 189 4.6 0.0486/32/M 72 179 3.4 0.0367/23/M 78 174 5.5 0.0588/26/F 72 179 4.8 0.0509/28/M 80 180 3.5 0.03710/23/F 74 177 3.6 0.038

Note.—The mean values % SEM were as follows: age, 29 years % 2; body weight, 78 kg % 3;height, 179 cm % 1; 'QF, 4.1 gray tones per pixel % 0.2; c ('QF/V), 0.043 gray tones per pixelper milliliter % 0.003. (Adapted and reprinted, with permission, from reference 1).

* Ninety-five milliliters of saline solution was instilled.

Figure 7. Histogram represents distribution of calcific area. Labels onthe x axis indicate the calcific area in square millimeters for images withpositive findings. (Reprinted, with permission, from reference 6.)

626 ! Radiology ! December 2002 Sonnad

Ra

dio

logy

Page 13: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

ing order. In a simple example, Heverha-gen and colleagues (1) used the bodyweight in kilograms of 10 patients. Theoriginal data appear in Table 4. Figure 9shows the making of a stem plot fromthese data. The stem and leaf plot lookslike a histogram with horizontal barsmade of numbers. The plot’s primary ad-vantage is that all of the actual values ofthe observations are retained. Stem andleaf plots do not work well with very

large data sets because there are too manyleaves, which makes it difficult both toread and to fit on a page.

Box Plots

The final graphic method presented inthis article is the box plot. The box plotshows the distribution of the data and isespecially useful for comparing distribu-tions graphically. It is created from a setof five numbers: the median, the 25thpercentile or lower quartile, the 75th per-centile or upper quartile, the minimumdata value, and the maximum data value.The horizontal line in the middle of thebox is the median of the measured val-ues, the upper and lower sides of the boxare the upper and lower quartiles, andthe bars at the end of the vertical lines arethe data minimum and maximum val-ues.

Tombach and colleagues (7) used boxplots in their study of renal tolerance of agadolinium chelate to show changes overtime in the distribution of values of se-rum creatinine concentration and creati-nine clearance across different groups ofpatients. Some of these box plots areshown in Figure 10. They clearly demon-strate the changing distribution and thedifference in change between patientgroups.

In conclusion, a statistical analysis typ-ically requires statistics in addition to ameasure of location and a measure ofvariability. However, the plotting of datato see their general distribution and thecomputing of measures of location andspread are the first steps in being able todetermine the interesting relationships

that exist among data sets. In this article,methods have been provided for calculat-ing the data center with the mean, me-dian, or mode and for calculating dataspread. In addition, several graphicmethods were explained that are usefulboth when exploring and when present-ing data. By starting with describing andgraphing of study data, better analysisand clear presentation of data will result;therefore, descriptive and graphic meth-ods will improve communication of im-portant research findings.

APPENDIX

The following is a list of the common termsand definitions related to statistical andgraphic methods of describing data.

Coefficient of variation.—SD divided by themean and then multiplied by 100%.

Descriptive statistics.—Statistics used tosummarize a body of data. Contrasted withinferential statistics.

Frequency distribution.—A table that showsa body of data grouped according to nu-meric values.

Frequency polygon.—A graphic method ofpresenting a frequency distribution.

Histogram.—A bar graph that represents afrequency distribution.

Inferential statistics.—Use of sample statis-tics to infer characteristics about the popu-lation.

Mean.—The arithmetic average for agroup of data.

Median.—The middle item in a group ofdata when the data are ranked in order ofmagnitude.

Mode.—The most common value in anydistribution.

Figure 8. Representative histogram and frequency polygon constructed from hypothetical data. Alternate ways of showing the distribution of aset of data are shown (a) with both the histogram and the frequency polygon depicted or (b) with only the frequency polygon depicted.

Figure 9. Stem and leaf plot constructed fromdata in Table 4. Step one shows construction ofthe stem, and step 2 shows construction of theleaves. These steps result in a sideways histogramdisplay of the data distribution, which preservesthe values of the data used to construct it. Thenumber 9 appears on the stem as a place saver;the lack of digits to the right of this stem numberindicates that no values began with this numberin the original data.

Volume 225 ! Number 3 Describing Data: Statistical and Graphical Methods ! 627

Ra

dio

logy

Page 14: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Nominal data.—Data with items that canonly be classified into groups. The groupscannot be ranked.

Normal distribution.—A bell-shaped curvethat describes the distribution of many phe-nomena. A symmetric curve with the high-est value in the center and with set amountsof data on each side with the mathematicalproperty that the logarithm of its probabil-ity density is a quadratic function of thestandardized error.

Percentage distribution.—A frequency dis-tribution that contains a column listing thepercentage of items in each class.

Quartile.—Value below which 25% (lowerquartile) or 75% (upper quartile) of data lie.

Sample.—A subset of the population thatis usually selected randomly. Measures thatsummarize a sample are called sample sta-tistics.

Sampling distribution.—The distributionactually seen (often represented with a his-togram) when data are drawn from an un-derlying population.

Standard deviation.—A measure of disper-sion, the square root of the average squareddeviation from the mean.

Variance.—The average squared deviationfrom the mean, or the square of the SD.

References1. Heverhagen JT, Muller D, Battmann A, et

al. MR hydrometry to assess exocrine func-tion of the pancreas: initial results of non-invasive quantification of secretion. Radi-ology 2001; 218:61–67.

2. Kaus MR, Warfield SK, Nabavi A, Black PM,Jolesz FA, Kikinis R. Automated segmenta-tion of MR images of brain tumors. Radi-ology 2001; 218:586–591.

3. Deliganis AV, Fisher DJ, Lam AM, Mara-

villa KR. Cerebrospinal fluid signal inten-sity increase on FLAIR MR images in pa-tients under general anesthesia: the role ofsupplemental O(2). Radiology 2001; 218:152–156.

4. Jackman RJ, Burbank F, Parker SH, et al.Stereotatic breast biopsy of nonpalpable le-sions: determinants of ductal carcinoma insitu underestimation rates. Radiology 2001;218:497–502.

5. Fain SB, King BF, Breen JF, Kruger DG, Rie-derer SJ. High-spatial-resolution contrast-enhanced MR angiography of the renalarteries: a prospective comparison withdigital subtraction angiography. Radiology2001; 218:481–490.

6. Bielak LF, Sheedy PF II, Peyser PA. Auto-mated segmentation of MR images of braintumors. Radiology 2001; 218:224–229.

7. Tombach B, Bremer C, Reimer P, et al. Re-nal tolerance of a neutral gadolinium che-late (gadobutrol) in patients with chronicrenal failure: results of a randomizedstudy. Radiology 2001; 218:651–657.

Figure 10. Box plots present follow-up data for serum creatinine concentration. Left: Within 72 hours after injection of a gadolinium chelate(group 1 baseline creatinine clearance, (80 mL/min [(0.50 mL/sec]). Right: Within 120 hours after injection (group 2 baseline creatinine clearance,(30 mL/min [(0.50 mL/sec]). (Reprinted, with permission, from reference 7.)

628 ! Radiology ! December 2002 Sonnad

Ra

dio

logy

Page 15: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Elkan F. Halpern, PhDG. Scott Gazelle, MD, MPH,

PhD

Index terms:Statistical analysis

Published online before print10.1148/radiol.2261011712

Radiology 2003; 226:12–15

Abbreviation:PPV ! positive predictive value

1 From the DATA Group, Departmentof Radiology, Massachusetts GeneralHospital, Zero Emerson Pl, Suite 2H,Boston, MA 02114. Received October19, 2001; revision requested Decem-ber 26; revision received March 27,2002; accepted April 9. Address cor-respondence to E.F.H. (e-mail:[email protected]).© RSNA, 2002

Probability in Radiology1

In this article, a summary of the basic rules of probability using examples of theirapplication in radiology is presented. Those rules describe how probabilities may becombined to obtain the chance of “success” with either of two diagnostic ortherapeutic procedures or with both. They define independence and relate it to theconditional probability. They describe the relationship (Bayes rule) between sensi-tivity, specificity, and prevalence on the one hand and the positive and negativepredictive values on the other. Finally, the two distributions most commonly en-countered in statistical models of radiologic data are presented: the binomial andnormal distributions.© RSNA, 2002

Radiologists routinely encounter probability in many forms. For instance, the sensitivity ofa diagnostic test is really just a probability. It is the chance that disease (eg, a liver tumor)will be detected in a patient who actually has the disease. In the process of determiningwhether a sequence of successive diagnostic tests (say, both computed tomography [CT]and positron emission tomography) is a significant improvement over CT alone, radiolo-gists must understand how those probabilities are combined to give the sensitivity of thecombination.

Similarly, the prevalence of disease such as malignant liver cancer among patients withcirrhosis is a probability. It is the fraction of patients with a history of cirrhosis who havea malignant tumor. It can be determined simply from the number of patients withcirrhosis and the number of patients with both cirrhosis and malignant tumors of the liveror from the two prevalences.

The likelihood that a patient with a cirrhotic liver and positive CT findings has amalignant tumor (the positive predictive value [PPV] of CT in this setting) is anotherprobability. It is determined by combining the prevalence of the disease among patientswith cirrhosis with the sensitivity and specificity of CT.

In all its forms, probabilities obey a single set of rules that determine how they may becombined; this article presents an outline of these rules. The rules are also found andexplained in greater detail in textbooks of biostatistics, such as that by Rosner (1).

THE EARLIEST DEFINITION OF PROBABILITY

Probability is a numeric expression of the concept of chance, in all its guises. As such, itobeys the rules of arithmetic and the logic of mathematics.

In its earliest manifestation, probability was a tool for gamblers. The earliest expressionsof probability dealt with simple gambling games such as flipping a coin, rolling a die, ordealing a single card from the top of a deck. The essential property assumed of such gameswas that the possible outcomes were all equally likely. The probability of some event wasproportional to the number of individual outcomes that comprised the event.

In more complicated scenarios, the fundamental outcomes might not be equally likely orthere might be an infinite number of them. In these situations, the definition of a probabilityof an event became the fraction of times that the event would occur by chance. That is, it isthe fraction of times in a sufficiently long sequence of trials where the chance of the event wasthe same in all trials and was not affected by the results of previous or subsequent trials.

With either definition, counting possible outcomes or measuring the frequency ofoccurrence, probability was susceptible to the laws and rules of simple arithmetic. Thesimplest rule of all was the rule of addition. Addition of the fraction of patients who havea single liver tumor that is malignant to the fraction of patients who have a single tumorthat is benign must yield the fraction of patients who have a single tumor, either malig-nant or benign.

Statistical Concepts Series

12

Ra

dio

logy

Page 16: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Algebraically, we express this “addi-tive” rule for two events, A and B, as thefollowing: If Prob(A AND B) ! 0, thenProb(A OR B) ! Prob(A) " Prob(B), wherewe use Prob(A) to denote the probabilityof A. The condition “Prob(A AND B) ! 0”is a way of saying that the simultaneousoccurrence of both A and B is impossible.

One consequence of this rule is thatthe chance that an event would not hap-pen is immediately determined by thechance that it would happen. As “A” and“NOT A” (the event that A would nothappen) cannot occur simultaneously,yet one or the other is certain, Prob(A) "Prob(NOT A) ! 1 or Prob(NOT A) ! 1 #Prob(A). Thus, the fraction of patientswho do not have a single tumor is justthe complement to the fraction of pa-tients who do.

The limitation “Prob(A AND B) ! 0” isnecessary, as the rule does not applywithout modification when it is possiblefor both events to be true. For instance,addition of the fraction of patients whohave at least one malignant tumor to thefraction of patients who have at least onebenign tumor may cause overestimationof the fraction of patients with tumors,either malignant or benign. The extent ofthe overestimation is given by the frac-tion of patients who have both benignand malignant tumors. These patientswere included in the counts for both tu-mor-type specific fractions but should becounted only once when patients whohave at least one tumor are counted. Thismay be best seen in Figure 1, where theadded shaded areas of A and B yield asum greater than the total shaded area.The area representing the overlap of Aand B must be accounted for.

The required modification to the addi-tive rule of probabilities is: Prob(A OR Bor [A AND B]) ! Prob(A) " Prob(B) #Prob(A AND B).

SUBJECTIVE VERSUSOBJECTIVE PROBABILITIES

Before a coin is flipped, the probabilitythat it will land heads up is 0.5. After it isflipped and is seen to have landed headsup, the probability that it landed headsup is 1 and that it landed tails up is 0. Butwhat is the probability that it landedheads up before anyone saw which facewas up? The face has been determined,and the probability is either 1 or 0 that itis heads, though which is not yet known.Yet, any gambler would be willing to betat that point on which face had landedup in precisely the same way that he or

she would have bet before the coin hadbeen flipped. The gambler has his or herown “subjective” probability that thecoin will be face up when observed,which reflects his or her beliefs concern-ing the probability before the coin wasflipped. The subjective probabilities obeythe same laws and rules as the “objective”probabilities in describing future events.Subjective probabilities may be revised asone learns more about what else may betrue.

In radiology, before an examination,the probability that a patient will have anabnormality detected is a fraction be-tween 0 and 1. After the procedure andafter the image has been viewed, theprobability is either 1 or 0, depending onwhether an abnormality has been de-tected. In the interim, after the procedurebut before the image has been viewed,the subjective probability is the same asthe objective probability was before theprocedure.

CONDITIONAL PROBABILITIESAND INDEPENDENCE VERSUSDEPENDENCE

While the probability that a randomlyselected woman has an undetected ma-lignant breast cancer at least 1 cm in di-ameter has real meaning, the probabilityis not the same for all women. It certainlyis higher for women who have never un-dergone mammographic screening thanfor women who have—all other thingsbeing equal. The probability that awoman who has never been screened hassuch a tumor is called a “conditional”probability, because it is defined as thechance that the woman has a 1 cm or

greater tumor, given that the woman hasnever been screened. The probability thata randomly chosen woman has such atumor is the number of women withsuch tumors divided by the number of allwomen. The conditional probability thata randomly chosen woman who hasnever been screened has such a tumor isdefined analogously. It is the number ofwomen who have never been screenedwho have such tumors out of the numberof women who have never beenscreened. By using Prob(A!B) to denotethe conditional probability of A (awoman with a 1 cm or greater breast tu-mor) given B (she underwent no priorscreening), we have Prob(A!B) ! Prob(AAND B)/Prob(B).

This forms the basis of the definition ofsensitivity and specificity. If we let Astand for a positive diagnostic examina-tion result and B to represent the actualpresence of the disease, then Prob(A!B) isthe sensitivity, the chance of a positiveexamination result among individualswith disease or of a true-positive result.The definition can also be used to calcu-late the chances derived from a succes-sion of diagnostic tests. For instance, ifconfirmation of any positive test result,D1", is required by means of a secondpositive test result, D2", then the chancethat we will obtain two positive test re-sults is given by the “multiplicative” lawof probabilities: Prob(D1" AND D2") !Prob(D2"!D1") $ Prob(D1").

That is, the chance that both tests havepositive results is a fraction of all secondtests with positive results, once the firsttest had a positive result times the chancethat the first test had a positive result.The rule can be similarly used to calculate

Figure 1. Venn diagram represents the probability of either of twoevents based on the probabilities of each and of both. The area of theshape made by combining A and B is the total of the areas of A and Bless the area where they overlap.

Volume 226 ! Number 1 Probability in Radiology ! 13

Ra

dio

logy

Page 17: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

the chance of two successive negativefindings or any other combination.

Occasionally, the chances for the sec-ond examination, D2, are unaltered bythe results of the initial examination, D1.For instance, after the diagnostic imagehas been obtained, the interpretationby a blinded reader, D2, should be un-affected by the prior interpretation byanother blinded reader, D1, of the sameimage. In these situations, the two in-terpretations are said to be independent(2).

If A and B are independent, thenProb(A) ! Prob(A!B) ! Prob(A!NOT B);that is, the (conditional) chance of Agiven that B occurs is the same as the(conditional) chance of A given that Bdoes not occur and, as a result, alsoequals the (unconditional) chance of A.

A consequence is the multiplicativelaw of probabilities for independentevents; namely, if A and B are indepen-dent, then Prob(A AND B) ! Prob(A) $Prob(B).

In practice, the diagnostic tests of ra-diology are rarely truly independent asCT, magnetic resonance imaging, andultrasonography all rely similarly onthe lesion size and differences in tissuecharacteristics. Yet, quite often, themultiplicative law is used as an approx-imation because the exact conditionalprobability has never been accuratelydetermined in clinical trials of both di-agnostic modalities.

The same rules may be used to calcu-late the risk of disease in the presence ofmultiple predictive characteristics (or co-factors). If the effects of the cofactors aresynergistic, the more general multiplica-tive rule must be used. But if the effects ofthe cofactors are unrelated or indepen-dent, the multiplicative rule for indepen-dent events may be used. Similarly, therules may be used to compute the overallchance of a successful treatment of dis-ease with a succession of treatmentsbased on the (conditional) chance of suc-cess of each treatment.

BAYES RULE AND POSITIVEAND NEGATIVE PREDICTIVEVALUES

While the sensitivity and specificity of adiagnostic test are important to the clini-cian when he or she determines whichtest to use, they do not entirely addressthe question of concern after the test hasbeen performed. To the patient, the issueis not “How often does the test detect realdisease?” but rather, “Now that the testresults are known, what is the chancethat I have the disease?” The patientwants to know a conditional probabilitythat is the reverse of sensitivity. If we useDx to denote a positive finding and D todenote the actual presence of disease, thepatient is not as concerned with the sen-sitivity, Prob(Dx!D), as with the PPV,Prob(D!Dx), the chance that there is dis-ease present given that the test result waspositive.

Both sensitivity and specificity are con-sidered to be inherent invariant test char-acteristics. In contrast, the PPV and thenegative predictive value depend notonly on the sensitivity and specificity butalso on the prevalence of the disease,Prob(D). They may be combined by usingBayes rule (3), which relates the PPV,Prob(D!Dx), to the sensitivity, Prob(Dx!D),the specificity, 1 # Prob(Dx!NOT D), andthe prevalence, Prob(D).

Prob%D!Dx&

!Prob%Dx!D& " Prob%D&

Prob%Dx!D& " Prob%D&# Prob%Dx!NOT D& " Prob%NOT D&

.

In this equation, the denominator isthe total number of expected positivefindings in the population, while the nu-merator is the number of positive find-ings that accompany the actual disease.

Suppose that we had a diagnostic testused for screening that had both 95% sen-sitivity (positive in 95% of all cases wherethe disease is present) and 95% specificity(negative in 95% of all cases where the

disease is absent). When the prevalence ofthe disease is 10%, out of every 10,000screening examinations, we can expect tosee 1,400 cases that result in a positive find-ing with the diagnostic test (Table). Ratherthan go through the laborious exercise ofconstructing tables, Bayes rule gives usthe PPV directly. For 10% prevalence,Prob(D) ! 0.10, we have PPV ! (0.95 $0.10)/(0.95 $ 0.10) " (0.05 $ 0.90) !0.095/(0.095 " 0.045) ! 0.679.

P VALUES, POWER, ANDBAYESIAN STATISTICS

Bayes rule applies for any two events, Aand B, not just positive findings, DX, andpresence of disease, D. The distinction be-tween the Prob(A!B) and the Prob(B!A) alsoforms the basis of the difference betweenconventional and Bayesian statistical anal-ysis of a clinical trial. In conventional sta-tistical analysis of the results of a study, Brepresents the null hypothesis. After thestudy has been performed and the results Ahave been observed, the conventional de-cision regarding the truth of B is based onthe likelihood of A*, any result as extremeor even more extreme than A, given that Bis true, Prob(A*!B). This probability isknown as the P value. The less likely thatany result as extreme as A, given B, thestronger the evidence that B is not true.Conventionally, some cutoff (known asthe level of significance) is set in advance,and the study is deemed to have significantfindings if the P value is smaller than thecutoff.

In a conventional analysis, the proba-bility of significant study results, S, con-ditional on B being false, Prob(S!NOT B),is known as the power. It is not a factor inthe conclusion drawn from the study.Rather, it is the major factor in the designbefore the study is conducted. It deter-mines the number of patients in thestudy. The study sample size is chosen toprovide the desired chance of success-fully showing that B is not true.

Bayesian statistics differs from conven-tional statistics insofar as it depends onthe probability that the hypothesis holdsgiven the observed results of the study orstudies. This probability is calculated bymeans of the Bayes rule. In order to beable to calculate it, the (subjective) prob-ability reflecting the prior (before thestudy) chance or belief that the hypoth-esis was true is required. Much of thedispute regarding the use of Bayesiananalysis centers around the possibility ofconclusions that might be largely dic-tated by “opinion” before hard data areobtained.

Example of Calculation of Positive and Negative Predictive Values Based onExpected Number of Cases Derived from Prevalence, Sensitivity, and Specificity

Disease Positive Finding B Negative Finding NOT B Prevalence*

Present A TP, 950 FN, 50 1,000 (10)Absent NOT A FP, 450 TN, 8,550 9,000 (90)

Total 1,400 8,600 10,000

Note.—Prevalence of disease is 10% out of every 10,000 screening examinations. A and B areevents. FN ! false-negative, FP ! false-positive, TN ! true-negative, TP ! true-positive.

* Data in parentheses are percentages.

14 ! Radiology ! January 2003 Halpern and Gazelle

Ra

dio

logy

Page 18: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

P values, power, and Bayesian analysiswill be presented in later articles in thisseries. Software for Bayesian probabilitiesmay be found at the University of SheffieldWeb site at www.shef.ac.uk/~st1ao/1b.html.

PROBABILITY FORCONTINUOUS OUTCOMES

The interpretation of the probability that atumor consists of a specified number ofcells differs in one essential regard from theinterpretation of the probability that thetumor is a specified diameter or volume.The number of cells is “discrete” in thesense that it can only be an integer value.No fractional number of cells is possible. Assuch, each possible number of cells has itsown probability, and if that number ofcells is possible, the probability is greaterthan 0. But the diameter or volume of atumor can take on all fractional values, aswell as an integer. The probability that thetumor is between 2 and 3 cm in diametercould be treated in the same way as all ofthe probabilities that we have been discuss-ing. However, the probability that the tu-mor is exactly ' cm to the last decimalplace (or any other exact value, even, say, 3cm to the last decimal place) has to differin meaning.

For continuous measures such as thediameter or volume of a tumor or time ofan occurrence, the probability that it ex-actly equals a single value is analogous tothe distance traveled or the radiation re-ceived in an instant. Instead, one canexpress the rate at each value and calcu-late the probability of any interval just asone calculates the distance traveled overany time interval from the instantaneousspeed or the total radiation exposurefrom the instantaneous rate of exposure.

Happily, all of the rules discussed fordiscrete outcomes apply equally well tocontinuous ones, both for intervals andfor the rates at specified values.

DISTRIBUTIONS

These rules may be used to provide for-mulas for calculating the distribution,the probabilities for all possible out-comes, under a variety of circumstances.Two of the distributions most commonlyencountered by radiologists are the bino-mial and the normal distributions.

The binomial distribution, B(i!n,p), (4)describes the probability of an event occur-ring i times out of n tries, where thechance, p, of the event is the same for alltries and the occurrence of the event in onetry is unrelated (independent) to its occur-rence in any other try. The distributionthen gives the probabilities of each possi-ble number of occurrences of the event outof n cases. The specific formula for theprobability of i events out of n cases,B(i!n,p), is B(i!n,p) ! [n!/i!(n # i)!]pi(1 # p)n # i,where ! indicates factorial, as in n! ! n $(n # 1) $ (n # 2) $ . . .$ 2 $ 1.

In practice, this distribution is builtinto most statistical and spreadsheet soft-ware packages. For instance, by using Mi-crosoft Excel software, B(i!n,p) is calcu-lated by the function BINOMDIST.

A binomial distribution that a radiolo-gist might encounter is the probability ofdetecting i cancers with screening. In orderfor the binomial distribution to apply, thesensitivity would have to be identical forall of the cancers. Additionally, the detec-tion (or nondetection) of any one cancercould not influence the chance that an-other was detected. If both conditions ap-plied and there were n actual cases of can-cer among those screened, B(i!n,p) wouldbe the chance that i cancers were detectedif the sensitivity of the technique wasgiven by p. Thus, if there were actuallyeight cancers and the sensitivity of screen-ing was 70%, the chance that exactly sixof those eight cancer were detected isB(6!8,0.7) ! [8!/6!(8 # 6)!]0.76(1 # 0.7)8 # 6 !

0.296. The full distribution is depicted inFigure 2.

The normal, or Gaussian (2,5,6), distri-bution describes the probabilities for acontinuous outcome that is the result ofaveraging out a very large number of in-dependent random contributions. Thebackground component to number of xrays detected in a square millimeter ofplain film is normally distributed. It iscommonly described as a “bell-shaped”curve.

The distribution depends on two val-ues or parameters: the mean, (, and theSD, ). (See the preceding article [7] in thisseries.) The mean determines the loca-tion of the high point of the curve. TheSD gives the scale. The height of thecurve at any point, x, is determined bythe “z score,” the difference of x and ( inunits of ). That is, the height dependsonly on z ! (x # ()/).

Again, the height is found as a func-tion in most spreadsheets and statisticalsoftware. With Excel software, it is givenby NORMDIST. The distribution is de-picted in Figure 3.

References1. Rosner B. Fundamentals of biostatistics. 5th

ed. Pacific Grove, Calif: Duxbury, 2000.2. DeMoivre A. The doctrine of chance. London,

England: W. Pearson, 1718.3. Bayes T. An essay toward solving a problem in

the doctrine of chances. Philos Trans R SocLondon 1763; 53:370–418. (Reprinted in Bi-ometrika 1958; 45:293–315.)

4. Bernoulli J. Ars conjectandi. Basel, Switzer-land: 1713. (Reprinted in: Die werke von Ja-kob Bernoulli. Vol 3. Basel, Switzerland:Birkhauser Verlag, 1975; 106-286.

5. Laplace PS. Theorie analytique des probabil-ites. Paris, France: Ve. Courcier, 1812.

6. Gauss CF. Theoria motus corporum coeles-tium in sectionibus conicis solem ambien-tium. Hamburg, Germany: F. Perthes et I. H.Besser, 1809.

7. Applegate KE, Crewson PE. An introduction tobiostatistics. Radiology 2001; 225:318–322.

Figure 2. Binomial distribution, B(i!n,p), for the number of cancersdetected out of a total of eight, if the sensitivity of the detector is P ! .70.

Figure 3. The normal, or Gaussian, distribution.

Volume 226 ! Number 1 Probability in Radiology ! 15

Ra

dio

logy

Page 19: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

L. Santiago Medina, MD,MPH

David Zurakowski, PhD

Index terms:Radiology and radiologists, researchStatistical analysis

Published online before print10.1148/radiol.2262011537

Radiology 2003; 226:297–301

Abbreviations:ACL ! anterior cruciate ligamentOR ! odds ratioPSV ! peak systolic velocitySEM ! standard error of the mean

1From the Department of Radiologyand Health Outcomes Policy and Eco-nomics (H.O.P.E.) Center, Children’sHospital, 3100 SW 62nd Ave, Miami, FL33155; and Departments of Orthopae-dic Surgery and Biostatistics, Children’sHospital, Harvard Medical School, Bos-ton, Mass. Received September 17,2001; revision requested November 12;revision received December 17; ac-cepted January 21, 2002. Address cor-respondence to L.S.M. (e-mail:[email protected]).© RSNA, 2003

Measurement Variability andConfidence Intervals inMedicine: Why ShouldRadiologists Care?1

In radiology, appropriate diagnoses are often based on quantitative data. However,these data contain inherent variability. Radiologists often see P values in the litera-ture but are less familiar with other ways of reporting statistics. Statistics such as theSD and standard error of the mean (SEM) are commonly used in radiology, whereasthe CI is not often used. Because the SEM is smaller than the SD, it is ofteninappropriately used in order to make the variability of the data look tighter.However, unlike the SD, which quantifies the variability of the actual data for a singlesample, the SEM represents the precision for an estimated mean of a generalpopulation taken from many sample means. Since readers are usually interested inknowing about the variability of the single sample, the SD often is the preferredstatistic. Statistical calculations combine sample size and variability (ie, the SD) togenerate a CI for a population proportion or population mean. CIs enable research-ers to estimate population values without having data from all members of thepopulation. In most cases, CIs are based on a 95% confidence level. The advantageof CIs over significance tests (P values) is that the CIs shift the interpretation from aqualitative judgment about the role of chance to a quantitative estimation of thebiologic measure of effect. Proper understanding and use of these fundamentalstatistics and their calculations will allow more reliable analysis, interpretation, andcommunication of clinical information among health care providers and betweenthese providers and their patients.© RSNA, 2003

Radiologists and physicians rely heavily on quantitative data to make specific diagnoses.Furthermore, the patients and the referring physicians place trust in the radiologist’sassessment of these quantitative data for determination of appropriate treatment. Forexample, a radiologist can inform a patient that he or she has a significant stenosis of thecarotid artery because the peak systolic velocity (PSV) determined by using ultrasonogra-phy (US) is 280 cm/sec. The radiologist’s final diagnosis of a significant stenosis of thecarotid artery may suggest that treatment with endarterectomy or stent placement isindicated for this patient. But how reliable is the PSV of 280 cm/sec as a true estimate ofsignificant stenosis of the carotid artery? How does this value help in the differentiationbetween stenosis and nonstenosis? What is the variability of the PSV when stenosis ispresent? Which population was included in the study and what was the sample size fordetermining normal and abnormal arterial lumen size? These are very important questionsbecause the well-being of the patients and the credibility of the radiologists depend, inpart, on a clear understanding of the answers to these questions.

Precise knowledge of important statistical parameters, such as the SD, the standard errorof the mean (SEM), and the CIs, will provide the radiologist with answers to the questionspreviously posed. Most of these parameters can be quickly and easily obtained with a smallcalculator. In addition, these parameters are useful while reading the literature. Appropri-ate understanding and use of these fundamental statistics, namely, the SD, the SEM, andthe CI, will allow more reliable analysis, interpretation, and communication of clinicalinformation among health care providers and between these providers and their patients.

Statistical Concepts Series

297

Ra

dio

logy

Page 20: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

WHAT ARE RANDOMSAMPLING AND THECENTRAL LIMIT THEOREM?

Obtaining a sample that is representativeof a larger population is key in any studydesign. A random sample is a sample cho-sen to minimize bias (1, pp 248–277). Asimple random sample is a sample inwhich every subject of the populationhas an equal chance of being included inthe sample (1, pp 248–277). The onlyway to be sure of a representative sampleis to select the subjects at random, so thatwhether or not each subject in the pop-ulation is chosen for the sample is purelya matter of chance and is not based onthe subject’s characteristics (2, pp 33–36).

Random sampling has other advantages.Because the sample is randomly selected,the methods of probability theory can beapplied to the data obtained. This enablesthe clinician to estimate the likely size ofthe errors that may occur, for example,with the SD or CIs, and to present them aspart of the results (2, pp 33–36).

In general, if one has any series of in-dependent identically distributed ran-dom variables, then their sum tends toproduce a normal distribution as the num-ber of variables increases (2, pp 116–120)(Fig 1). This fundamental theorem in sta-tistics is known as the central limit theo-rem (1, pp 248–277; 2, pp 116–120). Sim-ply stated, as sample size increases, themeans of samples from a population of anydistribution will approach the normal(Gaussian) distribution. This is an impor-tant property because it allows clinicians touse the normal distribution to formulateinferences from the data about means ofpopulations. In addition, the variability ofmeans of samples obtained from a popula-tion decreases as the sample size increases.However, the sample size required to makeuse of the central limit theorem dependson the underlying distribution of the pop-ulation, and skewed populations requirelarger samples.

For example, suppose a group of radi-ologists want to study PSV in the com-mon carotid artery of children in a smallAmazon Indian tribe by using DopplerUS spectrum analysis. Figure 1 shows thatas the number of children selected ineach series of random samples increases,the sum of these numbers tends to pro-duce a normal distribution, as shown bythe bell-shaped Gaussian distribution ofPSV in the pediatric population sampled.As the sample size increases, the meanand SD come closer to the population’smean and SD (Fig 2).

WHAT IS THE DIFFERENCEBETWEEN THE SD ANDTHE SEM?

The SD and the SEM measure two verydifferent entities, but clinicians oftenconfuse them. Some medical researcherssummarize their data with the SEM be-cause it is always smaller than the SD (3).Because the SEM is smaller, it often isinappropriately used to make the vari-ability of the data look tighter. This kindof reporting of statistics should be dis-couraged.

The following example is given to il-lustrate the difference between the SD

and the SEM and why one should sum-marize data by using the SD. Supposethat, in a study sample of patients withatherosclerotic disease, an investigatorreported that the PSV in the carotid ar-tery was 220 cm/sec and the SD was 10.Since the PSV in about 95% of all popu-lation members is within roughly 2 SDsof the mean, the results would tell onethat, assuming the distribution is approx-imately normal, it would be unusual toobserve a PSV less than 200 cm/sec orgreater than 240 cm/sec in moderate ath-erosclerotic disease of the carotid artery.Therefore, a summary of the populationand a range with which to compare spe-

Figure 1. Graph shows PSV of the common carotid artery in an Am-azon Indian population. Note that as the sample size increases from 5 to100 subjects, the SD decreases and the 95% CI becomes narrower.

298 ! Radiology ! February 2003 Medina and Zurakowski

Ra

dio

logy

Page 21: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

cific patients who are examined by theclinician are described in the article.

Unfortunately, the investigator is quitelikely to say that the PSV of the commoncarotid artery was 220 cm/sec " 1.6(SEM). If one confused the SEM with theSD, one would believe that the range ofmost of the population was narrow, be-tween 216.8 and 223.2 cm/sec. These val-ues describe the range that about 95% ofthe time includes the mean PSV of theentire population from which the sampleof patients was chosen. The SEM is sim-ply a measure of how far the samplemean is likely to be from the actual pop-ulation mean. In practice, however, onegenerally wants to compare an individualpatient’s PSV with the spread of the pop-ulation distribution as a whole and notwith the population mean (3). This infor-mation is provided by the SD and not bythe SEM.

WHAT ARE CIs?

Most biomedical research relies on thepremise that what is true for a randomlyselected sample from a population will betrue, more or less, for the populationfrom which the sample was chosen (1, pp55–63). Therefore, measurements in thesample are used to estimate the charac-teristics of the population included in thestudy. The reliability of the results ob-tained from a sample is addressed by con-structing CIs around statistics of the sam-

ple. The amount of variation associatedwith an estimate determined from a sam-ple can be expressed by a CI.

A CI is the range of values that is be-lieved to encompass the actual (“true”)population value (1, pp 55–63). This truepopulation value or parameter of interestusually is not known, but it does exist andcan be estimated from an appropriately se-lected sample. CIs around population esti-mates provide information about how pre-cise the estimate is. Wider CIs indicatelesser precision, while narrower ones indi-cate greater precision (Figs 1, 2). CIs pro-vide bounds to estimates.

If one repeatedly obtained samples fromthe population and constructed CIs foreach sample, then one could expect a cer-tain percentage of the CIs to include thevalue of the true population and a certainpercentage of them not to include thatvalue. For example, with a 95% CI, thelevel of certainty is 95% of such CIs ob-tained in repeated sampling to include thetrue parameter value and only 5% of theCIs not to include the true parametervalue.

HOW ARE CIs FOR A MEANCALCULATED?

The mean of a set of measurements iscalculated from the sample of patients.Therefore, the mean one calculates is un-likely to be exactly equal to the popula-tion mean. The size of the discrepancy

depends on the size and variability of thesample (3, pp 163–190; 4). If the sampleis small and variable, the sample meanmay be far from the population mean. Ifthe sample is large with little scatter, thesample mean will probably be very closeto the population mean. Statistical calcu-lations combine sample size and variabil-ity (ie, SD) to generate a CI for the pop-ulation mean.

One can calculate an interval for anydesired degree of confidence, although95% CIs are by far the most commonlyused. The following equation is the usualmethod for calculating a 95% CI that isbased on a normally distributed samplewith a known SD or one that is based ona sample from a population with an un-known SD but in which the population isknown to be normally distributed andthe sample itself is large (ie, n # 100):

95% CI ! mean " z

$ %sample SD/&!n'(, (1)

where z (standardized score) is the valueof the standard normal distribution withthe specific level of confidence. For a 95%CI, z ! 1.96 (approximately 2.0).

The scale of z scores is independent ofthe units of measurement. Therefore, forany measurement being investigated,one can calculate an individual’s z scoreand compare it with that of other indi-viduals. The z scores are calculated fromthe sample data as (X ) mean)/SD, whereX is the actual individual’s value. For ex-ample, if an individual’s value is 1 SDabove the mean for the group, that indi-vidual’s z score is 1.0; a value 1 SD belowthe mean corresponds to a z score of)1.0. Approximately 68.0% of the areaunder the normal curve includes z scoresbetween )1.0 and 1.0, approximately95.0% of the area includes z scores be-tween )2.0 and 2.0, and 99.7% of thearea under the normal curve includes zscores between )3.0 and 3.0.

Equation (1) can be applied when thedata conform to a normal (Gaussian) dis-tribution and when the population SD isknown. When the sample is small (n *100) and information regarding the para-metric SD is not known, one must rely onthe sample SD, which requires setting CIsby using the t distribution. In this situa-tion, the z value should be replaced withthe appropriate critical value of the t dis-tribution with n ) 1 degrees of freedom,where n is the sample size.

The t distribution, or the Student t dis-tribution, resembles the normal distribu-tion, although its shape depends on thesample size. It is wider than the normal

Figure 2. Graph summarizes the data from Figure 1. As the sample size increases, the samplemean and SD are a closer representation of the population’s mean and SD. In addition, as thesample size increases, the 95% CI narrows.

Volume 226 ! Number 2 Measurement Variability and Confidence Intervals in Medicine ! 299

Ra

dio

logy

Page 22: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

distribution to account for variability inestimating the mean and SD from thesample data (5). The t distribution differsfrom the normal distribution in that itassumes different shapes depending onthe number of degrees of freedom. There-fore, when setting a CI around a mean,the appropriate critical value of the t dis-tribution should be used in place of the zvalue in Equation (1). This t value can befound in a conventional t table includedin most statistical textbooks. For exam-ple, in a study with a sample size of 25,the critical value for a t distribution thatcorresponds to a 95% CI, where 1 ) + isthe confidence level and n ) 1 indicatesthe degrees of freedom, is 2.064.

CIs can be constructed for any desiredlevel of confidence. There is nothingmagical about 95%, although it is tradi-tionally used. If greater confidence isneeded, then the CIs have to be wider.Consequently, 99% CIs are wider than95% CIs, and 90% CIs are narrower than95% CIs. Wider CIs are associated withgreater confidence but less precision.This is the trade-off.

If one assumes that a sample was ran-domly selected from a certain population(that follows a normal distribution), onecan be 95% sure that the CI includes thepopulation mean. More precisely, if onegenerates many 95% CIs from many datasets, one can expect that the CI will in-clude the true population mean in 95%of the cases and that the CI will not in-clude the true mean value in the other5%. Therefore, the 95% CI is related tostatistical significance at the .05 level,which means that the CI itself can beused to determine if an estimated changeis statistically significant at the .05 level(1, pp 55–63).

Whereas the P value is often inter-preted as an indication of a statisticallysignificant difference, the CI, by provid-ing a range of values, allows the reader tointerpret the implications of the results ateither end of the range (1, pp 55–63; 6).For example, if one end of the range in-cludes clinically important results butthe other does not, the results can beregarded as inconclusive, not simply asan indication of a statistically significantdifference or not. In addition, whereas Pvalues are not presented in units, CIs arepresented in the units of the variable ofinterest, and this latter presentationhelps readers to interpret the results. CIsare generally preferred to P values be-cause CIs shift the interpretation from aqualitative judgment about the role ofchance to a quantitative estimation ofthe biologic measure of effect (1, pp 55–

63; 6). More importantly, the CI quanti-fies the precision of the mean.

For example, findings in two hypothet-ical articles about US in the carotid arteryin elderly patients indicate that a meanPSV of 200 cm/sec is associated with a70% stenosis of the vascular diameter.Both articles reported the same SD of 50cm/sec. However, one article was about astudy that included 50 subjects, whereasthe other one was about a study thatincluded 500 subjects. At first glance,both articles appear to have the same in-formation. This is delineated with thecalculations here.

The calculations in the article with thesmaller sample were as follows:

95% CI ! 200 " 1.96" 50!50#,

95% CI ! 200 " 14.

The calculations in the article with thelarger sample were as follows:

95% CI ! 200 " 1.96" 50!500#,

95% CI ! 200 " 4.

However, in the article with the smallersample, the 95% CI was 186 to 214 cm/sec, whereas in that with the larger sam-ple, the 95% CI was 196 to 204 cm/sec.Therefore, the article with the larger sam-ple has a narrower 95% CI.

WHY ARE CIs FORSENSITIVITY AND SPECIFICITYOF A TEST IMPORTANT?

Most radiologists are familiar with thebasic concepts of specificity and sensitiv-ity and use them to evaluate the diagnos-tic accuracy of diagnostic tests in clinicalpractice. Since sensitivity and specificityare proportions, CIs can be calculatedand should be reported in all researcharticles. CIs are needed to help one to bemore certain about the clinical value of

any screening or diagnostic test and todecide to what degree one can rely on theresults. Omission of the precision of thesensitivity and specificity in a particularstudy can make a difference in the inter-pretation of the findings of that study (7).

The simplest diagnostic test is dichot-omous, in which the results are used toclassify patients into two groups accord-ing to the presence or absence of disease.Magnetic resonance (MR) imaging andarthroscopic findings from a hypotheti-cal example are delineated in Table 1. Inthis hypothetical study, arthroscopy isconsidered the standard of reference. Thequestion that arises in the clinical settingis, “How good is knee MR imaging athelping to distinguish torn and intactACLs?” In other words, “To what degreecan one rely on the interpretation of MRimaging in making judgments about thestatus of a patient’s knee?”

One method of measuring the value ofMR imaging in the detection of ACL tearsis to calculate the proportion of tornACLs and the proportion of intact ACLsthat were correctly classified by using MRimaging. These proportions are known asthe sensitivity and specificity of a test,respectively.

Sensitivity is calculated as the propor-tion of torn ACLs that were correctly clas-sified by using MR imaging. In this exam-ple, of the 421 knees with ACL tears, 394were correctly evaluated with MR imag-ing (Table 1). The sensitivity of MR im-aging in the detection of ACL tears is,therefore, 94% (ie, sensitivity ! 394/421 ! 0.94). In other words, 94% of ACLtears were correctly classified as torn byusing MR imaging. The 95% CI for a pro-portion can be determined by the equa-tion shown here:

95% CI ! p " z # ! %p&1 $ p'(/n. (2)

By using Equation (2), the 95% CI forsensitivity is 0.94 " 0.02, or 0.92 to 0.96.Therefore, one expects MR imaging tohave a sensitivity between 92% and 96%.

TABLE 1Hypothetical Example: MR Imaging Depiction of ACL Tear

MR Imaging Findings

Arthroscopic Findings

TotalTorn ACL Intact ACL

Torn ACL 394 32 426Intact ACL 27 101 128

Total 421 133 554

Note.—Data are numbers of knees. Arthroscopic findings were the standard of reference. ACL !anterior cruciate ligament.

300 ! Radiology ! February 2003 Medina and Zurakowski

Ra

dio

logy

Page 23: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Specificity is calculated as the propor-tion of intact ACLs that were correctlyclassified by using MR imaging. Of the133 knees with an intact ACL, 101 werecorrectly classified. The specificity of MRimaging is, therefore, 76% (ie, specific-ity ! 101/133 ! 0.76). This means that76% of intact ACLs were correctly classi-fied as intact by using MR imaging. Byusing Equation (2), the 95% CI for spec-ificity is 0.76 " 0.07 or 0.69 to 0.83.Therefore, one expects MR imaging tohave a specificity between 69% and 83%.It is also important to note that the CIwas wider for specificity than it was forsensitivity because the sample groupswere 133 (smaller) and 421 (larger), re-spectively.

CAN CIs FOR ODDS RATIOS BECALCULATED?

CIs can also be calculated around riskmeasures, such as the relative risk or theodds ratio (OR). Consider an example inwhich radiographic effusion is examinedto ascertain whether it is useful in thedifferentiation between septic arthritisand transient synovitis among childrenwho have acute hip pain at presentation.

Data from a study by Kocher et al (8)are shown in Table 2 and can be summa-rized as follows: For those patients withradiographic effusion, the odds of havingseptic arthritis are 63/33 (effusion/no ef-fusion) ! 1.9. The odds of having septicarthritis for those with no radiographiceffusion are 19/53 (effusion/no effu-sion) ! 0.36. The OR is the ratio of thesetwo odds: 1.9/0.36 ! 5.3. This meansthat children with a radiographic effu-sion are approximately five times morelikely to have septic arthritis than thosewithout a radiographic effusion. The ORis sometimes referred to as the cross prod-uct ratio because it can be calculated bymeans of multiplication of the counts in

the diagonal cells and division of data asfollows (Table 2): OR ! ad/bc ! (63 $53)/(33 $ 19) ! 3,339/627 ! 5.3.

The OR is only a single number, that is,a “point estimate.” The precision of thisestimate can be described with a CI,which describes the statistical signifi-cance of the association between twovariables within a specific range. Thewidth of the CI reflects the amount ofvariability inherent in the OR. There is atrade-off between precision and confi-dence. Wider CIs provide greater cer-tainty but are less precise. Narrower CIsare more precise but less certain that thetruth is within the CI. In radiology andmedicine, the most commonly reportedCI corresponding to the OR is the 95%CI.

Several methods are commonly used toconstruct CIs around the OR. A simplemethod for constructing CIs (8) can beexpressed as follows:

CI ! &OR'exp

%" z! &1/a % 1/b % 1/c % 1/d'(, (3)

where z is the value of the standard nor-mal distribution with the specific level ofconfidence, and exp is the base of thenatural logarithm (often symbolized as e).

By using Equation (3), the 95% CI inour example is calculated as follows:

95% CI ! logeOR

" 1.96! 163 %

133 %

119 %

153

95% CI ! loge&5.3' " 1.96&0.348'

95% CI ! loge&5.3' " 0.682

LL ! 1.67 $ 0.682 ! 0.988

UL ! 1.67 % 0.682 ! 2.352

LL ! e0.988 ! 2.69

UL ! e2.352 ! 10.50,

where e is 2.718, LL represents the lowerlimit, and UL represents the upper limit.

Therefore, among children who haveacute hip pain at presentation, thosewith a radiographic effusion are, on av-erage, 5.3 times more likely to have septicarthritis compared with those with noradiographic effusion. The 95% CI lowerlimit of the OR is 2.7 and the upper limitis 10.5. When the 95% CI does not in-clude 1.0 (as in this example), the resultsindicate a statistically significant differ-ence at the .05 level (ie, P * .05).

CONCLUSION

The SD and SEM measure different pa-rameters. The two are commonly con-fused in the medical literature. The SDcan be thought of as a descriptive statisticthat indicates the variation among mea-surements taken from a sample (1, pp55–63). Investigators should not reportsummary statistics in terms of the SEM.The 95% CI is the preferred statistic forindicating the precision of an estimate ofa population characteristic (1, pp 55–63).

CIs can be calculated for means as wellas for proportions. Proportions com-monly used in medicine include sensitiv-ity, specificity, and the OR. Proportionsshould always be accompanied by 95%CIs. Proper understanding and use offundamental statistics, such as the SD,the SEM, and the CI, and their calcula-tions will allow more reliable analysis,interpretation, and communication ofclinical data to patients and to referringphysicians.

References1. Lang TA, Secic M. How to report statistics

in medicine. Philadelphia, Pa: AmericanCollege of Physicians, 1997.

2. Bland M. An introduction to medical sta-tistics. Oxford, England: Oxford MedicalPublications, 1987.

3. Glantz SA. Primer of biostatistics. 2nd ed.New York, NY: McGraw-Hill, 1987.

4. Rosner B. Fundamentals of biostatistics,4th ed. Belmont, Calif: Duxbury, 1995;141–190.

5. Sokal RR, Rohlf FJ. Biometry. 3rd ed. NewYork, NY: Freeman, 1995; 143–150.

6. Gardner MJ, Altman DG. Confidence in-tervals rather than P values: estimationrather than hypothesis testing. BMJ 1986;292:746–750.

7. Harper R, Reeves B. Reporting of precisionof estimates for diagnostic accuracy: a re-view. BMJ 1999; 318:1322–1323.

8. Kocher MS, Zurakowski D, Kasser JR. Dif-ferentiating between septic arthritis andtransient synovitis of the hip in children:an evidence based prediction algorithm.J Bone Joint Surg Am 1999; 81:1662–1670.

TABLE 2Data for Calculating OR and CI

Radiographs Septic Arthritis Transient Synovitis Total

Effusion 63 (a) 33 (b) 96 (a , b)No effusion 19 (c) 53 (d) 72 (c , d)

Total 82 (a , c) 86 (b , d) 168

Note.—Data are numbers of patients examined at radiography. The letters and/or symbols inparentheses are the variables that represent the given value in the equations used to calculatethe OR and CI: OR ! ad/bc ! (63 $ 53)/(33 $ 19) ! 3,339/627 ! 5.3. CI !(OR) exp["z-(1/a , 1/b , 1/d)]. Adapted and reprinted, with permission, from Kocher et al (8).

Volume 226 ! Number 2 Measurement Variability and Confidence Intervals in Medicine ! 301

Ra

dio

logy

Page 24: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Kelly H. Zou, PhDJulia R. Fielding, MD2

Stuart G. Silverman, MDClare M. C. Tempany, MD

Index term:Statistical analysis

Published online before print10.1148/radiol.2263011500

Radiology 2003; 226:609–613

Abbreviations:H0 ! null hypothesisH1 ! alternative hypothesisRDOG ! Radiologic Diagnostic

Oncology Group

1 From the Department of Radiology,Brigham and Women’s Hospital, Har-vard Medical School, Boston, Mass(K.H.Z., J.R.F., S.G.S., C.M.C.T.); andDepartment of Health Care Policy,Harvard Medical School, 180 Long-wood Ave, Boston, MA 02115(K.H.Z.). Received September 10,2001; revision requested November 8;revision received December 12; ac-cepted December 19. Supported inpart by Public Health Service GrantNIH-U01 CA9398-03 awarded by theNational Cancer Institute, Departmentof Health and Human Services. Ad-dress correspondence to K.H.Z. (e-mail: [email protected]).Current address:2 Department of Radiology, Universityof North Carolina at Chapel Hill.© RSNA, 2003

Hypothesis Testing I:Proportions1

Statistical inference involves two analysis methods: estimation and hypothesis testing,the latter of which is the subject of this article. Specifically, Z tests of proportion arehighlighted and illustrated with imaging data from two previously published clinicalstudies. First, to evaluate the relationship between nonenhanced computed tomo-graphic (CT) findings and clinical outcome, the authors demonstrate the use of theone-sample Z test in a retrospective study performed with patients who had ureteralcalculi. Second, the authors use the two-sample Z test to differentiate between primaryand metastatic ovarian neoplasms in the diagnosis and staging of ovarian cancer. Thesedata are based on a subset of cases from a multiinstitutional ovarian cancer trialconducted by the Radiologic Diagnostic Oncology Group, in which the roles of CT,magnetic resonance imaging, and ultrasonography (US) were evaluated. The statisticalformulas used for these analyses are explained and demonstrated. These methods mayenable systematic analysis of proportions and may be applied to many other radiologicinvestigations.© RSNA, 2003

Statistics often involve a comparison of two values when one or both values are associatedwith some uncertainty. The purpose of statistical inference is to aid the clinician, re-searcher, or administrator in reaching a conclusion concerning a population by examininga sample from that population. Statistical inference consists of two components, estima-tion and hypothesis testing, and the latter component is the main focus of this article.

Estimation can be carried out on the basis of sample values from a larger population (1).Point estimation involves the use of summary statistics, including the sample mean andSD. These values can be used to estimate intervals, such as the 95% confidence level. Forexample, by using summary statistics, one can determine the sensitivity or specificity ofthe size and location of a ureteral stone for prediction of the clinical managementrequired. In a study performed by Fielding et al (2), it was concluded that stones larger than5 mm in the upper one-third of the ureter were very unlikely to pass spontaneously.

In contrast, hypothesis testing enables one to quantify the degree of uncertainty in sam-pling variation, which may account for the results that deviate from the hypothesized valuesin a particular study (3,4). For example, hypothesis testing would be necessary to determine ifovarian cancer is more prevalent in nulliparous women than in multiparous women.

It is important to distinguish between a research hypothesis and a statistical hypothesis.The research hypothesis is a general idea about the nature of the clinical question in thepopulation of interest. The primary purpose of the statistical hypothesis is to establish thebasis for tests of significance. Consequently, there is also a difference between a clinicalconclusion based on a clinical hypothesis and a statistical conclusion of significance basedon a statistical hypothesis. In this article, we will focus on statistical hypothesis testingonly.

In this article we review and demonstrate the hypothesis tests for both a single propor-tion and a comparison of two independent proportions. The topics covered may providea basic understanding of the quantitative approaches for analyzing radiologic data. De-tailed information on these concepts may be found in both introductory (5,6) andadvanced textbooks (7–9). Related links on the World Wide Web are listed in Appendix A.

STATISTICAL HYPOTHESIS TESTING BASICS

A general procedure is that of calculating the probability of observing the differencebetween two values if they really are not different. This probability is called the P value,

Statistical Concepts Series

609

Ra

dio

logy

Page 25: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

and this condition is called the null hy-pothesis (H0). On the basis of the P valueand whether it is low enough, one canconclude that H0 is not true and thatthere really is a difference. This act ofconclusion is in some ways a “leap offaith,” which is why it is known as statis-tical significance. In the following text,we elaborate on these key concepts andthe definitions needed to understand theprocess of hypothesis testing.

There are five steps necessary for con-ducting a statistical hypothesis test:(a) formulate the null (H0) and alterna-tive (H1) hypotheses, (b) compute the teststatistic for the given conditions, (c) cal-culate the resulting P value, (d) either re-ject or do not reject H0 (reject H0 if the Pvalue is less than or equal to a prespeci-fied significance level [typically .05]; donot reject H0 if the P value is greater thanthis significance level), and (e) interpretthe results according to the clinical hy-pothesis relevant to H0 and H1. Each ofthese steps are discussed in the followingtext.

Null and Alternative Hypotheses

In general, H0 assumes that there is noassociation between the predictor and out-come variables in the study population. Insuch a case, a predictor (ie, explanatory orindependent) variable is manipulated, andthis may have an effect on another out-come or dependent variable. For example,to determine the effect of smoking onblood pressure, one could compare theblood pressure levels in nonsmokers, lightsmokers, and heavy smokers.

It is mathematically easier to frame hy-potheses in null and alternative forms,with H0 being the basis for any statisticalsignificance test. Given the H0 of no as-sociation between a predictor variableand an outcome variable, a statistical hy-pothesis test can be performed to esti-mate the probability of an associationdue to chance that is derived from theavailable data. Thus, one never acceptsH0, but rather one rejects it with a certainlevel of significance.

In contrast, H1 makes a claim thatthere is an association between the pre-dictor and outcome variables. One doesnot directly test H1, which is by defaultaccepted when H0 is rejected on the basisof the statistical significance test results.

One- and Two-sided Tests

The investigator must also decidewhether a one- or two-sided test is mostsuitable for the clinical question (4). A

one-sided H1 test establishes the direc-tion of the association between the pre-dictor and the outcome—for example,that the prevalence of ovarian cancer ishigher in nulliparous women than inparous women. In this example, the pre-dictor is parity and the outcome is ovar-ian cancer. However, a two-sided H1 testestablishes only that an association existswithout specifying the direction—for ex-ample, the prevalence of ovarian cancerin nulliparous women is different (ie, ei-ther higher or lower) from that in parouswomen. In general, most hypothesis testsinvolve two-sided analyses.

Test Statistic

The test statistic is a function of sum-mary statistics computed from the data.A general formula for many such test sta-tistics is as follows: test statistic ! (rele-vant statistic " hypothesized parametervalue)/(standard error of the relevant sta-tistic), where the relevant statistics andstandard error are calculated on the basisof the sample data. The standard error isthe indicator of variability, and much ofthe complexity of the hypothesis test in-volves estimating the standard error cor-rectly. H0 is rejected if the test statisticexceeds a certain level (ie, critical value).

For example, for continuous data, theStudent t test is most often used to deter-

mine the statistical significance of an ob-served difference between mean valueswith unknown variances. On the basis oflarge samples with underlying normaldistributions and known variances (5),the Z test of two population means isoften conducted. Similar to the t test, theZ test involves the use of a numerator tocompare the difference between the sam-ple means of the two study groups withthe difference that would be expectedwith H0, that is, zero difference. The de-nominator includes the sample size, aswell as the variances, of each study group(5).

Once the Z value is calculated, it can beconverted into a probability statistic bymeans of locating the P value in a stan-dard reference table. The Figure illus-trates a standard normal distribution(mean of 0, variance of 1) of a test statis-tic, Z, with two rejection regions that areeither below "1.96 or above 1.96. Twohypothetical test statistic values, "0.5and 2.5, which lie outside and inside therejection regions, respectively, are alsoincluded. Consequently, one does not re-ject H0 when Z equals "0.5, but one doesreject H0 when Z equals 2.5.

P Value

When we conclude that there is statis-tical significance, the P value tells us

Graph illustrates the normal distribution of the test statistic Z in a two-sided hypothesis test.Under H0, Z has a standard normal distribution, with a mean of 0 and a variance of 1. The criticalvalues are fixed at #1.96, which corresponds to a 5% significance level (ie, type I error) under H0.The rejection regions are the areas marked with oblique lines under the two tails of the curve, andthey correspond to any test statistic lying either below "1.96 or above $1.96. Two hypotheticaltest statistic values, "0.5 and 2.5, result in not rejecting or rejecting H0, respectively.

610 ! Radiology ! March 2003 Zou et al

Ra

dio

logy

Page 26: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

what the probability is that our conclu-sion is wrong when in fact H0 is correct.The lower the P value, the less likely thatour rejection of H0 is erroneous. By con-vention, most analysts will not claimthat they have found statistical signifi-cance if there is more than a 5% chanceof being wrong (P ! .05).

Type I and II Errors

Two types of errors can occur in hy-pothesis testing: A type I error (signifi-cance level %) represents the probabilitythat H0 was erroneously rejected when infact it is true in the underlying popula-tion. Note that the P value is not thesame as the % value, which represents thesignificance level in a type I error. Thesignificance level % is prespecified (5%conventionally), whereas the P value iscomputed on the basis of the data andthus reflects the strength of the rejectionof H0 on the test statistic. A type II error(significance level &) represents the prob-ability that H0 was erroneously retainedwhen in fact H1 is true in the underlyingpopulation. There is always a trade-offbetween these two types of errors, andsuch a relationship is similar to that be-tween sensitivity and specificity in thediagnostic literature (Table) (10). Theprobability 1 " & is the statistical powerand is analogous to the sensitivity of adiagnostic test, whereas the probability1 " % is analogous to the specificity of adiagnostic test.

STATISTICAL TESTS OFPROPORTIONS: THE Z TEST

We now focus on hypothesis testing foreither a proportion or a comparison oftwo independent proportions. First, westudy a one-sample problem. In a set ofindependent trials, one counts the num-ber of times that a certain interestingevent (eg, a successful outcome) occurs.The underlying probability of success (aproportion) is compared against a hy-pothesized value. This proportion can bethe diagnostic accuracy (eg, sensitivity orspecificity) or the proportion of patients

whose cancers are in remission. We alsostudy a two-sample problem in which tri-als are conducted independently in twostudy groups. For example, one maycompare the sensitivities or specificitiesof two imaging modalities. Similarly, pa-tients in one group receive a new treat-ment, whereas independently patients inthe control group receive a conventionaltreatment, and the proportions of remis-sion in the two patient populations arecompared.

When sample sizes are large, the ap-proximate normality assumptions holdfor both the sample proportion and thetest statistic. In the test of a single pro-portion (') based on a sample of n inde-pendent trials at a hypothesized successprobability of '0 (the hypothesized pro-portion), both n'0 and n(1 " '0) need tobe at least 5 (Appendix B). In the com-parison of two proportions, '1 and '2,based on two independent sample sizesof n1 and n2 independent trials, respec-tively, both n1 and n2 need to be at least30 (Appendix C) (5). The test statistic islabeled Z, and, hence, the analysis is re-ferred to as the Z test of a proportion.Other exact hypothesis-testing methodsare available if these minimum numbersare not met.

Furthermore, the Z and Student t testsboth are parametric hypothesis tests—that is, they are based on data with anunderlying normal distribution. Thereare many situations in radiology researchin which the assumptions needed to usea parametric test do not hold. Therefore,nonparametric tests must be considered(9). These statistical tests will be dis-cussed in a future article.

TWO RADIOLOGIC EXAMPLES

One-Sample Z Test of a SingleProportion

Fielding et al (2) evaluated the unen-hanced helical CT features of 100 ureteralcalculi, 71 of which passed spontane-ously and 29 of which required interven-tion. According to data in the availableliterature (11–13), approximately 80% of

the stones smaller than 6 mm in diame-ter should have passed spontaneously.Analysis of the data in the Fielding et alstudy revealed that of 66 stones smallerthan 6 mm, 57 (86%) passed spontane-ously. To test if the current finding agreeswith that in the literature, we conduct astatistical hypothesis test with five steps:

1. H0 is as follows: 80% of the ureteralstones smaller than 6 mm will pass spon-taneously (' ! 0.80). H1 is as follows: Theproportion of the stones smaller than 6mm that pass spontaneously does notequal 80%—that is, it is either less thanor greater than 80% (' ( 0.80). This istherefore a two-sided hypothesis test.

2. The test statistic Z is calculated to be1.29 on the basis of the results of the Ztest of a single proportion (5).

3. The P value, .20, is the sum of thetwo tail probabilities of a standard nor-mal distribution for which the Z valuesare beyond #1.29 (Figure).

4. Because the P value, .20, is greaterthan the significance level % of 5%, H0 isnot rejected.

5. Therefore, our data support the be-lief that 80% of the stones smaller than 6mm in diameter will pass spontaneously,as reported in the literature. Thus, H0 isnot rejected, given the data at hand.Consequently, it is possible that a type IIerror will occur if the true proportion inthe population does not equal 80%.

Two-Sample Z Test to CompareTwo Independent Proportions

Brown et al (14) hypothesized that theimaging appearances (eg, multilocular-ity) of primary ovarian tumors and met-astatic tumors to the ovary might be dif-ferent. Data were obtained from 280patients who had an ovarian mass andunderwent US in the Radiologic DiagnosticOncology Group (RDOG) ovarian cancerstaging trial (15,16). The study resultsshowed that 30 (37%) of 81 primary ovar-ian cancers, as compared with three (13%)of 24 metastatic neoplasms, were multiloc-ular at US. To test if the respective under-lying proportions are different, we conducta statistical hypothesis test with five steps:

1. H0 is as follows: There is no differ-ence between the proportions of mul-tilocular metastatic tumors ('1) and mul-tilocular primary ovarian tumors ('2)among the primary and secondary ovar-ian cancers—that is, '1 " '2 ! 0. H1 is asfollows: There is a difference in these pro-portions: One is either less than orgreater than the other—that is, '1 " '2 (0. Thus, a two-sided hypothesis test isconducted.

Cross Tabulation Showing Relationship between the Two Error Types

Test Result, Underlying Truth H0 True H0 False

Do not reject H0 Correct action (1 " %) Type II error (&)Reject H0 Type I error (%) Correct action (1 " &)

Note.—A statistical power of 1 " & is analogous to the sensitivity of a diagnostic test. Theprobability 1 " % is analogous to the specificity of a diagnostic test.

Volume 226 ! Number 3 Hypothesis Testing I: Proportions ! 611

Ra

dio

logy

Page 27: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

2. The test statistic Z is calculated to be2.27 on the basis of the results of the Ztest to compare two independent propor-tions (5).

3. The P value, .02, is the sum of thetwo tail probabilities of a standard nor-mal distribution for which the Z valuesare beyond #2.27 (Figure).

4. Because the P value, .02, is less thanthe significance level % of 5%, H0 is re-jected.

5. Therefore, there is a statistically sig-nificant difference between the propor-tion of multilocular masses in patientswith primary tumors and that in patientswith metastatic tumors.

SUMMARY AND REMARKS

In this article, we reviewed the hypothe-sis tests of a single proportion and forcomparison of two independent propor-tions and illustrated the two test meth-ods by using data from two prospectiveclinical trials. Formulas and programcodes are provided in the Appendices.With large samples, the normality of asample proportion and test statistic canbe conveniently assumed when conduct-ing Z tests (5). These methods are thebasis for much of the scientific researchconducted today; they allow us to makeconclusions about the strength of re-search evidence, as expressed in the formof a probability.

Alternative exact hypothesis-testingmethods are available if the sample sizesare not sufficiently large. In the case of asingle proportion, the exact binomial testcan be conducted. In the case of two in-dependent proportions, the proposedlarge-sample Z test is equivalent to a testbased on contingency table (ie, )2) anal-ysis. When large samples are not avail-able, however, the Fisher exact test basedon contingency table analysis can beadopted (8,17–19). For instance, in theclinical example involving data from theRDOG study, the sample of 24 metastaticneoplasms is slightly smaller than the re-quired sample of 30 neoplasms, and,thus, use of the exact Fisher test may bepreferred.

The basic concepts and methods re-viewed in this article may be applied tosimilar inferential and clinical trial de-sign problems related to counts and pro-portions. More complicated statisticalmethods and study designs may be con-sidered, but these are beyond the scope ofthis tutorial article (20–24). A list of avail-able software packages can be found byaccessing the Web links given in Appen-dix A.

APPENDIX A

Statistical Resources Available onthe World Wide Web

The following are links to electronic text-books on statistics: www.davidmlane.com/hyperstat/index.html, www.statsoft.com/textbook/stathome.html, www.ruf.rice.edu/*lane/rvls.html,www.bmj.com:/collections/statsbk/index.shtml,and espse.ed.psu.edu/statistics/investigating.htm. In addition, statistical software pack-ages are available at the following address:www.amstat.org/chapters/alaska/resources.htm.

APPENDIX B

Testing a Single Proportion by Usinga One-Sample Z Test

Let ' be a population proportion to betested (Table B1). The procedure for decid-ing whether or not to reject H0 is as follows:' ! '0; this is based on the results of aone-sided, one-sample Z test at the signifi-cance level of % with n independent trials

(Table B1). The observed number of suc-cesses is x, and, thus, the sample proportionof successes is p ! x/n. In our first clinicalexample, that in which the unenhanced he-lical CT features of 100 ureteral calculi wereevaluated (2), ' ! 0.80, n ! 66, x ! 57, andp ! 66/57 (0.86).

APPENDIX C

Comparing Two IndependentProportions by Using a Two-SampleZ Test

Let '1 and '2 be the two independentpopulation proportions to be compared (Ta-ble C1). The procedure for deciding whetheror not to reject H0 is as follows: '1 " '2 ! 0;this is based on the results of a one-sided,two-sample Z test at the significance level of %with two independent trials: sample sizes ofn1 and n2, respectively (Table C1). The ob-served numbers of successes in these two sam-ples are p1 ! x1/n1 and p2 ! x2/n2, respec-tively. To denote the pooled proportion ofsuccesses over the two samples, use the fol-lowing equation: pc ! (x1 $ x2)/(n1 $ n2). In

TABLE B1Testing a Single Proportion by Using a One-Sample Z Test

Procedure

H1

' + '0 ' , '0

Do not reject H0 if p is inconsistentwith H1—that is, if p , !0 p + !0

Perform the Z test if p " !0 p # !0

Compute the Z test statistic z $p % !0

!['0(1"'0)]/nSame

Compute the P value Probability (Z + z) Probability (Z , z)With % ! .05, reject H0 if P value # % Same

Note.—Under H0, Z has a standard normal distribution, with a mean of 0 and a variance of 1. TheP value from a two-sided test is twice that from a one-sided test, as shown above. Large sampleassumption requires that both n!0 and n(1 " !0) are greater than or equal to 5, where n!0 andn(1 " !0) represent numbers of trials; the sum of these two numbers represents the total numberof trials (n) in the sample.

TABLE C1Comparing Two Independent Proportions by Using a Two-Sample Z Test

Procedure

H1

'1 " '2 + 0 '1 " '2 , 0

Do not reject H0 if p1 " p2 isinconsistent with H1—that is, if p1 " p2 , 0 p1 " p2 + 0

Perform the test if p1 " p2 " 0 p1 " p2 # 0

Compute the test statistic z $p1 % p2

![pc-1 % pc.]/-1/n1 & 1/n2.Same

Compute the P value Probability (Z + z) Probability (Z , z)With % ! .05, reject H0 if P value # % Same

Note.—Under H0, Z has a standard normal distribution, with a mean of 0 and a variance of 1. TheP value from a two-sided test is twice that from a one-sided test, as shown above. Large sampleassumption requires that both numbers of trials, n1 and n2, are greater than or equal to 30.

612 ! Radiology ! March 2003 Zou et al

Ra

dio

logy

Page 28: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

our second clinical example, that involving280 patients with ovarian masses in theRDOG ovarian cancer staging trial (15,16),n1 ! 81, x1 ! 30, n2 ! 24, x2 ! 3, p1 ! x1/n1

(30/81 [0.37]), p2 ! x2/n2 (3/24 [0.13]), andpc ! (x1 $ x2)/(n1 $ n2), or 33/105 (0.31).

Acknowledgments: We thank Kimberly E.Applegate, MD, MS, and Philip E. Crewson,PhD, for their constructive comments on ear-lier versions of this article.

References1. Altman, DG. Statistics in medical jour-

nals: some recent trends. Statist Med2000; 1g:3275-3289.

2. Fielding JR, Silverman SG, Samuel S, ZouKH, Loughlin KR. Unenhanced helical CTof ureteral stones: a replacement for ex-cretory urography in planning treatment.AJR Am J Roentgenol 1998; 171:1051–1053.

3. Gardner MJ, Altman DG. Confidence in-tervals rather than P values: estimationrather than hypothesis testing. Br Med J(Clin Res Ed) 1986; 292:746–750.

4. Bland JM, Altman DF. One and two sidedtests of significance. BMJ 309:248, 1994.

5. Goldman RN, Weinberg JS. Statistics: anintroduction. Englewood Cliffs, NJ: Pren-tice Hall, 1985; 334–353.

6. Hulley SB, Cummings SR. Designing clin-ical research: an epidemiologic approach.Baltimore, Md: Williams & Wilkins, 1988;128–138, 216–217.

7. Freund JE. Mathematical statistics. 5th ed.

Englewood Cliffs, NJ: Prentice Hall, 1992;425–430.

8. Fleiss JL. Statistical methods for rates andproportions. 2nd ed. New York, NY:Wiley, 1981; 1–49.

9. Gibbons JD. Sign tests. In: Kotz S, John-son NL, eds. Encyclopedia of statisticalsciences. New York, NY: Wiley, 1982;471–475.

10. Browner WS, Newman TB. Are all signif-icant p values created equal? The analogybetween diagnostic tests and clinical re-search. JAMA 1987; 257:2459–2463.

11. Drach GW. Urinary lithiasis: etiology, di-agnosis and medical management. In:Walsh PC, Staney TA, Vaugham ED, eds.Campbell’s urology. 6th ed. Philadelphia,Pa: Saunders, 1992; 2085–2156.

12. Segura JW, Preminger GM, Assimos DG,et al. Ureteral stones: clinical guidelinespanel summary report on the manage-ment of ureteral calculi. J Urol 1997; 158:1915–1921.

13. Motola JA, Smith AD. Therapeutic optionsfor the management of upper tract calculi.Urol Clin North Am 1990; 17:191–206.

14. Brown DL, Zou KH, Tempany CMC, et al.Primary versus secondary ovarian malig-nancy: imaging findings of adnexalmasses in the Radiology Diagnostic On-cology Group study. Radiology 2001;219:213–218.

15. Kurtz AB, Tsimikas JV, Tempany CMD, etal. Diagnosis and staging of ovarian can-cer: comparative values of Doppler andconventional US, CT, and MR imagingcorrelated with surgery and histopatho-logic analysis—report of the Radiology

Diagnostic Oncology Group. Radiology1999; 212:19–27.

16. Tempany CM, Zou KH, Silverman SG,Brown DL, Kurtz AB, McNeil BJ. Stating ofovarian cancer: comparison of imagingmodalities—report from the RadiologyDiagnostic Oncology Group. Radiology2000; 215:761–767.

17. Agresti A. Categorical data analysis. NewYork, NY: Wiley, 1990; 8–35.

18. Joe H. Extreme probabilities for contin-gency tables under row and column in-dependence with application to Fisher’sexact test. Comm Stat A Theory Methods1988; 17:3677–3685.

19. MathSoft/Insightful. S-Plus 4 guide to sta-tistics. Seattle, Wash: MathSoft, 1997; 89-96. Available at: http://www.insightful.com/products.splus.

20. Lehmann EL, Casella G. Theory of pointestimation. New York, NY: Springer Ver-lag, 1998.

21. Lehmann EL. Testing statistical hypothe-ses. 2nd ed. New York, NY: Springer Ver-lag, 1986.

22. Hettmansperger TP. Statistical inferencebased on ranks. Malabar, Fla: Krieger,1991.

23. Joseph L, Du Berger R, Belisle P. Bayesianand mixed Bayesian/likelihood criteriafor sample size determination. Stat Med1997; 16:769–781.

24. Zou KH, Norman SL. On determinationof sample size in hierarchical binomialmodels. Stat Med 2001; 20:2163–2182.

Volume 226 ! Number 3 Hypothesis Testing I: Proportions ! 613

Ra

dio

logy

Page 29: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Richard Tello, MD, MSME,MPH

Philip E. Crewson, PhD

Index term:Statistical analysis

Published online before print10.1148/radio1.2271020085

Radiology 2003; 227:1–4

Abbreviation:ANOVA ! analysis of variance

1 From the Department of Radiology,Boston University School of Medicine,88 E Newton St, Atrium 2, Boston, MA02118 (R.T.); and Health Services Re-search and Development Service, De-partment of Veterans Affairs, Washing-ton, DC (P.E.C.). Received February11, 2002; revision requested March18; revision received April 1; acceptedMay 1. Address correspondence toR.T. (e-mail: [email protected]).© RSNA, 2003

Hypothesis Testing II: Means1

Whenever means are reported in the literature, they are likely accompanied by teststo determine statistical significance. The t test is a common method for statisticalevaluation of the difference between two sample means. It provides information onwhether the means from two samples are likely to be different in the two popula-tions from which the data originated. Similarly, paired t tests are common whencomparing means from the same set of patients before and after an intervention.Analysis of variance techniques are used when a comparison involves more than twomeans. Each method serves a particular purpose, has its own computational for-mula, and uses a different sampling distribution to determine statistical significance.In this article, the authors discuss the basis behind analysis of continuous data withuse of paired and unpaired t tests, the Bonferroni correction, and multivariateanalysis of variance for readers of the radiology literature.© RSNA, 2003

To establish if there is a statistically significant difference in two groups that are measuredwith a continuous variable, such as patient height versus sex, a test of the hypothesis thatthere is no difference must be performed. In this article, we discuss the application of threecommonly used methods for testing the difference of means. The three methods are theindependent samples t test, the paired samples t test, and one-way analysis of variance(ANOVA). Each method is used to compare means obtained from continuous sample data,but each is designed to serve a particular purpose. All three approaches require normallydistributed variables and produce an estimate of whether there is a significant differencebetween means. Independent samples t tests provide information on whether the meansfrom two samples are likely to be different in the two populations from which the dataoriginated. Paired samples t tests compare means from the same set of observations(patients) before and after an intervention is performed. ANOVA is used to test thedifferences between three or more sample means. The t test is a commonly used statisticaltest that is easy to calculate but can be misapplied (1,2). If a radiologist wishes to comparetwo proportions, such as the sensitivity of two tests, the appropriate test is the "2 test,which is addressed in other articles in this series. What follows in this article is a briefintroduction to three techniques for testing the difference of means.

HYPOTHESIS TESTING FOR TWO SAMPLE MEANS

The t test can be used to evaluate the difference between two means from two independentsamples or between two samples for which the observations in the second sample are notindependent of those in the first sample. The latter is commonly referred to as a paired ttest. Both paired and unpaired t tests use the t sampling distribution to determine the Pvalue. As reported in a previous article (3), test statistics are compared with samplingdistributions to determine the probability of error. The t distribution is similar to thestandard normal distribution, except that it compensates for small sample sizes (especiallyfewer than 30 observations). As total sample size of the two groups increases beyond 120observations, the two sampling distributions are virtually identical. This attribute allowsthe t distribution to be used for all sample sizes, large or small.

Independent Samples t TestsIf a researcher wants to compare means collected from two patient populations, a t test

for independent samples will often be used. The term independent samples indicates thatnone of the patients in one sample are included in the second sample. The followingresearch question will serve as an example: Are T2s in magnetic resonance (MR) imagingof malignant hepatic masses different from those for benign hepatic masses? Table 1

Statistical Concepts Series

1

Ra

dio

logy

Page 30: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

presents summary statistics from two pa-tient samples to answer this question(Figure). One sample includes only pa-tients with malignant tumors. The sec-ond sample includes only patients withbenign tumors (hemangiomas). The aver-age T2 for malignant lesions is about 92msec, while hemangiomas had an aver-age T2 of 136 msec. The t test is used totest the null hypothesis that the T2 formalignant tumors is not different fromthat for benign tumors: To conduct thistest, the difference between the twomeans is used in conjunction with thevariation found in both samples (SD) andsample sizes to compute a t test statistic.The t test formula is

t !X1 " X2

Sx1#x2

,

where X1 is the mean for sample 1, X2 isthe mean for sample 2, and S is the vari-ance. The pooled standard error of thedifference between two sample means is

Sx1#x2 ! !n1s12 # n2s2

2

n1 # n2 " 2 ! n1 # n2

n1n2,

where n1 is the size of sample 1, n2 is thesize of sample 2, s1 is the variance of sam-ple 1, and s2 is the variance of sample 2.

As shown in Table 1, the calculationsresult in a t test statistic of 7.44, which,when compared with the t distribution,produces a P value of less than .001 (4) byperforming the following equations:

t !X1 " X2

SX1#X2

!136.1 " 91.7

SX1#X2

;

SX1#X2

! !n1s12 # n2s2

2

n1 # n2 " 2 ! n1 # n2

n1n2

! ! 37$26.3%2 # 32$21.9%2

37 # 32 " 2 ! 37 # 3237$32%

! 5.96;

t !X1 " X2

SX1#X2

!136.1 " 91.7

5.96 ! 7.44.

Assuming a .05 cutoff P value, the nullhypothesis is rejected in favor of the con-clusion that on average, there is a statis-tically significant difference in the T2 forMR imaging of malignant tumors com-pared with that for benign tumors.

Homogeneity of Variance

The t test assumes that the variances ineach group are equal (termed homoge-neous). Alternative methods can be usedwhen this assumption is not valid. Statis-tical software programs often automati-cally compute t tests and report resultsfor both equal and unequal variances.Alternate approximations of the t testwhen the variances are unequal can befound in a publication by Rosner (5). It isin this setting that the need to determineif variances are equal requires anotherstatistical test.

Determination of whether the assump-tion of equal variances is valid requiresthe use of an F test. The F test involvesconducting a variance ratio test (6). Thiscalculation tests the null hypothesis thatthe variances are equal. If the results ofthe F test are statistically significant (P &.05), this suggests that the variance of thetwo groups is not equal. If this occurs,two recommended solutions are to eithermodify the t test to compensate for un-equal variances or use a nonparametrictest, called the Mann-Whitney test (7).

Paired t Tests

Two samples are paired when eachdata point of the first sample is matchedand related to a data point in the secondsample. This is common in studies inwhich measurements are collected fromthe same patients before and after inter-vention. Table 2 presents data on the sizeof a cancerous tumor in the same patientbefore and after receiving a new treat-ment. The research question representedin Table 2 is whether the new therapyaffects tumor size. There are two groups

represented by the same seven patients.One group is represented by the patientsample before therapy. The second groupis represented by the same sample of pa-tients after therapy. Before therapy,mean tumor size was 4.86 cm. After ther-apy, mean tumor size was 4.50 cm, rep-resenting a mean decrease of .36 cm. Thet test is used to test the null hypothesisthat the mean difference in tumor sizebetween the groups before and after ther-apy does not differ significantly fromzero, with the assumption that this dif-ference is distributed normally. The testis used to compare the observed differ-ence obtained from the sample of sevenpatients with the hypothesized value ofno difference in the population. Thepaired t test formula is

t !d " 0

S d,

where d is the observed mean difference,and Sd! is the standard error of the ob-served mean difference. On the basis of at test statistic of 5.21, calculated as fol-lows:

t !d " 0

Sd!

0.36 " 00.18/! 7

!0.360.068 ! 5.21,

the probability of falsely rejecting thenull hypothesis of no change in size (ie,the observed difference is due to randomchance) is .002. Hence, the null hypoth-esis is rejected in favor of the conclusionthat tumors shrank in patients who un-derwent therapy.

ANOVA

In the previous section, we compared themeans of two normally distributed vari-ables with the two-sample t test. Whenthe means of more than two distributions

TABLE 1Example of the IndependentSamples (unpaired) t Test of T2sfor Benign Hemangiomas andMalignant Lesions

ParameterBenign

HemangiomasMalignant

Lesions

Mean T2(msec)' SD 136.1 ' 26.3 91.7 ' 21.9

No. ofpatients 37 32

Note.—t Test results indicate a significant dif-ference between mean T2 values in the twogroups (t ! 7.44, P & .001) (4).

Calculations for the independent samples t test statistic reported inTable 1.

2 ! Radiology ! April 2003 Tello and Crewson

Ra

dio

logy

Page 31: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

must be compared, one-way ANOVA isused. With ANOVA, the means of two ormore independent groups (each of whichfollow a normal distribution and have sim-ilar SDs) can be evaluated to ascertain therelative variability between the groupscompared with the variability within thegroups.

ANOVA calculations are best per-formed with statistical software (softwareeasily capable of calculating the t test andvariances include Excel version 5.0 [Mi-crosoft, Bothell, Wash]; more sophisti-cated analyses for performance ofANOVA include Stata version 5.0 [Col-lege Park, Tex], SPSS [Chicago, Ill], or SAS[Cary, NC]), but the basic approach is tocompare the means and variances of in-dependent groups to determine if thegroups are significantly different fromone another. The null hypothesis pro-poses that the samples come from popu-lations with the same mean and variance.The alternative hypothesis is that at leasttwo of the means are not the same. IfANOVA is used with two groups, it wouldproduce results comparable to those ob-tained with the two independent sam-ples t test.

Table 3 presents data on vertebral bonedensity for three groups of women (8).The research question represented in Ta-ble 3 is whether the mean vertebral bone

density varies among the three groups;hence, ANOVA is used to determine ifthere is a significant difference. In thisexample, age groups of 46–55 years,56–65 years, and 66–75 years are used torepresent the three populations for pa-tient screening. As reported in Table 3,the mean vertebral bone density is 1.12for women 46–55 years of age, 1.13 forwomen 56–65 years of age, and 1.03 forwomen 66–75 years of age. Calculationof the test statistic involves estimation ofa ratio of the variance between thegroups to the variance within the groups.The ratio, called the F statistic, is com-pared with the F sampling distributioninstead of the t distribution discussed ear-lier (5). An F statistic of 1.0 occurs whenthe variance between the groups is thesame as the variance within the groups.

The F statistic is the mean squares be-tween groups, divided by the meansquares within groups:

F ! ()nk$Xi " Xg%2/K " 1*/

+()$Xi " X1%2 # )$Xi " X2%

2

# )$Xi " X3%2]/N " K} ,

where Xi is each group mean, Xg is thegrand mean for all the groups [(sum of allscores)/N], nk is the number of patients ineach group, K is the number of groups,and N is the total number of scores.

The sum of squares between groups is)nk(Xi # Xg)

2. The sum of squares withingroups is )(Xi # X1)2 , )(Xi # X2)2 ,)(Xi # X3)2. The example presented inTable 3 can be calculated by using the Fstatistic formula; however, the descrip-tive data in Table 3 would require manip-ulation. The sum of squares withingroups is the variance multiplied by thenumber of scores in a group.

The F statistic in Table 3 is 4.499. Thisindicates that there is greater variationbetween the group means than withineach group. In this case, there is a statis-tically significant difference (P & .013)between at least two of the age groups.

THE MULTIPLE COMPARISONPROBLEM

The vertebral bone density examplecould also be analyzed by using three ttests (46–55-year-old group vs 56–65-year-old group; 46–55-year-old group vs66–75-year-old group; and 56–65-year-old group vs 66–75-year-old group),which is commonly performed (althoughoften incorrectly) for simplicity of com-munication. Similarly, it is not uncom-mon for investigators who evaluatemany outcomes to report statistical sig-nificance with P values at the .04 and .02levels (9,10). This approach, however,leads to a multiple comparisons problem(11,12). In this situation, one may falselyconclude a significant effect where thereis none. In particular, use of a .05 cut-offvalue for significance theoretically guar-antees that if there were 20 pairwise com-parisons, there will by chance alone ap-pear to be one with significance at the .05level (20 - .05 ! 1).

There are corrections for this problem(11,14). Some are useful for unorderedgroups, such as patient height versus sex,while others are applied to orderedgroups (to evaluate a trend), such as pa-tient height versus sex when stratified byage. It is also worth noting that there is adebate about which method to use, andsome hold the view that this correction isoverused (13). We focus our attentionsolely on unordered groups and the mostcommonly used correction, the Bonfer-roni method.

The Bonferroni correction is critical inadjusting the threshold for significance(14), which is equal to the desired P value(eg, .05, .01) divided by the number ofoutcome variables being examined. Con-sequently, when multiple statistical testsare conducted between the same vari-ables, which would occur if multiple t

TABLE 2Example of the Paired t Test to Evaluate Tumor Size before and after Therapyin Seven Patients

Patient No.

Tumor Size (cm)

Change (cm)Before Therapy After Therapy

1 5.3 5.0 0.32 4.4 4.0 0.43 4.9 4.5 0.44 6.4 6.0 0.45 3.4 3.0 0.46 5.0 5.0 0.07 4.6 4.0 0.6

Mean size ' SD 4.86 ' 0.91 4.50 ' 0.96 0.36 ' 0.18

Note.—The calculated statistic suggests a significant difference between the means before andafter treatment (t ! 5.21, P & .002).

TABLE 3Example of ANOVA to Compare Differences in Vertebral Density in Three AgeGroups of Women

Parameter

Age Groups

46–55 Years 56–65 Years 66–75 Years

Mean density ' SD 1.12 ' 0.18 1.13 ' 0.20 1.03 ' 0.19Sample size 46 63 50

Note.—F ! 4.499, P & .013 (8).

Volume 227 ! Number 1 Hypothesis Testing II: Means ! 3

Ra

dio

logy

Page 32: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

tests were conducted for a comparison ofage and bone density, the significancecut-off value is often adjusted to repre-sent a more conservative estimate of sta-tistical significance. One limitation of theBonferroni correction is that by reducingthe level of significance associated witheach test, we have reduced the power ofthe test, thereby increasing the chance ofincorrectly keeping the null hypothesis.

Table 4 presents the results of t tests byusing the same data presented in Table 3.Use of the common threshold of .05would result in the conclusion that thereis a significant difference in vertebralbone density between those 46–55 yearsof age and those 66–75 years of age (P !.021) and also between those 56–65years of age and those 66–75 years of age(P ! .006). However, compensating formultiple comparisons would reduce thethreshold from .05 to .017 (.05/3). Thisresults in only the comparison betweenthose 56–65 years of age and those66–75 years of age, which reaches statis-tical significance.

The far right column of Table 4 showsthe exact probability of error with use ofthe Bonferroni adjustment provided bySPSS statistical software. The Bonferroniadjustment is a common option in statis-tical software when using ANOVA to de-termine if the group means are differentfrom each other. The P value for eachcomparison is adjusted so it can be com-pared directly with the P & .05 cutoff.Again, by using P & .05 as the standardfor significance, the results of the Bonfer-roni adjustment listed in Table 4 indicatethat the only statistically significant dif-ference is between those 56–65 years ofage and those 66–75 years of age (P !.015).

In summary, it is often necessary totest hypotheses that relate a continuousoutcome to an intervention—for exam-ple, tumor size versus treatment options.Depending on the number of groups(more than two) being analyzed, theANOVA technique may be used to testfor an effect, or a t test may be used ifthere are only two groups. A paired t testis used to examine two groups if the con-trol population is linked on an individual

basis, such as when a pre- and posttreat-ment comparison is made in the samepatients. For radiologists, the compari-sons may involve a new imaging tech-nique, the use of contrast material, or anew MR imaging sequence.

Fundamental limitations in using thesetests include the understanding that theygenerate an estimate of the probabilitythat the differences observed would bedue to random chance alone. This esti-mate is based not only on differences be-tween means but also on sample variabil-ity and sample size. In addition, theassumption that the underlying popula-tion is distributed normally is not alwaysappropriate—in which case, special non-parametric techniques are available. In amore common misapplication, the t testis used inappropriately to compare twogroups of categoric or binary data. Fi-nally, use of the tests presented in thisarticle is limited to comparisons betweentwo variables (such as patient age andbone density), which may often oversim-plify much more complex relationships.More complex relationships are best an-alyzed with other techniques, such asmultiple regression or ANOVA.

Acknowledgment: The authors thank MiriamBowdre for secretarial support in preparation ofthe manuscript.

References1. Mullner M, Matthews H, Altman DG. Re-

porting on statistical methods to adjustfor confounding: a cross-sectional survey.Ann Intern Med 2002; 136:122–126.

2. Bland JM, Altman DG. One and two sidedtests of significance. BMJ 1994; 309:248.

3. Zou KH, Fielding JR, Silverman SG, Tem-

pany CM. Hypothesis testing I: propor-tions. Radiology 2003; 226:609–613.

4. Fenlon HM, Tello R, deCarvalho VLS, Yu-cel EK. Signal characteristics of focal liverlesions on double echo T2-weighted con-ventional spin echo MRI: observer perfor-mance versus quantitative measurementsof T2 relaxation times. J Comput AssistTomogr 2000; 24:204–211.

5. Rosner B. Fundamentals of biostatistics.4th ed. Boston, Mass: Duxbury, 1995;270–273.

6. Altman DG. Practical statistics for medi-cal research. London, England: Chapman& Hall/CRC, 1997.

7. Rosner B. Fundamentals of biostatistics.4th ed. Boston, Mass: Duxbury, 1995;570–575.

8. Bachman DM, Crewson PE. Comparisonof central DXA with heel ultrasound andfinger DXA for detection of osteoporosis.J Clin Densitom 2002; 5:131–141.

9. Sheafor DH, Keogan MT, Delong DM,Nelson RC. Dynamic helical CT of theabdomen: prospective comparison of pre-and postprandial contrast enhancement.Radiology 1998; 206:359–363.

10. Tello R, Seltzer SE. Hepatic contrast-en-hanced CT: statistical design for prospec-tive analysis. Radiology 1998; 209:879–881.

11. Gonen M, Panageas KS, Larson SM. Sta-tistical issues in analysis of diagnostic im-aging experiments with multiple observa-tions per patient. Radiology 2001; 221:763–767.

12. Ware JH, Mosteller F, Delgado F, Don-nelly C, Ingelfinger JA. P values. In: BailarJC III, Mosteller F, eds. Medical uses ofstatistics. 2nd ed. Boston, Mass: NEJMBooks, 1992; 181.

13. Perneger TV. What’s wrong with Bonfer-roni adjustments. BMJ 1998; 316:1236–1238.

14. Armitage P, Berry G. Statistical methodsin medical research. 3rd ed. Oxford, En-gland: Blackwell Scientific, 1994; 331.

TABLE 4Adjustment for Multiple Comparisons of Vertebral Density according toAge Group

Age Group Comparisons

P Values

Mean Difference t Test t Test with Bonferroni Adjustment

46–55 Years and 56–65 years .014 .824 .9946–55 Years and 66–75 years .090 .021 .072

Note.—Values provided by statistical software, indicating a significant difference in density for thethree groups.

4 ! Radiology ! April 2003 Tello and Crewson

Ra

dio

logy

Page 33: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

John Eng, MD

Index terms:Radiology and radiologists, researchStatistical analysis

Published online10.1148/radiol.2272012051

Radiology 2003; 227:309–313

1 From the Russell H. Morgan Depart-ment of Radiology and RadiologicalScience, Johns Hopkins University, 600N Wolfe St, Central Radiology ViewingArea, Rm 117, Baltimore, MD 21287.Received December 17, 2001; revisionrequested January 29, 2002; revisionreceived March 7; accepted March 13.Address correspondence to the au-thor (e-mail: [email protected]).© RSNA, 2003

Sample Size Estimation: HowMany Individuals Should BeStudied?1

The number of individuals to include in a research study, the sample size of thestudy, is an important consideration in the design of many clinical studies. Thisarticle reviews the basic factors that determine an appropriate sample size andprovides methods for its calculation in some simple, yet common, cases. Sample sizeis closely tied to statistical power, which is the ability of a study to enable detectionof a statistically significant difference when there truly is one. A trade-off existsbetween a feasible sample size and adequate statistical power. Strategies for reduc-ing the necessary sample size while maintaining a reasonable power will also bediscussed.© RSNA, 2003

How many individuals will I need to study? This question is commonly asked by theclinical investigator and exposes one of many issues that are best settled before actuallycarrying out a study. Consultation with a statistician is worthwhile in addressing manyissues of study design, but a statistician is not always readily available. Fortunately, manystudies in radiology have simple designs for which determination of an appropriate samplesize—the number of individuals that should be included for study—is relatively straight-forward.

Superficial discussions of sample size determination are included in typical introductorybiostatistics texts (1–3). The goal of this article is to augment these introductory discus-sions with additional practical material. First, the need for considering sample size will bereviewed. Second, the study design parameters affecting sample size will be identified.Third, formulae for calculating appropriate sample sizes for some common study designswill be defined. Finally, some advice will be offered on what to do if the calculated samplesize is impracticably large. To assist the reader in performing the calculations described inthis article and to encourage experimentation with them, a World Wide Web page hasbeen developed that closely parallels the equations presented in this article. This page canbe found at www.rad.jhmi.edu/jeng/javarad/samplesize/.

Even if a statistician is readily available, the investigator may find that a workingknowledge of the factors affecting sample size will result in more fruitful communicationwith the statistician and in better research design. A working knowledge of these factors isalso required to use one of the numerous Web pages (4–6) and computer programs (7–9)that have been developed for calculating appropriate sample sizes. It should be noted thatWeb pages for calculating sample size are typically limited for use in situations involvingthe well-known parametric statistics, which are those involving the calculation of summarymeans, proportions, or other parameters of an assumed underlying statistical distributionsuch as the normal, Student t, or binomial distributions. The calculation of sample size fornonparametric statistics such as the Wilcoxon rank sum test is performed by somecomputer programs (7,9).

IMPORTANCE OF SAMPLE SIZE

In a comparative research study, the means or proportions of some characteristic in two ormore comparison groups are measured. A statistical test is then applied to determinewhether or not there is a significant difference between the means or proportions observedin the comparison groups. We will first consider the comparative type of study.

Statistical Concepts Series

309

Ra

dio

logy

Page 34: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Sample size is important primarily be-cause of its effect on statistical power. Sta-tistical power is the probability that astatistical test will indicate a significantdifference when there truly is one. Statis-tical power is analogous to the sensitivityof a diagnostic test (10), and one couldmentally substitute the word “sensitivi-ty” for the word “power” during statisticaldiscussions.

In a study comparing two groups ofindividuals, the power (sensitivity) of astatistical test must be sufficient to enabledetection of a statistically significant dif-ference between the two groups if a dif-ference is truly present. This issue be-comes important if the study results wereto demonstrate no statistically significantdifference. If such a negative result wereto occur, there would be two possibleinterpretations. The first interpretation isthat the results of the statistical test arecorrect and that there truly is no statisti-cally significant difference (a true-nega-tive result). The second interpretation isthat the results of the statistical test areerroneous and that there is actually anunderlying difference, but the study wasnot powerful enough (sensitive enough)to find the difference, yielding a false-negative result. In statistical terminol-ogy, a false-negative result is known as atype II error. An adequate sample size givesa statistical test enough power (sensitiv-ity) so that the first interpretation (thatthe results are true-negative) is muchmore plausible than the second interpre-tation (that a type II error occurred) inthe event no statistically significant dif-ference is found in the study.

It is well known that many publishedclinical research studies possess low sta-tistical power owing to inadequate sam-ple size or other design issues (11,12).One could argue that it is as wasteful andinappropriate to conduct a study withinadequate power as it is to obtain a di-agnostic test of insufficient sensitivity torule out a disease.

PARAMETERS THATDETERMINE APPROPRIATESAMPLE SIZE

An appropriate sample size generally de-pends on five study design parameters:minimum expected difference (also knownas the effect size), estimated measurementvariability, desired statistical power, sig-nificance criterion, and whether a one- ortwo-tailed statistical analysis is planned.

Minimum Expected Difference

This parameter is the smallest measureddifference between comparison groups thatthe investigator would like the study todetect. As the minimum expected differ-ence is made smaller, the sample sizeneeded to detect statistical significanceincreases. The setting of this parameter issubjective and is based on clinical judg-ment and experience with the problembeing investigated. For example, supposea study is designed to compare a standarddiagnostic procedure of 80% accuracy witha new procedure of unknown but poten-tially higher accuracy. It would probablybe clinically unimportant if the new pro-cedure were only 81% accurate, but sup-pose the investigator believes that itwould be a clinically important improve-ment if the new procedure were 90% ac-curate. Therefore, the investigator wouldchoose a minimum expected differenceof 10% (0.10). The results of pilot studiesor a literature review can also guide theselection of a reasonable minimum dif-ference.

Estimated Measurement Variability

This parameter is represented by theexpected SD in the measurements madewithin each comparison group. As statis-tical variability increases, the sample sizeneeded to detect the minimum differenceincreases. Ideally, the estimated measure-ment variability should be determinedon the basis of preliminary data collectedfrom a similar study population. A reviewof the literature can also provide esti-mates of this parameter. If preliminarydata are not available, this parameter mayhave to be estimated on the basis of sub-jective experience, or a range of valuesmay be assumed. A separate estimate ofmeasurement variability is not requiredwhen the measurement being comparedis a proportion (in contrast to a mean),because the SD is mathematically derivedfrom the proportion.

Statistical Power

This parameter is the power that is de-sired from the study. As power is increased,sample size increases. While high power isalways desirable, there is an obvioustrade-off with the number of individualsthat can feasibly be studied, given theusually fixed amount of time and re-sources available to conduct a study. Inrandomized controlled trials, the statisti-cal power is customarily set to a numbergreater than or equal to 0.80, with many

clinical trial experts now advocating apower of 0.90.

Significance Criterion

This parameter is the maximum Pvalue for which a difference is to be con-sidered statistically significant. As the sig-nificance criterion is decreased (mademore strict), the sample size needed todetect the minimum difference increases.The significance criterion is customarilyset to .05.

One- or Two-tailed StatisticalAnalysis

In a few cases, it may be known beforethe study that any difference betweencomparison groups is possible in onlyone direction. In such cases, use of a one-tailed statistical analysis, which would re-quire a smaller sample size for detectionof the minimum difference than would atwo-tailed analysis, may be considered.The sample size of a one-tailed designwith a given significance criterion—forexample, !—is equal to the sample size ofa two-tailed design with a significancecriterion of 2!, all other parameters beingequal. Because of this simple relationshipand because truly appropriate one-tailedanalyses are rare, a two-tailed analysis isassumed in the remainder of this article.

SAMPLE SIZES FORCOMPARATIVE RESEARCHSTUDIES

With knowledge of the design parame-ters detailed in the previous section, thecalculation of an appropriate sample sizesimply involves selecting an appropriateequation. For a study comparing twomeans, the equation for sample size (13)is

N !4"2# zcrit " zpwr$

2

D2 , (1)

where N is the total sample size (the sumof the sizes of both comparison groups)," is the assumed SD of each group (as-sumed to be equal for both groups), thezcrit value is that given in Table 1 for thedesired significance criterion, the zpwrvalue is that given in Table 2 for the de-sired statistical power, and D is the min-imum expected difference between thetwo means. Both zcrit and zpwr are cutoffpoints along the x axis of a standard nor-mal probability distribution that demar-cate probabilities matching the specifiedsignificance criterion and statistical power,respectively. The two groups that make up

310 ! Radiology ! May 2003 Eng

Ra

dio

logy

Page 35: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

N are assumed to be equal in number,and it is assumed that two-tailed statisti-cal analysis will be used. Note that N de-pends only on the difference between thetwo means; it does not depend on themagnitude of either one.

As an example, suppose a study is pro-posed to compare a renovascular proce-dure versus medical therapy in loweringthe systolic blood pressure of patientswith hypertension secondary to renal ar-tery stenosis. On the basis of results ofpreliminary studies, the investigators es-timate that the vascular procedure mayhelp lower blood pressure by 20 mm Hg,while medical therapy may help lowerblood pressure by only 10 mm Hg. Onthe basis of their clinical judgment, theinvestigators might also argue that thevascular procedure would have to betwice as effective as medical therapy tojustify the higher cost and discomfort of

the vascular procedure. On the basis ofresults of preliminary studies, the SD forblood pressure lowering is estimated tobe 15 mm Hg. According to the normaldistribution, this SD indicates an expec-tation that 95% of the patients in eithergroup will experience a blood pressurelowering within 30 mm Hg (2 SDs) of themean. A significance criterion of .05 andpower of 0.80 are chosen. With these as-sumptions, D % 20 & 10 % 10 mm Hg," % 15 mm Hg, zcrit % 1.960 (from Table1), and zpwr % 0.842 (from Table 2). Equa-tion (1) yields a sample size of N % 70.6.Therefore, a total of 70 patients (round-ing N to the nearest even number) shouldbe enrolled in the study: 35 to undergothe vascular procedure and 35 to receivemedical therapy.

For a study in which two proportionsare compared with a '2 test or a z test,which is based on the normal approxi-mation to the binomial distribution, theequation for sample size (14) is

N ! 2 ! (zcrit!2p!#1 # p!$

) zpwr!p1#1 # p1$ " p2#1 # p2$ *2/D2 ,

(2)

where p1 and p2 are pre-study estimates ofthe two proportions to be compared, D %"p1 & p2" (ie, the minimum expected dif-ference), p % (p1 ) p2)/2, and N, zcrit, andzpwr are defined as they are for Equation(1). The two groups comprising N are as-sumed to be equal in number, and it isassumed that two-tailed statistical analy-sis will be used. Note that in this case, Ndepends not only on the difference be-tween the two proportions but also onthe magnitude of the proportions them-selves. Therefore, Equation (2) requiresthe investigator to estimate p1 and p2, aswell as their difference, before perform-ing the study. However, Equation (2)does not require an independent esti-mate of SD because it is calculated fromp1 and p2 within the equation.

As an example, suppose a standard di-agnostic procedure has an accuracy of80% for the diagnosis of a certain disease.A study is proposed to evaluate a newdiagnostic procedure that may havegreater accuracy. On the basis of theirexperience, the investigators decide thatthe new procedure would have to be atleast 90% accurate to be considered sig-nificantly better than the standard proce-dure. A significance criterion of .05 and apower of 0.90 are chosen. With these as-sumptions, p1 % 0.80, p2 % 0.90, D %0.10, p % 0.85, zcrit % 1.960, and zpwr %0.842. Equation (2) yields a sample size ofN % 398. Therefore, a total of 398 pa-

tients should be enrolled: 199 to undergothe standard diagnostic procedure and199 to undergo the new one.

SAMPLE SIZES FORDESCRIPTIVE STUDIES

Not all research studies involve the com-parison of two groups. The purpose ofmany studies is simply to describe, withmeans or proportions, one or more char-acteristics in one particular group. Inthese types of studies, known as descrip-tive studies, sample size is important be-cause it affects how precise the observedmeans or proportions are expected to be.In the case of a descriptive study, the min-imum expected difference reflects the dif-ference between the upper and lowerlimit of an expected confidence interval,which is described with a percentage. Forexample, a 95% CI indicates the range inwhich 95% of results would fall if a studywere to be repeated an infinite number oftimes, with each repetition including thenumber of individuals specified by thesample size.

In studies designed to estimate a mean,the equation for sample size (2,15) is

N !4"2(zcrit)2

D2 , (3)

where N is the sample size of the singlestudy group, " is the assumed SD for thegroup, the zcrit value is that given in Ta-ble 1, and D is the total width of theexpected CI. Note that Equation (3) doesnot depend on statistical power becausethis concept only applies to statistical com-parisons.

As an example, suppose a fetal sonog-rapher wants to determine the mean fetalcrown-rump length in a group of preg-nancies. The sonographer would like thelimits of the 95% confidence interval tobe no more than 1 mm above or 1 mmbelow the mean crown-rump length ofthe group. From previous studies, it isknown that the SD for the measurementis 3 mm. Based on these assumptions,D % 2 mm, " % 3 mm, and zcrit % 1.960(from Table 1). Equation (3) yields a sam-ple size of N % 35. Therefore, 35 fetusesshould be examined in the study.

In studies designed to measure a char-acteristic in terms of a proportion, theequation for sample size (2,15) is

N !4(zcrit)2p#1 # p$

D2 , (4)

where p is a pre-study estimate of theproportion to be measured, and N, zcrit,and D are defined as they are for Equa-

TABLE 1Standard Normal Deviate (zcrit)Corresponding to SelectedSignificance Criteria and CIs

Significance Criterion* zcrit Value†

.01 (99) 2.576

.02 (98) 2.326

.05 (95) 1.960

.10 (90) 1.645

* Numbers in parentheses are the probabili-ties (expressed as a percentage) associatedwith the corresponding CIs. Confidenceprobability is the probability associated withthe corresponding CI.

† A stricter (smaller) significance criterion isassociated with a larger zcrit value. Values notshown in this table may be calculated in Excelversion 97 (Microsoft, Redmond, Wash) by us-ing the formula zcrit % NORMSINV(1&(P/2)),where P is the significance criterion.

TABLE 2Standard Normal Deviate (zpwr)Corresponding to SelectedStatistical Powers

Statistical Power zpwr Value*

.80 0.842

.85 1.036

.90 1.282

.95 1.645

* A higher power is associated with a largervalue for zpwr. Values not shown in this tablemay be calculated in Excel version 97 (Mi-crosoft, Redmond, Wash) by using the for-mula zpwr % NORMSINV(power). For calculat-ing power, the inverse formula is power %NORMSDIST(zpwr), where zpwr is calculatedfrom Equation (1) or Equation (2) by solvingfor zpwr.

Volume 227 ! Number 2 Sample Size Estimation ! 311

Ra

dio

logy

Page 36: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

tion (3). Like Equation (2), Equation (4)depends not only on the width of theexpected CI but also on the magnitude ofthe proportion itself. Also like Equation(2), Equation (4) does not require an in-dependent estimate of SD because it iscalculated from p within the equation.

As an example, suppose an investigatorwould like to determine the accuracy of adiagnostic test with a 95% CI of +10%.Suppose that, on the basis of results ofpreliminary studies, the estimated accu-racy is 80%. With these assumptions,D % 0.20, p % 0.80, and zcrit % 1.960.Equation (4) yields a sample size of N %61. Therefore, 61 patients should be ex-amined in the study.

MINIMIZING THE SAMPLE SIZE

Now that we understand how to calcu-late sample size, what if the sample sizewe calculate is too large to be feasiblystudied? Browner et al (16) list a numberof strategies for minimizing the samplesize. These strategies are briefly discussedin the following paragraphs.

Use Continuous MeasurementsInstead of Categories

Because a radiologic diagnosis is oftenexpressed in terms of a binary result, suchas the presence or absence of a disease, itis natural to convert continuous mea-surements into categories. For example,the size of a lesion might be encoded as“small” or “large.” For a sample of fixedsize, the use of the actual measurementrather than the proportion in each cate-gory yields more power. This is becausestatistical tests that incorporate the use ofcontinuous values are mathematicallymore powerful than those used for pro-portions, given the same sample size.

Use More Precise Measurements

For studies in which Equation (1) orEquation (2) applies, any way to increasethe precision (decrease the variability) ofthe measurement process should be sought.For some types of research, precision canbe increased by simply repeating themeasurement. More complex equationsare necessary for studies involving re-peated measurements in the same indi-viduals (17), but the basic principles aresimilar.

Use Paired Measurements

Statistical tests like the paired t test aremathematically more powerful for agiven sample size than are unpaired tests

because in paired tests, each measure-ment is matched with its own control.For example, instead of comparing theaverage lesion size in a group of treatedpatients with that in a control group,measuring the change in lesion size ineach patient after treatment allows eachpatient to serve as his or her own controland yields more statistical power. Equa-tion (1) can still be used in this case. Drepresents the expected change in themeasurement, and " is the expected SDof this change. The additional power andreduction in sample size are due to the SDbeing smaller for changes within individ-uals than for overall differences betweengroups of individuals.

Use Unequal Group Sizes

Equations (1) and (2) involve the as-sumption that the comparison groups areequal in size. Although it is statisticallymost efficient if the two groups are equalin size, benefit is still gained by studyingmore individuals, even if the additionalindividuals all belong to one of the groups.For example, it may be feasible to recruitadditional individuals into the controlgroup even if it is difficult to recruit moreindividuals into the noncontrol group.More complex equations are necessaryfor calculating sample sizes when com-paring means (13) and proportions (18)of unequal group sizes.

Expand the Minimum ExpectedDifference

Perhaps the minimum expected differ-ence that has been specified is unneces-sarily small, and a larger expected differ-ence could be justified, especially if theplanned study is a preliminary one. Theresults of a preliminary study could beused to justify a more ambitious follow-upstudy of a larger number of individualsand a smaller minimum difference.

DISCUSSION

The formulation of Equations (1–4) in-volves two statistical assumptions whichshould be kept in mind when these equa-tions are applied to a particular study. First,it is assumed that the selection of individ-uals is random and unbiased. The decisionto include an individual in the study can-not depend on whether or not that indi-vidual has the characteristic or outcomebeing studied. Second, in studies in whicha mean is calculated from measurements ofindividuals, the measurements are as-sumed to be normally distributed. Both of

these assumptions are required not only bythe sample size calculation method, butalso by the statistical tests themselves (suchas the t test). The situations in which Equa-tions (1–4) are appropriate all involve para-metric statistics. Different methods for de-termining sample size are required fornonparametric statistics such as the Wil-coxon rank sum test.

Equations for calculating sample size,such as Equations (1) and (2), also pro-vide a method for determining statisticalpower corresponding to a given samplesize. To calculate power, solve for zpwr inthe equation corresponding to the designof the study. The power can be then de-termined by referring to Table 2. In thisway, an “observed power” can be calcu-lated after a study has been completed,where the observed difference is used inplace of the minimum expected differ-ence. This calculation is known as retro-spective power analysis and is sometimesused to aid in the interpretation of thestatistical results of a study. However, ret-rospective power analysis is controversialbecause it can be shown that observedpower is completely determined by the Pvalue and therefore cannot add any ad-ditional information to its interpretation(19). Power calculations are most appro-priate when they incorporate a minimumdifference that is stated prospectively.

The accuracy of sample size calcula-tions obviously depends on the accuracyof the estimates of the parameters used inthe calculations. Therefore, these calcula-tions should always be considered esti-mates of an absolute minimum. It is usu-ally prudent for the investigator to planto include more than the minimumnumber of individuals in a study to com-pensate for loss during follow-up or othercauses of attrition.

Sample size is best considered early inthe planning of a study, when modifica-tions in study design can still be made.Attention to sample size will hopefullyresult in a more meaningful study whoseresults will eventually receive a high pri-ority for publication.

References1. Pagano M, Gauvreau K. Principles of bio-

statistics. 2nd ed. Pacific Grove, Calif:Duxbury, 2000; 246–249, 330–331.

2. Daniel WW. Biostatistics: a foundationfor analysis in the health sciences. 7th ed.New York, NY: Wiley, 1999; 180–185, 268–270.

3. Altman DG. Practical statistics for medi-cal research. London, England: Chapman& Hall, 1991.

4. Bond J. Power calculator. Available at:http://calculators.stat.ucla.edu/powercalc/.Accessed March 11, 2003.

312 ! Radiology ! May 2003 Eng

Ra

dio

logy

Page 37: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

5. Uitenbroek DG. Sample size: SISA—sim-ple interactive statistical analysis. Avail-able at: http://home.clara.net/sisa/samsize.htm. Accessed March 3, 2003.

6. Lenth R. Java applets for power and sam-ple size. Available at: www.stat.uiowa.edu/,rlenth/Power/index.html. Accessed March3, 2003.

7. NCSS Statistical Software. PASS 2002.Available at: www.ncss.com/pass.html. Ac-cessed March 3, 2003.

8. SPSS. SamplePower. Available at: www.spss.com/SPSSBI/SamplePower/. Accessed March3, 2003.

9. Statistical Solutions. nQuery Advisor.Available at: www.statsolusa.com/nquery/nquery.htm. Accessed March 3, 2003.

10. Browner WS, Newman TB. Are all signif-icant P values created equal? The analogy

between diagnostic tests and clinical re-search. JAMA 1987; 257:2459–2463.

11. Moher D, Dulberg CS, Wells GA. Statisti-cal power, sample size, and their report-ing in randomized controlled trials.JAMA 1994; 272:122–124.

12. Freiman JA, Chalmers TC, Smith H, Kue-bler RR. The importance of beta, the typeII error and sample size in the design andinterpretation of the randomized controltrial: survey of 71 “negative” trials.N Engl J Med 1978; 299:690–694.

13. Rosner B. Fundamentals of biostatistics.5th ed. Pacific Grove, Calif: Duxbury,2000; 308.

14. Feinstein AR. Principles of medical statis-tics. Boca Raton, Fla: CRC, 2002; 503.

15. Snedecor GW, Cochran WG. Statisticalmethods. 8th ed. Ames, Iowa: Iowa StateUniversity Press, 1989; 52, 439.

16. Browner WS, Newman TB, Cummings SR,Hulley SB. Estimating sample size andpower. In: Hulley SB, Cummings SR,Browner WS, Grady D, Hearst N, New-man TB. Designing clinical research: anepidemiologic approach. 2nd ed. Phila-delphia, Pa: Lippincott Williams &Wilkins, 2001; 65–84.

17. Frison L, Pocock S. Repeated measure-ments in clinical trials: analysis usingmean summary statistics and its implica-tions for design. Stat Med 1992; 11:1685–1704.

18. Fleiss JL. Statistical methods for rates andproportions. 2nd ed. New York, NY:Wiley, 1981; 45.

19. Hoenig JM, Heisey DM. The abuse ofpower: the pervasive fallacy of power cal-culations for data analysis. Am Stat 2001;55:19–24.

Volume 227 ! Number 2 Sample Size Estimation ! 313

Ra

dio

logy

Page 38: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Kelly H. Zou, PhDKemal Tuncali, MDStuart G. Silverman, MD

Index terms:Data analysisStatistical analysis

Published online10.1148/radiol.2273011499

Radiology 2003; 227:617–628

1 From the Department of Radiology,Brigham and Women’s Hospital (K.H.Z.,K.T., S.G.S.) and Department of HealthCare Policy (K.H.Z.), Harvard MedicalSchool, 180 Longwood Ave, Boston,MA 02115. Received September 10,2001; revision requested October 31;revision received December 26; ac-cepted January 21, 2002. Addresscorrespondence to K.H.Z. (e-mail:[email protected]).© RSNA, 2003

Correlation and Simple LinearRegression1

In this tutorial article, the concepts of correlation and regression are reviewed anddemonstrated. The authors review and compare two correlation coefficients, thePearson correlation coefficient and the Spearman !, for measuring linear and non-linear relationships between two continuous variables. In the case of measuring thelinear relationship between a predictor and an outcome variable, simple linearregression analysis is conducted. These statistical concepts are illustrated by using adata set from published literature to assess a computed tomography–guided inter-ventional technique. These statistical methods are important for exploring therelationships between variables and can be applied to many radiologic studies.© RSNA, 2003

Results of clinical studies frequently yield data that are dependent of each other (eg, totalprocedure time versus the dose in computed tomographic [CT] fluoroscopy, signal-to-noise ratio versus number of signals acquired during magnetic resonance imaging, andcigarette smoking versus lung cancer). The statistical concepts correlation and regression,which are used to evaluate the relationship between two continuous variables, are re-viewed and demonstrated in this article.

Analyses between two variables may focus on (a) any association between the variables,(b) the value of one variable in predicting the other, and (c) the amount of agreement.Agreement will be discussed in a future article. Regression analysis focuses on the form ofthe relationship between variables, while the objective of correlation analysis is to gaininsight into the strength of the relationship (1,2). Note that these two techniques are usedto investigate relationships between continuous variables, whereas the "2 test is an exam-ple of a test for association between categorical variables. Continuous variables, such asprocedure time, patient age, and number of lesions, have no gaps on the measurementscale. In contrast, categorical variables, such as patient sex and tissue classification basedon segmentation, have gaps in their possible values. These two types of variables and theassumptions about their measurement scales were reviewed and distinguished in an articleby Applegate and Crewson (3) published earlier in this Statistical Concepts Series inRadiology.

Specifically, the topics covered herein include two commonly used correlation coeffi-cients, the Pearson correlation coefficient (4,5) and the Spearman ! (6–10) for measuringlinear and nonlinear relationship, respectively, between two continuous variables. Corre-lation analysis is often conducted in a retrospective or observational study. In a clinicaltrial, on the other hand, the investigator may also wish to manipulate the values of onevariable and assess the changes in values of another variable. To evaluate the relativeimpact of the predictor variable on the particular outcome, simple regression analysis ispreferred. We illustrate these statistical concepts with existing data to assess patient skindose based on total procedure time by using a quick-check method in CT fluoroscopy–guided abdominal interventions (11).

These statistical methods are useful tools for assessing the relationships between con-tinuous variables collected from a clinical study. However, it is also important to distin-guish these statistical methods. While they are similar mathematically, their purposes aredifferent. Correlation analysis is generally overused. It is often interpreted incorrectly (toestablish “causation”) and should be reserved for generating hypotheses rather than fortesting them. On the other hand, regression modeling is a more useful statistical techniquethat allows us to assess the strength of the relationships in the data and the uncertainty inthe model by using confidence intervals (12,13).

Statistical Concepts Series

617

Ra

dio

logy

Page 39: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

CORRELATION

The purpose of correlation analysis isto measure and interpret the strength ofa linear or nonlinear (eg, exponential,polynomial, and logistic) relationship be-tween two continuous variables. Whenconducting correlation analysis, we usethe term association to mean “linear asso-ciation” (1,2). Herein, we focus on thePearson and Spearman ! correlation co-efficients. Both correlation coefficientstake on values between #1 and $1, rang-ing from being negatively correlated (#1)to uncorrelated (0) to positively corre-lated ($1). The sign of the correlationcoefficient (ie, positive or negative) de-fines the direction of the relationship.The absolute value indicates the strengthof the correlation (Table 1, Fig 1). Weelaborate on two correlation coefficients,linear (eg, Pearson) and rank (eg, Spear-man), that are commonly used for mea-suring linear and general relationshipsbetween two variables.

Linear Correlation

The Pearson correlation coefficient isalso known as the sample correlation co-efficient (r), product-moment correlationcoefficient, or coefficient of correlation(14). It was introduced by Galton in 1877(15,16) and developed later by Pearson(17). It measures the linear relationshipbetween two random variables. For ex-ample, when the value of the predictor ismanipulated (increased or decreased) bya fixed amount, the outcome variablechanges proportionally (linearly). A lin-ear correlation coefficient can be com-puted by means of the data and theirsample means (Appendix A). When a sci-entific study is planned, the requiredsample size may be computed on the ba-sis of a certain hypothesized value withthe desired statistical power at a specifiedlevel of significance (Appendix B) (18).

Rank Correlation

The Spearman ! is the sample correla-tion coefficient (rs) of the ranks (the rel-ative order) based on continuous data(19,20). It was first introduced by Spear-man in 1904 (6). The Spearman ! is usedto measure the monotonic relationshipbetween two variables (ie, whether onevariable tends to take either a larger orsmaller value, though not necessarily lin-early) by increasing the value of the othervariable.

Linear versus Rank CorrelationCoefficients

The Pearson correlation coefficient ne-cessitates use of interval or continuousmeasurement scales of the measured out-come in the study population. In con-trast, rank correlations also work wellwith ordinal rating data, and continuousdata are reduced to their ranks (AppendixC) (20,21). The rank procedure will alsobe illustrated briefly with our exampledata. The smallest value in the samplehas rank 1, and the largest has the high-est rank. In general, rank correlations arenot easily influenced by the presence ofskewed data or data that are highly vari-able.

Statistical Hypothesis Tests for aCorrelation Coefficient

The null hypothesis states that the un-derlying linear correlation has a hypoth-esized value, !0. The one-sided alterna-tive hypothesis is that the underlyingvalue exceeds (or is less than) !0. Whenthe sample size (n) of the paired data islarge (n ! 30 for each variable), the stan-dard error (s) of the linear correlation (r)is approximately s(r) % (1 # r2)/&n. Thetest statistic value (r # !0)/s(r) may becomputed by means of the z test (22). Ifthe P value is below .05, the null hypoth-esis is rejected. The P value based on the

Spearman ! can be found in the literature(20,21).

Limitations and Precautions

It is worth noting that even if two vari-ables (eg, cigarette smoking and lungcancer) are highly correlated, it is notsufficient proof of causation. One vari-able may cause the other or vice versa, ora third factor is involved, or a rare eventmay have occurred. To conclude causa-tion, the causal variables must precedethe variable it causes, and several con-ditions must be met (eg, reversibility,strength, and exposure response on thebasis of the Bradford-Hill criteria or theRubin causal model) (23–26).

SIMPLE LINEAR REGRESSION

The purpose of simple regression analysisis to evaluate the relative impact of apredictor variable on a particular out-come. This is different from a correlationanalysis, where the purpose is to examinethe strength and direction of the rela-

TABLE 1Interpretation of CorrelationCoefficient

CorrelationCoefficient Value

Direction and Strengthof Correlation

#1.0 Perfectly negative#0.8 Strongly negative#0.5 Moderately negative#0.2 Weakly negative

0.0 No association$0.2 Weakly positive$0.5 Moderately positive$0.8 Strongly positive$1.0 Perfectly positive

Note.—The sign of the correlation coefficient(ie, positive or negative) defines the directionof the relationship. The absolute value indi-cates the strength of the correlation.

Figure 1. Scatterplots of four sets of data generated by means of the following Pearson correlation coefficients (from left to right): r % 0(uncorrelated data), r % 0.8 (strongly positively correlated), r % 1.0 (perfectly positively correlated), and r % #1 (perfectly negatively correlated).

618 ! Radiology ! June 2003 Zou et al

Ra

dio

logy

Page 40: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

tionship between two random variables.In this article, we deal with only linearregression of one continuous variable onanother continuous variable with nogaps on each measurement scale (3).There are other types of regression (eg,multiple linear, logistic, and ordinal)analyses, which will be provided in a fu-ture article in this Statistical ConceptsSeries in Radiology.

A simple regression model containsonly one independent (explanatory) vari-able, Xi, for i % 1, . . ., n subjects, and islinear with respect to both the regressionparameters and the dependent variable.The corresponding dependent (outcome)variable is labeled. The model is ex-pressed as

Yi % a $ bXi $ ei, (1)

where the regression parameter a is the in-tercept (on the y axis), and the regressionparameter b is the slope of the regressionline (Fig 2). The random error term ei isassumed to be uncorrelated, with a meanof 0 and constant variance. For conve-nience in inference and improved effi-ciency in estimation (27), analyses oftenincur an additional assumption that theerrors are distributed normally. Transfor-mation of the data to achieve normalitymay be applied (28,29). Thus, the word line(linear, independent, normal, equal vari-ance) summarizes these requirements.

Typical steps for regression model anal-ysis are the following: (a) determine if theassumptions underlying a normal relation-ship are met in the data, (b) obtain theequation that best fits the data, (c) evaluatethe equation to determine the strength of

the relationship for prediction and estima-tion, and (d) assess whether the data fitthese criteria before the equation is appliedfor prediction and estimation.

Least Squares Method

The main goal of linear regression is tofit a straight line through the data thatpredicts Y based on X. To estimate the in-tercept and slope regression parametersthat determine this line, the least squaresmethod is commonly used. It is not neces-sary for the errors to have a normal distri-bution, although the regression analysis ismore efficient with this assumption (27).With this regression method, a set of re-gression parameters are found such thatthe sum of squared residuals (ie, the differ-ences between the observed values of theoutcome variable and the fitted values) areminimized (14). The fitted y value is thencomputed as a function of the given xvalue and the estimated intercept andslope regression parameter (Appendix D).For example, in Equation (1), once the es-timates of a and b are obtained from theregression analysis, the predicted y value atany given x value is calculated as a $ bx.

Coefficient of Determination, R2

It is meaningful to interpret the value ofthe Pearson correlation coefficient r bysquaring it; hence, the term R-square (R2)or coefficient of determination. This mea-sure (with a range of 0–1) is the fraction ofthe variability in Y that can be explainedby the variability in X through their linearrelationship, or vice versa. That is, R2 %SSregression/SStotal, where SS stands for thesum of squares. Note that R2 is calculatedonly on the basis of the Pearson correlationcoefficient in the linear regression analysis.Thus, it is not appropriate to compute R2

on the basis of rank correlation coefficientssuch as the Spearman !.

Statistical Hypothesis Tests

There are several hypotheses in thecontext of regression analysis, for exam-ple, to test if the slope of the regressionline is b % 0 (hypothesis, there is no lin-ear association between Y and X). Onemay also test whether intercept a takeson a certain value. The significance of theeffects of the intercept and slope mayalso be computed by means of a Studentt statistic introduced earlier in this Statis-tical Concepts Series in Radiology (30).

Limitations and Precautions

The following understandings shouldbe considered when regression analysis is

performed. (a) To understand whetherthe assumptions have been met, deter-mine the magnitude of the gap betweenthe data and the assumptions of themodel. (b) No matter how strong a rela-tionship is demonstrated with regressionanalysis, it should not be interpreted ascausation (as in the correlation analysis).(c) The regression should not be used topredict or estimate outside the range ofvalues of the independent variable of thesample (eg, extrapolation of radiationcancer risk from the Hiroshima data tothat of diagnostic radiologic tests).

AN EXAMPLE: DOSE VERSUSTOTAL PROCEDURE TIMEIN CT FLUOROSCOPY

We applied these statistical methods tohelp assess the benefit of the use of CTfluoroscopy to guide interventions in theabdomen (11). During CT fluoroscopy–guided interventions, one might postu-late that the radiation dose received by apatient is related to (or correlated with)the total procedure time, because themore difficult the procedure is, the moreCT fluoroscopic scanning is required,which means a longer procedure time.The rationale was to assess whether radi-ation dose could be estimated by simplymeasuring the total CT fluoroscopic pro-cedure time, with the null hypothesisthat the slope of the regression line is 0.

Earlier, we discussed two methods totarget lesions with CT fluoroscopy. Inone method, continuous CT scanning isused during needle placement. In theother method, short CT scanning is usedto image the needle after it is placed. Thelatter method, the so-called quick-checkmethod, has been adopted almost exclu-sively at our institution. Now, we demon-strate correlation and regression analysesbased on a subset of the interventionalprocedures (n % 19). With the quick-check method, we examine the relation-ship between total procedure time (inminutes) and dose (in rads) on a naturallog scale. We also examine the marginalranks of the x (log of total time) and y(log of dose) components (Table 2). Forconvenience, the x data are given in as-cending order.

In Table 2, each set of rank data isderived by first placing the 19 observa-tions in each sample in ascending orderand then assigning ranks 1–19. Ties arebroken by means of averaging the respec-tive adjacent ranks. Finally, the ranks areidentified for the observations of each ofthe paired x and y samples.

Figure 2. Simple linear regression modelshows that the expectation of the dependentvariable Y is linear in the independent variableX, with an intercept a % 1.0 and a slope b %2.0.

Volume 227 ! Number 3 Correlation and Simple Linear Regression ! 619

Ra

dio

logy

Page 41: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

The natural log (ln) transformation ofthe total time is used to make the dataappear normal, for more efficient analy-sis (Appendix D), with normality verifiedstatistically (31). However, normality isnot necessary in the subsequent regres-sion analysis. We created a scatterplot ofthe data, with the log of dose (ln[rad]) onthe x axis and the log of total time (ln-[minutes]) on the y axis (Fig 3).

For illustration purposes, we will con-duct both correlation and regressionanalyses; however, the choice of analysisdepends on the aim of research. For ex-ample, if the investigators wish to assesswhether there is a relationship betweentime and dose, then correlation analysisis appropriate. In comparison, if the in-vestigators wish to evaluate the impact ofthe total time on the resulting dose, thenregression analysis is preferred.

Correlations

To compute the Spearman ! with a Pear-son correlation coefficient of r % 0.85, themarginal ranks of time and dose were de-rived separately; consequently, rs % 0.84.Both correlation coefficients confirm thatthe log of total time and the log of dose arecorrelated strongly and positively.

Regression

We first conducted a simple linear re-gression analysis of the data on a log scale(n % 19); results are shown in Table 3. Thevalue calculated for R2 was 0.73, whichsuggests that 73% of the variability of thedata could be explained by the linear re-gression.

The regression line, expressed in theform given in Equation (1), is Y %#9.28 $ 2.83X, where the predictor vari-able X represents the log of total time,and the outcome variable Y representsthe log of dose. The estimated regressionparameters are a % #9.28 (intercept) andb % 2.83 (slope) (Fig 4). This regressionline can be interpreted as follows: At X %0, the value of Y is #9.28. For every one-unit increase in X, the value of Y willincrease on average by 2.83. Effects ofboth the intercept and slope are statisti-cally significant (P ' .005) (Excel; Mi-crosoft, Redmond, Wash); therefore, thenull hypothesis (H0, the dose remainsconstant as the total procedure time in-creases) is rejected. Thus, we confirm thealternative hypothesis (H1, the dose in-creases in the total procedure time).

The regression line may be used to givepredicted values of Y. For example, if in afuture CT fluoroscopy procedure, the log

total time is specified at x % 4 (translated toe4 % 55 minutes, approximately), then thelog dose that is to be applied is approxi-mately y % #9.28 $ 2.83 ( 4 % 2.04 (trans-lated to e2.04 % 7.69 rad). On the otherhand, if the log total time is specified at x %4.5 (translated to e4.5 % 90 minutes, ap-proximately), then the log dose that is tobe applied is approximately y % #9.28 $2.83 ( 4.5 % 3.46 (translated to e3.46 %31.82 rad). Such prediction can be usefulfor future clinical practice.

SUMMARY AND REMARKS

Two important statistical concepts, cor-relation and regression, which are used

commonly in radiology research, are re-viewed and demonstrated herein. Addi-

TABLE 2Total Procedure Time and Dose of CT Fluoroscopy–guided Procedures, byMeans of the Quick-Check Method

SubjectNo.

x Data: Log Time(ln[min])

Ranks ofx Data

y Data: Log Dose(ln[rad])

Ranks ofy Data

1 3.61 1 1.48 22 3.87 2 1.24 13 3.95 3 2.08 5.54 4.04 4 1.70 35 4.06 5 2.08 5.56 4.11 6 2.94 107 4.19 7 2.24 78 4.20 8 1.85 49 4.32 9.5 2.84 9

10 4.32 9.5 3.93 1611 4.42 11.5 3.03 1112 4.42 11.5 3.23 1313 4.45 13 3.87 1514 4.50 14 3.55 1415 4.52 15 2.81 816 4.57 16 4.07 1717 4.58 17 4.44 1918 4.61 18 3.16 1219 4.74 19 4.19 18

Source.—Reference 11.Note.—Paired x and y data are sorted according to the x component; therefore, the log of the

total procedure time and the log of the corresponding rank have an increasing order. When ties arepresent in the data, the average of their adjacent ranks is used. Pearson correlation coefficientbetween log time and log dose, r % 0.85; Spearman ! % 0.84.

TABLE 3Results based on Correlation andRegression Analysis for ExampleData

Regression Statistic Numerical Result

Correlation coefficient r 0.85R-square (R2) 0.73Regression parameter

Intercept #9.28Slope 2.83

Source.—Reference 11.

Figure 3. Scatterplot of the log of dose (yaxis) versus the log of total time (x axis). Eachpoint in the scatterplot represents the values oftwo variables for a given observation.

Figure 4. Scatterplot of the log of dose (yaxis) versus the log of total time (x axis). Theregression line has the intercept a % #9.28 andslope b % 2.83. We conclude that there is apossible association between the radiationdose and the total time of the procedure.

620 ! Radiology ! June 2003 Zou et al

Ra

dio

logy

Page 42: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

tional sources of information and elec-tronic textbooks on statistical analysismethods found on the World Wide Webare listed in Appendix E. A glossary of thestatistical terms used in this article is pre-sented in Appendix F.

When correlation analysis is con-ducted to measure the association be-tween two random variables, either thePearson linear correlation coefficient orthe Spearman rank correlation coeffi-cient ! may be adopted. The former coef-ficient is used to measure the linear rela-tionship but is not recommended for usewith skewed data or data with extremelylarge or small values (often called theoutliers). In contrast, the latter coeffi-cient is used to measures a general asso-ciation, and it is recommended for usewith data that are skewed or that haveoutliers.

When simple regression analysis isconducted to assess the linear relation-ship of a dependent variable as a func-tion of the independent variable, cautionmust be used when determining whichof the two variables is viewed as the in-dependent variable that makes senseclinically. A useful graphical aid is a scat-terplot. Once the regression line is ob-tained, caution should also be used toavoid prediction of a y value for anyvalue of x that is outside the range of thedata. Finally, correlation and regressionanalyses do not infer causality, and morerigorous analyses are required if causalinference is to be made (23–26).

APPENDIX A

Formula for computing the Pearson cor-relation coefficient, r: The formula for com-puting r between bivariate data, Xi and Yi

values (i % 1,. . .,n) is

r "

!i%1

n

)Xi # X! *)Yi # Y!*

"!i%1

n

)Xi # X! *2!i%1

n

)Yi # Y!*2

,

where X and Y are the sample means of theXi and Yi values, respectively.

The Pearson correlation coefficient maybe computed by means of a computer-basedstatistics program (Excel; Microsoft) by us-ing the option “Correlation” under the op-tion “Data Analysis Tools”. Alternatively, itmay also be computed by means of abuilt-in software function “Cor” (Insightful;MathSoft, Seattle, Wash [MathSoft S-Plus 4guide to statistics, 1997; 89–96]. Availableat: www.insightful.com) or with a free soft-

ware program (R Software. Available at: lib.stat.cmu.edu/R).

APPENDIX B

Total sample size based on the Pearsoncorrelation coefficient: Specify r % expectedcorrelation coefficient, C % 0.5 ( ln[(1 $r)/(1 # r)], N % total number of subjectsrequired, + % type I error (ie, significancelevel, typically fixed at 0.05), , % type IIerror (ie, 1 minus statistical power, typicallyfixed at 0.10). Then N % [(Z+ $ Z,)/C]2 $ 3,where Z+ is the inverse of the cumulativeprobability of a standard normal distribu-tion with the tail probability of +. Similarly,Z, is the inverse of the cumulative proba-bility of a standard normal distributionwith the tail probability of ,. Consequently,compute the smallest integer, n, such thatn ! N, as the required sample size.

For example, an investigator wishes toconduct a clinical trial of a paired designbased on a one-tailed hypothesis test of thecorrelation coefficient. The null hypothesisis that the correlation between two vari-ables is r % 0.60 (ie, C % 0.693) in thepopulation of interest. The alternative hy-pothesis is that the correlation is r - 0.60.Type I error is fixed to be 0.05 (ie, Z+ %1.645), while type II error is fixed to be 0.10(ie, Z, % 1.282). Thus, the required samplesize is N % 21 subjects. A sample size tablemay also be found in reference 18.

APPENDIX C

Formula for computing Spearman ! andPearson rs: Replace bivariate data, Xi and Yi

(i % 1,. . .,n), by their respective ranks Ri %rank(Xi) and Si % rank(Yi). Rank correlationcoefficient, rs, is defined as the Pearson cor-relation coefficient between the Ri and Si

values, which can be computed by means ofthe formula given in Appendix A. An alter-native direct formula was given by Hett-mansperger (19).

The Spearman ! may also be computed byfirst reducing the continuous data to theirmarginal ranks by using the “rank and per-centile” option with Data Analysis Tools(Excel; Microsoft) or the “rank” function(Insightful; MathSoft) or the free software.Both software programs correctly rank thedata in ascending order. However, the rankand percentile option in Excel ranks thedata in descending order (the largest is 1).Therefore, to compute the correct ranks,one may first multiply all of the data by #1and then apply the rank function. Excelalso gives integer ranks in the presence ofties compared with the methods that yieldpossible noninteger ranks, as described inthe standard statistics literature (19).

Subsequently, the sample correlation co-efficient is computed on the basis of the

ranks of the two marginal data by using theCorrelation option in Data Analysis Tools(Excel; Microsoft) or by using the Cor func-tion (Insightful; MathSoft) or the free soft-ware.

APPENDIX D

Simple regression analysis: Regressionanalysis may be performed by using the“Regression” option with Data AnalysisTools (Excel; Microsoft). This regressionanalysis tool yields the sample correlationR2; estimates of the regression parameters,along with their statistical significance onthe basis of the Student t test; residuals; andstandardized residuals. Scatter, line fit, andresidual plots may also be created. Alterna-tively, the analyses can be performed byusing the function “lsfit” (Insightful; Math-Soft) or the free software.

With either program, one may choose totransform the data or exclude outliers be-fore conducting a simple regression analy-sis. A commonly used variance-stabilizingtransformation is the natural log function(ln) applied to one or both variables. Othertransformation (eg, Box-Cox transforma-tion) and weighting methods in regressionanalysis may also be used (28,29).

APPENDIX E

Uniform resource locator, or URL, links toelectronic statistics textbooks: www.davidmlane.com/hyperstat/index.html, www.statsoft.com/textbook/stathome.html, www.ruf.rice.edu/.lane/rvls.html, www.bmj.com/collections/statsbk/index.shtml, espse.ed.psu.edu/statistics/investigating.htm.

APPENDIX F

Glossary of statistical terms:Bivariate data.—Measurements obtained

on more than one variable for the same unitor subject.

Correlation coefficient.—A statistic be-tween #1 and 1 that measures the associa-tion between two variables.

Intercept.—The constant a in the regres-sion equation, which is the value for y whenx % 0.

Least squares method.—The regression linethat is the best fit to the data for which thesum of the squared residuals is minimized.

Outlier.—An extreme observation faraway from the bulk of the data, oftencaused by faulty measuring equipment orrecording error.

Pearson correlation coefficient.—Samplecorrelation coefficient for measuring thelinear relationship between two variables.

R2.—The square of the Pearson correla-tion coefficient r, which is the fraction ofthe variability in Y that can be explained by

Volume 227 ! Number 3 Correlation and Simple Linear Regression ! 621

Ra

dio

logy

Page 43: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

the variability in X through their linear re-lationship or vice versa.

Rank.—The relative ordering of the mea-surements in a variable, which can be non-integer numbers in the presence of ties.

Residual.—The difference between the ob-served values of the outcome variable andthe fitted values based on a linear regressionanalysis.

Scatterplot.—A plot of the observed biva-riate outcome variable (y axis) against itspredictor variable (x axis), with a dot foreach pair of bivariate observations.

Simple linear regression analysis.—A linearregression analysis with one predictor andone outcome variable.

Skewed data.—A distribution is skewed ifthere are more extreme data on one side ofthe mean. Otherwise, the distribution issymmetric.

Slope.—The constant b in the regressionequation, which is the change in y that cor-responds to a one-unit increase (or de-crease) in x.

Spearman !.—A rank correlation coeffi-cient for measuring the monotone relation-ship between two variables.

Acknowledgments: We thank Kimberly E.Applegate, MD, MS, and Philip E. Crewson,PhD, co-editors of this Statistical Concepts Se-ries in Radiology for their constructive com-ments on earlier versions of this article.

References1. Krzanowsk WJ. Principles of multivariate

analysis: a user’s perspective. Oxford, En-gland: Clarendon, 1988; 405–432.

2. Rodriguez RN. Correlation. In: Kotz S,Johnson NL, eds. Encyclopedia of statisti-

cal sciences. New York, NY: Wiley, 1982;193–204.

3. Applegate KE, Crewson PE. An introduc-tion to biostatistics. Radiology 2002; 225:318–322.

4. Goldman RN, Weinberg JS. Statistics: anintroduction. Upper Saddle River, NJ:Prentice Hall, 1985; 72–98.

5. Freund JE. Mathematical statistics. 5th ed.Upper Saddle River, NJ: Prentice Hall,1992; 494–546.

6. Spearman C. The proof and measurementof association between two things. Am JPsychol 1904; 15:72–101.

7. Fieller EC, Hartley HO, Pearson ES. Testsfor rank correlation coefficient. I. Bio-metrika 1957; 44:470–481.

8. Fieller EC, Pearson ES. Tests for rank cor-relation coefficients. II. Biometrika 1961;48:29–40.

9. Kruskal WH. Ordinal measurement of asso-ciation. J Am Stat Assoc 1958; 53:814–861.

10. David FN, Mallows CL. The variance ofSpearman’s rho in normal samples. Bio-metrika 1961; 48:19–28.

11. Silverman SG, Tuncali K, Adams DF,Nawfel RD, Zou KH, Judy PF. CT fluoros-copy-guided abdominal interventions:techniques, results, and radiation expo-sure. Radiology 1999; 212:673–681.

12. Daniel WW. Biostatistics: a foundationfor analysis in the health sciences. 7th ed.New York, NY: Wiley, 1999.

13. Altman DG. Practical statistics for medi-cal research. Boca Raton, Fla: CRC, 1990.

14. Neter J, Wasserman W, Kutner MH. Ap-plied linear models: regression, analysisof variance, and experimental designs.3rd ed. Homewood, Ill: Irwin, 1990; 38–44, 62–104.

15. Galton F. Typical laws of heredity. Proc RInst Great Britain 1877; 8:282–301.

16. Galton F. Correlations and their measure-ments, chiefly from anthropometric data.Proc R Soc London 1888; 45:219–247.

17. Pearson K. Mathematical contributionsto the theory of evolution. III. Regression,

heredity and panmixia. Phil Trans R SocLond Series A 1896; 187:253–318.

18. Hulley SB, Cummings SR. Designing clin-ical research: an epidemiological ap-proach. Baltimore, Md: Williams &Wilkins, 1988; appendix 13.C.

19. Hettmansperger TP. Statistical inferencebased on ranks. Malabar, Fla: Krieger,1991; 200–205.

20. Kendall M, Gibbons JD. Rank correlationmethods. 5th ed. New York, NY: OxfordUniversity Press, 1990; 8–10.

21. Zou KH, Hall WJ. On estimating a trans-formation correlation coefficient. J ApplStat 2002; 29:745–760.

22. Fisher RA. Frequency distributions of thevalues of the correlation coefficient insamples from an indefinitely large popu-lation. Biometrika 1915; 10:507–521.

23. Duncan OD. Path analysis: sociologicalexamples. In: Blalock HM Jr, ed. Causalmodels in the social sciences. Chicago,Ill: Alpine-Atherton, 1971; 115–138.

24. Rubin DB. Estimating casual effects oftreatments in randomized and nonran-domized studies. J Ed Psych 1974;66:688–701.

25. Holland P. Statistics and causal inference.J Am Stat Assoc 1986; 81:945–970.

26. Angrist JD, Imbens GW, Rubin DB. Iden-tification of causal effects using instru-mental variables. J Am Stat Assoc 1996;91:444–455.

27. Seber GAF. Linear regression analysis.New York, NY: Wiley, 1997; 48–51.

28. Carroll RJ, Ruppert D. Transformationand weighting in regression. New York,NY: Chapman & Hall, 1988; 2–61.

29. Box GEP, Cox DR. An analysis of trans-formation. J R Stat Soc Series B 1964; 42:71–78.

30. Tello R, Crewson PE. Hypothesis testingII: means. Radiology 2003; 227:1–4.

31. Mudholkar GS, McDermott M, Scrivas-tava DK. A test of p-variate normality.Biometrika 1992; 79:850–854.

622 ! Radiology ! June 2003 Zou et al

Ra

dio

logy

Page 44: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Curtis P. Langlotz, MD,PhD

Index terms:Diagnostic radiologyStatistical analysis

Published online10.1148/radiol.2281011106

Radiology 2003; 228:3–9

Abbreviation:ROC ! receiver operating

characteristic

1 From the Departments of Radiologyand Epidemiology, University of Penn-sylvania, 3400 Spruce St, Philadelphia,PA 19104. Received June 25, 2001; re-vision requested August 2; revision re-ceived May 3, 2002; accepted May 15.Address correspondence to the author(e-mail: [email protected]).© RSNA, 2003

Fundamental Measures ofDiagnostic ExaminationPerformance: Usefulness forClinical Decision Makingand Research1

Measures of diagnostic accuracy, such as sensitivity, specificity, predictive values, andreceiver operating characteristic curves, can often seem like abstract mathematic con-cepts that have a minimal relationship with clinical decision making or clinical research.The purpose of this article is to provide definitions and examples of these concepts thatillustrate their usefulness in specific clinical decision-making tasks. In particular, nineprinciples are provided to guide the use of these concepts in daily radiology practice, ininterpreting clinical literature, and in designing clinical research studies. An understand-ing of these principles and of the measures of diagnostic accuracy to which they applyis vital to the appropriate evaluation and use of diagnostic imaging examinations.© RSNA, 2003

The bulk of the radiology literature concerns the assessment of examination performance,which is sometimes referred to as diagnostic accuracy. Despite the proliferation of suchresearch on examination performance, it is still difficult to assess new imaging technolo-gies, in part because such initial assessments are not always performed with an eye for howthe results will be used clinically (1). The goal of this article is to describe nine fundamentalprinciples (Appendix) to help answer specific clinical questions by using the radiologyliterature.

Consider the following clinical scenario: A referring physician calls you about thefindings of a diagnostic mammogram that you interpreted yesterday. In the upper outerquadrant of the left breast you identified a cluster of suspicious microcalcifications—notthe kind that suggests definite cancer but rather that which indicates the need for a moredefinitive work-up. The referring physician relays to you the patient’s desire to explore thepossibility of breast magnetic resonance (MR) imaging.

In this article, I will use this clinical example to illustrate the basic concepts of exami-nation performance. To supplement previously published introductory material (2–4), Iwill relate the nine fundamental principles to the specific clinical scenario just describedto illustrate the strengths and weaknesses of using them for clinical decision making andclinical research. I plan to answer the following questions in the course of this discussion:Which descriptors of an examination are the best intrinsic measures of performance?Which are the most clinically important? What are the limitations of sensitivity andspecificity in the assessment of diagnostic examinations? What are receiver operatingcharacteristic (ROC) curves, and why is their clinical usefulness limited? Why are predic-tive values more clinically relevant, and what are the pitfalls associated with using them?The ability of radiologists to understand the answers to these questions is critical toimproving the application of the radiology literature to clinical practice.

TWO-BY-TWO CONTINGENCY TABLE: A SIMPLEAND UNIVERSAL TOOL

One of the most intuitive methods for the analysis of diagnostic examinations is thetwo-by-two table. This simple device can be jotted on the back of an envelope yet is quite

Statistical Concepts Series

3

Ra

dio

logy

Page 45: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

versatile and powerful, both in the anal-ysis of a diagnostic examination and inincreasing our understanding of exami-nation performance.

Simplifying Assumptions

The use of two-by-two tables (a moregeneral term is contingency tables) requirescertain simplifying assumptions and pre-requisites. The first assumption that I willmake is that the examination in questionmust be compared with a reference-stan-dard examination—that is, one with re-sults that yield the truth with regard tothe presence or absence of the disease. Inthe past, this reference standard has com-monly been called a “gold standard”—aterm that is falling out of favor, perhapsbecause of the recognition that evensome of the best reference standards areimperfect. For example, even clinical di-agnoses supplemented by the results ofthe most effective histopathologic analy-ses are fallible (5).

A second major assumption that I willmake is that the examination result mustbe considered either positive or negative.This is perhaps the least appealing as-sumption with regard to a two-by-twotable, because many examinations havecontinuous result values, such as the de-gree of stenosis in a vessel or the attenu-ation of a liver lesion. As we will see later,this is one of the first assumptions that Iwill discard when advanced conceptssuch as ROC curves are discussed (6–8).The final assumption is that one assessesexamination performance with respect tothe presence or absence of a single dis-ease and not several diseases.

Example Use of a Two-by-TwoTable

Table 1 is a prototypical form of a two-by-two table. Across the top, we see twocenter columns, one for all cases (or pa-tients) in which the disease is trulypresent (D") and the other for all cases inwhich the disease is truly absent (D#). Inthe far left column of the table, we seethe two possible examination results:positive, indicating disease presence, andnegative, indicating disease absence. Thistable summarizes the relationship be-tween the examination result and the ref-erence-standard examination and definesfour distinct table cells (ie, true-positive,false-positive, true-negative, and false-negative examination results). In the firstrow (T"), we see that a positive examina-tion result can be either true-positive orfalse-positive, depending on whether the

disease is present or absent, respectively.The second row (T#) shows that a nega-tive examination result can be eitherfalse-negative or true-negative, again de-pending on whether the disease is presentor absent, respectively.

Data in the D" column show how theexamination performs (ie, yields resultsthat indicate the true presence or trueabsence of a given disease) in patientswho have the disease in question. Data inthe D# column show how the examina-tion performs in patients who do nothave the disease (ie, who are “healthy”with respect to the disease in question).The total numbers of patients who actu-ally do and do not have the disease ac-cording to the reference-standard exami-nation results are listed at the bottom ofthe D" and D# columns, respectively.

The datum in the first row at the farright (TP " FP) is the total number ofpatients who have positive examinationresults; the datum in the second row atthe far right (FN " TN) is the total num-ber of patients who have negative exam-ination results. The overall total (N) is thetotal number of patients who partici-pated in the study of examination perfor-mance.

The example data in Table 2 are in-terim data from an experiment to evalu-ate the accuracy of breast MR imaging inpatients with clinically or mammo-graphically suspicious lesions. Like thepatient with suspicious microcalcifica-tions who is considering undergoing MRimaging in the hypothetical scenario de-scribed earlier, all patients in this exper-iment had suspicious lesions and wereabout to undergo open excisional biopsy.Prior to biopsy, each woman underwentdynamic contrast material–enhanced MRimaging of the breast. The results of his-topathologic examination of the specimenobtained at subsequent excisional biopsywere used as the reference standard for dis-ease. (A more detailed description of theexperimental methodology and a more re-cent report of the data are published else-where [9].) As shown in Table 2, a total of182 women were enrolled in the study atthe time the table was constructed. Seven-ty-four of these women had cancer, and108 did not. There were a total of 99 posi-tive examination results: 71 were true-pos-itive and 28 false- positive. The 83 negativeexamination results comprised three false-negative and 80 true-negative results.

In the following sections, I describe theimportant quantitative measures of ex-amination performance that can be com-puted from a two-by-two table.

SENSITIVITY AND SPECIFICITY:INTRINSIC MEASURES OFEXAMINATION PERFORMANCE

Sensitivity: ExaminationPerformance in Patients withthe Disease in Question

Principle 1: Sensitivity is a measure ofhow a diagnostic examination performsin a population of patients who have thedisease in question. The value can be de-fined as how often the examination willenable detection of the disease when it ispresent: TP/(TP " FN).

Given the data in two-by-two Table 2,sensitivity is computed by using thenumbers in the “Malignant” column. Ofthe 74 women who actually had cancer,71 had a positive MR imaging result.Thus, the sensitivity of breast MR imag-ing in the sample of women undergoingbreast biopsy was 96%—that is, 71 of 74women with cancer were identified byusing MR imaging.

TABLE 1Shorthand Two-by-Two TableDescribing Diagnostic ExaminationPerformance

ExaminationResult D" D# Total

T" TP FP TP " FPT# FN TN FN " TNTotal TP " FN FP " TN N

Note.—D" ! all cases or patients in whichdisease is truly present (ie, according to refer-ence-standard examination results), D# ! allcases or patients in which disease is truly ab-sent, FN ! number of cases or patients withfalse-negative examination results, FP ! num-ber of cases or patients with false-positive ex-amination results, N ! overall total number ofcases or patients, T" ! positive examinationresult, T# ! negative examination result,TN ! number of cases or patients with true-negative examination results, TP ! number ofcases or patients with true-positive examina-tion results.

TABLE 2Patient Data in Experiment to StudyBreast MR Imaging

MR ImagingResult Malignant Benign Totals

Positive 71 28 99Negative 3 80 83Total 74 108 182

Note.—Data are numbers of women withmalignant or benign breast tumors.

4 ! Radiology ! July 2003 Langlotz

Ra

dio

logy

Page 46: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Specificity: ExaminationPerformance in Patients withoutthe Disease in Question

Principle 2: Specificity is a measure ofhow a diagnostic examination performsin a population of patients who do nothave the disease (ie, healthy subjects)—inother words, a value of the ability of anexamination to yield an appropriately neg-ative result in these patients. Specificitycan be defined as how often a healthy pa-tient will have a normal examination re-sult: TN/(FP " TN).

In the example scenario, specificity iscalculated by using the numbers in the“Benign” column of two-by-two Table2. Of the 108 women who had benignlesions, 80 had negative MR imagingresults. Thus, the specificity of breastMR imaging in the sample of womenundergoing breast biopsy was 74%—that is, 80 of 108 women without can-cer were identified by using MR imag-ing.

Relative Importance of Sensitivityand Specificity

How can sensitivity and specificityvalues be used directly to determinewhether an examination might be use-ful in a specific clinical situation?Which value is more important? Aquantitative analysis of these questions(4) is beyond the scope of this article.However, here are two qualitative rulesof thumb, which together make up

principle 3: A sensitive examination ismore valuable in situations where false-negative results are more undesirablethan false-positive results. A specific ex-amination is more valuable in situa-tions where false-positive results aremore undesirable than false-negativeresults.

For example, with regard to a womanwith a suspicious breast mass, we mustconsider how we would feel if we were tomiss a cancer owing to a false-negativeexamination. Because we would regretthis outcome, we place appropriate em-phasis on developing and enhancing thesensitivity of breast MR imaging to avoidmissing cancers that may progress duringthe follow-up interval after a false-nega-tive MR imaging examination. We wouldalso feel uncomfortable about referring apatient for excisional biopsy of a benignlesion, but perhaps less so, since this re-sult would occur even if MR imaging wasnever performed. Consequently, thisprinciple leads us to the conclusionthat sensitivity is more important thanspecificity with respect to breast MR im-aging in this clinical setting. Becausethe main potentially beneficial role ofbreast MR imaging in this clinical set-ting is to allow some women withoutcancer to avoid excisional biopsy, thisprinciple also highlights the greater im-portance of sensitivity compared withspecificity in this case. Pauker and Kas-sirer (10) provide a quantitative discus-sion of how this principle functions.

Limitations

Sensitivity and specificity are impor-tant because they are diagnostic exami-nation descriptors that do not varygreatly among patient populations. A de-tailed analysis of the limitations of thesemeasures is described elsewhere (11). Letus return to the woman with a suspiciouslesion on the mammogram. She wants toknow whether breast MR imaging mighthelp her. Now that we have computedthe sensitivity and specificity of MR im-aging by using the data in the two-by-two table, we can convey to her the fol-lowing: “If you have cancer, the chancethat your MR imaging examination willbe positive is 96%. If you don’t have can-cer, the chance that your MR imagingexamination will be negative is 74%.”Statements of this kind are often difficultfor patients and health care providers toincorporate into their clinical reasoning.Thus, a key weakness of sensitivity andspecificity values is that they do not yieldinformation about a diagnostic examina-tion in a form that is immediately rele-vant to a specific clinical decision-mak-ing task. Therefore, while the diagnosticimaging literature may contain a greatdeal of information about the measuredsensitivity and specificity of a given ex-amination, it often contains few datathat help us assess the optimal clinicalrole of the examination (12).

Principle 4: The sensitivity and speci-ficity of a diagnostic examination are re-lated to one another. An additional im-portant weakness of sensitivity andspecificity is that these two measures can-not always be used to rank the accuracyof two examinations (or two radiolo-gists). This weakness is particularly evi-dent when one examination has a highersensitivity but a lower specificity thananother. The reason that examinationcomparison difficulties often arise withregard to sensitivity and specificity is thatthese two values are inherently related:You cannot evaluate one without theother. As sensitivity increases, specificitytends to decrease, and vice versa. We seethis phenomenon every day when twocolleagues interpret the same images dif-ferently.

Consider, for example, how two radi-ologists decide whether congestive heartfailure is depicted on a chest radiograph.One reader may use strict criteria for thepresence of congestive heart failure andthus interpret fewer findings as positive,facilitating decreased sensitivity and in-creased specificity. The other reader mayuse much more flexible criteria and thus

Figure 1. Six-category scale for rating the presence or absence of breast cancer. By varying acutoff for rating categories, one can create five two-by-two tables of data from which sensitivityand specificity values can be calculated (sensitivity and specificity #1–#5).

Volume 228 ! Number 1 Fundamental Measures of Diagnostic Examination Performance ! 5

Ra

dio

logy

Page 47: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

interpret more image findings as positivefor congestive heart failure, facilitatingincreased sensitivity but decreased speci-ficity.

ROC CURVES

Comprehensive Comparisonsamong Diagnostic Examinations

Principle 5: ROC curves provide amethod to compare diagnostic examina-tion accuracy independently of the diag-nostic criteria (ie, strict or flexible) used.When one examination is more sensitivebut another is more specific, how do wedecide which examination provides bet-ter diagnostic information? Or, are theaccuracies of the two examinations reallysimilar, with the exception that one in-volves the use of stricter criteria for apositive result? ROC curves are importantand useful because they can answer thesequestions by explicitly representing theinherent relationship between the sensi-tivity and specificity of an examination.As such, ROC curves are designed to il-lustrate the overall information yieldedby an imaging examination, regardless ofthe criteria a reader uses to interpret theimages. Therefore, ROC curves specifi-cally address situations in which exami-nations cannot be compared on the basisof sensitivity and specificity alone. A de-tailed discussion of ROC methodology ispublished elsewhere (7).

For the discussion of ROC curves, I will“relax” one of the assumptions made ear-lier—that of a two-value (ie, positive ornegative) examination result. Instead,the readers of the images generated in thetwo examinations will be allowed tospecify their results on a scale. Figure 1 isan illustration of a six-point rating scalefor imaging-based identification of breastcancer. When using this scale, the readerof breast images is asked to specify theinterpretation in terms of one of six find-ing categories: definitely cancer, proba-bly cancer, possibly cancer, possibly be-nign, probably benign, or definitelybenign. One then tabulates the ratings byusing the two-by-six table shown in Fig-ure 2.

The rating scale and the table pro-duced by using it provide multiple op-portunities to measure sensitivity andspecificity. For example, we can assumethat the examination is positive onlywhen “definitely cancer” is selected andis negative otherwise. Next, we can as-sume that the probably cancer and defi-nitely cancer ratings both represent pos-itive examination results and that the

remaining ratings represent negative re-sults. When the two-by-six table is col-lapsed in this manner, a new two-by-twotable is formed comprising higher sensi-tivity and lower specificity values thanthe first two-by-two table (because lessstrict imaging criteria were used andmore image findings were rated as posi-tive).

We can repeat this process five times,concluding with a two-by-two table suchas that shown in Figure 2 (bottom table),in which only the definitely benign rat-ings are considered to represent negativeresults and the remaining ratings are con-sidered to represent positive results. Thisapproach would result in low specificityand high sensitivity. Figure 3 shows aplot of all five sensitivity-specificity pairsthat can be derived from the two-by-sixtable data in Figure 2 and the ROC curvedefined by these points.

Clinical Limitations

Several methods to quantitatively as-sess an examination on the basis of theROC curve that it yields have been de-signed. The most popular method is tomeasure the area under the ROC curve,or the Az. In general, the larger the areaunder the ROC curve, the better the di-agnostic examination. Despite the ad-vantages of these measurements of areaunder the ROC curve for research andanalysis, they do not provide informa-tion that is useful to the clinician or pa-tient in clinical decision making. Thus,

the value of the area under the ROCcurve has no intrinsic clinical meaning.Also, there is no cutoff above or belowwhich one can be certain of the useful-ness a diagnostic examination.

Should a patient with an abnormalmammogram be satisfied if she is toldthat the area under the ROC curve forbreast MR imaging is 0.83? Although thearea under the ROC curve is helpful forcomparing two examinations, it has lim-ited usefulness in facilitating the clinicaldecisions of patients and referring physi-cians.

Figure 2. Sample two-by-six table showing the results of an ROC study of breast cancer iden-tification in 200 patients. The two-by-two table at the bottom can be created by setting a cutoffbetween the ratings of definitely benign and probably benign. This cutoff corresponds to sensi-tivity #1 and specificity #1 in Figure 1. Sensitivity and specificity values are calculated by using thetwo-by-two table data. As expected, use of the more flexible criteria leads to high sensitivity butlow specificity.

Figure 3. Sample ROC curve. This curve is aplot of sensitivity versus (1 # specificity). The0,0 point and 1,1 point are included by defaultto represent the situation in which all imagesare considered to be either negative or positive,respectively. FPF ! false-positive fraction,TPF ! true-positive fraction.

6 ! Radiology ! July 2003 Langlotz

Ra

dio

logy

Page 48: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

PREDICTIVE VALUES

Measuring PostexaminationLikelihood of Disease

Because sensitivity, specificity, andROC curves provide an incomplete pic-ture of the clinical usefulness of an imag-ing examination, I will now shift back tothe original two-by-two table for breastMR imaging (Table 2) to examine twoadditional measurements that have muchgreater clinical relevance and intuitive ap-peal: positive and negative predictive val-ues. These quantities emphasize an impor-tant principle of diagnostic examinations—principle 6: A diagnostic examination causesa change in our belief about the likelihoodthat disease is truly present.

For example, the data in two-by-twoTable 2 indicate that the probability ofcancer in the women with suspiciousmammograms was 41% (74 of 182women). Thus, simply on the basis of herreferral for excisional biopsy (withoutknowing the specific mammographic ap-pearance of the lesion), the chance ofcancer in the woman in the hypotheticalscenario is about 41%. How can MR im-aging help modify this likelihood to ben-efit the patient? The predictive values canhelp answer this question.

Principle 7: The positive predictivevalue indicates the likelihood of diseasegiven a positive examination. Positivepredictive value is defined as the proba-bility of disease in a patient whose exam-ination result is abnormal: TP/(TP " FP).Thus, we can compute the positive pre-

dictive value by using only the numbersin the first row (“Positive”) of two-by-twoTable 2. A total of 99 patients had posi-tive MR imaging results. Seventy-one ofthese patients actually had cancer. Thus,the positive predictive value of breast MRimaging is 72%—in other words, 71 ofthe 99 women with positive MR imagingresults had cancer. These values aresometimes referred to as “posttest likeli-hoods” or “posttest probabilities of dis-ease,” because the predictive value sim-ply reflects the probability of disease afterthe examination result is known. Thepositive predictive value tells us, as ex-pected, that a positive breast MR imagingresult increases the probability of diseasefrom 41% to 72%.

Principle 8: The negative predictivevalue indicates the likelihood of no dis-ease given a negative examination. Thenegative predictive value is the negativeanalog of the positive predictive value.The negative predictive value can be de-fined as the probability that disease isabsent in a patient whose examinationresult is negative: TN/(FN " TN). Thus,the negative predictive value can be com-puted solely by using the values in thesecond row (“Negative”) of two-by-twoTable 2. A total of 83 patients in the sam-ple had negative MR imaging results.Eighty of these patients actually had be-nign lesions; there were three false-nega-tive results. Thus, the negative predictivevalue of breast MR imaging was 96%—inother words, 80 of the 83 women withnegative MR imaging results did not have

cancer. (Note: It is coincidence that thesensitivity and negative predictive valueare approximately equivalent in thiscase.) The probability of disease after anegative examination is simply 100% mi-nus the negative predictive value, or 4%in this case. This computation tellsus—as expected—that a negative exami-nation decreases the probability of dis-ease in this case from 41% to 4%.

The clinical usefulness of the predic-tive values is best illustrated by the firstquestion that the patient in our hypo-thetical scenario might ask after she hasundergone MR imaging: “Do I have can-cer?,” which in the uncertain world ofmedicine, can be translated as “Howlikely is it that I have cancer?” The pre-dictive values, in contrast to sensitivityand specificity, answer this question andtherefore are helpful in the clinical deci-sion-making process. Knowing that anegative MR imaging result decreases thechance of cancer to 4% raises several ad-ditional questions for the clinician andpatient to consider: Is a 4% likelihoodlow enough that excisional biopsy couldbe deferred during a short period of fol-low-up? Is it worth it to trade the poten-tial harm of tumor progression duringshort-interval follow-up for the potentialbenefit of not undergoing excisional bi-opsy? These trade-offs can be consideredexplicitly by using decision analysis (13)but are routinely considered implicitly byreferring physicians, patients, and othermedical decision makers.

Limitations

Although predictive values have sub-stantial clinical usefulness, a discussionof their weaknesses is warranted. Themost important weakness is the depen-dence of predictive values on the preex-amination probability, or the prevalenceof disease in the imaged population. Asemphasized earlier, a diagnostic exami-nation causes a change in our beliefabout the likelihood of disease. And, asexpected, the higher the preexaminationprobability of disease, the higher the pos-texamination probability of disease.Thus, predictive values are directly de-pendent on the population in which thegiven examination is performed.

Consider the following example, whichillustrates this dependence: Since breastMR imaging may depict some cancers thatwere missed with mammography, why notuse MR imaging as a screening tool to de-tect cancer in low-risk asymptomaticwomen? This approach has some intuitiveappeal since breast MR imaging is a highly

Figure 4. Flowchart depicts a simulated population of 20,050 low-risk asymptomatic womenwho might be screened with breast MR imaging. The assumed prevalence of cancer in thisscreening population of 50 women with and 20,000 women without cancer is 0.25%. MRI" andMRI# ! positive and negative MR imaging results, respectively.

Volume 228 ! Number 1 Fundamental Measures of Diagnostic Examination Performance ! 7

Ra

dio

logy

Page 49: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

sensitive examination and might depict agreater number of cancers than would bedetected with screening mammography.To illustrate the implications of this ap-proach, a simulation of what would occurif a group of low-risk asymptomatic womenwere screened with MR imaging is pro-vided (Fig 4). To approximate the preva-lence of cancer in a low-risk screening pop-ulation and to simplify the calculations, Iwill use 0.25% as the prevalence of breastcancer. For simplicity, I will consider ascreening population of 20,050 women, 50of whom have occult cancer.

Since we have established that MR im-aging is 96% sensitive, 48 of the 50women who actually have cancer willhave positive MR imaging examinations(50 $ .96 ! 48). The other two womenwill have false-negative MR imaging ex-aminations and undetected cancer, justas if they had never undergone screeningMR imaging. With the 74% specificity forbreast MR imaging computed earlier,14,800 women will have normal MR im-aging examinations (20,000 $ .74 !14,800). The remaining 5,200 womenwill have false-positive examinations. Ta-ble 3 is a two-by-two table containingthese data.

Because the true disease status of thesewomen will be unknown at the time ofexamination, clinical inferences must bedrawn from the examination results andpredictive values. According to principle8, the negative predictive value indicatesthe clinical implications of a negative ex-amination. Since there are 14,802 womenwith negative examinations, only two ofwhom have cancer, the negative predictivevalue is 0.01% (two of 14,802 women).This value represents a decrease from0.25% and has no real clinical importancein a screening population; however, it doeshave some potential reassurance value.

There are 5,248 women with positiveexaminations in the simulation, and 48of them actually have cancer, so the like-lihood of cancer is approximately 1.00%(48 of 5,248 women). Thus, a positive MRimaging examination increases the likeli-hood of cancer from 0.25% to 1.00%.This group of women represents a clinicalproblem, however: Are we willing to per-form 100 biopsies to find one cancer?Probably not. Should these women befollowed up with a special more intensiveregimen? Perhaps, but there are lingeringquestions regarding the cost-effective-ness of this program, which would likelycost tens of millions of dollars and lead toa substantial increase in the number ofnegative excisional biopsies.

This example of screening MR imaging

illustrates clearly that the predictive val-ues for breast MR imaging are vastly dif-ferent in a screening population with amuch smaller prevalence of disease. Like-wise, the values of clinical usefulness ofbreast MR imaging as a screening exami-nation are in stark contrast to the analo-gous measures of MR imaging performedin women with mammographically sus-picious lesions. This contrast illustrates aweakness of predictive values: They varyaccording to the population in which thegiven examination is performed. Al-though the predictive values for breastMR imaging performed in women withsuspicious mammograms are appealing,the predictive values for this examina-tion performed in a screening populationsuggest that it has little value for asymp-tomatic women with low breast cancerrisk. Another realistic clinical example ofthis phenomenon is described elsewhere(14).

Despite these limitations, predictivevalues can be determined mathemati-cally from sensitivity, specificity, andprevalence data. Because sensitivity andspecificity values are often published, aclinician can compute the predictive val-ues for a particular population of interestby using the prevalence of disease in thatpopulation and the sensitivity and spec-ificity values provided in the literature.

LIKELIHOOD RATIO

Quantifying Changes in DiseaseLikelihood

Principle 9: Likelihood ratios enablecalculation of the postexamination prob-ability of disease from the preexamina-tion probability of disease. A brief de-scription of the likelihood ratio (15,16) isrelevant here because this measurementis not affected by disease prevalence andcan yield clinically useful information.Likelihood ratio is defined as the proba-bility that a person with a disease willhave a particular examination result di-vided by the probability that a personwith no disease will have that same re-sult. Positive likelihood ratio (LR"),sometimes expressed as %, is defined asthe likelihood, or probability, that a per-son with a disease will have a positiveexamination divided by the likelihoodthat a person with no disease will have apositive examination: LR" ! sensitivity/(1 # specificity). Negative likelihood ra-tio (LR#) is defined as the probabilitythat a person with a disease will have anegative examination divided by theprobability that a person without the dis-

ease will have a negative examination:LR# ! (1 # sensitivity)/specificity.

To illustrate the use of likelihood ra-tios, consider the breast MR imaging ex-ample: The sensitivity of breast MR im-aging is 96%, and the specificity is 74%.Therefore, the positive likelihood ratio iscalculated by dividing the sensitivity by(1 # specificity). Thus, LR" ! 0.96/(1 #0.74) ! 3.7. The negative likelihood ratiois calculated by dividing (1 # sensitivity)by the specificity. Thus, LR# ! (1 #0.96)/0.74 ! 0.055.

Once the likelihood ratios have beencalculated, they can be used to calculatethe postexamination probability of dis-ease given the preexamination proba-bility of disease, or P(D"!T"). For thiscalculation, one first must convertprobabilities of disease to odds of dis-ease. Odds of disease, Odds(D"), is de-fined as the probability that disease ispresent, p(D"), divided by the probabil-ity that disease is absent, p(D#):Odds(D") ! p(D")/p(D#).

To compute the postexaminationprobability of disease given a positive ex-amination result for any preexaminationprobability value and examination result,multiply the preexamination odds of dis-ease, Odds(D"), by the positive likeli-hood ratio for the examination result,LR", to obtain the postexaminationodds, Odds(D"!T"). In other words, thepostexamination odds of having a givendisease given a positive examination re-sult is equal to the positive likelihoodratio multiplied by the preexaminationodds of the disease: Odds(D"!T") !LR" ! Odds(D").

Finally, to determine the postexamina-tion probability of disease, P(D"!T"),convert the postexamination odds backto a postexamination probability value.The postexamination probability of dis-ease given the examination result can becomputed from the postexaminationodds as follows:

TABLE 3Patient Data in a Hypothetical Groupof Women Undergoing ScreeningBreast MR Imaging

MR ImagingResult Malignant Benign Total

Positive 48 5,200 5,248Negative 2 14,800 14,802Total 50 20,000 20,050

Note.—Data are numbers of women withmalignant or benign breast tumors.

8 ! Radiology ! July 2003 Langlotz

Ra

dio

logy

Page 50: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

P(D"!T") !Odds&D"!T")

1"Odds(D"!T")

With these three formulas, the likeli-hood ratio can be used to determine thepostexamination probability of disease(or predictive value) from any preexami-nation probability of disease.

Limitations of the Likelihood Ratio

The likelihood ratio has several prop-erties that limit its usefulness in describ-ing diagnostic examinations. First, itfunctions only on the basis of the odds ofdisease rather than the more intuitiveprobability of disease. Accordingly, thelikelihood ratio is best considered on alogarithmic scale: Likelihood ratios ofless than 1.0 indicate that the examina-tion result will decrease disease likeli-hood, and ratios of greater than 1.0 indi-cate that the examination result willincrease disease likelihood. To many, it isnot obvious that a likelihood ratio of 4.0increases the likelihood of disease to thesame degree that a likelihood ratio of0.25 decreases the likelihood. Further-more, it is sometimes counterintuitivethat the same likelihood ratio causes dif-ferent absolute changes in probability,depending on the preexamination prob-ability. Despite these weaknesses, thelikelihood ratio is probably underused inthe radiology literature today as a mea-sure of examination performance.

CONCLUSION

In this article, I describe nine principles(Appendix) that guide the evaluation anduse of diagnostic imaging examinationsin clinical practice. These principles arehelpful when choosing measures to de-scribe the capabilities of a diagnostic ex-amination. As I have discussed, sensitiv-ity and specificity are relatively invariantdescriptors of examination accuracy.However, they have limited clinical use-fulness and often cannot be used directlyto compare two diagnostic examinations.ROC curves can be used to directlycompare examinations independently ofreader temperament or varying image in-terpretation criteria, but they yield littleinformation that is useful to the clinicaldecision maker. Positive and negative

predictive values yield useful informa-tion for clinical decision makers by facil-itating explicit consideration of thetrade-offs at hand, but they are intrinsi-cally dependent on the preexaminationlikelihood of disease and therefore on thepopulation of patients in whom thegiven examination is performed. Finally,the likelihood ratio can be used to calcu-late the postexamination likelihood ofdisease from the preexamination likeli-hood of disease, but the associated use ofodds and the logarithmic scale are coun-terintuitive for some. An understandingof the described fundamental measuresof examination performance and howthey are clinically useful is vital to theappropriate evaluation and use of diag-nostic imaging examinations.

APPENDIX

Nine principles that are helpful whenchoosing measures to describe the capabili-ties of a diagnostic examination:

1. Sensitivity is a measure of how a diag-nostic examination performs in a popula-tion of patients who have the disease inquestion.

2. Specificity is a measure of how a diag-nostic examination performs in a popula-tion of patients who do not have the diseasein question (ie, healthy subjects).

3. A sensitive examination is more valu-able in situations where false-negative re-sults are more undesirable than false-posi-tive results. A specific examination is morevaluable in situations where false-positiveresults are more undesirable than false-neg-ative results.

4. The sensitivity and specificity of a di-agnostic examination are related to one an-other.

5. ROC curves provide a method to com-pare diagnostic examination accuracy inde-pendently of the diagnostic criteria (ie,strict or flexible) used.

6. A diagnostic examination causes achange in our belief about the likelihoodthat disease is truly present.

7. The positive predictive value indicatesthe likelihood of disease given a positiveexamination.

8. The negative predictive value indicatesthe likelihood of no disease given a negativeexamination.

9. Likelihood ratios enable calculation ofthe postexamination probability of disease

from the preexamination probability of dis-ease.

Acknowledgments: My special thanks toHarold Kundel, MD, Kimberly Applegate, MD,MS, and Philip Crewson, PhD, for theirthoughtful comments on earlier versions ofthis manuscript.

References1. Fryback DG, Thornbury JR. The efficacy

of diagnostic imaging. Med Decis Making1991; 11:88–94.

2. Brismar J, Jacobsson B. Definition ofterms used to judge the efficacy of diag-nostic tests: a graphic approach. AJR Am JRoentgenol 1990; 155:621–623.

3. Black WC. How to evaluate the radiologyliterature. AJR Am J Roentgenol 1990;154:17–22.

4. McNeil BJ, Keeler E, Adelstein SJ. Primeron certain elements of medical decisionmaking. N Engl J Med 1975; 293:211–215.

5. Burton E, Troxclair D, Newman W. Au-topsy diagnoses of malignant neoplasms:how often are clinical diagnoses incor-rect? JAMA 1998; 280:1245–1248.

6. Hanley JA, McNeil BJ. The meaning anduse of the area under a receiver operatingcharacteristic (ROC) curve. Radiology1982; 143:29–36.

7. Metz CE. ROC methodology in radiologicimaging. Invest Radiol 1986; 21:720–733.

8. Brismar J. Understanding receiver-operat-ing-characteristic curves: a graphic ap-proach. AJR Am J Roentgenol 1991; 157:1119–1121.

9. Nunes LW, Schnall MD, Siegelman E, etal. Diagnostic performance characteristicsof architectural features revealed by highspatial-resolution MR imaging of thebreast. AJR Am J Roentgenol 1997; 169:409–415.

10. Pauker S, Kassirer J. The threshold ap-proach to clinical decision making.N Engl J Med 1980; 302:1109–1117.

11. Ransohoff D, Feinstein A. Problems ofspectrum bias in evaluating the efficacyof diagnostic tests. N Engl J Med 1978;229:926–930.

12. Hillman BJ. Outcomes research and cost-effectiveness analysis for diagnostic imag-ing. Radiology 1994; 193:307–310.

13. Hrung J, Langlotz C, Orel S, Fox K,Schnall M, Schwartz J. Cost-effectivenessof magnetic resonance imaging and nee-dle core biopsy in the pre-operativeworkup of suspicious breast lesions. Radi-ology 1999; 213:39–49.

14. Filly RA. The “lemon” sign: a clinical per-spective. Radiology 1988; 167:573–575.

15. Thornbury JR, Fryback DG, Edwards W.Likelihood ratios as a measure of thediagnostic usefulness of excretory uro-gram information. Radiology 1975; 114:561–565.

16. Black WC, Armstrong P. Communicatingthe significance of radiologic test results:the likelihood ratio. AJR Am J Roentgenol1986; 147:1313–1318.

Volume 228 ! Number 1 Fundamental Measures of Diagnostic Examination Performance ! 9

Ra

dio

logy

Page 51: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Harold L. Kundel, MDMarcia Polansky, ScD

Index terms:Diagnostic radiology, observer

performanceStatistical analysis

Published online before print10.1148/radiol.2282011860

Radiology 2003; 228:303–308

1 From the Department of Radiology(H.L.K.) and MCP Hahnemann Schoolof Public Health (M.P.), University ofPennsylvania Medical Center, 3600Market St, Suite 370, Philadelphia, PA19104. Received November 21, 2001;revision requested January 29, 2002;revision received March 4; acceptedMarch 11. Supported by grant P01 CA53141 from the National Cancer Insti-tute, National Institutes of Health, U.S.Public Health Service, Department ofHealth and Human Services. Addresscorrespondence to H.L.K. (e-mail:[email protected]).© RSNA, 2003

Measurement of ObserverAgreement1

Statistical measures are described that are used in diagnostic imaging for expressingobserver agreement in regard to categorical data. The measures are used to characterizethe reliability of imaging methods and the reproducibility of disease classifications and,occasionally with great care, as the surrogate for accuracy. The review concentrates onthe chance-corrected indices, ! and weighted !. Examples from the imaging literatureillustrate the method of calculation and the effects of both disease prevalence and thenumber of rating categories. Other measures of agreement that are used less frequently,including multiple-rater !, are referenced and described briefly.© RSNA, 2003

The statistical analysis of observer agreement in imaging is generally performed for threereasons. First, observer agreement provides information about the reliability of imagingdiagnosis. A reliable method should produce good agreement when used by knowledge-able observers. Second, observer agreement can be used to check the consistency of amethod for classification of an abnormality that indicates the extent or severity of disease(1) and to determine the reliability of various signs of disease (2). It can also be used tocompare the performance of humans and computers (3). Third, observer agreement canprovide a general estimate of the value of an imaging technique when an independentmethod of proving the diagnosis precludes the measurement of sensitivity and specificityor the more general receiver operating characteristic curve. In many clinical situations,imaging provides the best evidence of abnormality. Furthermore, even if an independentmethod for obtaining proof exists, it may be difficult to use. For every suspected lesion, abiopsy cannot be performed to obtain a specific tissue diagnosis. As we will demonstrate,currently popular measures of agreement do not necessarily reflect accuracy. However,there are statistical techniques for use of the agreement of multiple expert readers (4) or theagreement of multiple tests (5) to estimate the underlying accuracy of the test.

We illustrate the standard methods for description of agreement in regard to categoricaldata and point out the advantages and disadvantages of the use of these methods. We referto some of the less common, although not less important, methods but do not describethem. Then we describe some current developments in methods for use of agreement toestimate accuracy. The discussion is limited to data that can be assigned to categories, suchas positive or negative; high, medium, or low; class I–V. Data, such as lesion volume orheart size, that are collected on a continuous scale are more appropriately analyzed withmethods of correlation.

MEASUREMENT OF AGREEMENT OF TWO READERS

Consider readings of the same 150 images that are reported as either positive or negativeby two readers. The results are shown in Table 1 as joint agreement in a 2 " 2 format, withthe responses of each reader as marginal totals. Three general indices of agreement can bederived from Table 1. The overall proportion of agreement, which we will call po, iscalculated as follows:

po !7 " 121

150 ! 0.85.

The proportion is useful for calculations, but the result is usually expressed as a per-centage. A po of 0.85 indicates that the two readers agree in regard to 85% of theirinterpretations. If the number of negative readings is large relative to the number ofpositive readings, the agreement in regard to negative readings will dominate the value of

Statistical Concepts Series

303

Ra

dio

logy

Page 52: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

po and may give a false impression ofperformance. For example, suppose that90% of the cases are actually negative,and two readers agree about all of thenegative interpretations but disagreeabout the positive interpretations. Theoverall agreement will be at least 90%and may be greater depending on thenumber of positive interpretations onwhich they agree. As an alternative to theoverall agreement, the positive and neg-ative agreement can be estimated sepa-rately. This will give an indication of thetype of decision on which readers dis-agree. The positive agreement, which wewill call ppos, is the number of positivereadings that both readers agree on di-vided by all of the positive readings forboth readers. For the data in Table 1, thepositive agreement is calculated with thefollowing equation:

ppos !7 " 7

#10 " 7$ " #12 " 7$! 0.39.

The negative agreement, which we willcall pneg, can be calculated in a similarway as follows:

pneg !121 " 121

#10 " 121$ " #12 " 121$! 0.92.

In the example given in Table 1, al-though the two readers agree 85% of thetime overall, they only agree on positiveinterpretations 39% of the time, whereasthey agree on negative interpretations92% of the time. The advantage of calcu-lation of ppos and pneg is that any imbal-ance in the proportion of positive andnegative responses becomes apparent, asin the example. The disadvantage is thatCIs cannot be calculated.

COHEN !

Some of the observer agreement concern-ing findings of imaging tests can becaused by chance. For example, chanceagreement occurs when the readers knowin advance that most of the cases arenegative and they adopt a reading strat-egy of reporting a case as negative when-ever they are in doubt. Both will have alarge percentage of negative agreementsbecause of prior knowledge of the preva-lence of negative cases, not because ofinformation obtained from viewing ofthe images. An index called ! has beendeveloped as a measure of agreementthat is corrected for chance. The ! is cal-culated by subtracting the proportion ofthe readings that are expected to agree bychance, which we will call pe, from theoverall agreement, po, and dividing the

remainder by the number of cases onwhich agreement is not expected to oc-cur by chance. This is demonstrated inEquation (1) as follows:

! !po # pe

1 # pe. (1)

Another way to view ! is that if thereaders read different images and thereadings were paired, some agreement,namely po, would be observed. The ob-served agreement would occur purely bychance. The agreement that is expectedto occur by chance, which we shall des-ignate pe, can be calculated. When thereadings of different images are com-pared, the observed value, namely the po,should equal the expected value, pe, be-cause there is no agreement beyondchance and ! is zero.

The joint agreement that is expectedbecause of chance is calculated for eachcombination with multiplication of thetotal responses of each reader containedin the marginal totals of the data table.From Table 1, the agreement expected bychance for the joint positive and jointnegative responses is calculated with thefollowing equation:

pe ! ! 17150 !

19150"" !133

150 !131150"! 0.79.

The value for ! is 0.31, as is calculatedwith this equation:

! !0.85 # 0.79

1 # 0.79 ! 0.31.

The standard error, which we will callSE, of ! for a 2 " 2 table can be estimatedwith the following equation:

SE ! #po#1 # po$

n#1 # pe$2 ,

SE#!$ ! #0.85#1 # 0.85$

150#1 # 0.79$2 ! 0.14. (2)

A more accurate and more complicatedequation for the standard error of ! canbe found in most books about statistics(6,7).

The 95% CIs of ! can be calculated asfollows:

CI95% ! ! $ 1.96 ! SE#!$. (3)

For example, the 95% CIs are 0.31 %1.96 " 0.14 & 0.04 and 0.31 ' 1.96 "0.14 & 0.58.

Thus, what is the meaning of a ! of 0.31,together with an overall agreement of0.85? The calculated value of ! can rangefrom %1.00 to '1.00, but for practicalpurposes the range from zero to '1.00 isof interest. A ! of zero means that there isno agreement beyond chance, and a ! of

TABLE 1Joint Judgment of Two Readersabout Same 150 Images

First Reader

Second Reader

TotalPositive

for DiseaseNegative

for Disease

Positive fordisease 7 10 17

Negative fordisease 12 121 133

Total 19 131 150

TABLE 2Guidelines for Strength ofAgreement Indicated with ! Values

! ValueStrength of Agreement

beyond Chance

(0 Poor0–0.20 Slight0.21–0.40 Fair0.41–0.60 Moderate0.61–0.80 Substantial0.81–1.00 Almost perfect

Note.—Data are from Landis and Koch (8).

TABLE 3Joint Judgment of Two Readersabout Position of Tubes andCatheters on 100 PortableChest Images

First Reader

Second Reader

TotalMal-

positionedCorrectlyPositioned

Malpositioned 3 3 6Correctly

positioned 2 92 94

Total 5 95 100

TABLE 4Joint Judgment of Two Readersabout Presence of Signs ofCongestive Heart Failure on 100Portable Chest Images

First Reader

Second Reader

TotalCHF No CHF

CHF 20 12 32No CHF 8 60 68

Total 28 72 100

Note.—CHF & congestive heart failure.

304 ! Radiology ! August 2003 Kundel and Polansky

Ra

dio

logy

Page 53: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

1.00 means that there is perfect agreement.Interpretations of intermediate values aresubjective. Table 2 shows the strength ofagreement beyond chance for variousranges of ! that were suggested by Landisand Koch (8). The choice of intervals is en-tirely arbitrary but has become ingrainedwith frequent usage. The values calculatedfrom Table 1 show that there is goodoverall agreement (po & 0.85) but onlyfair chance-corrected agreement (! &0.31). This paradoxical result is caused bythe high prevalence of negative cases.Prevalence effects can lead to situationsin which the values of ! do not corre-spond with intuition (9,10). This is illus-trated with the data in Tables 3 and 4that were extrapolated, with a bit of ad-justment to make the numbers come outeven, from a data set collected during astudy of readings in regard to portablechest images obtained in a medical inten-sive care unit (11). Table 3 shows theagreement of the reports of two of thereaders concerning the position of tubesand catheters. An incorrectly positionedtube or catheter was defined as a positivereading. Table 4 shows the agreement inregard to the reports of the same two read-ers about the presence of radiographicsigns of congestive heart failure. The ex-ample was chosen because the actual val-ues of ! for the two diagnoses were veryclose.

The agreement indices for the twotypes of readings are shown in Table 5.

The overall agreement (95%) for the po-sition of tubes and catheters is very high,but so is the agreement according tochance (90%) calculated from the mar-ginal values in Table 3. This results in alow ! of 0.52, which happens to be thesame ! as that for congestive heart fail-ure. The result is not intuitively appeal-ing, because a relatively simple decisionsuch as that about the location of a cath-eter tip should have a higher index ofagreement than a more difficult decisionsuch as that concerning a diagnosis ofcongestive heart failure. Feinstein andCicchetti (9) have pointed out the para-dox of high overall agreement and low !,and Cicchetti and Feinstein (10) suggestthat when investigators report the resultsof studies of agreement they should in-clude the three indices of !, positive agree-ment, and negative agreement. We agreethat this is a useful way of showing agree-ment data, because it provides more de-tails about where disagreements occur andalerts the reader to the possibility of effectscaused by prevalence or prior knowledge.

WEIGHTED ! FOR MULTIPLECATEGORIES

The ! can be calculated for two readers whoreport results with multiple categories. Asthe number of categories increases, thevalue of ! decreases because there is moreroom for disagreement with more catego-

ries. However, when findings are reportedby using a ranked variable, the relative im-portance of disagreement between cate-gories may not be the same for adjacentcategories as it is for distant categories.Two readers who consistently disagreeabout minimal and moderate categorieswould have the same value for ! calcu-lated in the usual way as would two read-ers who consistently disagree about min-imal and severe categories. A method forcalculation of ! has been developed thatallows for differences in the importanceof disagreements. The usual approachis to assign weights between 1.00 andzero to each agreement pair, where 1.00represents perfect agreement and zerorepresents no agreement. Assignment ofweights can be very subjective and canconfuse comparison of ! values betweenstudies in which different weights wereused. For theoretical reasons, Fleiss (7)suggests assignment of weights as fol-lows:

wij ! 1 ##i # j$2

#k # 1$2 , (4)

where w represents weight, i is the num-ber of the row, j is the number of the col-umn, and k is the total number of catego-ries. The weighting is called quadraticbecause of the squared terms. An exampleof the method for calculation of weighted! by using four categories is presented inthe Appendix. In the example in the Ap-pendix, the categories of absent, mini-mal, moderate, and severe are used. Theweighted and unweighted values for po

and ! are included in Table 6. The calcu-lations were repeated by collapsing thedata for four categories first into threeand then into two categories: First, min-imal and moderate categories were com-bined, and then minimal, moderate, andsevere categories were combined, andthese two combinations would be equiv-alent to normal and abnormal categories,respectively. Table 6 shows that the valueof ! increases as the number of categoriesis decreased, thus indicating better agree-ment when the fine distinctions are elim-inated. The weighted ! is greater than theunweighted ! when multiple categories areused and is the same as the unweighted !when only two categories are used. Someinvestigators prefer to use multiple catego-ries because they are a better reflection ofactual clinical decisions, and if sensibleweighting can be achieved, the weighted !may reflect the actual agreement betterthan does the unweighted !.

TABLE 5Indices of Agreement for Readings of Two Radiologists Regarding PortableChest Images for Position of Tubes and Catheters and Signsof Congestive Heart Failure

AgreementIndex

Type ofAgreement

Tubes andCatheters

CongestiveHeart Failure

po Overall 0.95 0.80ppos Positive 0.54 0.67pneg Negative 0.97 0.86pe Chance 0.90 0.57! Chance corrected 0.52 0.52

TABLE 6Comparison of Unweighted and Weighted po and ! Calculated by Using Four-,Three-, and Two-Response Categories

Categories

Unweighted Quadratic Weighting

po ! po(w) %(w)

Four-response 0.55 0.37 0.93 0.76Three-response 0.66 0.48 0.92 0.71Two-response 0.82 0.62 0.82 0.62

Note.—Values were calculated for data from Table A1.

Volume 228 ! Number 2 Measurement of Observer Agreement ! 305

Ra

dio

logy

Page 54: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

ESTIMATION OF ! FORMULTIPLE READERS

When multiple readers are used, someauthors calculate the values of ! for pairsof readers and then compute an average !for all possible pairs (12–14). Fleiss (7)describes a method for calculation of a !index for multiple readers. It has not beenused very much in diagnostic imaging, al-though it has been reported in somestudies along with values for weighted !(15).

ADVANTAGES ANDDISADVANTAGESOF THE ! INDEX

! has the advantage that it is correctedfor agreement with statistical chance,and there is an accepted method for com-puting confidence limits and for statisti-cal testing. The main disadvantage of ! isthat the scale is not free of dependenceon disease prevalence or the number ofrating categories. As a consequence, it isdifficult to interpret the meaning of anyabsolute value of !, although it is stilluseful in experiments in which a controlfor prevalence and for the number of cat-egories is used. The prevalence bias makesit difficult to compare the results of clinicalstudies where disease prevalence mayvary; for example, this may occur in stud-ies about the screening and diagnosis ofbreast cancer. The disease prevalenceshould always be reported when ! is usedto prevent misunderstanding when oneis trying to make generalizations.

RELATIONSHIP BETWEENAGREEMENT AND ACCURACY

High accuracy implies high agreement, buthigh agreement does not necessarily implyhigh accuracy. There is no direct way toinfer the accuracy in regard to an image-reading task from reader agreement. Ac-curacy can only be implied from agree-ment, with the assumption that whenreaders agree they must be correct. Wefrequently make this assumption by seek-ing a consensus diagnosis or by obtaininga second opinion, but it is not alwayscorrect. The ! has been shown to be in-consistent with accuracy as measured bythe area under the receiver operatingcharacteristic curve (16) and should notbe used as a surrogate for accuracy. Dif-ferent areas under the receiver operatingcharacteristic curve can have the same !,and the same areas under the receiver

operating characteristic curve can have dif-ferent ! values. For example, Taplin et al(14) studied the accuracy and agreementof single- and double-reading screeningmammograms by using the area under thereceiver operating characteristic curve and!. The study included 31 radiologistswho read 120 mammograms. The meanarea under the receiver operating charac-teristic curve for single-reading mammo-grams was 0.85, and that for double-read-ing mammograms was 0.87. However, theaverage unweighted ! for patients withcancer was 0.41 for single-reading mam-mograms and 0.71 for double-readingmammograms. The average unweighted! for patients without cancer was 0.26 forsingle-reading mammograms and 0.34for double-reading mammograms. Dou-ble reading of mammograms resulted inbetter agreement but not in better accu-racy.

If we assume that agreement impliesaccuracy, then we can use measurementsof observed agreement to set a lower limitfor accuracy. Suppose two readers agreewith respect to interpretation in 50% ofthe cases; then, by implication, they areboth correct with respect to interpreta-tion in 50% of the cases about whichthey agree and one of them is correct with

respect to interpretation in half (25% ofthe total) of the cases about which theydisagree. Therefore, the overall accuracyof the readings is 75%. Typically, in radi-ology, observed between-reader agree-ment is 70%–80%, implying an accuracythat is 85%–90% (ie, 70% ' 30%/2 to80% ' 20%/2).

Some new approaches to estimation ofaccuracy from agreement have been pro-posed. These approaches are based on theassumption that when a majority of read-ers agree about a diagnosis they are likelyto be right (4,17). We have proposed theuse of a technique called mixture distri-bution analysis (4,18). At least five read-ers report the cases by using either ayes-no response or a rating scale. Theagreement of the group of readers abouteach case is fit to a mathematic model,with the assumption that the sample wasdrawn from a population that consists ofeasy normal, easy abnormal, and hardcases. With the computer program, thepopulation that best fits the sample islocated, and an overall measure of perfor-mance that we call the relative percent-age agreement is calculated. We havefound that the relative percentage agree-ment has values similar to those obtained

TABLE A1Frequency of Responses of Two Readers Who Rated a Disease as Absent,Minimal, Moderate, or Severe

Reader 2

Reader 1

TotalAbsent Minimal Moderate Severe

Absent 34 10 2 0 46Minimal 6 8 8 2 24Moderate 2 5 4 12 23Severe 0 1 2 14 17

Total 42 24 16 28 110

Note.—The frequencies in Table A1 are converted into proportions in Table A2 by dividing by thetotal number of cases.

TABLE A2Proportion of Responses of Two Readers Who Rated a Disease as Absent,Minimal, Moderate, or Severe

Reader 2

Reader 1

TotalAbsent Minimal Moderate Severe

Absent 0.31 0.09 0.02 0 0.42Minimal 0.05 0.07 0.07 0.02 0.22Moderate 0.02 0.05 0.04 0.11 0.21Severe 0 0.01 0.02 0.13 0.15

Total 0.38 0.22 0.15 0.25* 1.00

* Value was rounded.

306 ! Radiology ! August 2003 Kundel and Polansky

Ra

dio

logy

Page 55: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

by using receiver operating characteristiccurve analysis with proved cases (18,19).

CONCLUSION

Formal evaluations of imaging technol-ogy by using reader agreement started in1947 with the publication of an articleabout tuberculosis case finding by usingfour different chest imaging systems (20).The author of an editorial that accompa-nied the article expressed surprise thatthere was so much disagreement (21).History repeated itself when an articleabout agreement in screening mammog-raphy that showed considerable readervariability (22) was published; this articlewas accompanied by an editorial inwhich the author expressed surprise inregard to the extent of disagreement (23).The consensus of a group of physicians isfrequently the only basis for determina-tion of a difficult diagnostic decision.Studies of pathologists who classify can-cer have shown levels of disagreementare similar to those associated with harddecisions in radiology (24). Agreementusually results from informal discussion;however, the method used to obtainagreement can have a large influence onthe decision outcome (25). Formal proce-dures that are used to achieve agreementhave been proposed (26); although theycan minimize individual bias in achiev-ing a consensus, they are rarely used. Wehope that this brief review will stimulategreater use of existing statistics for char-

acterization of agreement and further ex-ploration of new methods.

APPENDIX

Consider a data set in Table A1 that consistsof four categories. The frequencies in TableA1 are converted into proportions, whichare included in Table A2, by dividing thedata by the total number of cases.

Table A3 shows the quadratic weights cal-culated by using Equation (4), as presentedearlier:

wij ! 1 ##i # j$2

#k # 1$2 ,

where w represents weight, i is the numberof the row, j is the number of the column,and k is the total number of categories. It isassumed that disagreement between adja-cent categories (ie, disagreement for absentto minimal is 0.89) is not as important asthat between distant categories (ie, disagree-ment for absent to severe is zero).

The weighted observed agreement is cal-culated by multiplying the proportion ofresponses in each cell of the 4 " 4 table bythe corresponding weighting factor. Thecalculations for the first row are as follows:0.31 " 1.00 & 0.31, 0.09 " 0.89 & 0.08,0.02 " 0.56 & 0.01, and 0 " 0 & 0.

The results for observed weighted propor-tions are presented in Table A4. The ex-pected agreement is calculated by multiply-ing the row and column total for each cellof the 4 " 4 table by the correspondingweighting factor. The calculations for thefirst row are as follows: (0.42 " 0.38) "1.00 & 0.16, (0.42 " 0.22) " 0.89 & 0.08,

(0.42 " 0.15) " 0.56 & 0.03, and (0.42 "0.25) " 0 & 0.

The results for expected weighted propor-tions are presented in Table A4. The sumof all of the cells in regard to observedweighted proportions (sum, 0.93) in TableA4 is the weighted observed agreement,which we call po(w), and the sum of all ofthe cells in regard to expected weighted pro-portions (sum, 0.70) in Table A4 is theweighted expected agreement, which wecall pe(w). When we apply the equation for! to the weighted values, we get a weighted! index of 0.76, which is calculated with thefollowing equation:

!#w$ !po#w$ # pe#w$

1 # pe#w$.

An unweighted ! can be calculated by usingthe sum of the diagonal cells in Table A2, or0.31 ' 0.07 ' 0.04 ' 0.13 & 0.55, to calculatethe observed agreement and the sum of thediagonal cells in Table A4 with regard toexpected weighted proportions, or 0.16 '0.05 ' 0.03 ' 0.04 & 0.28, to calculate theexpected agreement. The unweighted ! is 0.37.

The calculation of the appropriate standarderror and the use of the standard error fortesting either the hypothesis that ! is differ-ent from zero or that ! is different from avalue other than zero is beyond the scope ofthis article but is in most basic statisticaltexts (6,7).

GLOSSARY

Below is a list of common terms and def-initions related to the measurement of ob-server agreement.

Accuracy.—This value is the likelihood ofthe interpretation being correct when com-pared with an independent standard.

Agreement.—This term represents the like-lihood that one reader will indicate thesame responses as another reader.

Attributes.—An attribute is a categoricalvariable that represents a property of theobject being imaged (eg, tumor descriptorssuch as mass, calcification, and architec-tural distortion).

Categorical variables.—Categorical vari-ables are variables that can be assigned tospecific categories. Categorical variables canbe either ranked variables or attributes.

%.—The ! value is an overall measure ofagreement that is corrected for agreement bychance. It is sensitive to disease prevalence.

Marginal sums.—A marginal sum is thesum of the responses in a single row orcolumn of the data table, and it representsthe total response of one of the readers.

Measurement variable.—Measurement vari-ables are variables that can be measured orcounted. They are generally divided intocontinuous variables (eg, lesion diameter orvolume) and discrete variables (eg, number

TABLE A3Quadratic Weights for 4 " 4 Table

Absent, 1 Minimal, 2 Moderate, 3 Severe, 4

Absent, 1 1.0 0.89 0.56 0Minimal, 2 0.89 1.00 0.89 0.56Moderate, 3 0.56 0.89 1.00 0.89Severe, 4 0 0.56 0.89 1.00

Note.—Numbers 1–4 are weighting factors that correspond to the respective category.

TABLE A4Weighted Proportion of Observed and Expected Responses

Disease RatingCategory

Observed Weighted Proportions forDisease Rating Category

Expected Weighted Proportions forDisease Rating Category

Absent Minimal Moderate Severe Absent Minimal Moderate Severe

Absent 0.31 0.08 0.01 0 0.16 0.08 0.03 0Minimal 0.05 0.07 0.06 0.01 0.07 0.05 0.03 0.03Moderate 0.01 0.04 0.04 0.10 0.04 0.04 0.03 0.05Severe 0 0.01 0.02 0.13 0 0.02 0.02 0.04

Volume 228 ! Number 2 Measurement of Observer Agreement ! 307

Ra

dio

logy

Page 56: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

of lesions, expressed as whole numbers butnever as decimal fractions).

Prevalence.—Prevalence is the proportionof a particular class of cases in the popula-tion being studied.

Ranked variables.—Ranked variables arecategorical variables that have a natural or-der, such as stage of a disease, histologicgrade, or discrete severity index (ie, mild,moderate, or severe).

Reliability.—Reliability is the likelihoodthat one reader will provide the same re-sponses as those provided by a large consen-sus group.

Weighted %.—The weighted ! is an overallmeasure of agreement that is corrected foragreement by chance; a weighting factor isapplied to each pair of disagreements toaccount for the importance of the disagree-ment.

References1. Baker JA, Kornguth PJ, Floyd CE. Breast

imaging reporting and data system stan-dardized mammography lexicon: ob-server variability in lesion description.AJR Am J Roentgenol 1996; 166:773–778.

2. Markus JB, Somers S, Franic SE, et al. In-terobserver variation in the interpretationof abdominal radiographs. Radiology1989; 171:69–71.

3. Tiitola M, Kivisaari L, Tervahartiala P, etal. Estimation or quantification of tu-mour volume? CT study on irregularphantoms. Acta Radiol 2001; 42:101–105.

4. Polansky M. Agreement and accuracy:mixture distribution analysis. In: Beutel J,VanMeter R, Kundel H, eds. Handbook ofimaging physics and perception. Belling-ham, Wash: Society of Professional Imag-ing Engineers, 2000; 797–835.

5. Henkelman RM, Kay I, Bronskill MJ. Re-

ceiver operating characteristic (ROC)analysis without truth. Med Decis Making1990; 10:24–29.

6. Agresti A. Categorical data analysis. NewYork, NY: Wiley, 1990; 366–370.

7. Fleiss JL. Statistical methods for rates andproportions. 2nd ed. New York, NY: Wiley,1981; 212–236.

8. Landis JR, Koch GG. The measurement ofobserver agreement for categorical data.Biometrics 1977; 33:159–174.

9. Feinstein A, Cicchetti D. High agreementbut low kappa. I. The problem of two para-doxes. J Clin Epidemiol 1990; 43:543–549.

10. Cicchetti D, Feinstein A. High agreementbut low kappa. II. Resolving the paradoxes.J Clin Epidemiol 1990; 43:551–558.

11. Kundel HL, Gefter W, Aronchick J, et al.Relative accuracy of screen-film and com-puted radiography using hard and softcopy readings: a receiver operating char-acteristic analysis using bedside chest ra-diographs in a medical intensive care unit.Radiology 1997; 205:859–863.

12. Epstein DM, Dalinka MK, Kaplan FS, et al.Observer variation in the detection of os-teopenia. Skeletal Radiol 1986; 15:347–349.

13. Herman PG, Khan A, Kallman CE, et al.Limited correlation of left ventricular end-diastolic pressure with radiographic as-sessment of pulmonary hemodynamics.Radiology 1990; 174:721–724.

14. Taplin SH, Rutter CM, Elmore JG, et al.Accuracy of screening mammography us-ing single versus independent double in-terpretation. AJR Am J Roentgenol 2000;174:1257–1262.

15. Robinson PJ, Wilson D, Coral A, et al.Variation between experienced observersin the interpretation of accident andemergency radiographs. Br J Radiol 1999;72:323–330.

16. Swets JA. Indices of discrimination or di-agnostic accuracy: their ROCs and im-

plied models. Psychol Bull 1986; 99:100–117.

17. Uebersax JS. Modeling approaches for theanalysis of observer agreement. Invest Ra-diol 1992; 27:738–743.

18. Kundel HL, Polansky M. Mixture distribu-tion and receiver operating characteristicanalysis of bedside chest imaging usingscreen-film and computed radiography.Acad Radiol 1997; 4:1–7.

19. Kundel HL, Polansky M. Comparing ob-server performance with mixture distri-bution analysis when there is no externalgold standard. In: Kundel HL, ed. Medicalimaging 1998: image perception. Belling-ham, Wash: Society of Professional Imag-ing Engineers, 1998; 78–84.

20. Birkelo CC, Chamberlain WE, Phelps PS,et al. Tuberculosis case finding: a com-parison of the effectiveness of variousroentgenographic and photofluorographicmethods. JAMA 1947; 133:359–366.

21. The “personal equation” in the interpre-tation of a chest roentgenogram (edito-rial). JAMA 1947; 133:399–400.

22. Elmore JG, Wells CK, Lee CH, et al. Vari-ability in radiologists’ interpretation ofmammograms. N Engl J Med 1994; 331:1493–1499.

23. Kopans DB. Accuracy of mammographicinterpretation (editorial). N Engl J Med1994; 331:1521–1522.

24. Landis JR, Koch GG. An application ofhierarchical kappa-type statistics in theassessment of majority agreement amongmultiple observers. Biometrics 1977; 33:363–374.

25. Revesz G, Kundel HL, Bonitatibus M. Theeffect of verification on the assessment ofimaging techniques. Invest Radiol 1983;18:194–198.

26. Hillman BJ, Hessel SJ, Swensson RG, Her-man PG. Improving diagnostic accuracy:a comparison of interactive and Delphiconsultations. Invest Radiol 1977; 12:112–115.

308 ! Radiology ! August 2003 Kundel and Polansky

Ra

dio

logy

Page 57: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Kimberly E. Applegate, MD,MS

Richard Tello, MD, MSME,MPH

Jun Ying, PhD

Index term:Statistical analysis

Published online before print10.1148/radiol.2283021330

Radiology 2003; 228:603–608

1 From the Department of Radiology,Riley Hospital for Children, Indianapo-lis, Ind (K.E.A.); Department of Radiol-ogy, Boston University, 88 E NewtonSt, Atrium 2, Boston, MA 02118 (R.T.);and Departments of Radiology andBiostatistics, Indiana University Schoolof Medicine, Indianapolis (J.Y.). Re-ceived October 16, 2002; revision re-quested February 18, 2003; final revi-sion received March 13; acceptedApril 1. Address correspondence toR.T. (e-mail: [email protected]).© RSNA, 2003

Hypothesis Testing III: Countsand Medians1

Radiology research involves comparisons that deal with the presence or absence ofvarious imaging signs and the accuracy of a diagnosis. In this article, the authorsdescribe the statistical tests that should be used when the data are not distributednormally or when they are categorical variables. These nonparametric tests are usedto analyze a 2 ! 2 contingency table of categorical data. The tests include the "2

test, Fisher exact test, and McNemar test. When the data are continuous, differentnonparametric tests are used to compare paired samples, such as the Mann-Whitney U test (equivalent to the Wilcoxon rank sum test), the Wilcoxon signed ranktest, and the sign test. These nonparametric tests are considered alternatives to theparametric t tests, especially in circumstances in which the assumptions of t tests arenot valid. For radiologists to properly weigh the evidence in the literature, they musthave a basic understanding of the purpose, assumptions, and limitations of each ofthese statistical tests.© RSNA, 2003

The purpose of hypothesis testing is to allow conclusions to be reached about groups ofpeople by examining samples from the groups. The data collected are analyzed by usingstatistical tests, which may be parametric or nonparametric, depending on the nature ofthe data to be analyzed. Statistical methods that require specific distributional assump-tions are called parametric methods, whereas those that require no assumptions abouthow the data are distributed are nonparametric methods. Nonparametric tests are oftenmore conservative tests compared with parametric ones. This means that the test has lesspower to reject the null hypothesis (1). Nonparametric tests can be used with discretevariables or data based on weak measurement scales, consisting of rankings (ordinal scale)or classifications (nominal scale).

The purpose of this article is to discuss different nonparametric or distribution-free testsand their applications with continuous and categorical data. For the analysis of continu-ous data, many radiologists are familiar with the t test, a parametric test that is used tocompare two means. However, misuse of the t test is common in the medical literature (2).To perform t tests properly, we need to make sure the data meet the following two criticalconditions: (a) The data are continuous, and (b) the populations are distributed normally.In this article, we introduce the application of nonparametric statistical methods whenthese two assumptions are not met. These methods require less stringent assumptions ofthe population distributions than those for the t tests. When two populations are inde-pendent, the Mann-Whitney U test can be used to compare the two population distribu-tions (3). An additional advantage of the Mann-Whitney U test is that it can be used tocompare ordinal data, as well as continuous data. When the observations are in pairs fromthe same subject, we can use either the Wilcoxon signed rank test or the sign test to replacethe paired t test.

For categorical data, the "2 test is often used. The "2 test for goodness of fit is used tostudy whether two or more mutually independent populations are similar (or homoge-neous) with respect to some characteristic (4–12). Another application of the "2 test is atest of independence. Such a test is used to determine whether two or more characteristicsare associated (or independent). In our discussion, we will also introduce some extensionsof the "2 test, such as the Fisher exact test (13,14) for small samples and the McNemar testfor paired data (15).

Statistical Concepts Series

603

Ra

dio

logy

Page 58: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

CATEGORICAL DATA

In many cases, investigators in radiologyare interested in comparing two groupsof count data in a 2 ! 2 contingencytable. One of the most commonly usedstatistical tests to analyze categorical datais the "2 test (16). If two groups of sub-jects are sampled from two independentpopulations and a binary outcome isused for classification (eg, positive or neg-ative imaging result), then we use the "2

test of homogeneity. Sometimes radiolo-gists are interested in analyzing the asso-ciation between two criteria of class-ification. This results in the test ofindependence by using a similar 2 ! 2contingency table and "2 statistic. Whensample sizes are small, we prefer to usethe Fisher exact test. If we have pairedmeasurements from the same subject, weuse the McNemar test to compare theproportions of the same outcome be-tween these two measurements in the2 ! 2 contingency table.

!2 Test

The "2 test allows comparison of theobserved frequency with its correspond-ing expected frequency, which is calcu-lated according to the null hypothesis ineach cell of the 2 ! 2 contingency table(Eq [A1], Appendix A). If the expectedfrequencies are close to the observed fre-quencies, the model according to thenull hypothesis fits the data well; thus,the null hypothesis should not be re-jected. We start with the analysis of a 2 !2 contingency table by considering thefollowing two examples. The same "2 for-mula is used in both examples, but theyare different in the sense that the data aresampled in different ways.

Example 1: test of homogeneity betweentwo groups.—One hundred patients and100 healthy control subjects are enrolledin a magnetic resonance (MR) imagingstudy. The MR imaging result can be clas-sified as either “positive” or “negative”(Table 1). The radiologist is interested infinding out if the proportion of positivefindings in the patient group is the sameas that in the control group. In otherwords, the null hypothesis is that theproportion of positive findings is thesame in the two groups. The alternativehypothesis is that they are different. Wecall this a test of homogeneity. In thisfirst example, the two groups (patientsand subjects) are in the rows, and the twooutcomes of positive and negative testresults are in the columns. In the statis-tical analysis, only one variable, the im-

aging result (classified as positive or neg-ative), is considered.

The results in Table 1 show that 50patients and 28 control subjects are cate-gorized as having positive findings. The"2 statistic is calculated and yields a Pvalue of .001 (17). Typically, we reject thenull hypothesis if the P value is less than.05 (the significance level). In this exam-ple, we conclude that there is no homo-geneity between the two groups, sincethe proportions of positive imaging re-sults are different.

Example 2: test of independence betweentwo variables in one group.—A radiologiststudies gadolinium-based contrast mate-rial enhancement of renal masses at MRimaging in 65 patients (18). Table 2shows that there are 17 patients with en-hancing renal masses, with 14 malignantmasses and three benign masses at patho-logic examination. Among the 48 pa-tients with nonenhancing renal masses,three masses are malignant and 45 arebenign at pathologic examination. Inthis example, the presence or absence ofcontrast enhancement is indicated in therows, and the malignant and benignpathologic findings are in the columns.In this second example, only the totalnumber of 65 patients is fixed; the pres-ence or absence of contrast enhancementis compared with the pathologic result(malignant or benign). The question of

interest is whether these two variables areassociated. In other words, the null hy-pothesis is that contrast enhancement ofa renal mass is not associated with thepresence of a malignant tumor, and thealternative hypothesis is that enhance-ment and malignancy are associated. Inthis example, the "2 statistic yields a Pvalue less than .001. We reject the nullhypothesis and conclude that the pres-ence of contrast enhancement at MR im-aging is associated with renal malignancy.

One potential issue with the "2 test isthat the "2 statistic is discrete, since theobserved frequencies in the 2 ! 2 contin-gency table are counts. However, the "2

distribution itself is continuous. In 1934,Yates (12) proposed a procedure to cor-rect for this possible bias. Although thereis controversy about whether to applythis correction, it is sometimes usedwhen the sample size is small. In the firstexample discussed earlier, the "2 statisticwas 10.17, and the P value was .001. TheYates corrected "2 statistic is 9.27 with a Pvalue of .002. This corrected "2 statisticyields a smaller "2 statistic, and the Pvalue is larger after Yates correction. Thisindicates that the Yates corrected "2 testis less powerful in rejecting the null hy-pothesis. Some applications of Yates cor-rection in medicine are discussed in thestatistical textbook by Altman (19).

TABLE 1!2 Test of Homogeneity

ParticipantsPositive MR

Imaging ResultNegative MR

Imaging Result Total

Patients 50 50 100Control subjects 28 72 100

Total 78 122 200

Note.—Data are the number of participants. "2 statistic, 10.17; P # .001. The null hypothesis isthat the two populations are homogeneous. We reject the null hypothesis and conclude that thetwo populations are different.

TABLE 2!2 Test of Independence

Presence of ContrastEnhancement

MR Imaging Finding

TotalMalignant Mass Benign Mass

Enhancement 14 3 17No enhancement 3 45 48

Total 17 48 65

Note.—Data are the number of masses. "2 statistic, 34.65; P $ .001. The null hypothesis is thatcontrast enhancement of a renal mass at MR imaging is not associated with the presence of amalignant tumor, and the alternative hypothesis is that enhancement and malignancy are associ-ated. We reject the null hypothesis and conclude that the presence of contrast enhancement isassociated with renal malignancy.

604 ! Radiology ! September 2003 Applegate et al

Ra

dio

logy

Page 59: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Fisher Exact Test

When sample sizes are small, the "2

test yields poor results, and the Fisherexact test is preferred. A general rule ofthumb for its use is when either the sam-ple size is less than 30 or the expectednumber of observations in any one cell ofa 2 ! 2 contingency table is fewer thanfive (20). The test is called an “exact” testbecause it allows calculation of the exactprobability (rather than an approxima-tion) of obtaining the observed results orresults that are more extreme. Althoughradiologists may be more familiar withthe traditional "2 test, there is no reasonnot to use the Fisher exact test in itsplace, given the ease of use and availabil-ity of computer software today.

In example 1, the P value resultingfrom use of the "2 test was .001, whereasthe P value for the same data tested by

using the Fisher exact test was .002. Bothtests lead to the same conclusion of lackof homogeneity between the patient andcontrol groups. Intuitively, the P valuederived by using the Fisher exact test isthe probability of positive results becom-ing more and more discrepant betweenthe two groups. Most statistical softwarepackages provide computation of theFisher exact test (Appendix B).

Example 3: Fisher exact test.—A radiolo-gist enrolls 20 patients and 20 healthysubjects in a computed tomographic (CT)study. The CT result is classified as either“positive” or “negative.” Table 3 showsthat 10 patients and four healthy subjectshave positive findings at CT. The nullhypothesis is that the two populationsare homogeneous in the number of pos-itive findings seen at CT.

In this example, the sample sizes inboth the patient and control groups aresmall. The Fisher exact test yields a Pvalue of .10. We retain the null hypoth-esis because the P value does not indicatea significant difference, and we concludethat these two groups are homogeneous.If we use the "2 test incorrectly, the Pvalue is .05, which suggests the oppositeconclusion—that the proportions of pos-itive CT results are different in these twogroups.

McNemar Test for Paired Data

A test for assessment of paired countdata is the McNemar test (15). This test isused to compare two paired measure-ments from the same subject. When thesample size is large, the McNemar testfollows the same "2 distribution but usesa slightly different formula. Radiology re-search often involves the comparison oftwo paired imaging results from the samesubject. In a 2 ! 2 table, the results of oneimaging test are labeled “positive” and

“negative” in rows, and the results of an-other imaging test are labeled similarly incolumns. An interesting property of thistable is that there are two concordantcells in which the paired results are thesame (both positive or both negative) andtwo discordant cells in which the pairedresults are different for the same subject(positive-negative or negative-positive).

We are interested in analyzing whetherthese two imaging tests show equivalentresults. The McNemar test uses only theinformation in the discordant cells andignores the concordant cell data. In par-ticular, the null hypothesis is that theproportions of positive results are thesame for these two imaging tests, versusthe alternative hypothesis that they arenot the same. Intuitively, the null hy-pothesis is retained if the discordant pairsare distributed evenly in the two discor-dant cells. The following example illus-trates the problem in more detail.

Example 4: McNemar test for paireddata.—There are 200 patients enrolled ina study to compare CT and conventionalangiography of coronary bypass grafts forthe diagnosis of graft patency (Table 4).Seventy-one patients have positive re-sults with both conventional angiogra-phy and CT angiography, 86 have nega-tive results with both, 30 have positiveCT results but negative conventional an-giographic results, and 13 have negativeCT results but positive conventional an-giographic results. The McNemar testcompares the proportions of the discor-dant pairs (13 of 200 vs 30 of 200). TheP value of the McNemar statistic is .02,which suggests that the proportion ofpositive results is significantly differentfor the two modalities. Therefore, weconclude that the ability of these twomodalities to demonstrate graft patencyis different.

Some radiologists may incorrectlysummarize the data in a way shown inTable 5 and perform a "2 test, as discussedin example 1 (21). This is a common mis-take in the medical literature. In example1, the proportions compared are 101 of200 versus 84 of 200. The problem is theassumption that CT angiography andconventional angiography results are in-dependent, and thus, the paired relation-ship between these two imaging tests isignored (2,21). The "2 test has less powerto reject the null hypothesis than doesthe McNemar test in this situation andresults in a P value of .09. We wouldincorrectly conclude that there is no sig-nificant difference in the ability of thesetwo modalities to demonstrate graft pa-tency.

TABLE 3Fisher Exact Test for Small Samples

Participants

PositiveCT

Result

NegativeCT

Result Total

Patients 10 10 20Control subjects 4 16 20

Total 14 26 40

Note.—Data are the number of participants.(Two-sided) Fisher exact test result, P # .10.For small samples, the Fisher exact test isused. The null hypothesis is that the two pop-ulations of patients and control subjects arehomogeneous at CT—that is, they have thesame number of positive results. We retainthe null hypothesis because the P value doesnot indicate a significant difference, and weconclude that these two groups are homoge-neous. If we incorrectly used the "2 test forthis comparison, the conclusion would havebeen the opposite: "2 # 3.96, P # .04.

TABLE 4McNemar Test for PairedComparisons: Angiography versusCT Results in the Diagnosis ofCoronary Bypass Graft Thrombosis

CT Result

PositiveAngiography

Result

NegativeAngiography

Result Total

Positive 71 30 101Negative 13 86 99

Total 84 116 200

Note.—Data are the number of CT results.McNemar "2 result, 5.95; P # .02. The Pvalue indicates a significant difference, andtherefore, we reject the null hypothesis andconclude that there is a difference betweenthese two modalities.

TABLE 5Incorrect Use of the !2 Test forPaired Data for the Evaluation ofAngiography versus CT (when paireddata are incorrectly treated asindependent)

ModalityPositiveResult

NegativeResult Total

CT 101 99 200Angiography 84 116 200

Note.—Data are the number of results. Forthe "2 test with the assumption of two inde-pendent samples, P # .09. We would incor-rectly conclude that there is no significantdifference between these two modalities.

Volume 228 ! Number 3 Hypothesis Testing III: Counts and Medians ! 605

Ra

dio

logy

Page 60: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

HYPOTHESIS TESTING BYUSING MEDIANS

The unpaired and paired t tests requirethat the population distributions be nor-mal or approximately so. In medicine,however, we often do not know whethera distribution is normal, or we know thatthe distribution departs substantially fromnormality.

Nonparametric tests were developed todeal with situations where the popula-tion distributions are either not normalor unknown, especially when the samplesize is small ($30 samples). These testsare relatively easy to understand and sim-ple to apply and require minimal as-sumptions about the population distribu-tions. However, this does not mean thatthey are always preferred to parametrictests. When the assumptions are met,parametric tests have higher testingpower than their nonparametric counter-parts; that is, it is more likely that a falsenull hypothesis will be rejected.

Three commonly encountered non-parametric tests include the Mann-Whit-ney U test (equivalent to the Wilcoxonrank sum test), the Wilcoxon signed ranktest, and the sign test.

Comparison of Two IndependentSamples: Mann-Whitney U Test

The Mann-Whitney U test is used tocompare the difference between two pop-ulation distributions and assumes thetwo samples are independent (22). It doesnot require normal population distribu-tions, and the measurement scale can beordinal.

The Mann-Whitney U test is used totest the null hypothesis that there is nolocation difference between two popula-tion distributions versus the alternativehypothesis that the location of one pop-ulation distribution differs from the other.With the null hypothesis, the same loca-tion implies the same median for the twopopulations. For simplicity, we can re-state the null hypothesis: The medians ofthe two populations are the same. Threealternative hypotheses are available: (a)The population medians are not equal,(b) the population median of the firstgroup is larger than that of the second, or(c) the population median of the secondgroup is larger than that of the first. If weput the two random samples togetherand rank them, then, according to thenull hypothesis, which holds that there isno difference between the two popula-tions medians, the total rank of one sam-ple would be close to the total rank of the

other. On the other hand, if all the ranksof one sample are smaller than the ranksof the other, then we know almost surelythat the location of one population isshifted relative to that of the other.

We give two examples of the applica-tion of the Mann-Whitney U test, oneinvolving continuous data and the otherinvolving ordinal data.

Example 5: Mann-Whitney U test for con-tinuous data.—The uptake of fluorine 18choline (hereafter, “fluorocholine”) bythe kidney can be considered approxi-mately distributed normally (23). Let ussay that some results of hypothetical re-search suggest that fluorocholine uptakeabove 5.5 (percentage dose per organ) ismore common in men than in women. Ifwe are only interested in the patientswhose uptake is over 5.5, the distributionis no longer normal but becomes skewed.The Figure shows the uptake over 5.5 in10 men and seven women sampled frompopulations imaged with fluorocholinefor tumor surveillance. We are interestedin finding out if there are any differencesin these populations on the basis of pa-tient sex.

We can quickly exclude use of the t testin this example, since the fluorocholineuptakes we have selected are no longerdistributed normally. The null hypothe-sis is that the medians for the men andwomen are the same. By using the Mann-Whitney U test, the P value is .06, so weretain the null hypothesis. We concludethat the medians of these two popula-tions are the same at the .05 significancelevel, and therefore, men and womenhave similar renal uptake of fluorocho-line. If we had incorrectly used the t test,the P value would be .02, and we wouldconclude the opposite.

Example 6: Mann-Whitney U test for or-dinal data.—A radiologist wishes to knowwhich of two different MR imaging se-quences provides better image quality.Twenty-four patients undergo MR imag-ing with a T2-weighted fast spin-echo se-quence, and 22 other patients are imagedwith the same sequence but with the ad-dition of fat saturation (Table 6). The im-age quality is measured by using a stan-dardized scoring system, ranging from 1to 100, where 100 is the best image qual-ity. The null hypothesis is that the me-dian scores are the same for the two pop-ulations. In the group imaged with thefirst MR sequence, the images of eightsubjects are scored under 25, those of 14subjects are scored between 25 and 75,and those of two subjects are scoredabove 75. In the group imaged with thefat-saturated MR sequence, there arethree, 12, and seven subjects in thesethree score categories, respectively.

In this example, each patient’s imagescore is classified into one of three ordi-nal categories. Since the observations arediscrete rather than continuous, the t testcannot be used. Some researchers mightconsider the data in Table 6 to be a 2 ! 3contingency table and use a "2 statistic tocompare the two groups. The P value cor-responding to the "2 statistic is .08, andwe would conclude that the two groupshave similar image quality. The problem

TABLE 6Mann-Whitney U Test for OrdinalData

FatSaturation

Score

Total$25 25–75 %75

No 8 14 2 24Yes 3 12 7 22

Total 11 26 9 46

Note.—Data are scores from T2-weighted fastspin-echo MR images obtained with or with-out fat saturation. Wilcoxon rank sum test,P # .03. The null hypothesis is that the me-dian scores for the two types of MR imagesare the same. We reject the null hypothesisand conclude that they are not the same. Ifwe incorrectly used the "2 test, we wouldconclude the opposite: "2 # 5.13, P # .08.

Mann-Whitney U test result for continuousdata (fluorocholine uptake over 5.5 [percent-age dose per organ] in human kidneys), P #.06. We retain the null hypothesis that there isno difference in the medians, and we concludethat the fluorocholine renal uptake in menand women is similar at the .05 significancelevel (the marginal P value suggests a trendtoward a significant difference). If we had in-correctly used the t test, we would have con-cluded the opposite: P # .02.

606 ! Radiology ! September 2003 Applegate et al

Ra

dio

logy

Page 61: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

is that the three image quality score cat-egories are only treated as nominal vari-ables, and their ordinal relationship isnot accounted for in the "2 test. An alter-native test that allows us to use this in-formation is the Mann-Whitney U test.The Mann-Whitney U test yields a Pvalue of .03. We reject the null hypothe-sis and conclude that the median imagescores are different.

Comparison of Paired Samples:Wilcoxon Signed Rank Test

The Wilcoxon signed rank test is analternative to the paired t test. Eachpaired sample is dependent, and the dataare continuous. The assumption neededto use the Wilcoxon signed rank test isless stringent than the assumptionsneeded for the paired t test. It requiresonly that the paired population be dis-tributed symmetrically about its median(24).

The Wilcoxon signed rank test is usedto test the null hypothesis that the me-

dian of the paired population differencesis zero versus the alternative hypothesisthat the median is not zero. Since thedistribution of the differences is symmet-ric about the mean, it is equivalent tousing the mean for the purpose of hy-pothesis testing, as long as the samplesize is large enough (at least 10 rankings).

We rank the absolute values of thepaired differences from the sample. Withthe null hypothesis, we would expect thetotal rank of the pairs whose differencesare negative to be comparable to the totalrank of the pairs whose differences arepositive. The following example showsthe application of the Wilcoxon signedrank test.

Example 7: paired data.—A sample of 20patients is used to compare ring enhance-ment between T1-weighted spin-echoMR images and fat-saturated T1-weightedspin-echo MR images obtained after con-trast material administration (Table 7).We notice that the image quality scoreson fat-saturated T1-weighted spin-echo

MR images in case 7 is 98, which is muchhigher than the others. As a result, thedifference in values between the two se-quences is also much higher than that forthe other paired differences. It would beunwise to use a paired t test in this case,since the t test is sensitive to extremevalues in a sample and tends to incor-rectly retain a false null hypothesis as aconsequence. The nonparametric testsare more robust to data extremes, andthus, the Wilcoxon signed rank test ispreferred in this case. The null hypothe-sis states that the median of the pairedMR sequence differences is zero. The Wil-coxon signed rank test provides a P valueof .02, so we reject the null hypothesis.We conclude that the fat-saturated MRsequence showed ring enhancement bet-ter than did the MR sequence without fatsaturation. If we had incorrectly used thepaired t test, the P value would be .07,and we would have arrived at the oppo-site conclusion.

Comparing Paired Samples:Sign Test

The sign test is another nonparametrictest that can be used to analyze paireddata. Unlike the Wilcoxon signed ranktest or the paired t test, this test requiresneither a symmetric distribution nor anormal distribution of the variable of in-terest. The only assumption underlyingthis test is that the data are continuous.Since the distribution is arbitrary withthe sign test, the hypothesis of interestfocuses on the median rather than themean as a measure of central tendency.In particular, the null hypothesis forcomparing paired data is that the mediandifference is zero. The alternative hy-pothesis is that the median difference isnot, is greater than, or is less than zero.This simplistic test considers only thesigns of the differences between two mea-surements and ignores the magnitudes ofthe differences. As a result, it is less pow-erful than the Wilcoxon signed rank test,and a false null hypothesis is often notrejected (25).

SUMMARY

Hypothesis testing is a method for devel-oping conclusions about data. Radiologyresearch often produces data that requirenonparametric statistical analyses. Non-parametric tests are used for hypothesistesting when the assumptions about thedata distributions are not valid or whenthe data are categorical. We have dis-cussed the most common of these sta-

TABLE 7Wilcoxon Signed Rank Test for Paired Comparisons of Ring Enhancementbetween Two Spin-Echo MR Sequences

Case No.

MR Sequence*

Difference†

AbsoluteValue of

Difference‡ Rank†T1-

weightedFat-saturatedT1-weighted

1 45 43 &2 2 52 45 42 &3 3 83 49 47 &2 2 54 50 47 &3 3 85 49 48 &1 1 36 44 50 6 6 13.57 42 98 56 56 208 49 47 &2 2 59 39 44 5 5 11

10 42 42 0 0 1.511 44 54 10 10 1812 47 53 6 6 13.513 42 53 11 11 1914 45 54 9 9 16.515 44 48 4 4 1016 41 47 6 6 13.517 45 54 9 9 16.518 50 47 &3 3 819 51 51 0 0 1.520 42 48 6 6 13.5

Note.—Wilcoxon signed rank test, P # .02. The null hypothesis is that the median of the pairedpopulation differences is zero. The Wilcoxon signed rank test result indicates a significant differ-ence, and therefore, we conclude that the enhanced fat-saturated T1-weighted MR sequenceshowed ring enhancement better than did the conventional enhanced T1-weighted MR sequence.If we had incorrectly used the paired t test, the P value would be .07, and we would have had theopposite conclusion. Like the t test, the sign test produced a P value of .50, and the conclusionwould be that the two sequences are the same.

* Data are image quality scores on MR images after contrast material administration.† Difference in enhancement values between MR sequences.‡ Rank of the absolute values of the differences; when there is a tie in the ranking, an average

ranking is assigned—for example, rank 16.5 rather than ranks 16 and 17 for the tied case numbers14 and 17; and rank 13.5 for the four cases (6, 12, 16, and 20) that compose ranks 12, 13, 14, and15.

Volume 228 ! Number 3 Hypothesis Testing III: Counts and Medians ! 607

Ra

dio

logy

Page 62: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

tistical tests and provided examples todemonstrate how to perform them. For ra-diologists to properly weigh the evidencein our literature, we need a basic under-standing of the purpose, assumptions,and limitations of each of these statisticaltests. Understanding how and whenthese methods are used will strengthenour ability to evaluate the medical litera-ture.

APPENDIX A

The "2 formula is based on the followingequation:

"2 ! '!(Fo & Fe)2

Fe " , (A1)

where Fo is the frequency observed in eachcell, and Fe is the frequency expected ineach cell, which is calculated by multiply-ing the row frequency by the quotient ofthe column frequency divided by total sam-ple size.

APPENDIX B

Examples of statistical software that is easilycapable of calculating "2 and McNemar sta-tistics include SPSS, SAS, StatXact 5, andEpiInfo (EpiInfo allows calculation of theFisher exact test and may be downloaded atno cost from the Centers for Disease Controland Prevention Web site at www.cdc.gov).Other statistical Web sites include fonsg3.let.uva.nl/Service/Statistics.html, department.obg.cuhk.edu.hk/ResearchSupport/WhatsNew.asp, and www.graphpad.com/quickcalcs

/Contingency1.cfm (all Web sites accessedJanuary 30, 2003).

Acknowledgments: The authors thankMiriam Bowdre for secretarial support in prep-aration of the manuscript and George Parker,MD, Phil Crewson, PhD, and an anonymousstatistical reviewer for comments on earlierdrafts of the manuscript.

References1. Daniel WW. Biostatistics: a foundation

for analysis in the health sciences. 6th ed.New York, NY: Wiley, 1995; 526.

2. Gore SM, Jones IG, Rytter EC. Misuse ofstatistical methods: critical assessment ofarticles in BMJ from January to March1976. BMJ 1977; 1:85–87.

3. Wilcoxon F. Individual comparisons byranking methods. Biometrics 1945; 1:80–83.

4. Fisher LD, Belle GV. Biostatistics: a meth-odology for the health sciences. NewYork, NY: Wiley, 1993.

5. Ott RL. An introduction to statisticalmethods and data analysis. 4th ed. Bel-mont, Calif: Wadsworth, 1993.

6. Daniel WW. Biostatistics: a foundationfor analysis in the health sciences. 6th ed.New York, NY: Wiley, 1995.

7. Altman DG. Practical statistics for medi-cal research. London, England: Chapman& Hall, 1991.

8. Chase W, Bown F. General statistics. 4thed. New York, NY: Wiley, 2000.

9. Conover WJ. Practical nonparametric sta-tistics. 3rd ed. New York, NY: Wiley,1999.

10. Rosner B. Fundamentals of biostatistics.4th ed. Boston, Mass: Duxbury, 1995;370–565.

11. Fleiss JL. Statistical methods for rates andproportions. 2nd ed. New York, NY:Wiley, 1981; 112–125.

12. Yates F. Contingency tables involvingsmall numbers and the chi-square test. J

Royal Stat Soc Ser B 1934; (suppl 1):2179–2235.

13. Fisher RA. Statistical methods for researchworkers. 5th ed. Edinburgh, Scotland: Ol-iver & Boyd, 1934.

14. Fisher RA. The logic of inductive infer-ence. J Royal Stat Soc Ser A 1935; 98:39–54.

15. Cochran WG. Some methods forstrengthening the common chi-squaretests. Biometrics 1954; 10:417–451.

16. Agresti A. Categorical data analysis. NewYork, NY: Wiley, 1990.

17. Daniel WW. Biostatistics: a foundationfor analysis in the health sciences. 6th ed.New York, NY: Wiley, 1995; 691.

18. Tello R, Davidson BD, O’Malley M, et al.MR imaging of renal masses interpretedon CT to be suspicious. AJR Am J Roent-genol 2000; 174:1017–1022.

19. Altman D. Practical statistics for medicalresearch. London, England: Chapman &Hall, 1991; 252.

20. Daniel WW. Biostatistics: a foundationfor analysis in the health sciences. 6th ed.New York, NY: Wiley, 1995; 537.

21. Dwyer AJ. Matchmaking and McNemarin the comparison of diagnostic modali-ties. Radiology 1991; 178:328–330.

22. Mann HB, Whitney DR. On a test ofwhether one of two random variables isstochastically larger than the other. AnnMath Stat 1947; 18:50–60.

23. DeGrado TR, Reiman R, Price DT, WangS, Coleman RE. Pharmacokinetics and ra-diation dosimetry of F-fluorocholine.J Nucl Med 2002; 43:92–96.

24. Altman D. Practical statistics for medicalresearch. London, England: Chapman &Hall, 1991; 194–197.

25. Daniel WW. Biostatistics: a foundationfor analysis in the health sciences. 6th ed.New York, NY: Wiley, 1995; 569–578.

608 ! Radiology ! September 2003 Applegate et al

Ra

dio

logy

Page 63: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Nancy A. Obuchowski, PhD

Index terms:Diagnostic radiologyStatistical analysis

Published online10.1148/radiol.2291010898

Radiology 2003; 229:3–8

Abbreviations:FPR ! false-positive rateMS ! multiple sclerosisROC ! receiver operating

characteristic

1 From the Department of Biostatisticsand Epidemiology/Wb4, Cleveland ClinicFoundation, 9500 Euclid Ave, Cleveland,OH 44195. Received May 8, 2001; revi-sion requested June 11; revision receivedAugust 1; accepted August 2. Addresscorrespondence to the author (e-mail:[email protected]).© RSNA, 2003

Receiver OperatingCharacteristic Curves andTheir Use in Radiology1

Sensitivity and specificity are the basic measures of accuracy of a diagnostic test;however, they depend on the cut point used to define “positive” and “negative”test results. As the cut point shifts, sensitivity and specificity shift. The receiveroperating characteristic (ROC) curve is a plot of the sensitivity of a test versus itsfalse-positive rate for all possible cut points. The advantages of the ROC curve as ameans of defining the accuracy of a test, construction of the ROC, and identificationof the optimal cut point on the ROC curve are discussed. Several summary measuresof the accuracy of a test, including the commonly used percentage of correctdiagnoses and area under the ROC curve, are described and compared. Twoexamples of ROC curve application in radiologic research are presented.© RSNA, 2003

Sensitivity and specificity are the basic measures of the accuracy of a diagnostic test. Theydescribe the abilities of a test to enable one to correctly diagnose disease when disease isactually present and to correctly rule out disease when it is truly absent. The accuracy ofa test is measured by comparing the results of the test to the true disease status of thepatient. We determine the true disease status with the reference standard procedure.

Consider as an example the test results of 100 patients who have undergone mammog-raphy (Table 1). According to biopsy results and/or 2-year follow-up results (ie, thereference standard procedures), 50 patients actually have a malignant lesion and 50patients do not. If these 100 test results were from 100 asymptomatic women without apersonal history of breast cancer, then we might define a positive test result as any thatrepresents a “suspicious” or “malignant” finding and a negative test result as any thatrepresents a “normal,” “benign,” or “probably benign” finding. We have used a cut pointfor defining positive and negative test results. The cut point is located between thesuspicious and probably benign findings. The estimated sensitivity with this cut point is(18 " 20)/50 ! 0.76, and the specificity is (15 " 3 " 18)/50 ! 0.72.

Alternatively, if these 100 test results were from 100 asymptomatic women with apersonal history of breast cancer, then we might use a different cut point, such that apositive test result represents a probably benign, suspicious, or malignant finding and anegative test result represents a normal or benign finding. The estimates of sensitivity andspecificity would change (ie, they would now be 0.96 and 0.36, respectively).

Important point: Sensitivity and specificity depend on the cut point used to definepositive and negative test results. As the cut point shifts, the sensitivity increases while thespecificity decreases, or vice versa.

COMBINED MEASURES OF SENSITIVITY AND SPECIFICITY

It is often useful to summarize the accuracy of a test by using a single number; for example,when comparing two diagnostic tests, it is easier to compare a single number than tocompare both the sensitivity and the specificity values of the tests. There are several suchsummary measures; I will describe a popular but easily misinterpreted one that is usuallyreferred to simply as accuracy. Using the second cut point in Table 1, we can computeaccuracy as the percentage of correct diagnoses in the entire sample—that is, (48 "18)/100 ! 0.66, or 66%. The strength of this measure of accuracy is its simple computa-

Statistical Concepts Series

3

Ra

dio

logy

Page 64: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

tion. It has several limitations, however:Its magnitude varies as the prevalence ofdisease varies in the sample, it is calcu-lated on the basis of only one cut point,and false-positive and false-negative re-sults are treated as if they are equallyundesirable. As an illustration of the firstlimitation, note that in Table 2 the prev-alence of disease is 5% instead of the 50%in Table 1. The sensitivity and specificityvalues are the same in Tables 1 and 2, yetthe estimated accuracy value in Table 2drops to (48 " 342)/1,000 ! 0.39, or39%.

Important point: A measure of test ac-curacy is needed that combines sensitiv-ity and specificity but does not dependon the prevalence of disease.

RECEIVER OPERATINGCHARACTERISTIC CURVE

In 1971, Lusted (1) described how re-ceiver operating characteristic (ROC)curves could be used to assess the accu-racy of a test. An ROC curve is a plot oftest sensitivity (plotted on the y axis) ver-sus its FPR (or 1 # specificity) (plotted on

the x axis). Each point on the graph isgenerated by using a different cut point.The set of data points generated from thedifferent cut points is the empirical ROCcurve. We use lines to connect the pointsfrom all the possible cut points. The re-sulting curve illustrates how sensitivityand the FPR vary together.

Figure 1 illustrates the empirical ROCcurve for the mammography example.Since in our example there are five cate-gories for the test results, we can com-pute four cut points for the ROC curve.The two endpoints on the ROC curve are0,0 and 1,1 for FPR, sensitivity. Thepoints labeled 1 and 2 on the curve cor-respond to the first and second cutpoints, respectively, that are defined inthe note to Table 1. Estimations of theother points are provided in Table 3.

The ROC plot has many advantagesover single measurements of sensitivityand specificity (2). The scales of thecurve—that is, sensitivity and FPR—arethe basic measures of accuracy and areeasily read from the plot; the values ofthe cut points are often labeled on thecurve as well. Unlike the measure of ac-

curacy defined in the previous section (ie,the percentage of correct diagnoses), theROC curve displays all possible cut points.Because sensitivity and specificity are inde-pendent of disease prevalence, so too is theROC curve. The curve does not depend onthe scale of the test results (ie, we can alterthe test results by adding or subtracting aconstant or taking the logarithm or squareroot without any change to the ROC curve)(3). Lastly, the ROC curve enables a directvisual comparison of two or more tests ona common set of scales at all possible cutpoints.

It is often convenient to make someassumptions about the distribution ofthe test results and then to draw the ROCcurve on the basis of the assumed distri-bution (ie, assumed model). The result-ing curve is called the fitted or smoothROC curve. The fitted curve for the mam-mography study is plotted in Figure 1; itwas constructed from a binormal distri-bution (ie, two normal distributions: onefor the test results of patients withoutbreast cancer and another for test resultsof patients with breast cancer) (Fig 2).The binormal distribution is the most

TABLE 1Results from Mammography Study with 100 Patients

Cut Point and ReferenceStandard Result

Radiologist’s Interpretation

TotalNormal Benign Probably Benign Suspicious Malignant

Cut point 1*Reference standard result

Cancer present 2 0 10 18† 20† 50Cancer absent 15 3 18 13‡ 1‡ 50

Cut point 2*Reference standard result

Cancer present 2 0 10† 18† 20† 50Cancer absent 15 3 18‡ 13‡ 1‡ 50

Note.—Data are numbers of patients with the given result in a fictitious study of mammography in which 50 patients had a malignant lesion and 50did not.

* For cut point 1, a positive result is defined as a test score of suspicious or malignant; for cut point 2, a positive result is defined as a test score ofprobably benign, suspicious, or malignant.

† Test results considered true-positive (for estimating sensitivity) with this cut point.‡ Test results considered false-positive (for estimating the false-positive rate [FPR] or specificity) with this cut point.

TABLE 2Effect of Prevalence on Accuracy

ReferenceStandard

Result

Radiologist’s Interpretation

TotalNormal Benign Probably Benign Suspicious Malignant

Cancer present 2 0 10 18 20 50Cancer absent 285 57 342 247 19 950

Note.—Data are numbers of patients with the given result in a fictitious study of mammography with 1,000 patients. This data set represents amodification of the data set in Table 1 so that the prevalence of cancer is 5%. When cut point 2 (described in the note to Table 1) is used with this dataset, the estimated sensitivity ([10 " 18 " 20]/50 ! 0.96) and specificity ([285 " 57]/950 ! 0.36) are the same as with the data set in Table 1. However,one commonly used estimate of overall accuracy is the percentage of correct diagnoses in the sample. With this data set it is 39% ([10 " 18 " 20 "285 " 57]/1,000 ! 0.39), which is not the same as with the data set in Table 1.

4 ! Radiology ! October 2003 Obuchowski

Ra

dio

logy

Page 65: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

commonly used distribution for estimat-ing the smooth ROC curve. There arecomputer programs (for example, www-radiology.uchicago.edu/sections/roc/software.cgi)for estimating the smooth ROC curve onthe basis of the binormal distribution;these programs make use of a statisticalmethod called maximum likelihood esti-mation.

An ROC curve can be constructed fromobjective measurements of a test (eg, se-rum glucose level as a test for diabetes),objective evaluation of image features(eg, the computed tomographic [CT] at-tenuation coefficient of a renal mass rel-ative to normal kidney), or subjective di-agnostic interpretations (eg, the five-category Breast Imaging Reporting andData System scale used for mammo-graphic interpretation) (5). The only re-quirement is that the measurements orinterpretations can be meaningfullyranked in magnitude. With objectivemeasurements the cut point is explicit, soone can choose from an infinite numberof cut points along the continuum of thetest results. For diagnostic tests whose re-sults are interpreted subjectively, the cutpoints are implicit or latent in that theyonly exist in the mind of the observer (6).Furthermore, it is assumed that each ob-server has his or her own set of cutpoints.

The term receiver operating characteristiccurve comes from the idea that, given thecurve, we, the receivers of the informa-tion, can use (or operate at) any point onthe curve by using the appropriate cutpoint. The clinical application deter-mines which cut point is used. For exam-

ple, for evaluating women with a per-sonal history of breast cancer, we need acut point with good sensitivity (eg, cutpoint 2 in Table 1), even if the FPR ishigh. For evaluating women without apersonal history of breast cancer, we re-quire a lower FPR. For each applicationthe optimal cut point (2,7) can be deter-mined by finding the sensitivity andspecificity pair that maximizes the func-tion sensitivity # m(1 # specificity),where m is the slope of the ROC curve asfollows:

m !ProbNorm

ProbDis"

$CFP # CTN%

$CFN # CTP%,

ProbNorm is the probability that the pa-tient’s condition is normal before the testis performed, ProbDis is the probabilitythat the patient has the disease before thetest is performed, CFP is the cost (ie, thefinancial cost and/or health “cost”) of afalse-positive result, CTN is the cost of atrue-negative result, CFN is the cost of afalse-negative result, and CTP is the costof a true-positive result.

MEASURES OF ACCURACYBASED ON THE ROC CURVE

One of the most popular measures of theaccuracy of a diagnostic test is the areaunder the ROC curve. The ROC curvearea can take on values between 0.0 and1.0. A ROC curve with an area of 1.0 isshown in Figure 3. A test with an areaunder the ROC curve of 1.0 is perfectlyaccurate because the sensitivity is 1.0when the FPR is 0.0. In contrast, a testwith an area of 0.0 is perfectly inaccurate.That is, all patients with disease are in-correctly given negative test results andall patients without disease are incor-rectly given positive test results. Withsuch a test it would be better to convert itinto a test with perfect accuracy by re-versing the interpretation of the test re-

sults. The practical lower bound for theROC curve area is then 0.5. The line seg-ment from 0,0 to 1,1 has an area of 0.5; itis called the chance diagonal (Fig 3). If werelied purely on guessing to distinguishpatients with from patients without dis-ease, then the ROC curve would be ex-pected to fall along this diagonal line.Diagnostic tests with ROC curve areasgreater than 0.5 have at least some abilityto discriminate between patients withand those without disease. The closer theROC curve area is to 1.0, the better thediagnostic test. One method (8) of esti-mating the area under the empirical ROCcurve is described and illustrated in theAppendix. There are other methods(9,10) of estimating the area under theempirical ROC curve and its variance; allof these methods rely on nonparametricstatistical methods.

The ROC curve area has several inter-pretations: (a) the average value of sensi-tivity for all possible values of specificity,(b) the average value of specificity for allpossible values of sensitivity (11,12), and(c) the probability that a randomly se-lected patient with disease has a test re-sult that indicates greater suspicion thana randomly chosen patient without dis-ease (9).

In Figure 1 the area under the empiri-cal ROC curve for mammography is 0.82;that is, if we select two patients at ran-dom—one with breast cancer and onewithout—the probability is 0.82 that thepatient with breast cancer will have amore suspicious mammographic result.The area under the fitted curve is slightlylarger at 0.84. When the number of cutpoints is small, the area under the empir-ical ROC curve is usually smaller than thearea under the fitted curve.

The ROC curve area is a good summarymeasure of test accuracy because it doesnot depend on the prevalence of diseaseor the cut points used to form the curve.

TABLE 3Construction of Empirical ROC Curve for Mammography Study

Cut Point Sensitivity* FPR†

Between normal and benign 0.96 (48/50) 0.70 (35/50)Between benign and probably benign 0.96 (48/50) 0.64 (32/50)Between probably benign and suspicious 0.76 (38/50) 0.28 (14/50)Between suspicious and malignant 0.40 (20/50) 0.02 (1/50)

Note.—These data represent estimations of the points on the empirical ROC curve marked withopen circles and depicted in Figure 1. The ROC curve in Figure 1 was constructed on the basis ofthe data in Table 1, with sensitivity and the FPR estimated at each possible cut point.

* Data in parentheses are those used to calculate the sensitivity value.† Data in parentheses are those used to calculate the FPR (or 1 # specificity) value.

Figure 1. Graph of the empirical and fittedROC curves for the mammography study. Thepoints on the empirical curve are marked withopen circles and are estimated in Table 3. Thepoints labeled 1 and 2 on the curve correspondto the first and second cut points, respectively,that are defined in the note to Table 1.

Volume 229 ! Number 1 Receiver Operating Characteristic Curves in Radiology ! 5

Ra

dio

logy

Page 66: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

However, once a test has been shown todistinguish patients with disease fromthose without disease well, the perfor-mance of the test for particular applica-tions (eg, diagnosis, screening) must beevaluated. At this stage, we may be inter-ested in only a small portion of the ROCcurve. Furthermore, the ROC curve areamay be misleading when one is compar-ing the accuracies of two tests. Figure 4illustrates the ROC curves of two testswith equal area. At the clinically impor-tant FPR range (for example, 0.0–0.2),however, the curves are different: ROCcurve A demonstrates higher sensitivitythan does ROC curve B. Whenever theROC curves of two tests cross (regardlessof whether or not their areas are equal), itmeans that the test with superior accu-racy (ie, higher sensitivity) depends onthe FPR range; a global measure of accu-racy, such as the ROC curve area, is nothelpful here.

Important point: There are situationswhere we need a more refined measure ofdiagnostic test accuracy than the area un-der the ROC curve.

One alternative is to use the ROC curveto estimate sensitivity at a fixed FPR (or,as appropriate, we could use the FPR at afixed sensitivity). As an example, in Fig-ure 1 the sensitivity at a fixed FPR of 0.10is 0.60. This measure of accuracy allowsus to focus on the portion of the ROCcurve that is of clinical relevance.

Another alternative measure of accuracyis the partial area under the ROC curve. It isdefined as the area between two FPRs, e1and e2 (or, as appropriate, the area betweentwo false-negative rates). If e1 ! 0 and e2 !1, then the area under the entire ROCcurve is specified. If e1 ! e2, then the sen-sitivity at a fixed FPR is given. The partialarea measure is thus a “compromise” be-tween the entire ROC curve area and thesensitivity at a fixed FPR.

To interpret the partial area, we mustconsider its maximum possible value.The maximum area is equal to the widthof the interval—that is, e2 # e1 (13). Mc-Clish (13) and Jiang et al (14) recom-mend standardizing the partial area bydividing it by its maximum value. Jianget al (14) refer to this standardized partialarea as the partial area index. The partialarea index is interpreted as the averagesensitivity for the range of FPRs exam-ined (or the average FPR for the range ofsensitivities examined). As an example,in Figure 1, the partial area in the FPRrange of 0.00–0.20 is 0.112; the partialarea index is 0.56. In other words, whenthe FPR is between 0.00 and 0.20, theaverage sensitivity is 0.56.

EXAMPLES OF ROC CURVESIN RADIOLOGY

There are many examples of the applica-tion of ROC curves in radiologic research. Ipresent two examples here. The first ex-ample illustrates the comparison of twodiagnostic tests and the identification ofa useful cut point. The second exampledescribes a multireader study of the differ-ences in diagnostic accuracy of two testsand differences in reader performance.

The first example is the study of Mush-lin et al (15) of the accuracy of magneticresonance (MR) imaging for detectingmultiple sclerosis (MS). Three hundredthree patients suspected of having MSunderwent MR imaging and CT of thehead. The images were read separately bytwo neuroradiologists without knowl-edge of the clinical course of or final di-agnosis given to the patients. The imageswere scored as definitely showing MS,probably showing MS, possibly showingMS, probably not showing MS, or defi-nitely not showing MS. The referencestandard consisted of results of a reviewof the clinical findings by a panel of MSexperts, results of follow-up for at least 6months, and results of other diagnostictests; the results of CT and MR imagingwere not included to avoid bias.

The estimated ROC curve area for MRimaging was 0.82, indicating a good, butnot definitive, test. In contrast, the esti-mated ROC curve area of CT was only0.52; this estimated area was not signifi-cantly different from 0.50, indicatingthat CT results were no more accuratethan guessing for diagnosing MS. The au-thors concluded that a “definite MS”

Figure 2. Graph shows the binormal distribution that best fits themammography study data. By convention, the distribution of unob-served variables for the patients without cancer is centered at zero (ie,&1 ! 0) with variance ('1

2) equal to 1. For these data, the center of thedistribution of the unobserved variables for the patients with canceris estimated to be 1.59 (ie, &2 ! 1.59) with variance ('2

2) estimated tobe 1.54. The binormal distribution can be described by its two pa-rameters (4), a and b, as a ! (&1 # &2)/'2 and b ! '1/'2. The four cutpoints z1, z2, z3, and z4 define the five categories of test results. Thatis, a variable with a value below the point defined by z1 indicates anormal result; a variable with a value between z1 and z2, a benignresult; a variable with a value between z2 and z3, a probably benignresult; a variable with a value between z3 and z4, a suspicious result;and a variable with a value above the point defined by z4, a malignantresult. Note that the binormal variables exist only in the mind of thereader (ie, they are unobserved). When the reader applies the cutpoints z1, z2, z3, and z4 to the unobserved variables, we obtain theobserved five categories of test results.

Figure 3. Graph shows comparison of threeROC curves. A perfect test has an area underthe ROC curve of 1.0. The chance diagonal hasan ROC area of 0.5. Tests with some discrimi-nating ability have ROC areas between thesetwo extremes.

6 ! Radiology ! October 2003 Obuchowski

Ra

dio

logy

Page 67: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

reading at MR imaging essentially estab-lishes the diagnosis of MS (MR images inonly two of 140 patients without MSwere scored as definitely showing MS, foran FPR of 1%). However, a normal MRimaging result does not conclusively ex-clude the diagnosis of MS (MR images in35 of 163 patients with MS were scored asdefinitely not showing MS, for a false-negative rate of 21%).

In the second example, Iinuma et al(16) compared the accuracy of conven-tional radiography and digital radiogra-phy for the diagnosis of gastric cancers.One hundred twelve patients suspectedof having gastric cancer underwent con-ventional radiography, and 113 differentpatients with similar symptoms andcharacteristics underwent digital radiog-raphy. Six readers interpreted the imagesfrom all 225 patients; the readers wereblinded to the clinical details of the pa-tients. The images were scored with a six-category scale, in which a score of 1 in-dicated that cancer was definitely absent;a score of 2, cancer was probably absent;a score of 3, cancer was possibly absent; ascore of 4, cancer was possibly present; ascore of 5, cancer was probably present;and a score of 6, cancer was definitelypresent. The diagnostic standard consistedof the findings of a consensus panel ofthree radiologists (not the same individualsas the six readers) who examined the pa-tients and were told of the findings ofother tests, such as endoscopy and his-topathologic examination after biopsy.

The ROC curve areas of the six readerswere all higher with digital radiographythan with conventional radiography; theaverage ROC curve areas with digital andconventional radiography were 0.93 and

0.80, respectively. By plotting the fittedROC curve areas of each of the six read-ers, the authors determined that for fiveof the six readers, digital radiography re-sulted in higher sensitivity for all FPRs;for the sixth reader, digital radiographyresulted in considerably higher sensitiv-ity only at a low FPR.

In summary, the ROC curve has manyadvantages as a measure of the accuracyof a diagnostic test: (a) It includes all pos-sible cut points, (b) it shows the relation-ship between the sensitivity of a test andits specificity, (c) it is not affected by theprevalence of disease, and (d) from it wecan compute several useful summarymeasures of test accuracy (eg, ROC curvearea, partial area). The ROC curve alonecannot provide us with the optimal cutpoint for a particular clinical application;however, given information about thepretest probability of disease and the rel-ative costs of diagnostic test errors, wecan find the optimal cut point on theROC curve. There are many study designissues (eg, patient and reader selection,verification and diagnostic standard bias)that need to be considered when one isconducting and interpreting the resultsof a study of diagnostic test accuracy.Many of these issues will be covered in afuture article.

APPENDIX

The area under the empirical ROC curvecan be estimated as follows: First, considerevery possible pairing of patients with dis-ease and patients without disease. Give each

pair a score of 1.0 if the test result for thepatient with disease is higher (ie, more sus-picious for disease), a score of 0.5 if the testresults are the same, and a score of 0.0 if thetest result for the patient with disease islower (ie, less suspicious for disease). Sec-ond, take the sum of these scores. If thereare N nondiseased patients and M diseasedpatients in the sample, then there are M ( Nscores. Finally, divide the sum of thesescores by (M ( N). This gives the estimate ofthe area under the empirical ROC curve.

Figure A1 depicts a fictitious data set. Theprocess described and illustrated in the figurecan be written mathematically as follows (8):Let Xj denote the test score of the jth patientwith disease and Yk denote the test score ofthe kth patient without disease. Then,

A !1

$M " N% !$ j!1%

M !$k!1%

N

score$Xj, Yk%,

where A is the estimate of the area underthe empirical ROC curve and score(Xj, Yk) isthe score assigned to the pair composed ofthe jth patient with disease and the kthpatient without disease. The score equals 1if Xj is greater than Yk, equals 1

2if Xj is equal

to Yk, and equals 0 if Xj is less than Yk. Thesymbol in the following formula

!$k!1%

N

ck

is called a summation sign. It means that wetake the sum of all of the ck values, where k isfrom 1 to N. So, if N is equal to 12, then

!$k!1%

N

ck ! c1 $ c2 $ c3 $ · · · $ c12.

Figure 4. Graph shows two crossing ROCcurves. The ROC areas of the two tests are thesame at 0.80; however, for the clinically im-portant range (ie, an FPR of less than 0.20), testA is preferable to test B.

Figure A1. Fictitious data set and example of how to calculate the area under the empirical ROCcurve.

Volume 229 ! Number 1 Receiver Operating Characteristic Curves in Radiology ! 7

Ra

dio

logy

Page 68: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

References1. Lusted LB. Signal detectability and medi-

cal decision-making. Science 1971; 171:1217–1219.

2. Zweig MH, Campbell G. Receiver-operat-ing characteristic (ROC) plots: a funda-mental evaluation tool in clinical medi-cine. Clin Chem 1993; 39:561–577.

3. Campbell G. General methodology I: ad-vances in statistical methodology for theevaluation of diagnostic and laboratorytests. Stat Med 1994; 13:499–508.

4. Dorfman DD, Alf E Jr. Maximum likeli-hood estimation of parameters of signaldetection theory: a direct solution. Psy-chometrika 1968; 33:117–124.

5. Dwyer AJ. In pursuit of a piece of theROC. Radiology 1997; 202:621–625.

6. Hanley JA. Receiver operating character-istic (ROC) methodology: the state of the

art. Crit Rev Diagn Imaging 1989; 29:307–335.

7. Metz CE. Basic principles of ROC analysis.Semin Nucl Med 1978; 8:283–298.

8. Bamber D. The area above the ordinaldominance graph and the area below thereceiver operating characteristic graph. JMath Psychol 1975; 12:387–415.

9. Hanley JA, McNeil BJ. The meaning anduse of the area under a receiver operatingcharacteristic (ROC) curve. Radiology1982; 143:29–36.

10. DeLong ER, DeLong DM, Clarke-PearsonDL. Comparing the areas under two ormore correlated receiver operating char-acteristic curves: a nonparametric ap-proach. Biometrics 1988; 44:837–844.

11. Metz CE. ROC methodology in radiologicimaging. Invest Radiol 1986; 21:720–733.

12. Metz CE. Some practical issues of experi-mental design and data analysis in radio-

logic ROC studies. Invest Radiol 1989; 24:234–245.

13. McClish DK. Analyzing a portion of theROC curve. Med Decis Making 1989;9:190–195.

14. Jiang Y, Metz CE, Nishikawa RM. A re-ceiver operating characteristic partial areaindex for highly sensitive diagnostictests. Radiology 1996; 201:745–750.

15. Mushlin AI, Detsky AS, Phelps CE, et al.The accuracy of magnetic resonance im-aging in patients with suspected multiplesclerosis. JAMA 1993; 269:3146–3151.

16. Iinuma G, Ushio K, Ishikawa T, Nawano S,Sekiguchi R, Satake M. Diagnosis of gastriccancers: comparison of conventional radi-ography and digital radiography with a 4million-pixel charge-coupled device. Radi-ology 2000; 214:497–502.

8 ! Radiology ! October 2003 Obuchowski

Ra

dio

logy

Page 69: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Ilana F. Gareen, PhDConstantine Gatsonis, PhD

Index term:Statistical analysis

Published online10.1148/radiol.2292030324

Radiology 2003; 229:305–310

Abbreviations:FTE ! full-time equivalentICU ! intensive care unitOR ! odds ratio

1 From the Center for Statistical Sci-ences, Brown University, Box G-H, 167Angell St, Providence, RI 02912. Re-ceived February 19, 2003; revision re-quested March 25; revision receivedMay 9; accepted May 20. Supportedin part by National Cancer Institutegrant CA79778. Address correspon-dence to I.F.G. (e-mail: [email protected]).© RSNA, 2003

Primer on Multiple RegressionModels for Diagnostic ImagingResearch1

This article provides an introduction to multiple regression analysis and its applica-tion in diagnostic imaging research. We begin by examining why multiple regres-sion models are needed in the evaluation of diagnostic imaging technologies. Wethen examine the broad categories of available models, notably multiple linearregression models for continuous outcomes and logistic regression models forbinary outcomes. The purpose of this article is to elucidate the scientific logic,meaning, and interpretation of multiple regression models by using examples fromthe diagnostic imaging literature.© RSNA, 2003

Readers of the diagnostic imaging literature increasingly encounter articles with the resultsof multiple regression analyses. Typically, these analyses are performed to examine therelation of an outcome and several explanatory factors for the purpose of quantifying theeffect of the explanatory factors on the outcome and/or predicting the outcome. Forexample, multiple regression modeling is used to study which combinations of imagingfeatures are important predictors of the presence of disease. In this article and in thestatistics literature, the explanatory variables are also referred to as covariates or indepen-dent variables, and the outcome variable is also referred to as the response or dependentvariable. If the outcome is represented by a continuous variable, such as cost of care, thenlinear regression is often used. If the outcome is a dichotomous variable, such as presenceor absence of disease, then logistic regression is commonly used. These modeling tech-niques provide an important tool in medical research. They enhance our ability todisentangle the nature of the relation between multiple factors that affect a single out-come. In this article, we examine why investigators choose to use multiple regressionmethods and how analyses with these methods should be interpreted. We use examplesfrom the radiology literature as illustrations and focus on the meaning and interpretationof these models rather than on the methods and software used for building them.

WHAT IS REGRESSION ANALYSIS?

Regression analysis provides a quantitative approach to the assessment of the relationbetween two or more variables, one of which is considered the dependent or responsevariable, and the others are considered the independent or explanatory variables (alsocalled “covariates”). The purpose of the analysis may be to estimate the effect of a covariateor to predict the value of the response on the basis of the values of the covariates. In bothcases, a regression model is developed to predict the value of the response. The way inwhich a model is built depends on the specific research question and the nature of thedata. Each regression model incorporates assumptions regarding the nature of the data. Ifthese assumptions are incorrect, the model may be invalid, and the interpretation of thedata that is based on that model may be incorrect. The process of fitting such a modelinvolves the specification of a shape or curve for the expected value of the response andthe examination of how closely the data fit this specified shape. For example, the simplelinear regression model assumes a straight-line relation between the single independentvariable and the expected value of the dependent variable. The slope and intercept of thisstraight line are estimated from the data. The fitted model can then be used to estimate theeffect of the independent variable and can also be used to predict future values of thedependent variable, which correspond to specific values of the independent variable.

Statistical Concepts Series

305

Ra

dio

logy

Page 70: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

WHY ARE MULTIPLEREGRESSION MODELSUSED IN DIAGNOSTICIMAGING?

Multiple Factors of Interest

Multiple regression analyses may beused by investigators to examine the im-pact of multiple factors (independentvariables) on a single outcome of interest(dependent variable). Sunshine andBurkhardt (1), for example, assembledsurvey data from 87 radiology groups onpractice patterns. The authors used thesedata to examine the relation between thenumber of procedures each radiologistperformed per year (dependent variable)and several independent variables, in-cluding academic status of the group, an-nual hours worked by each full-time ra-diologist, group size, and the proportionof “high-productivity” procedures (Table1). High-productivity procedures are de-fined by the authors as those that requiremore mental effort, stress, physical effort,and training than do other types of pro-cedures. Such procedures include CT,MR, and interventional or angiographicprocedures. On the basis of the results inTable 1, it appears that workload, as mea-sured by the dependent variable, is sig-nificantly lighter in academic groupsthan that in non-academic groups anddecreases marginally with increasinggroup size. In this model, it appears thatworkload is not associated with the othertwo factors, namely group size and pro-portion of high-productivity procedures.The authors also report that this regres-sion model explains only a modestamount of the variability in the depen-dent variable and “did not yield very ac-curate results.” We return to the interpre-tation of Table 1 later in this article.

Adjustment for PotentialConfounding

Multiple regression techniques mayalso be used to adjust analyses for poten-tial confounding factors so that the influ-ence of these extraneous factors is quan-tified and removed from the evaluationof the association of interest. Extraneousor confounding factors are those that areassociated with both the exposure andthe outcome of interest but are not con-sequences of the exposure. Consider, forexample, a study to compare two diag-nostic procedures (independent variable)on the basis of their impact on a patientoutcome (death within a fixed follow-upperiod in this example). A factor such as

patient age may play a role in decisionsabout which procedure will be performedbut may also be related to the outcome.In this case, age would be a potentialconfounder. Clearly, confounding is amajor consideration in observationalstudies as contrasted with randomizedclinical trials, because in the former, par-ticipants are not randomly assigned toone imaging modality or another.

Goodman et al (2) compared results inpatients evaluated for suspected pulmo-nary embolism with helical CT withthose evaluated with ventilation-perfu-sion scintigraphy to determine whetherthere is a difference in the number ofpulmonary embolisms and deaths in the90 days following the diagnostic imagingevaluation. The population of patientsevaluated with CT was more likely tohave been referred from the intensivecare unit (ICU) (and, hence, more likelyto have severe co-morbid disease condi-tions), was older, and was at increasedrisk of pulmonary embolism due to pa-tient immobilization and patient malig-nancy than were those evaluated withventilation-perfusion scintigraphy. Toadjust for these potential confounders,the authors included them as indepen-dent variables in a logistic regressionanalysis and evaluated the associationbetween imaging methods and deathwithin 90. As a result of this adjustment,the magnitude of the estimated effect ofimaging on mortality, as measured bymeans of the OR, changed from 3.42 to2.54. We will return to this point later inthe discussion of logistic regression.

Prediction

Regression models are also used to pre-dict the value of a response variable usingthe explanatory variables. For example,to develop optimal imaging strategies for

patients after trauma, Blackmore et alused clinical data from trauma patientsseen in the emergency room to predictthe risk of cervical spine fracture (3). Theauthors evaluated 20 potential clinicalpredictors of cervical spine injury. Theirfinal prediction model includes four ofthese factors: the presence of focal neu-rologic deficit, presence of severe headinjury, cause of injury, and patient age.These independent variables predict cer-vical spine fracture with a high degree ofaccuracy (area under the receiver operat-ing characteristic curve ! 0.87) (3).

GENERAL FORM OFREGRESSION MODELS

Regression models with multiple inde-pendent variables have been constructedfor a variety of types of response vari-ables, including continuous and discretevariables. A large class of such models,and the models used most commonly inthe medical literature, are the so-calledgeneralized linear models (4). In thesemodels, a linear relation is postulated toexist between the independent variablesand the expected value of the dependentvariable (or some transformed value ofthat expected value, such as the loga-rithm). The observed value of the depen-dent variable (response) is then the sumof its expected value and an error term.Multiple regression models for continu-ous, binary, and other discrete depen-dent variables are discussed in the follow-ing sections.

MODELING OF CONTINUOUSOUTCOME DATA

First, let us consider the situation inwhich a continuous dependent variable

TABLE 1Results of Multiple Linear Regression Analysis to Examine the Number ofAnnual Procedures per FTE Radiologist in Diagnostic Radiology Groups

Variable Coefficient (") Standard Error* P Value

Intercept ("0) 10,403 2,154 .001Academic status (X1) #2,238 1,123 .05Annual hours per FTE (X2) 0.43 1.11 .70Group size (FTE) (X3) #59.7 32.5 .07Proportion of high productivity

procedures (X4)† #4,782 11,975 .69

Note.—Adapted and reprinted, with permission, from reference 1.* Standard error of the estimated coefficient.† High-productivity procedures included computed tomography (CT) and magnetic resonance

(MR) imaging, and interventional or angiographic procedures that required more mental effort,stress, physical effort, and training than did other types of procedures.

306 ! Radiology ! November 2003 Gareen and Gatsonis

Ra

dio

logy

Page 71: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

and a single independent variable areavailable. Such would be the case, for ex-ample, if in the practice pattern data dis-cussed earlier, only the number of annualprocedures per radiologist as the depen-dent variable and the radiology groupsize as the independent variable wereconsidered. An earlier article in this series(5) introduced the concept of a simplelinear regression model in which a linearrelation is assumed between the mean ofthe response and the independent vari-able. This model is represented as Yi ! "0 $"1X1i $ ei, where Yi is the term represent-ing the value of the dependent variablefor the ith case, X1i represents the valueof the independent random variable, "0represents the intercept, "1 represents theslope of the linear relation between themean of the dependent and the indepen-dent variables, and ei denotes the randomerror term, which has a mean of zero. Theexpected value of the dependent variableis then equal to "0 $ "1X1i, and the errorterm is what is left unexplained by themodel.

When several independent variablesare considered, such as in the analysis ofthe practice pattern data, multiple regres-sion models are used. For example, as-sume that in addition to X1, independentvariables X2, . . . , Xp are to be included inthe analysis. A linear multiple regressionmodel would then be written as

Yi ! "0 " "1X1i " "2X2i " · · · " "pXpi " ei.

The parameters "0, "1, . . . , "p fromthis equation are referred to as the regres-sion coefficients. To interpret the coeffi-cients, again consider first the simple lin-ear regression model. In this model, theparameter "0 represents the intercept,that is, the expected value of the depen-dent variable when the value of theindependent variable is set to zero. Theparameter "1 represents the slope of theregression line and measures the averagechange in the dependent variable Y thatcorresponds to an increase of one unit inthe independent variable X1.

In multiple regression, the relation be-tween the dependent variable Y and theindependent variables X1, . . . , Xp issomewhat more complex. The intercept"0 represents the mean value of the re-sponse when all of the independent vari-ables are set to zero (that is, X1 ! 0, X2 !0, . . . , Xp ! 0).

The slopes of the independent vari-ables in the multiple linear regressionmodel are interpreted in the followingway. The slope "j of the jth independentvariable measures the change in the de-

pendent variable that corresponds to anincrease of one unit in Xj, if all otherindependent variables are held fixed(that is, the values of the other covariatesdo not change). For example, the resultsof a multiple linear regression analysis ofthe practice pattern data reported by Sun-shine and Burkhardt (1) are shown inTable 1. From this survey of 87 radiologygroups, the dependent variable Y is thenumber of procedures performed annu-ally per full-time equivalent (FTE) radiol-ogist. X1 is a dichotomous variable andan indicator of academic status (coded 1if the group is academic or 0 if the groupis non-academic). The remaining vari-ables can be treated as approximatelycontinuous. X2 is the number of annualhours worked by each FTE radiologist, X3is the practice group size, and X4 is thepercentage of procedures that are highproductivity. Suppressing the notationfor cases, this model is written as

Y ! "0 " "1X1 " "2X2 " "3X3 " "4X4.

The terms can be substituted into thismodel such that it is represented as thefollowing equation: Y ! "0 $ "1 (aca-demic status) $ "2 (annual hours perFTE) $ "3 (group size) $ "4 (percentagehigh-productivity procedures).

The estimated coefficient of X1, the in-dicator of academic status in the regres-sion model, is #2,238 (Table 1). BecauseX1 takes the values of 1 (academic group)or 0 (non-academic group), this coeffi-cient estimate implies that, if all otherindependent variables remain fixed, aca-demic groups would, on an annual basis,be expected to have 2,238 procedures perFTE radiologist less than those performedin non-academic groups (the number de-creases because the coefficient is a nega-tive number).

The interpretation of coefficients forcontinuous independent variables is sim-ilar. For example, the model estimatesthat, if all other independent variableswere fixed, an increase of one unit ingroup size would correspond to an aver-age decrease of 59.7 in the number ofprocedures performed annually by eachFTE radiologist in a group practice. Thus,if all other independent variables re-mained fixed, and practice size increasedby five, the number of procedures per FTEradiologist would be expected to decreaseby 5 % 59.7 ! 298.5, and so on. Onecaveat in the interpretation of coeffi-cients is that it is not always possible togive them a direct physical interpreta-tion. In this example, the intercept termin the model does not have a direct in-

terpretation because it corresponds to asetting of all the independent variables tozero, which would be impossible to do. Itmay also be argued that it is not possibleto fix some of the independent variables,such as annual hours per FTE radiologist,while allowing others, such as practicesize, to vary.

In Table 1, the standard error for eachcoefficient provides a measure of the de-gree of statistical uncertainty about theestimate. The fitting of models to datawith a lot of scatter and small samplesizes can lead to large standard errors forthe estimated coefficients. The standarderror can be used to construct a confi-dence interval (CI) for the coefficient.The P values in Table 1 correspond totests of the null hypothesis that a partic-ular coefficient is equal to zero (that is,the hypothesis of “no association” be-tween the particular independent vari-able and the dependent variable).

MODELING OF DICHOTOMOUSOUTCOMES

Logistic regression is commonly used toanalyze dichotomous outcomes (depen-dent variable). The independent vari-ables in these models may be continuous,categoric, or a combination of the two.For simplicity, let us assume that the di-chotomous dependent variable is codedas 0 or 1. For example, a dichotomousoutcome of interest is whether each pa-tient is dead or alive at the end of thestudy observation period: Y ! 1 if a pa-tient died during the follow-up intervalor Y ! 0 if a patient was alive at the endof the follow-up interval.

In the logistic model, the expectedvalue of the response Y is equal to theprobability that Y ! 1, that is, the prob-ability that an event (such as death) oc-curs. The form of the model, however, ismore complex than that in the linearmodel for continuous responses. In par-ticular, the logit of the expected value,rather than the expected value of Y, isassumed to be a linear function of thecovariates. If p denotes the probabilitythat an event will occur, the logit of p isdefined as the logarithm of the odds, thatis, logit p ! log[p/(1 # p)].

Formally, the logistic model with mul-tiple independent variables is written as

logit p &Y ! 1' ! "0 " "1X1 " · · · " "pXp,

or, equivalently, as

( p&Y ! 1') !e"0$"1X1$· · ·$"pXp

1 " e"0$"1X1$· · ·$"pXp.

Volume 229 ! Number 2 Primer on Multiple Regression Models ! 307

Ra

dio

logy

Page 72: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

In the logistic model, "j measures thechange in log-odds for Y ! 1 that corre-sponds to an increase of one unit in Xj, ifall of the other independent variables re-main fixed. In contrast to the linearmodel for continuous responses, the cor-responding change in actual odds is mul-tiplicative. Hence, exp("j) measures theodds ratio (OR) that corresponds to anincrease of one unit in Xj. The OR is afrequently used measure of association inthe epidemiology literature and is a com-mon way of expressing the logistic re-gression results (6). The OR measures theodds of an outcome in the index groupcompared with the odds of the same out-come in the comparison group.

For example, in the study of Goodmanet al (2), Y indicates whether the patientis dead (Y ! 1) or alive (Y ! 0) 3 monthsafter admission for pulmonary embolismwork-up. The primary independent vari-able of interest, diagnostic method,would be represented by X1. Potentialconfounders (covariates) would be repre-sented by X2, . . . , Xp. In Table 2, X2 is anindicator that the patient was referredfrom an ICU, X3 is an indicator that thepatient was older than 67 years, X4 is anindicator of immobilization, and X5 is anindicator of malignancy.

Substitution of these covariates intothis model would result in the followingrepresentation: logit P(Y ! 1) ! "0 $ "1(underwent CT: yes/no) $ "2 (ICU refer-ral: yes/no) $ "3 (age older than 67 years:yes/no) $ "4 (immobilization: yes/no) $"5 (malignancy: yes/no).

For X1, the indicator of whether or nothelical CT was performed, the estimate ofthe coefficient was 0.93. Therefore, theestimated odds of death among patientswho were evaluated with helical CT com-pared with that among those who wereevaluated with lung scintigraphy wasexp(0.93) ! 2.54. That is, if all other in-dependent variables were fixed, the oddsof death within 90 days for patients whounderwent CT to diagnose pulmonaryembolism were 2.54 times as high as theodds for patients who underwent lungscintigraphy to diagnose pulmonary em-bolism. Note that this value represents anOR that was “adjusted” for the presenceof potential confounders. The “unad-justed” estimate (computed from datapresented in reference 2) was 3.42. Be-cause we cannot know the counterfactualoccurrence (the number of patients eval-uated with CT who would have died hadthey been evaluated with ventilation-perfusion scintigraphy), we cannot saywhether the adjustment was successful,and the OR is unbiased. That there is

some difference between the unadjustedOR (3.42) and the adjusted OR (2.54) pro-vides an indication that the potentialconfounders controlled for in the analy-sis may have been confounding the asso-ciation between imaging modality anddeath within 90 days. However, a strongassociation remains between imagingmethod and risk of death. A CI for the ORcan be obtained (6). In the example, the95% CI for the OR is (1.36, 4.80).

The authors (2) report that “the pa-tients in the CT imaging group had morethan twice the odds of dying within 90days as those in the [ventilation-perfu-sion] scintigraphy group.” They alsonoted that “the prevalence of clinicallyapparent pulmonary embolism after anegative helical CT scan was low (1.0%)and minimally different from that after anormal ventilation-perfusion scan (0%)”(2). Part of this association may be due toresidual confounding in the analysis. Inparticular, it is likely that there was con-founding by indication in this sample.That is, patients with a higher likelihoodof dying from pulmonary embolism werereferred selectively for CT. Other moresophisticated statistical techniques maybe needed to adjust for this type of con-founding (7).

In multiple logistic regression models,the intercept "0 measures the baselinelog-odds for Y ! 1, that is, the log-oddsfor Y ! 1 for cases in which all indepen-dent variables have a value of zero. In thepulmonary embolism example, this wouldcorrespond to the subset of patients withall independent variables set to “no,”that is, patients who did not undergo CT,were not referred from the ICU, were 67years old or younger, were not immobi-lized, and did not have a malignancy.Note that if all covariates are centered bymeans of subtraction of the average pop-ulation value, then "0 measures the log-odds for Y ! 1 for an “average” case.

POLYNOMIAL TERMS

The models discussed earlier assumed alinear relation between the independentvariables and the expected value of thedependent variable. If the relation isthought to follow a non-linear form, al-ternative models can be considered thatinvolve transformations of the depen-dent and/or independent variables. Herein,we discuss transformations of the inde-pendent variables. In a simple modelwith a continuous dependent variableand a continuous independent variable,if the slope of the relation appears tochange with the value of the indepen-dent variable X, then a polynomial in Xmay be used instead of a straight line.With the example from Sunshine andBurkhardt (Table 1), if the association be-tween the average number of proceduresper radiologist and group size was notlinear but seemed to be parabolic in na-ture, with extremes in each tail, then in-clusion of a term X3

2 might more fullydescribe the observed data. The additionof higher-order (ie, X3, X4) terms mayalso enhance model fit (8). In addition topolynomial functions, models with othernon-linear functions of the independentvariables are available (8).

MODEL INTERPRETATION:INTERACTIONS

In both linear and logistic regression, theassociation between the dependent vari-able and one of the independent vari-ables may vary across values of anotherindependent variable. To graphically de-pict this concept, the example from theSunshine and Burkhardt article (1) isused. The relation between the numberof procedures per FTE radiologist andgroup size for academic and nonaca-demic groups with no interaction termsis shown in the Figure, part a. Note that

TABLE 2Results of Multiple Logistic Regression Analysis to Examine Death within90 Days of Evaluation for Pulmonary Embolism

VariableRegression

Coefficient (") SDOddsRatio 95% CI P Value

Underwent CT (X1) 0.93 0.32 2.54 1.36, 4.80 .004Referral from intensive care

unit (X2) 1.78 0.32 5.93 3.09, 11.0 .001Age older than 67 years (X3) 0.75 0.34 2.12 1.12, 4.14 .024Immobilization (X4) 1.26 0.39 3.52 1.59, 7.58 .002Malignancy (X5) 0.87 0.34 2.39 1.21, 4.63 .012

Note.—Adapted and reprinted, with permission, from reference 2.

308 ! Radiology ! November 2003 Gareen and Gatsonis

Ra

dio

logy

Page 73: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

the two lines, which correspond to aca-demic and nonacademic group status, areparallel. Now, suppose that the authorswant to examine whether, in fact, thetwo lines are parallel or not. In otherwords, they want to know whether therelation between the number of proce-dures and group size depends on aca-demic group status. To examine thisquestion, the authors would consider amodel with an interaction term betweenacademic group status and group size.The Figure, part b, shows how statisticalinteraction with another variable (aca-demic status) might influence the rela-tion between the number of proceduresand group size.

Another example is drawn from the ar-ticle by Mercado et al (9). In this article, theauthors examine whether placement of astent influences the association betweenpost-procedural minimal lumen diameterand restenosis (Table 3). Questions of this

type can be addressed by including appro-priate “interaction” terms in the regressionmodel. In the restenosis data, a model withan interaction between post-procedural lu-men diameter and stent use can be writtenas follows:

logit p&Y ! 1' ! "0 " "1X1 " "2X2 " "3X3

" "4X4 " "5X5 " "6X1 ! X3,

where Y ! 1 if restenosis occurs and 0otherwise, X1 ! stent use, X2 ! lesionlength, X3 ! post-procedural maximumlumen diameter (PMLD), X4 ! previouscoronary artery bypass graft (CABG), andX5 ! diabetes mellitus. X1 ! X3 is a (mul-tiplicative) interaction term between X1and X3, in which ! indicates that X1 ismultiplied by X3. With the addition ofthe interaction term, the model would berepresented as follows: logit P(Y ! 1) !"0 $ "1 (stent use) $ "2 (lesion length) $"3 (PMLD) $ "4 (previous CABG) $ "5

(diabetes mellitus) $ "6 (stent use !PMLD).

The presence of a significant interac-tion suggests that the effect of X1 de-pends on the actual level of X3 and con-versely. For example, the OR for themaximal diameter size would be exp("3$ "6X1). Thus, for patients who did notreceive a stent, an increase of one unit inthe maximal diameter would multiplythe odds of restenosis by exp("3). How-ever, for patients who received a stent,the odds of restenosis would be multi-plied by exp("3 $ "6). Hence, in the pres-ence of interactions, the main effects cannotbe interpreted by themselves (4,6,7).

OTHER FORMS OF MULTIPLEREGRESSION MODELING

Dependent variables that are neithercontinuous nor dichotomous may alsobe analyzed by means of specialized mul-tiple regression techniques. Most com-monly seen in the radiology literature areordinal categoric outcomes. For example,in receiver operating characteristic stud-ies, the radiologist’s degree of suspicionabout the presence of an abnormality isoften elicited on the five-point ordinalcategoric scale, in which 1 ! definitelyno abnormality present, 2 ! probably noabnormality present, 3 ! equivocal, 4 !probably abnormality present, and 5 !definitely abnormality present. Ordinalregression models are available for thestudy of ordinal categoric outcomes.Such models can be used to fit receiveroperating characteristic curves and to es-timate the effect of covariates such aspatient, physician, or other factors. Ex-amples and further discussion of ordinal

TABLE 3Results of Multiple Regression Analysis to Examine Coronary Restenosis

Variable*OddsRatio 95% CI P Value

Intercept coefficient ("0) ! 0.12 . . . . . . . . .Stent use (X1) 0.83 0.72, 0.97 .0193Lesion length (X2) 1.05 1.04, 1.06 *.001PMLD (X3) 0.53 0.46, 0.61 *.001Previous CABG (X4) 0.69 0.53, 0.9 .006Diabetes mellitus (X5) 1.33 1.16, 1.54 *.001Stent use ! PMLD (X6)* 0.34 0.31, 0.39 .002

Source.—Reference 9.Note.—Table presents results from a multiple logistic regression analysis to examine coronary

restenosis as a function of medical treatment and other selected patient characteristics. CABG !coronary artery bypass graft, PMLD ! post-procedural maximum lumen diameter.

* In this term, the ! indicates that this is an interaction term in which stent use is multiplied byPMLD.

Examples of fitted regression lines of the relation between the number of procedures per FTE radiologist and group size for academic andnonacademic groups, based on the analyses presented by Sunshine and Burkhadrt (1), show (a) no statistical interaction and (b) statisticalinteraction.

Volume 229 ! Number 2 Primer on Multiple Regression Models ! 309

Ra

dio

logy

Page 74: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

regression are available in articles byTosteson and Begg (10) and Toledanoand Gatsonis (11).

RECENTERING AND RESCALINGOF VARIABLES

As noted earlier, in some cases an inde-pendent variable cannot possibly takethe value of 0, thus making it difficult tointerpret the intercept of a regressionmodel. For example, gestational age andage at menarche cannot be meaningfullyset to zero. This difficulty can be ad-dressed by subtracting some value fromthe independent variable before it is usedin the model. In practice, the averagevalue of the independent variable is oftenused, and the “centered” form of the vari-able now represents the deviation fromthat average. When independent vari-ables are centered at their averages, theintercept represents the expected re-sponse for an “average” case, that is, acase in which all independent variableshave been set to their average values.

The rescaling of variables may also en-hance the interpretability of the model.Often the units in which data are pre-sented are not those of clinical interest.By rescaling variables, each unit of in-crease may represent either a more clini-cally understandable or a more meaning-ful difference. For example, if gestationalage is measured in days, then it may berescaled by dividing the value for eachobservation by seven, which yields gesta-tional age measured in weeks. In thiscase, "1, the regression coefficient, wouldthen represent the difference in risk perunit increase in gestational age in weeksrather than in days.

MODEL SELECTION

A detailed discussion of model selectionis beyond the scope of this article. Wenote, however, that selection of the inde-

pendent variables to include in a modelis based on both subject matter and for-mal statistical considerations. Generally,certain independent variables will be in-cluded in the model even if they are notsignificantly associated with the responsebecause they are known a priori to berelated to both the exposure and the out-come of interest or to be potential con-founders of the association of interest.Additional independent variables of in-terest are then evaluated for their contri-bution to an explanation of the observedvariation. Models are sometimes built ina forward “stepwise” fashion in whichnew independent variables are added in asystematic manner, with additional termsbeing entered only if their contributionto the model is above a certain threshold.Alternatively, “backward elimination”may be used, starting with all potentialindependent variables of interest andthen sequentially deleting covariates iftheir contribution to the model is belowa fixed threshold. The validity and utilityof stepwise procedures for model selec-tion is a matter of debate and disagree-ment in the statistics literature (6).

In addition to the selection of perti-nent independent variables for inclusionin the model, it is essential to ensure thatthe form of the model is appropriate. Avariety of regression diagnostics are avail-able to help the analyst determine theadequacy of the postulated form of themodel. Such diagnostics generally focuson examination of the residuals, whichare defined as the difference between theobserved and the predicted values of theresponse. The analyst then examines theresiduals to detect the presence of pat-terns that suggest poor model fit (4,8).

CONCLUSION

Multiple regression models offer greatutility to radiologists. These models assistradiologists in the examination of multi-factorial etiologies, adjustment for multi-

ple confounding factors, and develop-ment of predictions of future outcomes.These models are adaptable to continu-ous, dichotomous, and other types ofdata, and their use may enhance the ra-diologist’s understanding of complex im-aging utilization and clinical issues.

References1. Sunshine JH, Burkhardt JH. Radiology

groups’ workload in relative value unitsand factors affecting it. Radiology 2000:214:815–822.

2. Goodman LR, Lipchik RJ, Kuzo RS, Liu Y,McAuliffe TL, O’Brien DJ. Subsequentpulmonary embolism: risk after a nega-tive helical CT pulmonary angiogram—prosepctive comparison with scintigra-phy. Radiology 2000; 215:535–542.

3. Blackmore CC, Emerson S, Mann F, Ko-epsell T. Cervical spine imaging in pa-tients with trauma: determination of frac-ture risk to optimize use. Radiology 1999;211:759–765.

4. Kleinbaum DG, Kupper LL, Muller KE,Nizam A. Applied regression analysis andother multivariable methods. 3rd ed. Bos-ton, Mass: Duxbury, 1998.

5. Zou KH, Tuncali K, Silverman SG. Corre-lation and simple linear regression. Radi-ology 2003; 227:617–628.

6. Hosmer DW, Lemeshow S. Applied logis-tic regression. New York, NY: Wiley,1989.

7. D’Agostino RB Jr. Propensity score meth-ods for bias reduction in the comparisonof a treatment to a non-randomized con-trol group. Stat Med 1998; 17:2265–2281.

8. Neter J, Kutner M, Nachtscheim C, Was-serman W. Applied linear statistical mod-els. 4th ed. Chicago, Ill: Irwin/McGraw-Hill, 1996.

9. Mercado N, Boersma E, Wijns W, et al.Clinical and quantitative coronary angio-graphic predictors of coronary restenosis:a comparative analysis from the balloon-to-stent era. J Am Coll Cardiol 2001; 38:645–652.

10. Tosteson AN, Begg CB. A general regres-sion methodology for ROC curve estima-tion. Med Decision Making 1988; 8:204–215.

11. Toledano A, Gatsonis C. Ordinal regres-sion methodology for ROC curves de-rived from correlated data. Stat Med 199515:1807–1826.

310 ! Radiology ! November 2003 Gareen and Gatsonis

Ra

dio

logy

Page 75: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Nancy A. Obuchowski, PhD

Index term:Statistical analysis

Published online before print10.1148/radiol.2293010899

Radiology 2003; 229:617–621

1 From the Departments of Biostatis-tics and Epidemiology and Radiology/Wb4, the Cleveland Clinic Founda-tion, 9500 Euclid Ave, Cleveland, OH44195. Received May 8, 2001; revi-sion requested June 4; revision re-ceived June 21; accepted July 9. Ad-dress correspondence to the author(e-mail: [email protected]).© RSNA, 2003

Special Topics III: Bias1

Researchers, manuscript reviewers, and journal readers should be aware of the manypotential sources of bias in radiologic studies. This article is a review of the commonbiases that occur in selecting patient and reader samples, choosing and applying areference standard, performing and interpreting diagnostic examinations, and an-alyzing diagnostic test results. Potential implications of various biases are discussed,and practical approaches to eliminating or minimizing them are presented.© RSNA, 2003

There are many potential sources of bias in radiologic studies. For those of us who performstudies, review manuscripts, or read the literature, it is important to be aware of thesebiases, and for the investigators in studies, it is important to know how to avoid orminimize bias. The term bias refers to the situation in which measurements from a study(eg, measurement of a test’s sensitivity and specificity) do not correspond to the valuesthat we would obtain if we performed the test in all patients in the relevant population. Ofcourse, we can never perform the test in all patients in the population, so it is imperativethat we do our best to design studies without bias.

Bias can occur in selecting the patients, images, and/or readers (ie, radiologists) for astudy, in choosing and applying the reference-standard procedure, in performing andinterpreting the tests, and in analyzing the results. Many of the biases encountered inradiologic studies have been given names, but there are many unnamed biases that we canidentify and avoid by using common sense.

It is best to recognize the potential sources of bias while in the process of designing astudy. Then, solutions to the bias problem, or at least ways to minimize the effect of thebias, can be implemented in the study. Note that having a large sample size may reduce thevariability (ie, random error) (Table) of our estimates, but it is never a solution to bias (ie,systematic error).

BIAS IN SELECTING THE PATIENT SAMPLE

The objectives of a study determine the type of patients that we recruit. If we have a newtest and want to determine if it has any diagnostic value, then we might select a group ofpatients with clinically evident disease and a group of volunteers who do not have thedisease for comparison. Sox et al (3) refer to the patients in such studies as “the sickest ofthe sick” and “the wellest of the well.” If the test does not yield different results for thesetwo groups, then it probably does not have any diagnostic value. In these patients, wecannot measure, without bias, other variables such as the test’s sensitivity and specificityor the differences in contrast material uptake. The reason is that our study sample ismissing important types of patients—for example, patients with latent disease and controlpatients with confounding illnesses. Variables such as sensitivity and specificity will mostlikely be different for these patients.

Selection bias occurs when external factors influence the composition of the sample tothe extent that the sample does not represent the population (eg, in terms of patient types,the frequency of the patient types, or both). Spectrum bias (4) is a type of selection bias;it exists when the sample is missing important subgroups. A classic example of spectrumbias is that encountered in screening mammography studies to compare the accuracy offull-field digital mammography with that of conventional mammography. A very largesample size is required to perform a comparison of these two modalities because theprevalence of breast cancer in screening populations is very low. One strategy to reduce thesample size is to consider women who have positive conventional mammography results.These women return for biopsy, and at that time, full-field digital mammography can beperformed. However, there is a serious problem with this strategy: The patients with

Statistical Concepts Series

617

Ra

dio

logy

Page 76: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

negative conventional mammographyresults (true- and false-negative cases)have been selected out. The consequenceis that the sensitivity of conventionalmammography will be greatly overesti-mated—making full-field digital mam-mography seem inferior—and the speci-ficity of conventional mammographywill be greatly underestimated—makingfull-field digital mammography seem su-perior.

Once a new diagnostic test is shown tobe capable of yielding different results for“the sickest of the sick” and “the wellestof the well,” it is time to challenge thetest. We challenge a test by performing itin study patients for whom making a di-agnosis is difficult (4). From the results ofthese studies, we can determine if the testwill be reliable in a clinical populationthat includes both patients for whom it iseasy and patients for whom it is difficultto make a diagnosis. However, because ofspectrum bias, we still cannot measure,without bias, other variables such as thetest’s sensitivity and specificity.

Suppose now that we have a well-es-tablished test that we know from previ-ous studies is reliable even for difficult-to-diagnose cases. We want to measure,for example, the test’s sensitivity andspecificity for a particular population ofpatients. Ideally, we would select ourstudy patients by taking a random sam-ple from the population of patients whopresent to their primary physician with acertain set of signs and symptoms. Infact, a random sample is the basis of theinterpretation of P values calculated instatistical analyses. We then perform thewell-established test in these patients andmeasure the test’s sensitivity and speci-ficity. These measurements will be gener-alizable (Table) to similar patients whopresent to their primary physicians withthe same signs and symptoms.

Sometimes this ideal study design isnot workable. Alternatively, for this well-established test, suppose we select ourstudy patients from a population of indi-viduals who are referred to the radiologydepartment for the test. Such a sample iscalled a referred or convenience sample.These patients have been selected to un-dergo the test. Other patients from thepopulation may not have been referredfor the test, or they may have been re-ferred at a different rate. It is usually im-possible to determine the factors thatinfluenced the evaluating physicians’ re-ferral patterns. Thus, the measurementstaken from a referred sample are general-izable only to the referring physicians in

the study since other physicians will se-lect different patients.

If we must use a referred sample, forexample, to minimize costs, then weshould at least carefully collect andrecord important patient characteris-tics—important in the sense that themeasurements taken in the study mightvary according to these characteristics—and the relative frequency of these char-acteristics. We should report the mea-surements obtained in patients withvarious characteristics (eg, report thetest’s sensitivity and specificity for pa-tients with and those without symp-toms). This will allow others to comparethe characteristics of their patient popu-lation with the characteristics of thestudy sample to determine how general-izable the study results are to their radi-ology practice.

BIAS IN SELECTING THEREADER SAMPLE

In some studies a sample of readers isneeded. For example, when studying thediagnostic accuracy of tests such aschest radiography, computed tomogra-phy (CT), magnetic resonance (MR) an-giography, and mammography, we mustrecognize that accuracy is a function ofboth the imaging unit and the readerwho uses it (5). Since readers differ incognitive and perceptual abilities, it isimportant to include multiple readers insuch studies and to make sure that thesereaders represent the population of radi-ologists in whom you are interested. Too

often radiologic research is performed attertiary care hospitals by radiology sub-specialists who are experts in their givenspecialties. Thus, the reported estimatesof diagnostic test accuracy may be high,but they might not be generalizable tocommunity hospitals where general radi-ologists practice.

It can be challenging to obtain a trulyrepresentative sample of readers for stud-ies. The problem is illustrated in themammography study performed byBeam et al (6). They identified all of theAmerican College of Radiology–accred-ited mammography centers in theUnited States. There were 4,611 such cen-ters in the United States at the time of thestudy. Then they randomly sampled 125of the 4,611 centers and mailed letters tothese centers to assess their willingness toparticipate. Only 50 centers (40%) agreedto take part in the study. One hundredeight radiologists from these 50 centersactually interpreted images for the study.There was a clear potential for bias be-cause the highly motivated centers andreaders may have been more likely to vol-unteer, and these centers and readersmay not have been representative of thepopulation. It is unclear how to over-come this type of bias.

BIAS IN CHOOSING ANDAPPLYING THEREFERENCE-STANDARD TEST

This section is focused on studies inwhich the objective is to measure theaccuracy (ie, sensitivity, specificity, and

Definition of Common Terms

Term Definition

Random error Variation in measurements due to inherent differences betweenpatients (or readers) and natural fluctuations within a patient(or reader)

Systematic error Pattern of variation in measurements attributable to an externalfactor

Generalizeable Situation in which a study’s results can be assumed to representand/or predict the situation at another clinical center (1)

Operational standards Set of definitions and/or rules used to conduct a research study(eg, definitions of presence and absence of disease)

Misclassification Incorrect diagnosis; examples are false-positive cases (ie, disease-free patients classified as having disease) and false-negative cases(ie, patients with disease classified as being disease free)

Blinding Process of withholding information (eg, results of the referencestandard procedure) from a technician and/or a reader for thestudy purpose of determining the value (eg, accuracy) of a testper se

Randomize Process of assigning patients (or images) to groups or positions in alist (eg, a list of images to be read in a study) by using variousmethods that allow each patient (or image) to have a known(usually equal) chance of being assigned to a group or position,but the group or position cannot be predicted (2)

618 ! Radiology ! December 2003 Obuchowski

Ra

dio

logy

Page 77: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

receiver operating characteristic curve) orcomparative accuracy of tests. For thesestudies, a reference-standard, or “goldstandard,” procedure is needed to deter-mine the true disease status of the pa-tients. A reference-standard procedure isa test or procedure whose results tell us,with nearly 100% accuracy, the true dis-ease status of patients. Choosing a refer-ence-standard procedure and applying itequitably is often the most challengingpart of designing a study.

Imperfect standard bias occurs whenthe reference-standard procedure yieldsresults that are not nearly 100% accurate.An example would be a study of the ac-curacy of head CT for the diagnosis ofmultiple sclerosis. If MR imaging wereused as the reference-standard test, thenthe measures of the accuracy of CT wouldbe biased (ie, probably too low in value)(7) because the accuracy of MR imagingin the diagnosis of multiple sclerosis isnot near 100%.

Some might argue that there is no suchthing as a “gold” standard. Even patho-logic analysis results are not 100% accu-rate because, like radiology, pathology isan interpretative discipline. For all stud-ies it is important to have operationalstandards (Table) that take into accountthe condition being studied, the objec-tives of the study, and the potential ef-fects of any bias. Some common sense isneeded as well (8).

There are various solutions to imper-fect standard bias. First, we can choose abetter reference-standard procedure, ifone exists. For the multiple sclerosisstudy, we could follow up the patients forseveral months or years to establish aclinical diagnosis and use the follow-upfindings as the reference standard forcomparison with the results of CT. Some-times, however, there is no reference-standard procedure. For example, sup-pose we want to estimate the accuracy ofa new test for identifying the location inthe brain that is responsible for epilepticseizures. There is no reference-standardtest in this case. However, as an alterna-tive to measuring the test’s accuracy, wecould frame the problem in terms of theclinical outcome (7): We could comparethe test results with the patients’ seizurestatus after nerve stimulation to variouslocations and report the strength of thisrelationship. Such analysis can yield use-ful clinical information, even when thetest’s accuracy cannot be adequatelyevaluated.

Another solution is to use an expertpanel to establish a working diagnosis.Thornbury et al (9) formed an expert

panel to determine the diagnoses for pa-tients who underwent MR imaging andCT for acute low back pain. The panelwas given the patients’ medical histories,physical examination results, laboratoryfindings, treatment results, and fol-low-up information to decide whether aherniated disk was present. The determi-nations of the expert panel regarding thepatients’ true diagnoses were used as thereference standards with which the MRimaging and CT results were compared.Note that the expert panel was not giventhe results of MR imaging or CT. This wasplanned to avoid incorporation bias,which occurs when the results of the di-agnostic test(s) under evaluation are in-corporated—in full or in part—into theevidence used to establish the definitivediagnosis (4).

A fourth solution to imperfect standardbias is to apply one of several statisticalcorrections (10). To apply these correc-tions, one must make some assumptionsabout the imperfect reference standard (eg,that its sensitivity and specificity areknown) and/or the relationship betweenthe results of the test being assessed andthe results of the reference-standard test(eg, that the test in question and the refer-ence-standard test make errors indepen-dently of one another). There is continuingresearch of new statistical methods for ad-dressing imperfect standard bias.

In some studies, a reference-standardprocedure exists, but it cannot be per-formed in all of the study patients, usu-ally owing to ethical reasons. An exampleof such bias is that which may be en-countered in a study to assess the accu-racy of lung cancer screening with CT. Ifa patient has negative CT results, then wecannot perform biopsy or surgery to de-termine his or her true disease status.Verification bias occurs when patientswith positive or negative test results arepreferentially referred for the reference-standard procedure and then the sensi-tivity and specificity are based only onthose patients who underwent the refer-ence-standard test (11). This bias is coun-terintuitive in that investigators usuallybelieve that including only the patientsfor whom there was rigorous verificationof the presence or absence of disease willmake their study design ideal (12). Theopposite is true, however: Studies inwhich the most stringent verification ofdisease status is required and the caseswith less definitive confirmation are dis-carded often yield the most biased esti-mates of accuracy (11,13).

One solution to verification bias is todesign the study so that the diagnostic

test results will not be used to determinewhich patients will undergo disease sta-tus verification. Rather, the study pa-tients can be selected to undergo the ref-erence-standard procedure on the basisof their signs, symptoms, and other testresults—not the results of the test(s) eval-uated in the study. This is not alwayspossible because the test(s) under evalua-tion may be the usual clinical test(s) usedto make diagnoses and manage the treat-ment of these patients.

Another solution is to use different ref-erence-standard procedure(s) for differ-ent patients. For example, in evaluatingthe accuracy of CT for lung cancerscreening, some patients may undergobiopsy and surgery and others can be fol-lowed up clinically and radiologically fora specified period (eg, 2 years) to detectwrongly diagnosed cases (Table). We can-not simply assume that patients withnegative test results are disease free; thisassumption can lead to a serious overes-timation of test specificity (11).

A third solution to verification bias isto apply a statistical correction to the es-timates of accuracy. A number of correc-tion methods exist (14). Most of thesemethods are based on the assumptionthat the decision to verify a patient’s di-agnosis—that is, to refer the patient forfurther diagnostic work-up, includingthe reference-standard test used in thestudy—is a conscious one and thus isbased on visible factors, such as the testresult and the patient’s signs and symp-toms. To apply any of the correctionmethods, it is essential that we record theresults of all patients who undergo thetest being assessed—not just those of pa-tients who undergo the evaluated testand the reference-standard procedure.

BIAS IN PERFORMING ANDINTERPRETING TESTS

Tests that are being evaluated in a studymust be performed and interpreted with-out knowledge of the results of compet-ing tests and, when applicable, withoutknowledge of the results of the reference-standard procedure. If a reference-stan-dard procedure is used in a study, it mustbe performed and interpreted withoutknowledge of the results of the diagnostictest(s) being evaluated.

Review bias (4) occurs when a diagnos-tic test, or the reference-standard test, isperformed or interpreted without properblinding (Table). Consider as an examplea study to compare the capability of CTand ultrasonography (US) to depict tu-

Volume 229 ! Number 3 Special Topics III: Bias ! 619

Ra

dio

logy

Page 78: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

mors. When performing US, the techni-cian and radiologist should not be awareof the CT findings because the technicianmight search with more scrutiny in loca-tions where a tumor was found at CT andthe radiologist may have a tendency to“overread” a suspicious area when he orshe knows that the CT reader interpretedit to be a tumor. The simplest way toavoid this type of bias is to “blind” boththe technician and the reader to the re-sults of the other tests.

In retrospective studies in which thetests have already been performed andinterpreted, it is critical that we scrutinizethe usual clinical practice in search ofreview bias. For example, suppose we arereviewing the test findings of all patientswho underwent CT and pulmonary an-giography for detection of pulmonaryemboli. We may find that angiographywas almost always performed after CT,and we may suspect that the angiogramwas obtained and interpreted withknowledge of the CT findings. For such astudy, it may be possible to reinterpretthe angiogram while blinded to the CTresults. However, one cannot performthe angiographic examination againwhile blinded to the CT results. In thesesituations we must be aware that the po-tential for bias exists and interpret thestudy findings with the appropriate levelof caution.

When two tests—for example, tests Aand B—are performed in the same pa-tient and the images are interpreted bythe same reader, the images read last—forexample, the test B images—will tend tobe interpreted more accurately than theimages read first—that is, the test A im-ages—if the reader retains any informa-tion (15). This situation is called reading-order bias, and it can (a) negate a realdifference (ie, if test A is really superior totest B), (b) inflate the true difference (ie,if test B is really superior to test A), or(c) create a difference when no true dif-ference exists.

The simplest way to reduce or elimi-nate reading-order bias is to vary the or-der in which the test findings are inter-preted (15). For example, suppose 50patients underwent both test A and testB. The reader could first interpret the re-sults of test A for half of the patients—letus call them group 1. Next, the readerwould interpret the results of test B forthe second half of the patients—let uscall them group 2. After a sufficient timelag, the reader would interpret the test Bresults for group 1 and then the test Aresults for group 2. This way, the effect ofreading-order bias would be cancelled

out, because although the test A resultswould be read first for half of the pa-tients, the test B results also would beread first for half of the patients.

Note that patients would have to berandomized (Table) to the two groupsand the images obtained in the twogroups would need to be presented to thereaders in random order. The rationalefor this protocol is that readers some-times remember the first (and even sec-ond and last) case in a reading session, soby randomizing patients we reduce theeffect of any retained information.

An additional way to reduce the effectof retained information is to allow a suf-ficient time lag between the first and sub-sequent readings of images in the samecase. No standard time is appropriate forall studies. Rather, the duration of thetime lag should depend on the complex-ity of the readings and the volume of thestudy cases and similar clinical cases thatthe reader is expected to interpret. Forexample, if the study cases are those fromscreening examinations and the reader inhis or her typical clinical practice inter-prets the results of many screening exam-inations, then a short time lag (ie, a fewdays) is probably sufficient. In contrast, ifthe study cases are difficult and complexto interpret and thus a great deal of timeis required to determine the diagnosis,and/or if the reader does not typicallyinterpret the types of cases included inthe study, then a long time lag (ie, severalmonths) is needed to minimize the re-tained information.

One last bias that I will discuss in thissection occurs when tests are interpretedin an artificial environment. Intuitively,in an experimental setting, we might ex-pect readers to interpret cases with morecare because they know that their perfor-mance is being measured. Egglin andFeinstein (16) addressed another issuethat affects reader performance. Theyperformed a study to assess the effect thatdisease prevalence has on test interpreta-tion. They assembled a test set of pulmo-nary arteriograms with a depicted pulmo-nary embolism prevalence of 33% andembedded this set into two larger groupsof arteriograms such that group A had anoverall prevalence rate of 60% and groupB an overall prevalence rate of 20%. Afterblinded randomized reviews by six read-ers, they concluded that readers’ accura-cies differ depending on the context andoften improve when the disease preva-lence is higher. Egglin and Feinstein (16)defined context bias as the bias in accu-racy measurements that occurs when thedisease prevalence in the sample differs

greatly from the prevalence in the clini-cal population. They suggested that in-vestigators use a sample with a diseaseprevalence similar to that in the clini-cally relevant population.

BIAS IN ANALYZING TESTRESULTS

Some tests yield uninterpretable results(17). Causes of uninterpretable results in-clude insufficient cell specimens fromneedle biopsy, abdominal gas interferingwith pelvic US imaging, and dense breasttissue at mammography screening. In an-alyzing the results of a study it is criticalnot to omit these cases. Rather, we mustreport the frequency and causes of suchcases. When comparing tests, the fre-quencies of uninterpretable results fromthe tests should be compared. Poynard etal (18) compared three tests for diagnos-ing extrahepatic cholestasis and foundthat the clinical usefulness of the threetests was strongly influenced by the fre-quencies of uninterpretable results of thedifferent examinations.

Another common problem occurs whensome study forms are missing or parts ofthe forms are incomplete or filled out in-correctly. Response bias occurs when weinclude just the complete data in our anal-ysis and ignore the missing data. The prob-lem is that there is often a pattern to themissing data—for example, patients whoare found to be disease free tend to be fol-lowed up with less scrutiny compared withpatients who have disease, so data on, forexample, patient satisfaction are mostlyfrom patients with disease. However, theresults might be different for disease-freepatients.

Although there are statistical methodsto account for data that are missing notat random (19), it is best to minimize thefrequency of missing data by properlytraining the staff who complete theforms and including mechanisms to col-lect the incomplete data (eg, multipletelephone and mail messages to nonre-sponders, cross checks in other databasesfor information on medical utilizationand major outcomes).

CONCLUSION

Sources of bias are everywhere, making itvery challenging to design and interpretstudies. Researchers should implementways to avoid bias or to minimize its ef-fect while still in the planning phase oftheir study. We all should be aware ofcommon biases so that we are able to

620 ! Radiology ! December 2003 Obuchowski

Ra

dio

logy

Page 79: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

make informed judgments about thegeneralizability of study results to ourclinical practice.

References1. Altman DG, Bland JM. Generalisation

and extrapolation. BMJ 1998; 317:409–410. Available at: www.bmj.com/cgi/content/full/317/7155/409. Accessed Sep-tember 26, 2003.

2. Altman DG, Bland JM. How to random-ise. BMJ 1999; 319:703–704. Available at:www.bmj.com/cgi/content/full/319/7211/703. Accessed September 26, 2003.

3. Sox H, Stern S, Owens D, Abrams HL.Assessment of diagnostic technology inhealth care: rationale, methods, prob-lems, and directions. Washington, DC:National Academy Press, 1989.

4. Ransohoff DF, Feinstein AR. Problems ofspectrum and bias in evaluating the effi-cacy of diagnostic tests. N Engl J Med1978; 299:926–930.

5. Beam CA, Baker ME, Paine SS, SostmanHD, Sullivan DC. Answering unansweredquestions: proposal for a shared resource

in clinical diagnostic radiology research.Radiology 1992; 183:619–620.

6. Beam CA, Layde PM, Sullivan DC. Vari-ability in the interpretation of screeningmammograms by US radiologists: find-ings from a national sample. Arch InternMed 1996; 156:209–213.

7. Valenstein PN. Evaluating diagnostictests with imperfect standards. Am J ClinPathol 1990; 93:252–258.

8. Metz CE. Basic principles of ROC analysis.Semin Nucl Med 1978; 8:283–298.

9. Thornbury JR, Fryback DG, Turski PA, etal. Disk-caused nerve compression in pa-tients with acute low-back pain: diagnosiswith MR, CT myelography, and plain CT.Radiology 1993; 186:731–738.

10. Hui SL, Zhou XH. Evaluation of diagnos-tic tests without gold standards. StatMethods Med Res 1998; 7:354–370.

11. Begg CB. Biases in the assessment of di-agnostic tests. Stat Med 1987; 6:411–423.

12. Begg CB, McNeil BJ. Assessment of radio-logic tests, control of bias and other designconsiderations. Radiology 1988; 167:565–569.

13. Black WC. How to evaluate the radiology

literature. AJR Am J Roentgenol 1990;154:17–22.

14. Zhou XH. Correcting for verification biasin studies of a diagnostic test’s accuracy.Stat Methods Med Res 1998; 7:337–353.

15. Metz CE. Some practical issues of experi-mental design and data analysis in radio-logic ROC studies. Invest Radiol 1989; 24:234–245.

16. Egglin TK, Feinstein AR. Context bias: aproblem in diagnostic radiology. JAMA1996; 276:1752–1755.

17. Begg CB, Greenes RA, Iglewicz B. The in-fluence of uninterpretability on the as-sessment of diagnostic tests. J ChronicDis 1986; 39:575–584.

18. Poynard T, Chaput JC, Etienne JP. Rela-tions between effectiveness of a diag-nostic test, prevalence of the disease,and percentages of uninterpretable re-sults: an example in the diagnosis ofjaundice. Med Decis Making 1982; 2:285–297.

19. Little RJA, Rubin DB. Statistical analysiswith missing data. New York, NY: Wiley,1987.

Volume 229 ! Number 3 Special Topics III: Bias ! 621

Ra

dio

logy

Page 80: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Christopher L. Sistrom, MD,MPH

Cynthia W. Garvan, PhD

Index term:Statistical analysis

Published online10.1148/radiol.2301031028

Radiology 2004; 230:12–19

Abbreviations:OR ! odds ratioRR ! relative risk

1 From the Departments of Radiology(C.L.S.) and Biostatistics (C.W.G.),University of Florida College of Medi-cine, PO Box 100374, Gainesville, FL32610. Received July 2, 2003; revisionrequested July 30; revision receivedAugust 4; accepted August 13. Ad-dress correspondence to C.L.S. (e-mail: [email protected]).© RSNA, 2004

Proportions, Odds, and Risk1

Perhaps the most common and familiar way that the results of medical research andepidemiologic investigations are summarized is in a table of counts. Numbers ofsubjects with and without the outcome of interest are listed for each treatment orrisk factor group. By using the study sample data thus tabulated, investigatorsquantify the association between treatment or risk factor and outcome. Threesimple statistical calculations are used for this purpose: difference in proportions,relative risk, and odds ratio. The appropriate use of these statistics to estimate theassociation between treatment or risk factor and outcome in the relevant populationdepends on the design of the research. Herein, the enumeration of proportions,odds ratios, and risks and the relationships between them are demonstrated, alongwith guidelines for use and interpretation of these statistics appropriate to the typeof study that gives rise to the data.© RSNA, 2004

In a previous article in this series (1), the 2 " 2 contingency table was introduced as a wayof organizing data from a study of diagnostic test performance. Applegate et al (2) havepreviously described analysis of nominal and ordinal data as counts and medians. Binaryvariables are a special case of nominal data where there are only two possible levels (eg,yes/no, true/false). Data in two binary variables arise from a variety of research methodsthat include cross-sectional, case-control, cohort, and experimental designs. In this article,we will describe three ways to quantify the strength of the relationship between two binaryvariables: difference of proportions, relative risk (RR), and odds ratio (OR). Appropriate useof these statistics depends on the type of data to be analyzed and the research study design.

Correct interpretation of the difference of proportions, the RR, and the OR is key to theunderstanding of published research results. Misuse or misinterpretation of them can leadto errors in medical decision making and may even have adverse public policy implica-tions. An example can be found in an article by Schulman et al (3) published in the NewEngland Journal of Medicine about the effects of race and sex on physician referrals forcardiac catheterization. Results of this study of Schulman et al received extensive mediacoverage about the findings that blacks and women were referred less often than whitemen for cardiac catheterization. In a follow-up article, Schwartz et al (4) showed how themagnitude of the findings of Schulman et al was overstated, chiefly because of confusionamong OR, RR, and probability. The resulting controversy underscores the importance ofunderstanding the nuances of these statistical measures.

Our purpose is to show how a 2 " 2 contingency table summarizes results of severalcommon types of biomedical research. We will describe the four basic study designs thatgive rise to such data and provide an example of each one from literature related toradiology. The appropriate use of difference in proportion, RR, and OR depends on thestudy design used to generate the data. A key concept to be developed about using odds toestimate risk is that the relationship between OR and RR depends on outcome frequency.Both graphic and computational correction of OR to estimate RR will be shown. With rarediseases, even a corrected RR estimate may overstate the effect of a risk factor or treatment.Use of difference in proportions (expressed as attributable risk) may give a better picture ofsocietal impact. Finally, we introduce the concept of confounding by factors extraneous tothe research question that may lead to inaccurate or contradictory results.

2 ! 2 CONTINGENCY TABLES

Let X and Y denote two binary variables that each have only two possible levels. Anotherterm for binary is dichotomous. Results are most often presented as counts of observationsat each level. The relationship between X and Y can be displayed in a 2 " 2 contingencytable. Another name for a contingency table is a cross-classification table. A 2 " 2

Statistical Concepts Series

12

Ra

dio

logy

Page 81: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

contingency table consists of four cells:the cell in the first row and first column(cell 1–1), the cell in the first row andsecond column (cell 1–2), the cell in thesecond row and first column (cell 2–1),and the cell in the second row and sec-ond column (cell 2–2). Commonly usedsymbols for the cell contents include nwith subscripts, p with subscripts, andthe letters a–d. The n11, n12, n21, and n22notation refers to the number of subjectsobserved in the corresponding cells. Ingeneral, “nij” refers to the number of ob-servations in the ith row (i ! 1, 2) and jthcolumn (j ! 1, 2). The total number ofobservations will be denoted by n (ie, n !n11 # n12# n21 # n22). The p11, p12, p21,and p22 notation refers to the proportionof subjects observed in each cell. In gen-eral, “pij” refers to the proportion of ob-servations in the ith row (i ! 1, 2) and jthcolumn (j ! 1, 2). Note that pij ! nij/n.For simplicity, many authors use the let-ters a–d to label the four cells as follows:a ! cell 1–1, b ! cell 1–2, c ! cell 2–1, andd ! cell 2–2. We will use the a–d notationin equations that follow. Table 1 showsthe general layout of a 2 " 2 contingencytable with symbolic labels for each celland common row and column assign-ments for data from medical studies.

In many contingency tables, one vari-able is a response (outcome or dependentvariable) and the other is an explanatory(independent) variable. In medical stud-ies, the explanatory variable (X in therows) is often a risk or a protective factorand the response (Y in the columns) is adisease state. The distribution of ob-served data in a 2 " 2 table indicates thestrength of relationship between the ex-planatory and the response variables. Fig-ure 1 illustrates possible patterns of ob-served data. The solid circle represents acell containing numerous observations.Intuitively, we would expect that if the Xand Y variables are associated, then pat-tern A or B would be observed. PatternsC, D, and E suggest that X and Y areindependent of each other (ie, there is norelationship between them).

STUDY DESIGNS THAT YIELD2 ! 2 TABLES

The statistical methods used to analyzeresearch data depend on how the studywas conducted. There are four types ofdesigns in which two-by-two tables maybe used to organize study data: case-con-trol, cohort, cross sectional, and experi-mental. The first three designs are oftencalled observational to distinguish them

from experimental studies; of the exper-imental studies, the controlled clinicaltrial is the most familiar. The 2 " 2 tablesthat result from the four designs maylook similar to each other. The outcomeis typically recorded in columns, and theexplanatory variable is listed in the rows.The $2 statistic may be calculated andused to test the null hypotheses of inde-pendence between row and column vari-ables for all four types of studies. Thesemethods are described in a previous arti-cle in this series (2). Table 2 summarizesthe features of the four types of designs inwhich 2 " 2 tables are used to organizestudy data. Each is briefly described next,with a radiology-related example pro-vided for illustration. The ordering of thestudy designs in the following para-graphs reflects, in general, the strength(ie, weaker to stronger) of evidence forcausation obtained from each one. Theadvantages and disadvantages of the dif-ferent designs and situations where eachis most appropriate are beyond the scopeof this article, and the reader is encour-aged to seek additional information inbiostatistics, research design, or clinicalepidemiology reference books (5–7).

The statistics used to quantify the rela-tionship between variables (ie, the effectsize) are detailed and the examples ofstudy designs are summarized in Table 3.These will be briefly defined now. Thedifference in proportion is the differencein the fraction of subjects who have theoutcome between the two levels of theexplanatory variable. The RR is the ratio

of proportion (ie, risk) of subjects whohave the outcome between different lev-els of the explanatory variable. The OR isthe ratio of the odds that a subject willhave the outcome between the two levelsof the explanatory variable. Each of thesestatistics has an associated standard errorthat can be calculated and used to formCIs around the estimate at any chosenlevel of precision. The calculation of CIsand their use for inference testing aredescribed in a previous article in this se-ries (8).

Cross-sectional Studies

A cross-sectional study does not in-volve the passage of time. A single sam-ple is selected without regard to diseasestate or exposure status. Information ondisease state and exposure status is deter-mined with data collected at a single timepoint. Data about exposure status and

TABLE 1Notation for 2 ! 2 ContingencyTable

X†

Y*

Yes‡ No§

Present! n11 n12p11 p12a b

Absent# n21 n22p21 p22c d

Note.—n ! number of subjects in the cell,p ! proportion of entire sample in the cell,a–d ! commonly used cell labels.

* Response, outcome, or disease statusvariable.

† Explanatory, risk factor, or exposure vari-able.

‡ Adverse outcome or disease-positive re-sponse.

§ No adverse outcome or disease-negativeresponse.

! Exposed or risk-positive group.# Unexposed or risk-negative group.

Figure 1. Diagram shows possible patterns ofobserved data in a 2 " 2 table. Cells with blackcircles contain relatively large numbers ofcounts. With pattern A, n11 and n22 are large,suggesting that when X is present, Y is “yes.”With pattern B, n12 and n21 are large, suggest-ing that when X is absent, Y is “yes.” Withpattern C, n11 and n21 are large, suggesting thatY is “yes” regardless of X. With pattern D, n12

and n22 are large, suggesting that Y is “no”regardless of X. With pattern E, n11, n12, n21,and n22 are all about the same, suggesting thatY is “yes” and “no” in equal proportion regard-less of X.

Volume 230 " Number 1 Proportions, Odds, and Risk " 13

Ra

dio

logy

Page 82: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

disease state can be organized into a 2 "2 contingency table, and the prevalence(ie, the proportion of a group that cur-rently has a disease) can be compared forthe exposed and unexposed groups. Ef-fect size from cross-sectional studies maybe assessed with difference in propor-tions, RR, or OR. For example, Cogo et al(9) studied the association between hav-ing a major risk factor (eg, immobiliza-tion, trauma, and/or recent surgery) anddeep vein thrombosis. This study was

performed in an outpatient setting by us-ing a cross-sectional design. A total of426 subjects who were referred by generalpractitioners underwent contrast mate-rial–enhanced venography to determinedeep vein thrombosis status (positive ornegative). Concurrently, information onmajor risk factors was recorded as beingpresent or absent. They found that deepvein thrombosis was more likely to occurwhen a major risk factor was present (81[55%] of 148) than when none waspresent (90 [32%] of 278). The data areshown in Table 4, and effect measures areincluded in Table 3.

Case-Control Studies

In a case-control study, the investiga-tor compares instances of a certain dis-ease or condition (ie, the cases) with in-dividuals who do not have the disease orcondition (ie, the “controls” or controlsubjects). Control subjects are usually se-lected to match the patients with cases ofdisease in characteristics that might be

related to the disease or condition of in-terest. Matching by age and sex is com-monly used. Investigators look backwardin time (ie, retrospectively) to collect in-formation about risk or protective factorsfor both cases and controls. This isachieved by examining past records, in-terviewing the subject, or in some otherway. The only correct measure of effectsize for a case-control study is the OR.However, the calculated OR may be usedto estimate RR after appropriate correc-tion for disease frequency in the popula-tion of interest, which will be explainedlater. For example, Vachon et al (10)studied the association between type ofhormone replacement therapy and in-creased mammographic breast density byusing a case-control study design. Theyidentified 172 women who were under-going hormone replacement therapywho had increased breast density (cases)and 172 women who were undergoinghormone replacement therapy who didnot have increased breast density (con-trols). The type of hormone replacementtherapy used by all subjects was then de-termined. They found that combinedhormone replacement therapy was asso-ciated with increased breast density moreoften than was therapy with estrogenalone (OR ! 2.22). The data are presentedin Table 5.

Cohort Studies

A cohort is simply a group of individ-uals. The term is derived from Romanmilitary tradition; according to this tra-dition, legions of the army were dividedinto 10 cohorts. This term now meansany specified subdivision or group ofpeople marching together through time.In other words, cohort studies are aboutthe life histories of sections of popula-tions and the individuals who are in-

TABLE 2Comparison of Four Study Designs

Attribute Cross-sectional Study Case-Control Study Cohort Study Experimental Study

Sample selection One sample selected withoutregard to disease orexposure status

Two samples selected: onefrom disease-positivepopulation, one fromdisease-negativepopulation

Two samples selected: onefrom exposedpopulation, one fromunexposed population

One sample selected that is diseasenegative; sample is randomlyassigned to treatment or controlgroup

Proportions that canbe estimated

Prevalence of disease in theexposed and unexposedgroups

Proportion of cases andcontrols that have beenexposed to a risk factor

Incidence of disease inexposed and unexposedgroups

Incidence of disease in treated anduntreated (control) groups

Time reference Present look at time Backward look in time Forward look in time Forward look in timeEffect measure OR, difference in

proportions*OR RR, difference in

proportions*RR, difference in proportions*

* Difference in proportions may be used as an alternate measure of effect.

TABLE 3Quantification of Effect Size for Two Binary Variables

Definition Difference of Proportions* RR† OR‡

Calculation for estimate based onsample data

n11/n11 # n12 %n21/n21 # n22

n11/n11 ! n12

n21/n21 ! n22n11n22/n12n21

Calculation in terms of a–d-celllabels

[a/(a # b)] %[c/(c # d)]

&a/'a ! b()

&c/'c ! d()ad/bc

Cross-sectional example (Table 4) 0.22 1.69 2.52Case-control example (Table 5) NA§ NA§ 2.22Cohort example (Table 6) 0.046 1.07 1.25!

Experimental example (Table 7) 0.029 1.56 1.61!

* Proportion with outcome in exposed group minus proportion with outcome in unexposed group.† Risk of outcome in exposed group divided by risk of outcome in unexposed group.‡ Odds of outcome in exposed group divided by odds of outcome in unexposed group.§ NA ! not typically calculated or reported, since the statistic is not meaningful for this study

design.! ORs may be calculated for cohort and experimental studies, but RR is preferred.

TABLE 4Example of Cross-sectionalStudy Data

Major RiskFactor

DVTPositive*

DVTNegative* Total

Present 81 67 148Absent 90 188 278

Source.—Reference 9.* DVT ! deep venous thrombosis.

14 " Radiology " January 2004 Sistrom and Garvan

Ra

dio

logy

Page 83: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

cluded in them. In a prospective study,investigators follow up subjects afterstudy inception to collect informationabout development of disease. In a retro-spective study, disease status is deter-mined from medical records producedprior to the beginning of the study butafter articulation of the cohorts. In bothtypes of studies, initially disease-free sub-jects are classified into groups (ie, co-horts) on the basis of exposure statuswith respect to risk factors. Cumulativeincidence (ie, the proportion of subjectswho develop disease in a specified lengthof time) can be computed and comparedfor the exposed and unexposed cohorts.The main difference between prospec-tive and retrospective cohort studies iswhether the time period in question isbefore (retrospective) or after (prospec-tive) the study begins. Effect size from acohort study is typically quantified withRR and/or difference in proportions. Forexample, Burman et al (11) studied theassociation between false-positive mam-mograms and interval breast cancerscreening by using a prospective cohortstudy design. Women in whom a false-positive mammogram was obtained atthe most recent screening formed onecohort, and women in whom a previ-ously negative mammogram was ob-tained formed the other cohort. All of thewomen were followed up to determine ifthey obtained a subsequent screeningmammogram within the recommendedinterval (up to 2 years, depending onage). Burman et al found no significantdifference in the likelihood that awoman would obtain a mammogram be-tween the two cohorts (RR ! 1.07). Thedata are included in Table 6.

Experimental Studies

The characteristic that distinguishesany experiment is that the investigatordirectly manipulates one or more vari-ables (not the outcome!). A clinical trial is

the most common type of experimentalstudy used in medical research. Here, theinvestigator selects a sample of subjectsand assigns each to a treatment. In manycases, one treatment may be standardtherapy or an inactive (ie, placebo) treat-ment. These subjects are the controls,and they are compared with the subjectswho are receiving a new or alternativetreatment. Treatment assignment is al-most always achieved randomly so thatsubjects have an equal chance of receiv-ing one of the treatments. Subjects arefollowed up in time, and the cumulativeincidence of the outcome or disease iscompared between the treatment groups.RR and/or difference in proportions istypically used to quantify treatment ef-fect on the outcome. An example of sucha study can be found in an article byHarrison et al (12) in which they describetheir trial of direct mailings to encourageattendance for mammographic screen-ing. At the start of the study, half of thewomen were randomly assigned to re-ceive a personally addressed informa-tional letter encouraging them to attendscreening. The other half (ie, controls)received no intervention. The number ofwomen in each group who underwentmammography during the subsequent 2years was enumerated. The women towhom a letter was sent were more likelyto obtain a mammogram (RR ! 1.56).The data are listed in Table 7.

CALCULATION OFPROPORTIONS FROMA 2 ! 2 TABLE

Various proportions can be calculatedfrom the data represented in a 2 " 2contingency table. The cell, row, and col-umn proportions each give different in-formation about the data. Proportionsmay be represented by percentages orfractions, with the former having the ad-vantage of being familiar to most people.Cell proportions are simply the observednumber in each of the four cells dividedby the total sample size. Each cell also hasa row proportion. This is the number inthe cell divided by the total in the rowcontaining it. Likewise, there are four col-umn proportions that are calculated bydividing the number in each cell by thetotal in the column that contains it.

In a 2 " 2 table organized with out-come in the columns and exposure in therows, the various proportions have com-monly understood meanings. Cell pro-portions are the fractions of the wholesample found in each of the four combi-nations of exposure and outcome status.Row proportions are the fractions withand without the outcome. It may seemcounterintuitive that row proportionsgive information about the outcome.However, remembering that cells in agiven row have the same exposure statushelps one to clarify the issue. Similarly,

TABLE 5Example of Case-Control Study Data

Condition Cases* Controls†

Exposed‡ 111 79Nonexposed§ 50 79

Source.—Reference 10.* Increased breast density.† No increased breast density.‡ Combined therapy.§ Estrogen alone.

TABLE 6Example of Cohort Study Data

Result at Last Mammography

No. Undergoing Mammographywithin 2 Years*

TotalReturned Did Not Return

False-positive 602 211 813True-negative 3,098 1,359 4,457

Source.—Reference 11.* Within 2 years after last mammography.

TABLE 7Example of Experimental Study Data

Treatment

Underwent Mammography within 2 Years

TotalNo. Who Did No. Who Did Not

Intervention* 100 1,129 1,229Control† 64 1,165 1,229

Source.—Reference 12.* Intervention indicates that subjects received a mailed reminder.† Control indicates that subjects did not receive a mailed reminder.

Volume 230 " Number 1 Proportions, Odds, and Risk " 15

Ra

dio

logy

Page 84: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

column proportions are simply the frac-tion of exposed and unexposed subjects.

DIFFERENCE IN PROPORTIONS

The difference in proportions is used tocompare the response Y (eg, disease: yesor no) according to the value of the ex-planatory variable X (eg, risk factor: ex-posed or unexposed). The difference isdefined as the proportion with the out-come in the exposed group minus theproportion with the outcome in the un-exposed group. By using the a–d lettersfor cell labels (Table 1), the calculation isas follows:

&a/'a ! b() " &c/'c ! d().

For the cross-sectional study data (Table4) we would calculate the following equa-tion:

&81/'81 ! 67() " &90/'90 ! 188()

# 0.547 " 0.324 # 0.223.

The difference in proportions always isbetween %1.0 and 1.0. It equals zerowhen the response Y is statistically inde-pendent of the explanatory variable X.When X and Y are independent, thenthere is no association between them. Itis appropriate to calculate the differencein proportions for the cohort, cross-sec-tional, and experimental study designs.In a case-control study, there is no infor-mation about the proportions that areoutcome (or disease) positive in the pop-ulation. This is because the investigatoractually selects subjects to get a fixednumber at each level of the outcome (ie,cases and controls). Therefore, the differ-ence in proportions statistic is inappro-priate for estimating the association be-tween exposure and outcome in a case-control study. Table 3 lists the differencein proportions estimate for cross-sec-tional, cohort, and experimental studyexamples (9,11,12).

RISK AND RR

Risk is a term often used in medicine forthe probability that an adverse outcome,such as a side effect, development of adisease, or death, will occur during a spe-cific period of time. Risk is a parameterthat is completely known only in the raresituation when data are available for anentire population. Most often, an inves-tigator estimates risk in a particular pop-ulation by taking a representative randomsample, counting those that experiencethe adverse outcome during a specified

time interval, and forming a proportionby dividing the number of adverse out-comes by the sample size. For example,the estimate of risk is equal to the num-ber of subjects who experience an eventor outcome divided by the sample size.

The epidemiologic term for the result-ing rate is the cumulative incidence ofthe outcome. The incidence of an eventor outcome must be distinguished fromthe prevalence of a disease or condition.Incidence refers to events (eg, the acqui-sition of a disease), while prevalence re-fers to states (eg, the state of having adisease). RR is a measure of associationbetween exposure to a particular factorand risk of a certain outcome. The RR isdefined to be the ratio of risk in the ex-posed and unexposed groups. An equiv-alent term for RR that is sometimes usedin epidemiology is the cumulative inci-dence ratio, which may be calculated asfollows: RR is equal to the risk amongexposed subjects divided by the riskamong unexposed subjects.

In terms of the letter labels for cells(Table 1), RR is calculated as follows:

RR # &a/'a ! b()/&c/'c ! d().

The RR for our cohort study example (Ta-ble 6) would be calculated by dividingthe fraction with a mammogram in thefalse-positive cohort (602/813 ! 0.74) bythe fraction with a mammogram in thetrue-negative cohort (3,098/4,457 !0.695). This yields 1.065. The value of RRcan be any nonnegative number. An RRof 1.0 corresponds to independence of(or no association between) exposure sta-tus and adverse outcome. When RR isgreater than 1.0, the risk of disease isincreased when the risk factor is present.When RR is less than 1.0, the risk of dis-ease is decreased when the risk factor ispresent. In the latter case, the factor ismore properly described as a protectivefactor. The interpretation of RR is quitenatural. For example, an RR equal to 2.0means that an exposed person is twice aslikely to have an adverse outcome as onewho is not exposed, and an RR of 0.5means that an exposed person is half aslikely to have the outcome. Table 3 illus-trates the calculation of the RR estimatesfrom the various examples.

Any estimate of relative risk must beconsidered in the context of the absoluterisk. Motulsky (5) gives an example toshow how looking only at RR can be mis-leading. Consider a vaccine that halvesthe risk of a particular infection. In otherwords, the vaccinated subjects have anRR of 0.5 of getting infected compared

with their unvaccinated peers. The publichealth impact depends not only on theRR but also on the absolute risk of infec-tion. If the risk of infection is two permillion unvaccinated people in a year,then halving the risk to one per million isnot so important. However, if the risk ofinfection is two in 10 unvaccinated peo-ple in a year, then halving the risk is ofimmense consequence by preventing100,000 cases per million. Therefore, it ismore informative to compare vaccinatedto unvaccinated cohorts by using the dif-ference in proportions. With the rare dis-ease, the difference is 0.0000001, whilefor the common disease it is 0.1. Thislogic underlies the concept of numberneeded to treat and number needed toharm. These popular measures developedfor evidence-based medicine allow directcomparison of effects of interventions(ie, number needed to treat) or risk fac-tors (ie, number needed to harm). In ourvaccination scenario, the number neededto harm (ie, to prevent one infection) forthe rare disease is 1 million and for thecommon disease is 10.

ODDS AND THE OR

The OR provides a third way of compar-ing proportions in a 2 " 2 contingencytable. An OR is computed from odds.Odds and probabilities are different waysof expressing the chance that an out-come may occur. They are defined as fol-lows: The probability of outcome is equalto the number of times the outcome isobserved divided by the total observa-tions. The odds of outcome is equal tothe probability that the outcome doesoccur divided by the probability that theoutcome does not occur.

We are familiar with the concept ofodds through gambling. Suppose that theodds a horse named Lulu will win a raceare 3:2 (ie, read as “three to two”). The3:2 is equivalent to 3/2 or 1.5. The prob-ability that Lulu will be the first to crossthe finish line can be calculated from theodds, since there is a deterministic rela-tionship between odds and probability(ie, if you know the value of one, thenyou can find the value of the other). Weknow that:

Pr # Odds/'1 ! Odds(

and

Odds # Pr/'1 " Pr(,

where Pr is probability.The probability that Lulu will win the

race is (1.5)/1 # (1.5) ! 0.60, or 60%.

16 " Radiology " January 2004 Sistrom and Garvan

Ra

dio

logy

Page 85: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Probabilities always range from 0 to 1.0,while odds can be any nonnegative num-ber. The odds of a medical outcome inexposed and unexposed groups are de-fined as follows: Odds of disease in theexposed group is equal to the probabilitythat the disease occurs in the exposedgroup divided by the probability that thedisease does not occur in the exposedgroup. Odds of disease in the unexposedgroup is equal to the probability that thedisease occurs in the unexposed groupdivided by the probability that the dis-ease does not occur in the unexposedgroup.

It is helpful to define odds in terms ofthe a–d notation shown in Table 1. Use-ful mathematical simplifications (whereOddsexp is the odds of outcome in theexposed group and Oddsunex is the oddsof outcome in the unexposed group) thatarise from this definition are as follows:

Oddsexp # &a/'a ! b()/&b/'a ! b() # a/b

and

Oddsunex # &c/'c ! d()/&d/'c ! d() # c/d.

Note that these simplifications meanthat the outcome probabilities do nothave to be known in order to calculateodds. This is especially relevant in theanalysis of case-control studies, as will beillustrated. The OR is defined as the ratioof the odds and may be calculated asfollows: OR is equal to the odds of diseasein the exposed group divided by the oddsof disease in the unexposed group. Byusing the a–d labeling,

OR # 'a/b(/'c/d( # ad/bc.

The OR for our case-control example (Ta-ble 5) would be calculated as follows:

OR # &'111('79()/&'79('50() # 2.22.

The OR has another property that isparticularly useful for analyzing case-control studies. The OR we calculate froma case-control study is actually the ratioof odds of exposure, not outcome. This isbecause the numbers of subjects with andwithout the outcome are always fixed ina case-control study. However, the calcu-lation for exposure OR and that for out-come OR are mathematically equivalent,as shown here:

'a/c(/'b/d( # ad/bc # 'a/b(/'c/d(.

Therefore, in our example, we can cor-rectly state that the odds of increasedbreast density is 2.2 times greater in thosereceiving combined hormone replace-ment therapy than it is in those receivingestrogen alone. Authors, and readers,

must be very circumspect about continu-ing with the chain of inference to statethat the risk of increased breast density inthose receiving combined hormone re-placement therapy is 2.2 times higherthan is the risk of increased breast den-sity in those receiving estrogen alone. Indoing this, one commits the logical errorof assuming causation from association.Furthermore, the OR is not a good esti-mator of RR when the outcome is com-mon in the population being studied.

RELATIONSHIP BETWEENRR AND OR

There is a strict and invariant math-ematic relationship between RR and ORwhen they both are from the same pop-ulation, as may be observed with the fol-lowing equation:

RR # OR/&'1 " Pro( ! 'Pro('OR(),

where Pro is the probability of the out-come in the unexposed group.

The relationship implies that the mag-nitude of OR and that of RR are similaronly when Pro is low (13). In epidemiol-ogy, Pro is referred to as the incidence ofoutcome in the unexposed group. Thus,the OR obtained in a case-control studyaccurately estimates RR only when theoutcome in the population being studiedis rare. Figure 2 shows the relationship

between RR and OR for various values ofPro. Note that the values of RR and OR aresimilar only when the Pro is small (eg,10/100 ! 10% or less). At increasing Pro,ORs that are less than 1.0 underestimatethe RR, and ORs that are greater than 1.0overestimate the RR. A rule of thumb isthat the OR should be corrected whenincidence of the outcome being studiedis greater than 10% if the OR is greaterthan 2.5 or the OR is less than 0.5 (4).Note that with a case-control study, theprobability of outcome in the unexposedmust be obtained separately because itcannot be estimated from the sample.

This distinction was at the heart of thecritique of Schwartz et al (4) regardingthe article by Schulman et al (3). Schul-man et al had reported an OR of 0.6 forreferral to cardiac catheterization (out-come) between blacks and whites (ex-planatory). However, referral for cathe-terization occurred up to 90% of thetime, so the corresponding RR shouldhave been 0.93. This information wouldhave created a rather unspectacular newsstory (ie, blacks referred 7% less oftenthan whites) compared with the initial,and incorrect, headlines stating thatblacks were referred 40% less often thanwhites.

In our examples, this relationship isalso apparent. In the studies where bothRR and OR were calculated, the OR is

Figure 2. Graph shows relationship between OR and probability of outcome in unexposedgroup. Curves represent values of underlying RR as labeled. Rectangle formed by dotted linesrepresents suggested bounds on OR and probability of outcome in the unexposed group withinwhich no correction from OR to RR is needed. When OR is more than 2.5 or less than 0.5 orprobability of outcome in the unexposed group is more than 0.1, a correction (calculation in text)should be applied.

Volume 230 " Number 1 Proportions, Odds, and Risk " 17

Ra

dio

logy

Page 86: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

larger than the RR, as we now expect. Inthe experimental study example, the ORof 1.61 is only slightly larger than the RRof 1.56. This is because the probability ofmammography (the outcome) is rare at6.7 per 100. In contrast, the cohort studyyields an OR of 1.25, which is consider-ably larger than the RR of 1.07, with thehigh overall probability of mammogra-phy of 73 per 100 explaining the largerdifference.

CONFOUNDING VARIABLESAND THE SIMPSON PARADOX

An important consideration in any studydesign is the possibility of confoundingby additional variables that mask or alterthe nature of the relationship between anexplanatory variable and the outcome.Consider data from three binary vari-ables, the now familiar X and Y, as well asa new variable Z. These can be repre-sented in two separate 2 " 2 contingencytables: one for X and Y at level 1 of vari-able Z and one for X and Y at level 2 ofvariable Z. Alternatively, the X and Ydata can be represented in a single tablethat ignores the values of Z (ie, pools thedata across Z).

The problem occurs when the magni-tude of association between X (explan-atory) and Y (outcome) is different ateach level of Z (confounder). Estima-tion of the association between X and Yfrom the pooled 2 " 2 table (ignoringZ) is inappropriate and often incorrect.In such cases, it is misleading to evenlist the data in aggregate form. Thereare statistical methods to correct forconfounding variables, with the as-sumption that they are known andmeasurable. A family of techniques at-tributed to Cochran (14) and Manteland Haenszel (15) are commonly used

to produce summary estimates of asso-ciation between X and Y corrected forthe third variable Z.

The entire rationale for the use of ran-domized clinical trials is to eliminate theproblem of unknown and/or immeasur-able confounding variables. In a clinicaltrial, subjects are randomly assigned tolevels of X (the treatment or independentvariable). The outcome (Y) is then mea-sured during the course of the study. Thebeauty of this scheme is that all poten-tially confounding variables (Z) areequally distributed among the treat-ment groups. Therefore, they cannot af-fect the estimate of association betweentreatment and outcome. For random-ization to be effective in elimination ofthe potential for confounding, it mustbe successful. This requires scrupulousattention to detail by investigators inconducting, verifying, and document-ing the randomization used in anystudy.

It is possible for the relationship be-tween X and Y to actually change direc-tion if the Z data are ignored. For in-stance, OR calculated from pooled datamay be less than 1.0, while OR calcu-lated with the separate (so-called strat-ified) tables is greater than 1.0. Thisphenomenon is referred to as the Simp-son paradox, as Simpson is creditedwith an article in which he explainsmathematically how this contradictioncan occur (16). Such paradoxic resultsare not only numerically possible butthey actually arise in real-world situa-tions and can have profound social im-plications.

Agresti (13) illustrated the Simpsonparadox by using death penalty data forblack and white defendants chargedwith murder in Florida between 1976and 1987 (17). He showed that when

the victim’s race is ignored, the per-centage of death penalty sentences washigher for white defendants. However,after controlling for the victim’s race,the percentage of death penalty sen-tences was higher for black defendants.The paradox arose because juries ap-plied the death penalty more fre-quently when the victim was white,and defendants in such cases weremostly white. The victim’s race (Z)dominated the relationship betweenthe defendant’s race (X) and the deathpenalty verdict (Y). Table 8 lists thedata stratified by the victim’s race andcombined across the victim’s race. Asindicated in the table, unadjusted RR ofreceiving the death penalty (white/black) with white victims is 0.495; withblack victims, 0.0; and with combinedvictims, 1.40. The Mantel-Haenszel es-timate of the common RR is 0.48,thus solving the paradox. The methodfor calculating the Mantel-Haenszelsummary estimates is beyond the scopeof this article. However, confoundingvariables, the Simpson paradox, andhow to handle them are discussedin any comprehensive text about bio-statistics or medical research design(5,6).

CONCLUSION

This article has focused on statisticalanalysis of count data that arise from thefour basic designs used in medical re-search (ie, cross-sectional, case-control,cohort, and experimental study designs).Each of these designs often yields datathat are best summarized in a 2 " 2 con-tingency table. Statistics calculated fromsuch tables include cell, row, and columnproportions; differences in proportion;RRs; and ORs. These statistics are used toestimate associated population parame-ters and are selected to suit the specificaims and design of the study. For infer-ence concerning association betweenstudy variables, one must use the correctstatistic, allow for variability, and ac-count for any confounding variables.

References1. Langlotz CP. Fundamental measures

of diagnostic examination performance:usefulness for clinical decision mak-ing and research. Radiology 2003; 228:3–9.

2. Applegate KE, Tello R, Ying J. Hypothesistesting III: counts and medians. Radiol-ogy 2003; 228:603–608.

3. Schulman KA, Berlin JA, Harless W, et al.The effect of race and sex on physi-cians’ recommendations for cardiac cath-

TABLE 8Death Penalty Verdicts Following Murder Convictions in Florida, 1976–1987

Victim’s Race Defendant’s Race

Death Penalty*

RR, White/BlackYes No

White White 53 (11) 414 (89) 0.495White Black 11 (23) 37 (77) . . .Black White 0 (0) 16 (100) 0Black Black 4 (3) 139 (97) . . .Combined White 53 (11) 430 (89) UNC, 1.40†

MH, 0.48‡Combined Black 15 (8) 176 (92) . . .

Source.—Reference 15.* Data are numbers of verdicts. Data in parentheses are percentages of the totals.† UNC ! uncorrected.‡ MH ! Mantel-Haenszel estimate.

18 " Radiology " January 2004 Sistrom and Garvan

Ra

dio

logy

Page 87: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

eterization. N Engl J Med 1999; 340:618–626.

4. Schwartz LM, Woloshin S, Welch HG.Misunderstandings about the effects ofrace and sex on physicians’ referrals forcardiac catheterization. N Engl J Med1999; 341:279–283.

5. Motulsky H. Intuitive biostatistics. Ox-ford, England: Oxford University Press,1995.

6. Riegelman RK. Studying a study and test-ing a test. 4th ed. Philadelphia, Pa: Lip-pincott Williams & Wilkins, 2000.

7. Sackett DL. Clinical epidemiology: a basicscience for clinical medicine. 2nd ed. Bos-ton, Mass: Little, Brown, 1991.

8. Medina LS, Zurakowski D. Measurementvariability and confidence intervals in

medicine: why should radiologists care?Radiology 2003; 226:297–301.

9. Cogo A, Bernardi E, Prandoni P, et al.Acquired risk factors for deep-vein throm-bosis in symptomatic outpatients. ArchIntern Med 1994; 154:164–168.

10. Vachon CM, Sellers TA, Vierkant RA, WuFF, Brandt KR. Case-control study of in-creased mammographic breast density re-sponse to hormone replacement therapy.Cancer Epidemiol Biomarkers Prev 2002;11:1382–1388.

11. Burman ML, Taplin SH, Herta DF, ElmoreJG. Effect of false-positive mammogramson interval breast cancer screening in ahealth maintenance organization. AnnIntern Med 1999; 131:1–6.

12. Harrison RV, Janz NK, Wolfe RA, et al. Per-sonalized targeted mailing increases mam-

mography among long-term noncompli-ant medicare beneficiaries: a randomizedtrial. Med Care 2003; 41:375–385.

13. Agresti A. Categorical data analysis. Hobo-ken, NJ: Wiley, 2003.

14. Cochran WG. Some methods for strengthen-ing the common chi-squared tests. Bio-metrics 1954; 10:417–451.

15. Mantel N, Haenszel W. Statistical aspectsof the analysis of data from retrospectivestudies of disease. J Natl Cancer Inst 1959;22:719–748.

16. Simpson EH. The interpretation of inter-action in contingency tables. J R Stat SocB 1951; 13:238–241.

17. Radelet ML, Pierce GL. Choosing thosewho will die: race and the death penaltyin Florida. Florida Law Rev 1991; 43:1–34.

Volume 230 " Number 1 Proportions, Odds, and Risk " 19

Ra

dio

logy

Page 88: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Jonathan H. Sunshine, PhDKimberly E. Applegate, MD,

MS

Index terms:Cancer screeningEfficacy studyRadiology and radiologists, outcomes

studiesTechnology assessment

Published online10.1148/radiol.2302031277

Radiology 2004; 230:309–314

1 From the Department of Research,American College of Radiology, 1891Preston White Dr, Reston, VA 20191(J.H.S.); Riley Hospital for Children, In-diana University Medical Center, Indi-anapolis (K.E.A.); and Department ofDiagnostic Radiology, Yale University,New Haven, Conn (J.H.S.). ReceivedAugust 10, 2003; revision requestedAugust 19; revision received and ac-cepted August 21. Address corre-spondence to J.H.S. (e-mail: [email protected]).© RSNA, 2004

Technology Assessmentfor Radiologists1

Health technology assessment is the systematic and quantitative evaluation of thesafety, efficacy, and cost of health care interventions. This article outlines aspects oftechnology assessment of diagnostic imaging. First, it presents a conceptual frame-work of a hierarchy of levels of efficacy that should guide thinking about imagingtest evaluation. In particular, the framework shows how the question answered bymost evaluations of imaging tests, “How well does this test distinguish disease fromthe nondiseased state?” relates to the fundamental questions for all health technol-ogy assessment, “How much does this intervention improve the health of people?”and “What is the cost of that improvement?” Second, it describes decision analysisand cost-effectiveness analysis, which are quantitative modeling techniques usuallyused to answer the two core questions for imaging. Third, it outlines design andoperational considerations that are vital if researchers who are conducting anexperimental study are to make a quality contribution to technology assessment,either directly through their findings or as an input into decision analyses. Finally, itincludes a separate discussion of screening—that is, the application of diagnostictests to nonsymptomatic populations—because the requirements for good screen-ing tests are different from those for diagnostic tests of symptomatic patients andbecause the appropriate evaluation methods also differ.© RSNA, 2004

Technologic innovation and diffusion of technology into daily practice in radiology havebeen nothing short of remarkable in the past several decades. Health technology assess-ment is the careful evaluation of a medical technology for evidence of its safety, efficacy,cost, cost-effectiveness, and ethical and legal implications (1). Interest and research inhealth technology assessment are growing in response to the wider application of newtechnology and the increasing costs of health care today (2).

The goal of this article is to describe some of the rationale and the methods of technol-ogy assessment as applied to radiology. For any health care intervention, includingdiagnostic imaging tests, the ultimate questions are, “How much does this do to improvethe health of people?” and “How much does it cost for that gain in health?” We need suchan understanding of the radiology services we provide to advocate for our patients and touse our resources efficiently and effectively.

OUTCOMES

Measures of diagnostic accuracy, which are the metrics most commonly used for evalua-tion of diagnostic tests, answer the question, “How well does this test distinguish diseasefrom the nondiseased state?” The answer to that question often does not provide ananswer to the questions about improvement of health and the cost of that improvement,which are the core outcome questions about health care interventions (3,4).

The most productive way to think about this gap between diagnostic accuracy on theone hand and outcomes on the other hand and to think about the inclusion of relevantoutcomes in the evaluation of diagnostic tests is to use the conceptual scheme of a six-level“hierarchy of efficacy” developed by Fryback and Thornbury (5,6) (Table ). They point outthat efficacy at any level in their hierarchy is necessary for efficacy at the level with thenext highest number but is not sufficient. In their scheme, diagnostic accuracy is at level2, and patient and societal outcomes are at levels 5 and 6, respectively. Thus, there may be“many a slip between cup and lip”—that is, between diagnostic accuracy of an imaging teston the one hand and improved health and adequate cost-effectiveness on the other.

Statistical Concepts Series

309

Ra

dio

logy

Page 89: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Let us trace partway through the schema,starting at the lowest level, to understandthe principle that efficacy at one level isnecessary but not sufficient for efficacy atthe next level. Technical efficacy (level 1),such as a certain minimum spatial resolu-tion, is necessary for diagnostic accuracy(level 2), but it does not guarantee it. Sim-ilarly, diagnostic accuracy is necessary if atest is to affect the clinician’s diagnosis(level 3), but it is not sufficient. Rather,other sources of information, such as pa-tient history, may dominate, so that even ahighly accurate test may have little or noeffect on the diagnosis. In such an in-stance, fairly obviously, the test does notcontribute to the level 5 goal of improvingpatient health.

As the Table shows, there are multiplemeasures that can be used to quantify theefficacy of a diagnostic imaging test atany of the six levels. Hence, evaluationsof imaging tests can involve a variety ofmeasures. Thinking in terms of the hier-archy is also helpful for identification ofthe level(s) at which information shouldbe obtained in an evaluation of a diag-nostic imaging test. Experience, as well asreflection, has taught some lessons. Themost important of these include:

1. Because higher-level efficacy is pos-sible only if lower-level efficacy exists, itis often useful to measure efficacy at rel-atively low-numbered levels.

2. In particular, in the development ofa test, it is helpful to measure aspects oftechnical efficacy (level 1), such as sharp-ness, noise level, and ability to visualizethe anatomic structures of interest. Animportant aspect of test developmentconsists of finding the technical parame-ters (voltage, section thickness, etc) thatgive the best diagnostic accuracy; thesemeasures of technical efficacy are oftenkey results in that process.

3. Diagnostic accuracy (level 2) is thehighest level of efficacy that is character-istic of the test alone. For example, thesensitivity and specificity of a test are notdependent on what other diagnostic in-formation is available, unlike level 3 (di-agnosis). Also, the methodology and sta-tistics used in measurement of diagnosticaccuracy are relatively fully developed.Therefore, measurement of diagnostic ac-curacy is usually worthwhile.

4. Above diagnostic accuracy, effecton treatment (level 4), an “intermediateoutcome,” is relatively attractive to mea-sure. It can be measured fairly easily andreliably in a prospective study, and it iscloser in the hierarchy to the ultimatecriteria, effect on patient health (level 5)and cost-effectiveness (level 6).

5. Effect on patient health (level 5) is usu-ally observable only after a substantial delay,especially for chronic illnesses, such as car-diovascular disease and cancer, which arecurrently the predominant causes of mortal-ity in the United States. Also, it is the endresult of a multistep process of health care.Because diagnostic tests occur near the be-ginning of the process, and some randomvariation enters into the results at every step,the effect of a diagnostic test on final out-comes is usually difficult to observe withoutan inordinate number of patients. For exam-ple, the current principal randomized con-trolled trial of computed tomographic (CT)screening for lung cancer requires some50,000 patients and is expected to take 8years and cost $200 million (7). Thus, effectson patient health (level 5) and cost-effective-ness (level 6) are uncommon as end points inexperimental studies on the evaluation ofdiagnostic tests.

CLINICAL DECISION ANALYSISAND COST-EFFECTIVENESSANALYSIS

Instead, assessments of imaging technolo-gies at levels 5 and 6 of the efficacy hierar-chy are generally conducted by usingdecision analysis rather than direct experi-mental studies. Decision analysis (8–11) isan objective and systematic technique forcombining the results of experimentalstudies that cover different health caresteps to estimate effects of care processes

more extensive than those directly studiedin any single experimental researchproject. Cost-effectiveness analysis is aform of decision analysis that involvesevaluation of the costs of health care, aswell as the outcomes (12,13). What followsis a brief explanation of clinical decisionanalysis and cost-effectiveness analysis andthe role they may play in technology as-sessment in radiology. Although we con-centrate on cost-effectiveness analysis, thesame methods and applications apply todecision analysis.

Cost-effectiveness analysis recognizesthat the results of care are rarely 0% and100% outcomes but rather are probabilis-tic (14). It involves the creation of algo-rithms, usually displayed as decisiontrees, as shown in Figure 1, which incor-porate probabilities of events and, often,the valuations (usually called “utilities”)of possible outcomes of these events. In-dividual or population-based preferencesfor certain outcomes and treatments arefactored into these utilities.

Cost-effectiveness analysis can be di-vided into three basic steps: defining theproblem, building the decision model,and analyzing the model.

Defining the Problem

For any cost-effectiveness analysis, oneof the most difficult tasks is defining theappropriate research question. The issuesto address in defining the problem are the

Hierarchy of Efficacy for Diagnostic Tests

Level Typical Measures

1, Technical efficacy Resolution of line pairsPixels per millimeterSection thicknessNoise level

2, Diagnostic accuracy SensitivitySpecificityArea under the receiver operating characteristic curve

3, Diagnosis Percentage of cases in which image is judged helpful in makingthe diagnosis

Percentage of cases in which diagnosis made without the testis altered—or altered substantially—when information fromthe test is received

4, Treatment Percentage of cases in which image is judged helpful inplanning patient treatment

Percentage of cases in which treatment planned without thetest is changed after information from the test is received

5, Patient health outcomes Percentage of patients improved with test conductedcompared with that improved without test conducted

Percentage difference in specific morbidities with testcompared with those without

Mean increase in quality-adjusted life years with test comparedwith that without

6, Societal value Cost-effectiveness from a societal perspectiveCost per life saved, calculated from a societal perspective

Source.—Adapted and reprinted, with permission, from reference 6.

310 ! Radiology ! February 2004 Sunshine and Applegate

Ra

dio

logy

Page 90: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

population reference case, strategies, timehorizon, perspective, and efficacy (out-come) measures. The reference case is adescription of the patient population thecost-effectiveness analysis is intended tocover. For example, the reference case forthe cost-effectiveness analysis in Figure 1consists of persons with acute abdominalpain seen in the emergency department.

The issue of strategies is, what are the carestrategies that we should compare? Toomany strategies may be confusing to com-pare. Too few may make an analysis suspectof missing possibly superior strategies. Thedecision tree in Figure 1 compares costs andoutcomes of a clinical examination versus animaging test for the diagnosis of acute ap-pendicitis; in a fuller model, ultrasonogra-phy (US) and CT might be considered sepa-rate imaging strategies. In general, cost-effectiveness analysis and decision analysisaddress whether a new diagnostic test ortreatment strategy should replace the currentstandard of care, in which case the currentstandard and the proposed new approachare the strategies to include. Alternatively,often the issue is which of a series of tests ortreatments is best, and these then becomethe strategies to include.

The time horizon for which the cost-effec-tiveness analysis model is used to evaluatecosts, benefits, and risks of each strategymust be stated and explained. Sometimes,the time horizon may be limited because ofincomplete data, but this creates a biasagainst strategies with long-term benefits.

Finally, cost-effectiveness analysis al-lows costs to be counted from differentperspectives. The perspective might bethat of a third-party payer, in which caseonly insurance payments count as costs,or that of society, in which case all mon-etary costs, including those paid by thepatient, count, and so—at least in someanalyses—do nonmonetary costs, such

as travel and waiting time involved inobtaining care.

Building the Cost-EffectivenessAnalysis Model

Cost-effectiveness analysis is usuallybased on a decision tree, a visual repre-sentation of the research question (Fig 1).These decision trees are created and ana-lyzed with readily available computersoftware, such as DATA (TreeAge Soft-ware, Williamstown, Mass). The tree in-corporates the choices, probabilities ofevents occurring, outcomes, and utilitiesfor each strategy being considered. Eachbranch of the tree must have a probabil-ity assigned to it, and each path in thetree must have a cost and outcome as-signed. Data typically come from directstudies of varying quality, from expertopinion (which is usually unavoidablebecause some needed data values can notbe obtained in any other way), and fromsome less directly relevant literature. Forexample, in Figure 1, the probability of apositive test result may be selected frompublished literature and added to the de-cision tree under the branch labeled“Positive Test/Surgery.” Costs are fre-quently not ascertained directly, butrather are estimated by using proxiessuch as Medicare reimbursement rates orthe charge and/or cost data of a hospital.Building the decision tree requires expe-rience and judgment.

The complexity of cost-effectivenessanalysis sometimes makes it difficult tounderstand and therefore undervalued(14,15). One way to improve understand-ing and allow readers to judge for them-selves the value of a cost-effectivenessanalysis model is to be explicit about theassumptions of the model. Many as-sumptions are needed simply because of

limited data available to answer the re-search question.

Analyzing the Cost-EffectivenessAnalysis Model

Once the model has been created, anal-ysis should then include baseline analysisof cost and effectiveness and sensitivityanalysis. The average cost and effective-ness for each strategy, considering all theoutcomes to which it might lead, arecomputed simultaneously. We calculateaverages by weighting the end probabili-ties of each branch and by summing foreach strategy by moving from right toleft in the tree. In cost-effectiveness anal-ysis decision trees such as that in Figure1, the costs and utilities for each outcomewould be placed in the decision tree atthe right end of each branch.

Possible results when comparing twostrategies include the following: Onestrategy is less expensive and more effec-tive than another, one strategy is moreexpensive and less effective, one strategyis less expensive but less effective, andone strategy is more expensive but moreeffective. The choice in the first two sit-uations is clear, and the better strategy iscalled “dominant.” The final two situa-tions involve trade-offs in cost versus ef-fectiveness, however. In these situations,one compares strategies by using the in-cremental cost-effectiveness ratio, whichallows evaluation of the ratio of increasein cost to increase in effectiveness. Whatmaximal incremental cost-effectivenessratio is acceptable is open to debate, butfor the United States, $50,000–$100,000per year of life in perfect health (usuallycalled a “quality-adjusted life-year”) iscommonly recommended as a maximum.

Almost all payers in the United Statesstate that they consider only effective-ness, not cost. Implicitly, then, they ac-cept an indefinitely high incrementalcost-effectiveness ratio—it does not mat-ter how much more expensive a strategyis, as long as it is the least bit more effec-tive or the public demands it intensely.

The final task in cost-effectivenessanalysis is sensitivity analysis. Sensitivityanalysis consists of changing “parametervalues” (numerical values, such as prob-abilities, costs, and valuation of out-comes) in the model to find out whateffect they have on the conclusions. Amodel should be tested in this way for“robustness,” or strength of its conclu-sions with regard to changes in its as-sumptions and uncertainty in the param-eters taken from the literature or expertopinion. If a small change in the value of

Figure 1. Example of a typical imaging decision analysis tree. In this example, an imaging testis compared with clinical examination for the correct diagnosis of acute appendicitis.

Volume 230 ! Number 2 Technology Assessment for Radiologists ! 311

Ra

dio

logy

Page 91: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

a parameter leads to a change in the pre-ferred strategy of the model, then theconclusion is said to be sensitive to thatparameter, and the conclusion is weak.Sensitivity analysis may persuade doubt-ful readers of the soundness of the con-clusions of the model by showing thatthe researchers were thorough and unbi-ased and the conclusions are not sensi-tive to the assumptions or parameters thereaders question. Often, however, sensi-tivity analysis will show that conclusionsare not robust. Alternatively, anothercost-effectiveness analysis, conducted bydifferent researchers by using differentassumptions and parameters (which is re-ally a form of sensitivity analysis), willreach different conclusions. While dis-couraging, a similar situation is not un-common with experimental studies(such as clinical research), with one studyhaving findings different from another.Also, identification of the parameters andassumptions to which the results are sen-sitive can be very helpful, because it tellsresearchers what needs to be investigatedfurther through experimental studies toreach reliable conclusions.

CHARACTERISTICS OFHIGH-QUALITY EXPERIMENTALSTUDIES

Whether an experimental study is in-tended to provide direct findings (princi-pally, as we have seen, at efficacy levels 1through 4) or to provide findings to beused as input into decision analysisand/or cost-effectiveness analysis (whichare then used to assess level 5 and 6 effi-cacy), several design and operationalconsiderations are important for thestudy to be of high quality and substan-tial value (2,16–19). Regrettably, thequality of studies on the evaluation ofdiagnostic imaging is very often poor(20–23). Therefore, radiologists shouldbe aware of these considerations so thatthey may read the literature critically andalso improve the quality of the technol-ogy assessment studies they conduct.

The most important considerationsfollow. We focus on studies of diagnosticaccuracy, since these are most commonand constitute the principal focus of ra-diologists, but most of what is said ap-plies to experimental studies of other lev-els of the hierarchy of efficacy.

Patient Characteristics

Patients in a study should be like those inwhom a test will be applied in practice. Of-ten, in initial studies, a test is applied pre-

dominantly to very sick patients or com-pletely healthy individuals. This “spectrumbias” exaggerates the real-world ability of thetest to distinguish disease from health be-cause intermediate cases that are less thantotally clear cut are eliminated. As a result,initial reports on a new test are often overlyoptimistic. On the other hand, such spec-trum bias can be useful in initial studies toascertain if a test has any possible promiseand to help establish the operating parame-ters at which the test works best.

Number of Cases

The number of cases included in stud-ies should be adequate. Almost always,the smaller the number of cases, thelarger the minimum difference that canreliably be observed. Before a study is be-gun, a statistician should be asked to per-form a power calculation to ascertain thenumber of cases required to detect, withdesired reliability, the minimum differ-ence regarded as clinically important. Of-ten, the number of cases included in ac-tual studies is inadequate (22). Suchstudies are referred to as “underpowered”and can lead to errors.

Design Considerations

Prospective studies are almost alwayspreferable to retrospective studies. “Wellbegun is half done” carries a corollary that“poorly begun is hard to salvage.” In a ret-rospective study, one has to work fromsomeone else’s design and data collection,and these are typically far from optimalfrom the standpoint of your purposes.

The temptation to include in the re-search everything that might be studiedshould be resisted, lest the study collapsefrom its own complexity.

Often, the purpose of a study is to com-pare two diagnostic tests—for example,to compare a proposed new test with anestablished one. In this situation, unlessdata on patient health outcomes and costmust be directly obtained, an optimal de-sign consists of applying both tests to allstudy patients, with interpretation ofeach test performed while blinded to theresults of the other. In contrast, the com-mon practice of using “historical con-trols” to represent the performance of theestablished test is usually a poor choice.The patient population in the historicalcontrol may be different, and the execu-tion of the historical series may not meetstandards of current best practice.

Reference Standard

The reference standard (sometimes lessformally called the “gold standard”)

needs to be chosen carefully. While a per-fect reference standard—one with 100%accuracy—often cannot be attained, it isimportant to do as well as possible. Meth-odologists routinely warn (4,22,24) that areference standard that is dependent,even in part, on the test(s) being evalu-ated involves circular reasoning, andthey say it is therefore seriously deficient,but they note that such standards arenonetheless not infrequently used.

Timing

Timing is important because diagnos-tic imaging is a field that is changingrelatively rapidly. There is little point inundertaking a large-scale study when anew technique is in the initial develop-mental stage and is changing particularlyrapidly; results will be obsolete beforethey are published. On the other hand, itis not wise to wait until a technique isfully mature because, by then, it will of-ten be widely disseminated, making thestudy too late for its results to readilyinfluence general clinical practice. Use oftechniques that lead to rapid completionof a study, such as gathering data frommultiple sites, is highly desirable becauseimaging evolves relatively rapidly.

Efficacy and Effectiveness

Most evaluations of diagnostic tests—and of any other medical care—are studiesof efficacy, which is defined as results ob-tained under ideal conditions, such asthose of a careful research project. Initially,efficacy is important to ascertain, but ulti-mately, one would want to know effective-ness, which is defined as results obtainedin ordinary practice. Effectiveness is usu-ally poorer than efficacy. For example,studies in individual academic institu-tions—that is, efficacy studies—showedthat abdominal CT for patients suspectedof having appendicitis significantly re-duced the perforation rate and unneces-sary surgery rate (25,26), but a study ofessentially all hospital discharges in Wash-ington state—that is, an effectivenessstudy—showed no improvement in eitherrate between 1987 and 1998, a periodwhen laparoscopy and cross-sectional im-aging techniques, including CT, becamewidely available (27). The systematizationnecessary for an organized study tends topreclude observation of effectiveness—thestudy protocol ensures uniform applica-tion of the test with its parameters set atoptimal levels, and people are generallymore careful and consistent and do better

312 ! Radiology ! February 2004 Sunshine and Applegate

Ra

dio

logy

Page 92: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

when they know their activity is being ob-served (this is called the Hawthorne effect).

Figure 2 lists some additional impor-tant considerations for high-quality stud-ies. Sunshine and McNeil (16) discuss theabove considerations and those in Figure2 in more detail.

SCREENING

Screening (28,29) is the performance of a di-agnostic test in an asymptomatic populationwith the aim of reducing morbidity and/ormortality from disease. The requirements ofefficacious screening are somewhat differentfrom those of “conventional” diagnostictesting—that is, testing applied to symptom-atic patients. These differences apply to thediagnostic test, available treatment, and eval-uation of the test.

The Test

Because the prevalence of disease in ascreening population is very low—for exam-ple, approximately one-half percent inscreening mammography—a screening testmust be highly specific. Otherwise, false-pos-

itive findings will greatly outnumber true-positive findings (even at the relatively high90%–95% specificity rate for mammogra-phy—ie, 5%–10% recall rate—false-positivefindings outnumber true-positive findingsby 10–20 to 1), and the cost and morbidityof working up patients with false-positivefindings will outweigh the gains from earlydetection in those with true-positive find-ings. Similarly, the cost and morbidity of thescreening test itself (which apply to everypatient screened) must be relatively low; oth-erwise, they will outweigh the gains ofscreening, which can occur only for the verysmall percentage of patients with true-posi-tive findings.

In contrast, sensitivity can be modest. Forexample, screening mammography has anapproximate 75% sensitivity, yet it allows usto identify three of every four possible breastcancers that could be detected if the test wereperfectly (100%) sensitive. These require-ments for a screening test can be somewhateased if a high-risk population is identified,because the proportion of true-positive find-ings will increase. Note that while a screen-ing test optimally has high specificity andmay only need modest sensitivity, an opti-mal diagnostic test for symptomatic patientsshould have a high sensitivity, but the spec-ificity may be modest.

Treatment

Oddly, the available treatment must beintermediate in efficacy. If treatment isfully efficacious—more specifically, if treat-ment of symptomatic patients is as effica-cious and no more costly than the pre-symptomatic treatment made possible by

screening—then nothing is to be gained byidentifying disease before it becomessymptomatic. Conversely, if treatment iscompletely inefficacious—that is, there isno useful treatment for even presymptom-atic disease—there is also no possible gainfrom screening. Screening can only be ben-eficial if treatment of presymptomatic dis-ease is more efficacious than treatment ofsymptomatic disease (29–31). (However,some hold that screening for untreatablegenetic diseases and other untreatable dis-eases can be reasonable because parentscan alter reproductive behavior and pa-tients can gain more time to prepare forthe consequences of disease.) Given theserequirements regarding treatment effec-tiveness for screening to be sensible, newdevelopments in treatment—for example,the introduction of pharmaceuticals suchas donepezil hydrochloride (Aricept; EisaiAmerica, New York, NY) that slows the pre-viously unalterable rate of progression ofAlzheimer disease—can completely alterthe relevance of screening.

Evaluation of Screening

In general, the efficacy of treatment of pr-esymptomatic disease relative to that ofsymptomatic disease is not known, althoughthis is a critical issue for screening, as indi-cated in the previous paragraph. The reasonfor the lack of knowledge is as follows: ifscreening has not been done previously, rel-ative efficacy simply is not known becausepresymptomatic cases have not been identi-fied and treated. On the other hand, if theissue is introduction of a more sensitivescreening test, one does not know the effi-cacy of treating the additional, presumablyless advanced cases the new test detects.Partly for this reason, evaluation of screeninggenerally has to consist of a randomized con-trolled trial in which (a) the interventionconsists of the test and the treatment in com-bination and (b) the end point studied is thedeath rate, morbidity, or other adverse out-come(s) from the disease being screening forin the intervention population comparedwith the rates in the control population.

Biases

Three well-known biases (30,32,33)also generally necessitate this random-ized controlled trial study design for eval-uation of screening tests and generallypreclude the use of other end points,such as 5-year survival from time of diag-nosis. These three biases should be un-derstood by all radiologists.

“Lead-time bias” refers to the fact thatscreening will allow detection of disease

Figure 3. Example of length bias. Half of thecases are the more indolent form (longer pre-clinical phase, longer symptomatic phase, andless severe adverse events, as shown by asmaller x). At any point in time (t1 and t2 arerandomly chosen points in time), however,two-thirds of the cases detectable only withscreening are indolent.

Figure 2. Additional procedures for enhance-ment of study quality and rapidity, with par-ticular reference to a study of substantial scale.

Volume 230 ! Number 2 Technology Assessment for Radiologists ! 313

Ra

dio

logy

Page 93: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

earlier in its natural history than willwaiting for symptoms, so any measure-ment from time of diagnosis will be bi-ased in favor of screening, regardless ofthe effectiveness of treatment. Consideran oversimplified example: For lung can-cer, 5-year survival from diagnosis is cur-rently 10%–20%. Assume that CT screen-ing advances diagnosis by 51⁄2 years, buttreatment has absolutely no value. Then5-year survival would nonetheless in-crease to essentially 100% with screen-ing. In short, survival time in a screenedgroup will incorrectly appear to be betterthan that in a nonscreened group.

“Overdiagnosis bias” or “pseudodis-ease” (29,31) refers to the fact that apply-ing a diagnostic test to asymptomatic in-dividuals will identify “positive cases”that will never become clinically mani-fest in a person’s lifetime. Prostate cancerprovides a striking example. It is the mostcommon nonskin malignancy in men inthe United States, affecting 10% of them,but careful histopathologic examinationat autopsy shows microscopic prostatecancers in nearly 50% of men over theage of 75 years (34). If an imaging test assensitive as histologic examination at au-topsy were developed, but early detectionhad absolutely no effect on outcomes,the percentage of “cases” showing ad-verse outcomes would nonetheless de-crease by four-fifths—but only becausefour-fifths of the “cases” never wouldhave shown any effects of the disease inthe absence of screening and treatment.The general point is that, because of over-diagnosis bias, any study of the outcomeof cases identified with a screening testwill be biased toward screening, for manyof the cases identified with screeningwould never have had any adverse out-comes, even in the absence of treatment.Incidentally, the morbidity and cost oftreating such cases is one of the negativeconsequences of screening.

“Length bias” can be thought of as anattenuated form of pseudodisease. It arisesbecause cases of a disease vary in aggres-siveness, with the faster-progressing casestypically also having a natural history withgreater morbidity and mortality. Cases de-tected with screening are typically dispro-portionately indolent. This is because slow-progressing cases remain longer in thepresymptomatic phase in which they aredetectable only with screening and do notmanifest symptoms. Thus, a test that helpsidentify asymptomatic cases dispropor-tionately uncovers indolent cases, as Figure3 shows. Hence, cases detected withscreening disproportionately have a rela-tively favorable prognosis, regardless of the

effectiveness of treatment. Thus, any studyof outcomes in cases detected with screen-ing (vs those detected when symptoms oc-cur) will be biased toward screening.

Other Considerations

While change in morbidity or mortal-ity from the disease being screened for isthe prime measure of the effect of screen-ing, changes in other morbidity and mor-tality possibly caused by screening and/ortreatment should also be considered. Con-cerns of this type include surgical compli-cations, chemotherapy toxicity, radiationtreatment–induced secondary cancers, ra-diation dose from screening, patient anxi-ety, and changes in patient satisfaction.

The percentage reduction in the risk ofan adverse effect from the disease beingscreened for, called “relative risk reduc-tion,” is a common measure of the ben-efit of screening, but this measure needsto be set in context (35). For example, ifscreening reduces an individual’s risk ofdying of a particular disease over the nextdecade from 1.0% to 0.4%, that is a 60%decrease in relative risk, but only 0.6 of apercentage point increase in the proba-bility of surviving the decade.In conclusion, for any health care inter-vention, including diagnostic imagingtests, the ultimate questions are, “Howmuch does this do to improve the healthof people?” and “How much does it costfor that gain in health?” By using themethods described in this article, wehave the ability to answer these ques-tions as we assess the remarkable imagingtechnologies available today.

References1. Perry S, Thamer M. Medical innovation and

the critical role of health technology assess-ment. JAMA 1999; 282:1869–1872.

2. Eisenberg J. Ten lessons for evidence-basedtechnology assessment. JAMA 1999; 282:1865–1869.

3. Clancy CM, Eisenberg JM. Outcomes re-search: measuring the end results of healthcare. Science 1998; 282:245–246.

4. Hillman BJ. Outcomes research and cost-ef-fectiveness analysis for diagnostic imaging.Radiology 1994; 193:307–310.

5. Fryback DG, Thornbury JR. The efficacy of diag-nostic imaging. Med Decis Making 1991; 11:88–94.

6. Thornbury JR. Clinical efficacy of diagnosticimaging: love it or leave it. AJR Am J Roent-genol 1994; 162:1–8.

7. American College of Radiology Imaging Net-work. Contemporary screening for the detectionof lung cancer. Available at: www.acrin-nlst.org/6654factsheet.html. Accessed August 20, 2003.

8. Weinstein MC, Fineberg HV. Clinical decisionanalysis. Philadelphia, Pa: Saunders, 1980.

9. Hunink MG, Glasziou PP, Siegel J, et al. De-cision making in health and medicine: inte-grating evidence and values. Cambridge, En-gland: Cambridge University Press, 2001.

10. Chapman GB, Sonnenberg FA. Decisionmaking in health care: theory, psychology,and applications. Cambridge, England: Cam-bridge University Press, 2000.

11. Janne D’Othee B, Black WC, Pirard S, Zhuang Z,

Bettman MA. Decision analysis in radiology. JRadiol 2001; 82:1693–1698.

12. Singer ME, Applegate KE. Cost-effectivenessanalysis in radiology. Radiology 2001; 219:611–620.

13. Soto J. Health economic evaluations using deci-sion analytic modeling: principles and prac-tices—utilization of a checklist to their develop-ment and appraisal. Int J Technol Assess HealthCare 2002; 18:94–111.

14. Kleinmuntz B. Clinical and actuarial judg-ment. Science 1990; 247:146–147.

15. Tsevat J. SMDM presidential address: hearsay orheresy—are health decision scientists too left-brained? Med Decis Making 2003; 23:83–87.

16. Sunshine JH, McNeil BJ. Rapid method for rigor-ous assessment of radiologic imaging technolo-gies. Radiology 1997; 202:549–557.

17. Baum RA, Rutter CM, Sunshine JH, et al. Multi-center trial to evaluate vascular magnetic reso-nance angiography of the lower extremity.JAMA 1995; 274:875–880.

18. Lilford RJ, Pauker SG, Braunholtz DA, Chard J.Decision analysis and the implementation of re-search findings. BMJ 1998; 317:405–409.

19. Blackmore CC. The challenge of clinical ra-diology research. AJR Am J Roentgenol 2001;176:327–331.

20. Kent DL, Haynor DR, Longstreth WT, LarsonEB. The clinical efficacy of magnetic reso-nance imaging in neuroimaging. Ann InternMed 1994; 120:856–871.

21. Holman BL. The research that radiologists do:perspective based on a survey of the literature.Radiology 1990; 176:329–332.

22. Blackmore CC, Black WC, Jarvik JG, LanglotzCP. A critical synopsis of the diagnostic andscreening radiology outcomes literature.Acad Radiol 1999; 6:S8–S18.

23. Bossuyt PM, Reitsma JB, Bruns DE, et al. Towardscomplete and accurate reporting of studies ofdiagnostic accuracy: the STARD Initiative. Radi-ology 2003; 226:24–28.

24. Sox H, Stern S, Owens D, Abrams HL. Mono-graph of the council on health care technol-ogy: assessment of diagnostic technology inhealth care-rationale, methods, problems, anddirections. Washington, DC: National Acad-emy Press, 1989.

25. Rao PM, Rhea JT, Novelline RA, MostafaviAA, Lawrason JN, McCabe CJ. Helical CTcombined with contrast material adminis-tered only through the colon for imaging ofsuspected appendicitis. AJR Am J Roentgenol1997; 169:1275–1280.

26. Sivit CJ, Siegel MJ, Applegate KE, NewmanKD. When appendicitis is suspected in chil-dren. RadioGraphics 2001; 21:247–262.

27. Flum DR, Morris A, Koepsell T, Dellinger EP.Has misdiagnosis of appendicitis decreasedover time? a population based analysis.JAMA 2001; 286:1748–1753.

28. Herman CR, Gill HK, Eng J, Fajardo LL.Screening for preclinical disease: test and dis-ease characteristics. AJR Am J Roentgenol2002; 179:825–831.

29. Black WC, Welch HG. Screening for disease.AJR Am J Roentgenol 1997; 168:3–11.

30. Black WC, Ling A. Is earlier diagnosis reallybetter? the misleading effects of lead timeand length biases. AJR Am J Roentgenol1990; 155:625–630.

31. Morrison AS. Screening in chronic disease.New York, NY: Oxford University Press,1992; 125–127.

32. Black WC, Welch HG. Advances in diagnos-tic imaging and overestimation of diseaseprevalence and the benefits of therapy.N Engl J Med 1993; 328:1237–1243.

33. Morrison AS. The effects of early treatment,lead time and length bias on the mortalityexperienced by cases detected by screening. IntJ Epidemiol 1982; 11:261–267.

34. Brant WE, Helms CA, eds. Fundamentals of di-agnostic radiology. 2nd ed. Philadelphia, Pa: Lip-pincott Williams & Wilkins, 1999; 825.

35. Laupacis A, Sackett DL, Roberts RS. An assess-ment of clinically useful measures of theconsequences of treatment. N Engl J Med1988; 318:1728–1733.

314 ! Radiology ! February 2004 Sunshine and Applegate

Ra

dio

logy

Page 94: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

John Eng, MD

Index terms:Radiology and radiologists, researchReceiver operating characteristic

(ROC) curveStatistical analysis

Published online10.1148/radiol.2303030297

Radiology 2004; 230:606–612

Abbreviations:Az ! area under ROC curveCNR ! contrast-to-noise ratioROC ! receiver operating

characteristic

1 From the Russell H. Morgan Depart-ment of Radiology and RadiologicalScience, Johns Hopkins University,Central Radiology Viewing Area,Room 117, 600 N Wolfe St, Baltimore,MD 21287. Received February 21,2003; revision requested April 10; re-vision received July 18; accepted July21. Address correspondence to theauthor (e-mail: [email protected]).© RSNA, 2004

Sample Size Estimation: AGlimpse beyond SimpleFormulas1

Small increments in the complexity of clinical studies can readily take sample sizeestimation and statistical power analysis beyond the capabilities of simple math-ematic formulas. In this article, the method of simulation is presented as a generaltechnique with which sample size may be calculated for complex study designs.Applications of simulation for determining sample size requirements in studiesinvolving correlated data and comparisons of receiver operating characteristiccurves are discussed.© RSNA, 2004

In a previous article in this series (1), I discussed the fundamental concepts involved indetermining the appropriate number of subjects that should be included in a clinicalinvestigation. This number is known as the sample size. In the earlier article (1), thenecessity for considering sample size, how certain study design characteristics affectsample size (Fig 1), and how to calculate sample size for several simple study designs werediscussed. Also discussed was how sample size is related to statistical power, which is thesensitivity of detecting a statistically significant difference in a comparative study when adifference is truly present.

In this article, I will first discuss some important consequences of sample size and powercalculations, then go on to discuss issues that arise when these basic principles are appliedto real clinical investigations, which are often more complex than the simple situationscovered in the previous article and in introductory biostatistics textbooks. My intent is toprovide an overview and appreciation of some of the advanced statistical methods forhandling some of the complex situations that arise. Since advanced statistical methods forsample size or power calculations cannot receive comprehensive treatment in the settingof an overview article, an investigator needing such methods is advised to seek help froma statistician early in the research project. However, I hope the material herein will at leasthelp bridge the knowledge gap between investigator and statistician so that their interac-tion can be more productive.

CONSEQUENCES OF SAMPLE SIZE CALCULATIONS

Academic and Ethical Importance

In conjunction with a well-defined research question (2), an adequate sample size canhelp ensure an academically interesting result, whether or not a statistically significantdifference is eventually found in the study. The investigator does not have to be overlyconcerned that the study will only be interesting (and worth the expenditure of resources)if its results are “positive.” For example, suppose a study is conducted to see if a newimaging technique is better than the conventional one. Obviously, the study would beinteresting if a statistically significant difference was found between the two techniques.But if no statistically significant difference is found, an adequate sample size allows theinvestigator to conclude that no clinically important difference was found rather thanwonder whether an important difference is being hidden by an inadequate sample size.

An inadequate sample size also has ethical implications. If a study is not designed toinclude enough individuals to adequately test the research hypothesis, then the studyunethically exposes individuals to the risks and discomfort of the research even thoughthere is no potential for scientific gain. Although the connection between research ethics

Statistical Concepts Series

606

Ra

dio

logy

Page 95: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

and adequate sample size has been recog-nized for at least 25 years (3), the perfor-mance of clinical trials with inadequatesample sizes remains widespread (4).

Practical Consequences ofMathematic Properties

A more intuitive understanding of thedeterminants of sample size can be ob-tained through closer inspection of theformulas for sample size. We saw in theprevious article (1) that when the out-come variable of a comparative study is acontinuous value for which means arecompared, the appropriate sample size(5) is given by

N !4"2# zcrit " zpwr$

2

D2 , (1)

where N is the total sample size (ie, thetotal of the two comparison groups), D isthe smallest meaningful difference be-tween the two means being compared, "is the SD of each group, and zcrit and zpwrare constants determined by the specifiedsignificance criterion (eg, .05) and de-sired statistical power (eg, .8), respec-tively. Since zcrit and zpwr are indepen-dent of the properties of the data, samplesize depends only on the ratio betweenthe smallest meaningful difference andthe SD (Fig 2).

Furthermore, because the ratio is in aninverse exponential relationship to sam-ple size, anything that can be done todecrease the SD or increase the meaning-ful difference can substantially reducethe required sample size. The SD could bedecreased by reducing measurement vari-ability (eg, by using more precise instru-ments or procedures) and/or by selectinga more homogeneous study population.The meaningful difference could be in-

creased by employing more sensitive in-struments or procedures.

Another property of the comparison ofmeans is that for a given SD, only thearithmetic difference between the com-parison groups affects the sample size.For example, the sample size would bethe same for detecting a systolic bloodpressure difference of 10 mm Hg whetherit is to be measured in normotensive in-dividuals (eg, 110 vs 120 mm Hg) or hy-pertensive individuals (eg, 170 vs 180mm Hg).

When proportions are being com-pared—a common task in clinical imag-ing research—the sample size dependson both the smallest meaningful differ-ence between the proportions and thesize of the proportions themselves.That is, when proportions are beingcompared, in contrast to when meansare being compared, the sample size de-pends not just on the difference alone.The sample size increases dramaticallyas the meaningful difference betweenproportions is made smaller (Fig 3). Thesample size also increases if the twoproportions being compared (ie, themean of the two proportions) are closeto 0.5.

Retrospective Power Analysis

In sample size calculations, appropri-ate values for the smallest meaningfuldifference and the estimated SD are often

difficult to obtain. Therefore, the formu-las are sometimes applied after the studyis completed, when the difference and SDactually observed in the study can be sub-stituted in the appropriate sample sizeformula. Since sample size is also knownafter the study is completed, the formulawill yield statistical power. In this case,power refers to the sensitivity of thestudy to enable detection of a statisticallysignificant difference of the magnitudeobserved in the study. This activity,known as retrospective power analysis, issometimes performed to aid in the inter-pretation of the statistical results of astudy. If the results were not statisticallysignificant, the investigator might ex-plain the result as being due to a lowpower.

However, it can be shown that the ret-rospective power—essentially an ob-served quantity—is inversely related tothe observed P value (6). The retrospec-tive power tends to be large in any studywith a small (statistically significant) ob-served P value. Conversely, the retrospec-tive power tends to be small in any studywith a large (statistically insignificant)observed P value. Therefore, the observedretrospective power cannot provide anyinformation in addition to the observedP value (7,8). The important point is thatthe smallest meaningful difference is notthe same as the observed difference: Theformer must be set before the study is

Figure 1. Study design characteristics that affect sample size andstatistical power.

Figure 2. Graph shows the relationship between sample size and theratio of meaningful difference to SD for studies in which means arecompared. The graph was created by using Equation (1) and illus-trates the fact that sample size increases exponentially as the ratiodecreases.

Volume 230 ! Number 3 Estimation of Sample Size for Complex Study Designs ! 607

Ra

dio

logy

Page 96: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

conducted and is not determined afterthe study is completed.

Even though calculating the retro-spective power is problematic, it remainsimportant to consider the issue of ade-quate sample size when one is facedwith a study whose results indicatethere is no difference between compar-ison groups. Fortunately, several statis-tical approaches are available to guidethe reader in terms of whether or not to“believe” a study that yields negativeresults (9). These approaches involvecalculating CIs or performing %2 tests.

USE OF SIMULATION TODETERMINE SAMPLE SIZE FORCOMPLEX STUDY DESIGNS

In contrast to the importance of consid-ering sample size and statistical power,relatively few formulas exist for calculat-ing them (10). The most simple formulas,such as those discussed previously (1)and in many introductory biostatisticstextbooks, concern the estimation andcomparison of means and proportionsand are, fortunately, applicable to manysituations in clinical radiology research.Beyond these simple formulas, methodshave been established to determine sam-ple size for general fixed-effect linear sta-tistical models (of which the t test, ordi-nary linear regression, and analysis ofvariance are special cases), two-way con-

tingency tables (of which the analysis ofa 2 & 2 table with the %2 test is a specialcase), correlation coefficient analysis,and simple survival analysis (10). Ap-proximations exist for some other sta-tistical models, most notably logistic re-gression, but the accuracy of theseapproximations may be difficult to es-tablish in all situations.

Thus, the list of all statistical tests forwhich exact sample size calculationmethods exist is much smaller than thelist of all statistical tests. When no for-mula exists, as often happens for moder-ately complex statistical designs, the in-vestigator may try to perform a samplesize analysis for a simplified version ofthe study design and hope that thesample size can be extrapolated to theactual (more complex) study design be-ing planned.

For situations without correspondingformulas, it is becoming more commonto estimate sample size by using the tech-nique of simulation (11). The simulationapproach is powerful because it can beapplied to almost any statistical model,regardless of the model’s complexity. Insimulation, a mathematic model is usedto generate a synthetic data set simulat-ing one that might be collected in thestudy being planned. The mathematicmodel contains the dependent and in-dependent variables being measured,along with estimates of each variable’s

SD. The synthetic data set contains thesame number of subjects as the plannedsample size.

The planned statistical analysis is per-formed with this synthetic data set, and aP value is determined. As usual, the nullhypothesis is rejected if the P value is lessthan a certain criterion value (eg, P '.05). This process is repeated a large num-ber of times (perhaps hundreds or thou-sands of times) by using the mathematicmodel to generate a different syntheticdata set for each iteration. The statisticalpower is equal to the percentage of thesedata sets in which the null hypothesis isrejected. In effect, simulation employs amathematic model to virtually repeat thestudy an arbitrarily large number oftimes, allowing the statistical power to bedetermined essentially by direct measure-ment.

Since a real data set would contain ran-dom statistical error, random statisticalerror must be modeled in the syntheticdata sets. To accomplish this in simula-tion, a random-number generator is usedto add random error (“noise”) to eachsynthetic data set. Because of their heavyreliance on random-number generators,simulation methods are also known asMonte Carlo methods, after the city inwhich random numbers also play an im-portant role.

Let us consider a simple example. Sup-pose we are planning a clinical study to

Figure 3. (a) Graph shows relationship between sample size and proportions being compared in a study involving comparison of proportions.Sample size increases dramatically as the meaningful difference decreases. Sample size also increases if the proportions being compared (ie, the meanof the two proportions) are near 0.50. (b) Extension of the circled corner of the graph in a, with x and y axes magnified; this corner correspondsto a region that is of particular interest to the design of clinical investigations.

608 ! Radiology ! March 2004 Eng

Ra

dio

logy

Page 97: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

compare the contrast-to-noise ratio (CNR)between two magnetic resonance imag-ing pulse sequences used to examineeach subject in a group of subjects. Wewould like to know the statistical powerof the study to detect a smallest mean-ingful CNR difference of 2. We would liketo plan a study with a power of .8 fordetecting this smallest meaningful differ-ence. We have resources to comfortablyrecruit and evaluate approximately 12subjects. Suppose that from our previousexperience with the pulse sequences, weestimate the SD of the CNR difference tobe 4.

The statistical model for this study is

CNRi ! D " (i , (2)

where CNRi is the observed CNR differ-ence for subject i (of the 12 subjects), D isthe true CNR difference (in this example,2), and (i is the random error associatedwith the observation of subject i. To runthe simulation, we use a normally distrib-uted random-number generator for (ithat generates a different normally dis-tributed random number for each of the12 observations. The mean of the num-bers generated by the random numbergenerator is 0 and the SD is 4, which weestimated on the basis of previous expe-rience. With these 12 random numbers,we can generate a synthetic data set of 12observations by using Equation (2). Thesimulated data set is then subjected to a ttest, and the resulting P value is recorded.

The entire simulation process is thenrepeated, say, 1,000 times. The P value isrecorded after each iteration. After com-pleting the iterations, the P values areexamined to determine what proportionof the iterations resulted in the detectionof a statistically significant difference (in-dicated by P ' .05); this proportion isequal to the power. The simulation forthis example was performed with Stataversion 7.0 (Stata, College Station, Tex),and the results are shown in the first lineof the Table. In this example, the nullhypothesis is rejected in 343 of the 1,000iterations. Therefore, the statisticalpower of the t test, given the conditionsof this example, is .34 (Table).

Obviously, it would have been easier touse the formula for comparison ofmeans. But the advantage of simulationis the ability to consider more complexstatistical models for which there are nosimple sample size formulas. This abilityis especially important because seem-ingly small changes in study design cancause the simple sample size formulas tobecome invalid.

Returning to the example, we note thatthe estimated power of our study is lowerthan desired. The only way to improvethe power, given our assumptions, is toincrease the number of observations. (Forthe moment, we only have resources tostudy 12 subjects.) So, we decide to makefour measurements of CNR difference persubject. This strategy will increase thenumber of observations by a factor offour and will result in an increase inpower. However, it is important to realizethat this data collection strategy is notthe same as increasing the number ofsubjects by a factor of four, because thefour observations within each subject arenot independent of one another. Withineach subject, the observations are likelyto be more similar to each other than tothe observations in the other subjects. Instatistical terms, this lack of indepen-dence is called correlation.

Because of correlation, an additionalobservation in the same subject does notprovide as much additional informationas an additional observation in a differ-ent subject. The more similar the obser-vations within each subject are, the lessadditional information will be providedby the repeated observation. If the obser-vations within each subject are identical(100% correlated), then the study wouldhave the same results (and sample size) asit would without the repeated observa-tions, so there would be no benefit fromrepeating the measurement for the samesubjects. Conversely, if the repeated ob-servations within each subject were com-pletely uncorrelated (0% correlation),then the results (and sample size) wouldbe identical to those of a study with thesame total number of observations butwith enough additional subjects thatonly one observation per subject is used.

Simulation can easily account for thecorrelation of the four observationswithin each subject. The statistical modelused is a slight variation of Equation (2):

CNRij ! D " (ij, (3)

where CNRij is the observed CNR for sub-ject i (of the 12 subjects) and repetition j(of the four repetitions), D is the trueCNR difference, and (ij is the random er-ror associated with each of the 48 obser-vations. As in Equation (2), (ij is gener-ated by a normally distributed random-number generator having a mean of 0and an SD of 4. In Equation (3), however,(ij is calculated in such a way that eacherror term (ij is correlated with the othererror terms within the same subject. Cor-relation of the error terms is the math-ematic mechanism for generating corre-lation in the observations. The amountof correlation is indicated by the correla-tion coefficient ). In this example, weassume a moderate amount of correla-tion () ! 0.5) between observations madewithin each subject. The results of thesimulation are shown in the Table.

With an ordinary t test, there appearsto be enough power in the proposedstudy design (Table). But an ordinary ttest is inappropriate in this case becauseit treats each of the 48 observations asindependent, ignoring the correlationbetween the four observations withineach subject. An appropriate methodthat accounts for correlation is linear re-gression with an adjustment for cluster-ing. When this type of linear regression isapplied instead of the t test, the simula-tion reveals that the power is actually .5(Table), which is lower than the desiredpower of .8. Results of further simula-tions indicate that increasing the numberof subjects from 12 to 22 would result inadequate power (Table).

A discussion of statistical tests that ad-just for correlation within subjects is be-yond the scope of this article. However,without a simple formula for sample size,and even without extensive knowledgeof the statistical test to be used, simula-tion still enabled the accurate determina-

Results of Simulations of Hypothetical Study in Which Difference between TwoImaging Techniques Is Being Compared within Each Subject

No. ofSubjects

No. ofObservationsper Subject

Correlationbetween

Observationswithin Each

SubjectStatistical

Test

No. ofSimulatedData Sets

No. ofStatisticallySignificant

Results

Power ofStatistical

Test

12 1 0.0 t Test 1,000 343 .3412 4 0.5 t Test* 1,000 846 .8512 4 0.5 Regression† 1,000 528 .5322 4 0.5 Regression† 1,000 829 .83

* Not the appropriate statistical test for these data, but done for purposes of illustration.† Linear regression with adjustment for clustering by subjects.

Volume 230 ! Number 3 Estimation of Sample Size for Complex Study Designs ! 609

Ra

dio

logy

Page 98: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

tion of power in the preceding example;this demonstrates the utility and gener-alizability of simulation. In addition, theeffects of the use of potentially inappro-priate statistical analyses were also able tobe examined.

One of the barriers to performing sim-ulation is the requirement of iterativecomputation, which in turn requires fastcomputers. This barrier is becomingmuch less important as the speed of com-monly available computers continues toincrease. Even when the barrier of com-putational speed is overcome, simulationis successful only if the assumed statisti-cal model accurately describes the studydesign being evaluated. Therefore, appro-priate attention must be paid to estab-lishing the model’s validity. Fortunately,it is often easier to develop a mathematicmodel for a statistical situation (fromwhich it is a straightforward process todetermine power and sample size withsimulation) than to search for a specificmethod or formula, if one even exists, forcalculating sample size. In the precedingexample, the introduction of correlationsubstantially increased the complexity ofthe analysis from a statistical point ofview but caused only a minor change inthe mathematic model and the subse-quent simulation.

SAMPLE SIZE CALCULATIONSFOR READER STUDIES

A reader study involving the calculationof receiver operating characteristic (ROC)

curves is another kind of study designthat is fairly common in radiology andfor which sample size and statisticalpower are difficult to determine. The areaunder the ROC curve (Az) is commonlyused as an indicator of the accuracy of adiagnostic test. A typical ROC study in-volves multiple readers interpreting im-ages obtained in the same group of sub-jects who have all undergone imagingwith two or more imaging techniques orconditions. The purpose of the study is tocompare the imaging techniques. Thedifficulty in determining sample size andstatistical power is a result of the fairlycomplicated computational process re-quired to calculate Az and the compli-cated correlations among the observa-tions. In such a study, each readergenerates multiple observations, and,likewise, each subject has a part in mul-tiple observations. Therefore, correlationcan occur simultaneously among the ob-servations (readings) within the samereader and among the observationswithin the same subject.

One approach to sample size analysisin complex ROC studies involves an ap-proximation performed by using the Fdistribution (12,13). Sample size tablescreated by using this method have beenpublished (14); this method can also beused to calculate sample sizes for situa-tions not addressed by such tables (Ap-pendix). The method may be used to ex-amine the trade-off between sample size,smallest meaningful difference, andnumber of readers (Fig 4). For most clin-

ical investigations, it is likely to be diffi-cult to include more than 10 readers or100 cases. Given these constraints, we seethat any ROC study will require at leastfour readers, even with a large meaning-ful difference of 0.15 in Az. At the otherextreme, the smallest meaningful differ-ence in Az that can be detected with 10readers and 100 cases is 0.07. These twogeneralizations are based on many as-sumptions (Fig 4). More cases or readersare required if the interobserver variabil-ity or intraobserver variability is higherthan assumed. Fewer cases or readers arerequired if the average Az (ie, accuracy ofthe readers) is higher than assumed.

Another major method for analyzingdata from an ROC study with multiplereaders and multiple cases is the jack-knifed pseudovalue method (15). In thismethod, the data are mathematicallytransformed into pseudovalues that canbe analyzed by using a relatively straight-forward analysis of variance. Reducingthe problem of ROC analysis to a moremanageable analysis of variance is astrength of the pseudovalue method. Adisadvantage is the lack of exact, or evenapproximate, formulas for determiningsample size and statistical power. Theperformance and validity of the pseudo-value method have been examined withsimulations (16,17), so simulation couldalso provide a viable method for deter-mining sample size and power for thepseudovalue method.

CONCLUSION

In contrast to the wide variety of statisti-cal tools available to the clinical investi-gator, relatively few formulas exist for theexact calculation of the sample size andstatistical power of a given study design.As demonstrated by the example of cor-related data, a frequent occurrence inclinical research, it is relatively easy toconstruct a study design for which nosimple formula exists. The availability offast computers makes the iterative pro-cess of simulation a viable generalmethod for performing sample size andpower analysis for complex study de-signs.

At first glance, simulation may appearartificial and therefore suspicious becauseit relies on an equation and many as-sumptions about the terms in the equa-tion, particularly the terms related to thevariability of the components of themodel. It should be noted, however, thatsimilar (although perhaps less complex)mathematic models are the foundation

Figure 4. Graph shows the relationship between sample size, small-est meaningful difference, and number of readers in an ROC study inwhich Az is used as the index of accuracy. Sample sizes were calculatedby using the method described in the Appendix. The values used forall variables except J and * are the same as those in the example inTable A1.

610 ! Radiology ! March 2004 Eng

Ra

dio

logy

Page 99: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

of most statistical analyses—even simpleones like comparing means with a t test.Furthermore, estimates of variance arealso required in sample size and poweranalysis for simple analyses like the t test.The reason for the large number of as-sumptions in simulations has more to dowith the complexity of the data set beingsimulated than the method of simulationitself.

In addition to the factors usually men-tioned as affecting sample size, correla-tion among observations within groupsdue to nonindependent sampling canalso increase sample size and decreasestatistical power. Therefore, when plan-ning the sample size, one should takecare to account for potential correlationin the data set.

APPENDIX

Although tables for the determination ofsample size in ROC studies are available(14), a practical presentation of the equa-

tions underlying these tables may be help-ful for situations not addressed by the pub-lished tables. The equations necessary forcalculating sample size in an ROC studythat involves a number of readers interpret-ing images obtained in the same group ofsubjects who have all undergone imagingwith two techniques are as follows (13,14):

"c2 !

J*2

2+# "b

2#1 # rb$ #"w

2

K#1 # r1$ " # J # 1$#r2 # r3$

, (A1)

A ! ,-1#. $ ! 1.414, (A2)

Npos !

0.0099e-A2/ 2!#5A2 " 8$ "A2 " 8

R ""c

2 ,

(A3)

and

N ! Npos#1 " R$. (A4)

Note that Equations (A1) and (A3) havebeen algebraically rearranged from theirpublished form to isolate the dependent

variables for more convenient calculation.All symbols are defined in Table A1. To cal-culate sample size with these equations, firstassign values to the variables in the firstsection of Table A1, then sequentially sub-stitute the values into Equations (A1)–(A4),using the suggested values of the variablesin the second section of Table A1 and valuesfrom Tables A2 and A3 where indicated.

For example, Table A1 shows all the val-ues involved in the calculation of samplesize for a study that includes four readersand is designed to examine the Az differ-ence between two imaging techniques. Onthe basis of preliminary study results, theexpected average Az (.) of the two tech-niques is 0.75. The smallest meaningful dif-ference (*) between the Az values for thetwo techniques is set to 0.15. Each readerinterprets each case once (K ! 1), and thestudy involves an equal number of positiveand negative cases (R ! 1). The difference inAz (wb) between the most accurate and leastaccurate observers is estimated to be 0.05.The values for ww, r1, r2, r3, and rb given inTable A1 are those suggested by published

TABLE A1Definition of Variables in Calculation of Sample Size for ROC Study in Which a Number of Readers Interpret Same Set ofCases Obtained by Using Two Different Imaging Techniques

Variable Definition Example*

Main design variablesJ Number of readers 4* Smallest meaningful difference between Az values associated with the two imaging techniques 0.15wb Interobserver variability expressed as expected difference between Az of most accurate observer in

study and Az of least accurate observer0.05

K Number of times each case is read by each reader for each imaging technique, typically equal to 1 1. Expected average Az for the two imaging techniques 0.75R Ratio of number of negative cases to number of positive cases in study, often equal to 1 1

Variables withsuggested values

ww Intraobserver variability expressed as expected difference between Az values of observer whointerprets the same images obtained with the same imaging technique on two different occasions;suggested value is 0.5 ! wb (14)

0.025

r1 Correlation between Az values estimated for the same subjects by the same observer with differentimaging techniques; an average value of 0.47 is suggested (14)

0.47

r2 Correlation between Az values estimated for the same subjects by different observers with the sameimaging technique; a value of 0 is suggested for r2 - r3 (14)

0

r3 Correlation between Az values estimated for the same subjects by different observers with differentimaging techniques; a value of 0 is suggested for r2 - r3 (14)

0

rb Correlation between Az values estimated for the same subjects by the group of observers withdifferent imaging techniques; a value of 0.8 is suggested (14)

0.8

Calculated or assignedvariables

+ Noncentrality parameter of a noncentral F distribution with 1 degree of freedom associated with thenumerator and J - 1 degrees of freedom associated with the denominator; the values are given inTable A2

18.11

$b2 Interobserver variability expressed as a variance; $b

2 ! wb ! m, where m is a multiplier that convertsrange into variance and is obtained from Table A3 for n ! J

0.05 ! 0.486

$w2 Intraobserver variability expressed as a variance; $w

2 ! ww ! m, where m is a multiplier that convertsrange into variance and is obtained from Table A3 for n ! 2

0.025 ! 0.886

$c2 Variance between subjects, calculated from Equation (A1) 0.003540

,-1() Inverse cumulative normal distribution function, equivalent to NORMSINV() function in MicrosoftExcel (Microsoft, Redmond, Wash)

Npos Number of positive cases in the study 38N Total number of cases (positive / negative) in the study—that is, the sample size of cases 76

* See Appendix for description of example values.

Volume 230 ! Number 3 Estimation of Sample Size for Complex Study Designs ! 611

Ra

dio

logy

Page 100: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

reports (14). The calculated sample size (N)is 76.

References1. Eng J. Sample size estimation: how many

individuals should be studied? Radiology2003; 227:309–313.

2. Eng J, Siegelman SS. Improving radiologyresearch methods: what is being askedand who is being studied? Radiology1997; 205:651–655.

3. Newell DJ. Type II errors and ethics (let-ter). BMJ 1978; 4:1789.

4. Halpern SD, Karlawish JHT, Berlin JA. Thecontinuing unethical conduct of under-powered clinical trials. JAMA 2002; 288:358–362.

5. Rosner B. Fundamentals of biostatistics.5th ed. Pacific Grove, Calif: Duxbury,2000; 308.

6. Lenth RV. Some practical guidelines foreffective sample size determination. AmStat 2001; 55:187–193.

7. Hoenig JM, Heisey DM. The abuse ofpower: the pervasive fallacy of power cal-culations for data analysis. Am Stat 2001;55:19–24.

8. Thomas L. Retrospective power analysis.Conserv Biol 1997; 11:276–280.

9. Detsky AS, Sackett DL. When was a “neg-ative” clinical trial big enough? Howmany patients you need depends on whatyou found. Arch Intern Med 1985; 145:709–712.

10. Castelloe JM. Sample size computationsand power analysis with the SAS system.Proceedings of the 25th Annual SAS UsersGroup International Conference, April9–12, 2000, Indianapolis, Ind. Cary, NC:SAS Institute, 2000.

11. Feiveson AH. Power by simulation. Stata J2002; 2:107–124.

12. Obuchowski NA. Multireader receiver op-erating characteristic studies: a compari-son of study designs. Acad Radiol 1995;2:709–716.

13. Zhou XH, Obuchowski NA, McClish DK.Statistical methods in diagnostic medi-cine. New York, NY: Wiley, 2002; 298–304.

14. Obuchowski NA. Sample size tables forreceiver operating characteristic studies.AJR Am J Roentgenol 2000; 175:603–608.

15. Dorfman DD, Berbaum KS, Metz CE. Re-

ceiver operating characteristic ratinganalysis: generalization to the popula-tion of readers and patients with thejackknife method. Invest Radiol 1992;27:723–731.

16. Roe CA, Metz CE. Dorfman-Berbaum-Metz method for statistical analysis ofmultireader, multimodality receiver oper-ating characteristic data: validation withcomputer simulation. Acad Radiol 1997;4:298–303.

17. Dorfman DD, Berbaum KS, Lenth RV,Chen YF, Donaghy BA. Monte Carlo val-idation of a multireader method for re-ceiver operating characteristic discreterating data: factorial experimental de-sign. Acad Radiol 1998; 5:591–602.

18. Abramowitz M, Stegun IA. Handbook ofmathematical functions with formulas,graphs, and mathematical tables. Wash-ington, DC: U.S. Department of Com-merce, National Bureau of Standards,1964. Applied Mathematics Series No. 55.

19. Feinstein AR. Principles of medical statis-tics. Boca Raton, Fla: Chapman & Hall,2002; 115–116.

20. Lenth RV. Java applets for power andsample size. Available at: www.stat.uiowa.edu/0rlenth/Power/index.html. Ac-cessed January 5, 2004.

21. Snedecor GW, Cochran WG. Statisticalmethods. 8th ed. Ames, Iowa: Iowa StateUniversity Press, 1989; 469.

TABLE A2Noncentrality Parameter (!) of the Noncentral F Distribution Corresponding toa Significance Criterion (") of .05 and a Power of .8

Numerator Degreesof Freedom

Denominator Degrees ofFreedom Noncentrality Parameter (+)*

1 1 266.801 2 31.961 3 18.111 4 14.151 5 12.351 6 11.341 7 10.691 8 10.251 9 9.921 10 9.671 11 9.481 12 9.321 13 9.191 14 9.081 15 8.991 16 8.911 17 8.841 18 8.781 19 8.72

Note.—The noncentral F distribution is a generalized form of the F distribution and contains a ratioof two %2 distributions (18). Each of the two component %2 distributions (in the numerator anddenominator) are associated with a respective degrees of freedom parameter of the F distribution.The degrees of freedom signify the number of independent units of information relevant to thecalculation of a statistic (19), in this case the component %2 distributions.

* Calculated by iteration, with Lenth (20) implementation of the noncentral F distributionfunction.

TABLE A3Multiplier for Converting the Rangeof a Set of Observations into anEstimate of the Variance

No. of Observations (n) Multiplier (m)

2 0.8863 0.5914 0.4865 0.4306 0.3957 0.3708 0.3519 0.337

10 0.32511 0.31512 0.30713 0.30014 0.29415 0.28816 0.28317 0.27918 0.27519 0.27120 0.268

Note.—Revised and adapted, with permis-sion, from reference 21.

612 ! Radiology ! March 2004 Eng

Ra

dio

logy

Page 101: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Kimberly E. Applegate, MD, MSPhilip E. Crewson, PhD

Statistical Literacy1

One should not go hunting for buriedtreasure, because buried treasure is foundat random, and, by definition, one cannotgo searching for something which is foundat random.

Attributed to the Talmud;cited by Salsburg (1)

With this issue of Radiology, the Statisti-cal Concepts Series of articles reaches itsconclusion. We take this opportunity tothank each of the talented authors forsharing with us their expertise and pa-tience. It takes considerable skill to trans-late complex concepts into a format under-standable to a wide audience. Withouttheir efforts and considerable time com-mitment, this series would not have beenpossible. Thank you.

In the current issue of Radiology, the17th article in the series, by Dr Eng (2),provides an example of how we can useadvanced statistical modeling to under-stand and predict what may work in ra-diology research (2,3). While this finalarticle is complex, its sophistication pro-vides a window into a world we radiolo-gists rarely visit. The articles that pre-ceded this final article were designed toprovide readers of Radiology with an un-

derstanding of the basic concepts of sta-tistics, probability, and scientific meth-ods that are used in the medical literature(4). Because the Accreditation Councilfor Graduate Medical Education now re-quires of residents a basic grasp of statis-tical concepts as a component of the sixcore competencies (5), one must have asound understanding of the methodsand results presented in today’s medicalliterature, whether one is a resident orseasoned radiologist.

We are all consumers of information.Statistics allow us to organize and objec-tively evaluate empiric evidence that canultimately lead to improved patient care.

Nearly all readers of the radiology lit-erature know that understanding studyresults and determining their applicabil-ity to practice requires an understandingof statistical issues. The articles that com-pose the Statistical Concept Series in Ra-diology are meant to increase understand-ing of the statistics commonly used inradiology research.

Statistical methods revolutionized sci-ence in the 20th century (1). Statisticalconcepts have become an essential aspectof scientific inquiry and even of our com-mon culture. The influence on society isevidenced by the use of statistics in thelay press and media; the use of termssuch as probability and correlation in ev-eryday language; and the willingness tocollect data, accept scientific conclu-sions, and set public policy on the basisof averages and estimates (1).

In just over 1 century, statistics havealtered our view of science. Togetherwith the rapid evolution of computer ca-pabilities, there are many new statisticalmethods on the horizon. Recent trendsin medical statistics include the use ofmeta-analysis and clustered-data analysis(6). In addition, some statistical meth-ods, formerly uncommon in medical re-search, are quickly becoming embeddedin our literature. These include the boot-strap method, Gibbs sampler, generalizedadditive models, classification and regres-

sion trees, models for longitudinal data(general estimating equations), hierar-chic models, and neural networks (6). Re-gardless of the sophistication of a tech-nique, to take full advantage of theirpotential it is necessary to understandfundamental statistical methods. Thechallenge for physicians is to developand maintain statistical “literacy,” in ad-dition to the scientific literacy of radiol-ogy and medicine.

We define a functional level of statisti-cal literacy as that which includes an un-derstanding of methods, the effect of sta-tistics on research design and analysis,and a basic vocabulary of statistical terms(7).

As a profession, how can we encouragestatistical literacy? First, we must educateourselves by requiring the teaching ofmedical statistics in medical school andresidency training; Second, we shouldencourage the development of consensusguidelines on the proper reporting of sci-entific research—for example, the CON-SORT (Consolidated Standards of Report-ing Trials) statement (8) for reportingresults of randomized controlled trials,the QUOROM (Quality of Reporting ofMeta-analyses) statement (9) for report-ing results of meta-analyses and system-atic reviews, and the STARD (Standardsfor Reporting of Diagnostic Accuracy)statement (10) for reporting results of di-agnostic accuracy studies. Some journalshave published statistical guidelines forcontributors to medical journals (11),while others have statistical checklists formanuscript reviewers. In January 2001,Radiology became the first American radi-ology journal to provide statistical reviewof all published manuscripts that containstatistical content, to the benefit of bothauthors and readers (12). Third, weshould continue to promote the learningof critical thinking skills and researchmethodology at our national meetings,such as the seminars held at the 2002Radiological Society of North Americaand the 2003 Association of University

Index terms:EducationStatistical analysis

Published online10.1148/radiol.2303031661

Radiology 2004; 230:613–614

1 From the Department of Radiology, RileyHospital for Children, 702 Barnhill Rd, Indi-anapolis, IN 46202 (K.E.A.); and Health Ser-vices Research and Development Service,Department of Veterans Affairs, Washing-ton, DC (P.E.C.). Received October 9, 2003;revision requested October 14; revision re-ceived October 17; accepted October 29.Address correspondence to K.E.A. (e-mail:[email protected]).© RSNA, 2004

613

Ra

dio

logy

Page 102: Radiology:Statistical Concepts Series PDF - Cornell …hpr.weill.cornell.edu/divisions/biostatistics/pdf/Radiology_Stats... · Anthony V. Proto, MD, Editor Radiology 2002ÑStatistical

Radiologists annual meetings. Fourth, wemust continue to promote the value ofscientifically rigorous reports, relative tothat of less scientific ones, through ournational organizations, the program con-tent at our scientific meetings, and thesupport of these concepts through writ-ten and oral announcements by our lead-ership.

The goal of the Statistical Concept Se-ries was to enhance the ability of radiol-ogists to evaluate the literature compe-tently and critically, not to make themstatisticians. When contemplating thevalue of such a basic understanding ofstatistics, consider that Bland (13) arguedthat “bad statistics leads to bad researchand bad research is unethical.” We mustbeware of translating bad research intobad medicine and recognize that we havean essential role in increasing the evi-

dence base of medical practice. Such anunderstanding is perhaps one of the mostuseful things that radiologists must learn.

References1. Salsburg D. The lady tasting tea: how sta-

tistics revolutionized science in the twen-tieth century. New York, NY: Freeman,2001.

2. Eng J. Simplified estimation: beyond sim-ple formulas. Radiology 2004; 230:606–612.

3. Proto AV. Radiology 2002—StatisticalConcepts Series. Radiology 2002; 225:317.

4. Applegate KE, Crewson PE. An introduc-tion to biostatistics. Radiology 2002; 225:318–322.

5. ACGME outcome project. Available at:www.acgme.org/outcome/comp/compMin.asp. Accessed September 1, 2003.

6. Altman DG. Statistics in medical journals:some recent trends. Stat Med 2000; 19:3275–3289.

7. Mossman KL. Nuclear literacy. HealthPhys 1990; 58:639–643.

8. Begg C, Cho M, Eastwood S, et al. Improv-ing the quality of reporting of random-ized controlled trials: the CONSORTstatement. JAMA 1996; 276:637–639.

9. Moher D, Cook DJ, Eastwood S, Olkin I,Rennie D, Stroup DF. Improving the qual-ity of reports of meta-analyses of random-ised controlled trials: the QUOROM state-ment—Quality of Reporting of Meta-analyses. Lancet 1999; 354:1896–1900.

10. Bossuyt PM, Reitsma JB, Bruns DE, et al.Towards complete and accurate reportingof studies of diagnostic accuracy: theSTARD initiative. Radiology 2003; 226:24–28.

11. Altman DG, Gore SM, Gardener MJ, Po-cock SJ. Statistical guidelines for contrib-utors to medical journals. BMJ 1983; 286:1489–1493.

12. Proto AV. Radiology 2001—the upcom-ing year. Radiology 2001; 218:1–2.

13. Bland M. An introduction to medical sta-tistics. 2nd ed. Oxford, England: OxfordUniversity Press, 1995.

614 ! Radiology ! March 2004 Applegate and Crewson

Ra

dio

logy