handbook of test development

P1: FBQ/JZK P2: FBQ/JZK QC: FBQ/JZK T1: FBQLE146-FM.tex LE146/Haladyna (2132G) Hamilton December 15, 2005 6:50HANDBOOKOFTEST DEVELOPMENTiP1: FBQ/JZK P2: FBQ/JZK QC: FBQ/JZK T1: FBQLE146-FM.tex LE146/Haladyna (2132G) Hamilton December 15, 2005 6:50iiP1: FBQ/JZK P2: FBQ/JZK QC: FBQ/JZK T1: FBQLE146-FM.tex LE146/Haladyna (2132G) Hamilton December 15, 2005 6:50HANDBOOKOFTEST DEVELOPMENTEdited bySteven M. DowningUniversity of Illinois at ChicagoThomas M. HaladynaArizona State UniversityLAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS2006 Mahwah, New Jersey LondoniiiP1: FBQ/JZK P2: FBQ/JZK QC: FBQ/JZK T1: FBQLE146-FM.tex LE146/Haladyna (2132G) Hamilton December 15, 2005 6:50Editorial Director: Lane AkersEditorial Assistant: Karen Willig-BatesCover Design: Tomai MaridouFull-Service Compositor: TechBooksText and Cover Printer: Hamilton Printing CompanyCopyright c 2006 by Lawrence Erlbaum Associates, Inc.All rights reserved. No part of this book may be reproduced inany form, by photostat, microform, retrieval system, or anyother means, without prior written permission of the publisher.Lawrence Erlbaum Associates, Inc., Publishers10 Industrial AvenueMahwah, New Jersey 07430www.erlbaum.comLibrary of Congress Cataloging-in-Publication DataHandbook of test development / edited by Steven M. Downing, Thomas M. Haladyna.p. cm.Includes bibliographical references and index.ISBN 0-8058-5264-6 (casebound : alk. paper)ISBN 0-8058-5265-4 (pbk. : alk. paper)1. Educational tests and measurementsDesign and constructionHandbooks, manuals, etc.2. Psychological testsDesign and constructionHandbooks, manual, etc.I. Downing, Steven M. II. Haladyna, Thomas M.LB3051.H31987 2006371.26

1dc22 2005030881ivThis edition published in the Taylor & Francis e-Library, 2011.To purchase your own copy of this or any of Taylor & Francis or Routledgescollection of thousands of eBooks please go to www.eBookstore.tandf.co.uk.ISBN 0-203-87477-3 Master e-book ISBNP1: FBQ/JZK P2: FBQ/JZK QC: FBQ/JZK T1: FBQLE146-FM.tex LE146/Haladyna (2132G) Hamilton December 15, 2005 6:50ContentsPreface ixSteven M. Downing and Thomas M. HaladynaI. FOUNDATIONS1. Twelve Steps for Effective Test Development 3Steven M. Downing2. The Standards for Educational and Psychological Testing:Guidance in Test Development 27Robert L. Linn3. Contracting for Testing Services 39E. Roger Trent and Edward Roeber4. Evidence-Centered Assessment Design 61Robert J. Mislevy and Michelle M. Riconscente5. Item and Test Development Strategies to Minimize Test Fraud 91James C. Impara and David Foster6. Preparing Examinees for Test Taking: Guidelines for Test Developersand Test Users 115Linda CrockerII. CONTENT7. Content-Related Validity Evidence in Test Development 131Michael Kane8. Identifying Content for Student Achievement Tests 155Norman L. Webb9. Determining the Content of Credentialing Examinations 181Mark Raymond and Sandra Neustel10. Standard Setting 225Gregory J. CizekvP1: FBQ/JZK P2: FBQ/JZK QC: FBQ/JZK T1: FBQLE146-FM.tex LE146/Haladyna (2132G) Hamilton December 15, 2005 6:50vi CONTENTSIII. ITEM DEVELOPMENT11. Computerized Item Banking 261C. David Vale12. Selected-Response Item Formats in Test Development 287Steven M. Downing13. Item and Prompt Development in Performance Testing 303Catherine Welch14. Innovative Item Formats in Computer-Based Testing: In Pursuit ofImproved Construct Representation 329Stephen G. Sireci and April L. Zenisky15. Item Editing and Editorial Review 349Rebecca A. Baranowski16. Fairness Reviews in Assessment 359Michael Zieky17. Language Issues in Item Development 377Jamal Abedi18. Anchor-Based Methods for Judgmentally Estimating Item Statistics 399Ronald K. Hambleton and Stephen J. Jirka19. Item Analysis 421Samuel A. LivingstonIV. TEST DESIGN20. Practical Issues in Designing and Maintaining Multiple Test Forms forLarge-Scale Programs 445Cathy LW Wendler and Michael E. Walker21. Vertical Scales 469Michael J. Young22. Developing Test Forms for Small-Scale Achievement Testing Systems 487Paul Jones, Russell W. Smith, and Diane Talley23. Designing Ability Tests 527Gale H. Roid24. Designing Computerized Adaptive Tests 543Tim Davey and Mary J. Pitoniak25. Designing Tests for PassFail Decisions Using Item Response Theory 575Richard M. LuechtV. TEST PRODUCTION AND ADMINISTRATION26. Test Production Effects on Validity 599Dan Campion and Sherri Miller27. Test Administration 625Rose C. McCallin28. Considerations for the Administration of Tests to Special Needs Students:Accommodations, Modications, and More 653Martha L. Thurlow, Sandra J. Thompson, and Sheryl S. LazarusP1: FBQ/JZK P2: FBQ/JZK QC: FBQ/JZK T1: FBQLE146-FM.tex LE146/Haladyna (2132G) Hamilton December 15, 2005 6:50CONTENTS viiVI. POSTTEST ACTIVITIES29. Practices, Issues, and Trends in Student Test Score Reporting 677Joseph M. Ryan30. Technical Reporting and Documentation 711Douglas F. Becker and Mark R. Pomplun31. Evaluating Tests 725Chad W. Buckendahl and Barbara S. Plake32. Roles and Importance of Validity Studies in Test Development 739Thomas M. HaladynaVII. EPILOGUEEpilogue 759Cynthia Board SchmeiserAuthor Index 761Subject Index 773P1: FBQ/JZK P2: FBQ/JZK QC: FBQ/JZK T1: FBQLE146-FM.tex LE146/Haladyna (2132G) Hamilton December 15, 2005 6:50viiiP1: FBQ/JZK P2: FBQ/JZK QC: FBQ/JZK T1: FBQLE146-FM.tex LE146/Haladyna (2132G) Hamilton December 15, 2005 6:50PrefaceThe Handbook of Test Development is a systematic and comprehensive source of informationabout developing tests of knowledge, skills, and ability. The Handbook presents the state of theart of developing tests in the twenty-rst century. Historically, issues and practices concerningtest development have received little scholarly, scientic attention in the academic disciplineof educational and psychological measurement; most research and writing has focused onthe more statistical issues of testing. Yet the development of tests and all the many issuesand practices associated with producing achievement and ability tests occupy a central andcritically important role in any testing program.The eld of educational and psychological measurement has changed dramatically in thepast decade. Computer-based and computer-adaptive testing, Item Response Theory, general-izability theory, and the unied view of validity are now part of the landscape of many testingprograms. Even with such major changes to the theory and practice of educational and psy-chological measurement and the delivery of tests, the activities associated with high-qualitytest development remain largely unchanged and mostly undocumented. This Handbook seeksto provide information about sound testing practices in a way that will be useful to both testdevelopers and researchers studying issues that affect test development.The purpose of this Handbook is to present a comprehensive, coherent, and scholarly-but-practical discussion of all the issues and sound practices required to construct tests thathave sufcient validity evidence to support their intended score interpretations. Well-respectedscholars and practitioners of the art and science of developing tests of achievement and abilityhave contributed chapters to this volume. Each chapter provides practitioners with a usefulresource upon which they can rely for well-supported, practical advice on developing tests.The art and science of test development is usually learned in testing agencies, primarilyas on the job training, with standard practicesbased on sound educational measurementtheoryhanded down frommaster to apprentice. Graduate educational measurement programsmay teach a course or offer a seminar on test development issues, but graduate students oftencomplete doctoral educational measurement programs with little or no insight into or experi-ence with the actual hands-on sound practices associated with developing tests. Psychometrictheory presents the foundation for all necessary test development activities, but the nexus be-tween theory and actual practice may not always be readily apparent. The Handbook presentsa comprehensive guide to test developers for every stage of test developmentfrom the plan-ning stage, through content denition and item or prompt development, to test production,administration, scoring, and reporting.ixP1: FBQ/JZK P2: FBQ/JZK QC: FBQ/JZK T1: FBQLE146-FM.tex LE146/Haladyna (2132G) Hamilton December 15, 2005 6:50x PREFACEThe Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999)provide a backdrop for all the test development issues discussed in this volume. Chapter authorsrefer the reader to the relevant Standards supporting their discussions, with frequent notationof specic validity evidence that is associated with aspects of test development discussed.PART I. FOUNDATIONSPart I of the Handbook covers essential aspects of test development that provides a foundationfor any testing program. In Chapter 1, Steven Downing introduces the Handbook of Test De-velopment by overviewing a detailed model for developing effective tests. Each chapter in thisbook connects with this chapter in some important way. Robert Linn in Chapter 2 discusses thehistory of standards in test development and provides a rationale for using the Standards for Ed-ucational and Psychological Testing, which are imbued throughout this Handbook. In Chapter3, Roger Trent and Edward Roeber discuss the important process of obtaining bids for testingservices and contracting for these services. Chapter 4, by Robert Mislevy and Michelle Ri-conscente, discusses the exiting development known as evidence-centered assessment design.This chapter shows the potential for integrating validity with test development throughout theprocess of creating a test. James Impara and David Foster, in Chapter 5, present a perspectiveabout security that is becoming increasingly important in high-stakes testing. The last chapterin Part I, by Linda Crocker, deals with the perils and responsibilities of test preparation.PART II. CONTENTOne primary type of validity evidence comes fromknowing the specic content of a test, partic-ularlyeducational achievement andcredentialingtests. Michael Kane inChapter 7discusses theevolution of the concept of validity evidence that pertains to test content. Chapter 8 by NormanWebb discusses ways in which educators can more effectively dene content, and, thereby,increase content-related validity evidence for their tests. Chapter 9, written by Mark Raymondand Sandra Neustel, provides a complement to Norman Webbs chapter, discussing how es-sential validity evidence is established for professional credentialing examinations throughpractice or task analysis. Chapter 10 by Gregory Cizek treats the important topic of standardsettingestablishing defensible cut scores for tests.PART III. ITEM DEVELOPMENTThe heart of any testing program is its item bank. Chapter 11 by David Vale discusses thechallenges of item banking and presents an ideal model of how computerized item bankingshould work. Steven Downing in Chapter 12 discusses the essentials of selected-responseitem development, and Catherine Welch in Chapter 13 provides a complementary treatmentof constructed-response item development. Chapter 14, by Stephen Sireci and April Zenisky,presents ideas about novel item formats that may be useful for newer computerized testingprograms.The next ve chapters are complementary and rich in advice about what we need to do totest items before each item becomes operational in a test. In Chapter 15, Rebecca Baranowskidiscusses the need for professional itemand test editing and relates important editorial concernsP1: FBQ/JZK P2: FBQ/JZK QC: FBQ/JZK T1: FBQLE146-FM.tex LE146/Haladyna (2132G) Hamilton December 15, 2005 6:50PREFACE xito validity. In Chapter 16, Michael Zieky presents information about reviewing items forfairness and sensitivity. In Chapter 17, Jamal Abedi advises on the threat to validity posed byunfair reading demands in our test items. In Chapter 18, Ronald Hambleton and Stephen Jirkadiscuss howitemwriters and content reviewers might use judgments of expected itemstatisticsto augment traditional methods used to increase the size of item banks. Chapter 19, written bySamuel (Skip) Livingston, shows the importance of item analysis in test development, and heaugments this emphasis with newer graphical procedures that promise to make item analysiseven more informative for evaluating test items.PART IV. TEST DESIGNPrinciples of test design can be found in statistical treatments, but many pragmatic issuesremain that are not found in such sources. This section discusses the importance of test designfrommany perspectives. Chapter 20, by Cathy Wendler and Michael Walker, provides wisdomon designing large-scale multi-form tests. Michael Youngs Chapter 21 explicates the methodsfor creating vertical scales that span more than a single grade level. Paul Jones, Russell Smith,and Diane Talley, in Chapter 22, provide a unique perspective on test design when the numberof persons taking the test is small. Gale Roid in Chapter 23 discusses essential methods requiredfor developing tests of ability. TimDavey and Mary Pitoniak in Chapter 24 discuss the problemsencountered in designing a computer-adaptive test in which each item is selected given theexaminees ability estimate. Chapter 25 by Richard Luecht presents an Item Response Theorydesign for computer-adaptive testing using multiple test modules, in which each module isadapted to the examinees estimated ability.PART V. TEST PRODUCTION AND ADMINISTRATIONAlthough test production and administration may not seem very important in the spectrumof test development, some of the greatest threats to validity come from inappropriate testproduction and poor administration. Dan Campion and Sherri Miller in Chapter 26 presentconcepts, principles, and procedures associated with effective test production. Chapter 27by Rose McCallin provides a perspective from an experienced test administrator on how togive a test that is seamlessly smooth. Finally, Martha Thurlow, Sandra Thompson, and SherylLazarus provide insight into problems we encounter when giving tests to special needs studentsin Chapter 28.PART VI. POSTTEST ACTIVITIESAlthough the Handbook is not concerned with scoring and analysis of item responses, thereare other activities that chronologically follow a test that bear on validity in important ways.Chapter 29 by Joseph Ryan presents new information about reporting test scores and sub-scores. Chapter 30 by Douglas Becker and Mark Pomplun discusses technical reporting andits link to validation. Chapter 31 by Chad Buckendahl and Barbara Plake discuss the importantprocess of evaluating tests, a much recommended but often neglected step in testing. Chapter 32by Thomas Haladyna argues that research is needed in all testing programs, both for increasingvalidity evidence but also for solving practical problems that test developers face.P1: FBQ/JZK P2: FBQ/JZK QC: FBQ/JZK T1: FBQLE146-FM.tex LE146/Haladyna (2132G) Hamilton December 15, 2005 6:50xii PREFACEEPILOGUEWritten by Cynthia Board Schmeiser, the epilogue provides indications of where we are in testdevelopment as well as what we need to accomplish in the future to make test developmentactivities more sound and meaningful. We hope her comments provide a basis for a new andbetter Handbook in the future.WHAT DOES THE HANDBOOK PROVIDE?The Handbook provides a practical and denitive resource for test developers working in allareas of test development. This Handbook can also serve as a text or as supplemental readingsfor academic courses concerned with test development issues, especially at the graduate level.Finally, the Handbook serves as a denitive reference book for researchers who are interestedin specic problems associated with test development.ACKNOWLEDGMENTSThe editors are grateful to Lane Akers of Lawrence Erlbaum Associates, who supported thisproject from the beginning and assisted the many steps from the birth of this project to its con-clusion. We are also grateful to each of the authors who contributed chapters to the Handbook,sacricing time from their busy, productive careers to write these chapters. Their efforts madethe Handbook possible. We also appreciate all the excellent professional support we receivedfrom Ms. Joanne Bowser, our project manager at TechBooks.We are appreciative of all our instructors and mentors, who encouraged our interest andenthusiasm for the art and science of test development; we especially acknowledge the lateRobert L. Ebel of Michigan State University, who highly valued test development and inspiredhis students to do likewise. And to our spouses, Barbara and Beth, who tolerated us during thisextra long year, we continue to be ever thankful.Finally, the editors and chapter authors wish to thank the National Council on Measurementin Education (NCME) for lifelong professional support. We are happy to provide royalties fromthis book to support NCMEs many important efforts to advance educational measurement asa profession.Steven M. DowningThomas M. HaladynaP1: FBQ/JZK P2: FBQ/JZK QC: FBQ/JZK T1: FBQLE146-FM.tex LE146/Haladyna (2132G) Hamilton December 15, 2005 6:50HANDBOOKOFTEST DEVELOPMENTxiiiP1: FBQ/JZK P2: FBQ/JZK QC: FBQ/JZK T1: FBQLE146-FM.tex LE146/Haladyna (2132G) Hamilton December 15, 2005 6:50xivP1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:11IFoundations1P1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:112P1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:111Twelve Steps for EffectiveTest DevelopmentSteven M. DowningUniversity of Illinois at ChicagoEffective test development requires a systematic, detail-oriented approach based on sound theoret-ical educational measurement principles. This chapter discusses twelve discrete test developmentprocedures or steps that typically must be accomplished in the development of most achievement,ability, or skills tests. Following these twelve steps of effective test development, for both selected-response or constructed-response tests, tends to maximize validity evidence for the intended testscore interpretation.These twelve-steps are presented as a framework, which test developers may nd useful in or-ganizing their approach to the many tasks commonly associated with test developmentstartingwith detailed planning in Step 1, carrying through to discussions of content denition and de-lineation, to creating test stimuli (items or prompts), and administering, scoring, reporting, anddocumenting all important test development activities. Relevant Standards are referenced for eachstep and validity issues are highlighted throughout.This chapter provides an overview of the content of the Handbook of Test Development, witheach of the twelve steps referenced to other chapters.Effective test development requires a systematic, well-organized approach to ensure sufcientvalidity evidence to support the proposed inferences from the test scores. A myriad of detailsand issues, both large and small, comprise the enterprise usually associated with the terms testdevelopment and test construction. All of these details must be well executed to produce a testthat estimates examinee achievement or ability fairly and consistently in the content domainpurported to be measured by the test and to provide documented evidence in support of theproposed test score inferences.This chapter discusses a model of systematic test development, organized into twelve dis-crete tasks or activities. All of the activities and tasks outlined in this chapter are discussed indetail in other chapters of this volume, such that this chapter may be considered an overviewor primer for the content of the Handbook of Test Development. This chapter is not intendedto be a comprehensive discussion of every important issue associated with test developmentor all of the validity evidence derived from such activities. Rather, this chapter presents apotentially useful model or template for test developers to follow so that each important test3P1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:114 DOWNINGdevelopment activity receives sufcient attention to maximize the probability of creating aneffective measure of the construct of interest.This particular organization of tasks and activities into twelve discrete steps is somewhatarbitrary; these tasks could be organized differently such that there were fewer or more discretesteps. Each of these steps must be accomplished, at some level of detail, for all types of tests,whether the format is selected response (e.g., multiple choice), constructed response (e.g., shortanswer essay), or performance (e.g., high-delity simulation), and whatever the mode of testadministrationtraditional paper-and-pencil or computer based. The intensity and technicalsophistication of each activity depends on the type of test under development, the tests purposeandits intendedinferences, the stakes associatedwiththe test scores, the resources andtechnicaltraining of the test developers, and so on; but all of the tasks noted in this chapter must becarried out at some level of detail for every test development project.These twelve steps provide a convenient organizational framework for collecting and re-porting all sources of validity evidence for a testing program and also provide a convenientmethod of organizing a review of the relevant Standards for Educational and Psychologi-cal Testing (American Educational Research Association [AERA], American PsychologicalAssociation [APA], National Council on Measurement in Education [NCME], 1999) pertain-ing to test development. Each of these steps can be thought of as one major organizer ofvalidity evidence to be documented in a technical report which summarizes all the importantactivities and results of the test. Each of these steps is also associated with one or more Stan-dards (AERA, APA, NCME, 1999) that apply, more or less, given the exact purpose of thetest, the consequences of test scores, and the desired interpretation or inferences from the testscores.Table 1.1 lists the twelve steps of test development and provides a brief summary of tasks,activities, and issues; selected relevant Standards (AERA, APA, NCME, 1999) are also notedin Table 1.1 for each step. These steps are listed as a linear model or as a sequential timeline,from a discrete beginning to a nal end point; however, in practice, many of these activitiesmay occur simultaneously or the order of some of these steps may be modied. For example,if cut scores or passing scores are required for the test, systematic standard setting activitiesmay occur (or a process may begin) much earlier in the test development process than Step 9as shown in Table 1.1. Item banking issues are shown as Step 11, but for ongoing testingprograms, many item banking issues occur much earlier in the test development sequence.But many of these activities are prerequisite to other activities; for example, content denitionmust occur before item development and test assembly, so the sequence of steps, althoughsomewhat arbitrary, is meaningful.Much of the information contained in this chapter is based on years of experience, learningwhat works and what does not work, in the actual day-to-day practice of test development.The relevant research literature supporting the best practice of test development comes frommany diverse areas of psychometrics and educational measurement. The Millman and Greenechapter (1989) and the Schmeiser and Welch chapter (in press) in Educational Measurementinspire these 12-steps for effective and efcient test development.STEP 1: OVERALL PLANEvery testing program needs some type of overall plan. The rst major decision is: Whatconstruct is to be measured? What score interpretations are desired? What test format orcombination of formats (selected response or constructed response/performance) is most ap-propriate for the planned assessment? What test administration modality will be used (paperP1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:111. TWELVE STEPS FOR EFFECTIVE TEST DEVELOPMENT 5TABLE 1.1Twelve Steps for Effective Test DevelopmentExample Test Development Example RelatedSteps Tasks Standards1. Overall plan Systematic guidance for all test development activities:construct; desired test interpretations; test format(s);major sources of validity evidence; clear purpose; desiredinferences; psychometric model; timelines; security;quality controlStandard 1.1Standard 3.2Standard 3.92. Content denition Sampling plan for domain/universe; various methods relatedto purpose of assessment; essential source ofcontent-related validity evidence; delineation of constructStandard 1.6Standard 3.2Standard 3.11Standard 14.83. Test specications Operational denitions of content; framework for validityevidence related to systematic, defensible sampling ofcontent domain; norm or criterion referenced; desireditem characteristicsStandard 1.6Standard 3.2Standard 3.3Standard 3.4Standard 3.114. Item development Development of effective stimuli; formats; validity evidencerelated to adherence to evidence-based principles;training of item writers, reviewers; effective item editing;CIV owing to awsStandard 3.6Standard 3.7Standard 3.17Standard 7.2Standard 13.185. Test designandassembly Designing and creating test forms; selecting items forspecied test forms; operational sampling by plannedblueprint; pretesting considerationsStandard 3.7Standard 3.86. Test production Publishing activities; printing or CBT packaging; securityissues; validity issues concerned with quality controlN/A7. Test administration Validity issues concerned with standardization; ADA issues;proctoring; security issues; timing issuesStandard 3.18Standard 3.19Standard 3.20Standard 3.218. Scoring test responses Validity issues: quality control; key validation; item analysis Standard 3.6Standard 3.229. Passing scores Establishing defensible passing scores; relative vs. absolute;validity issues concerning cut scores; comparability ofstandards: maintaining constancy of score scale(equating, linking)Standard 4.10Standard 4.11Standard 4.19Standard 4.20Standard 4.2110. Reporting test results Validity issues: accuracy, quality control; timely;meaningful; misuse issues; challenges; retakesStandard 8.13Standard 11.6Standard 11.12Standard 11.15Standard 13.19Standard 15.10Standard 15.1111. Item banking Security issues; usefulness, exibility; principles foreffective item bankingStandard 6.412. Test technical report Systematic, thorough, detailed documentation of validityevidence; 12-step organization; recommendationsStandard 3.1Standard 6.5Abbreviations: ADA, Americans with Disabilities Act; CBT, computer-based testing; CIV, construct-irreleventvariance.P1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:116 DOWNINGand pencil or computer based)? One needs to knowhowexactly and when to begin the programor process, in what sequence tasks must be accomplished, which tasks depend on the successfulcompletion of other tasks, what timeline must be adhered to, who is responsible for carryingout which specic tasks, how to quality control all major aspects of the testing program, plusliterally hundreds of other issues, decisions, tasks, and operational details. Step 1, the overallplan, places a systematic framework on all major activities associated with the test develop-ment project, makes explicit many of the most important a priori decisions, puts the entireproject on a realistic timeline, and emphasizes test security and quality control issues fromthe outset.Many of the most fundamental decisions about the testing program must be made priorto beginning formal test development activities. Each of these fundamental decisions, withits clear rationale, ultimately indicate a major source of validity evidence for the test scoresresulting from the testing program.Examples of Step 1type tasks and decisions include a clear, concise, well-delineated pur-pose of the planned test. The purpose of testing forms an operational denition of the proposedtest and guides nearly all other validity-related decisions related to test development activities.Ultimately, major steps such as content denition, the methods used to dene the test contentdomain, and the construct hypothesized to be measured by the examination are all directlyassociated with the stated purpose of the test. The choice of psychometric model, whetherclassical measurement theory or item response theory, may relate to the proposed purpose ofthe test, as well as the proposed use of the test data and the technical sophistication of thetest developers and test users. For example, if the clearly stated purpose of the test is to assessstudent achievement over a well-dened sequence of instruction or curriculum, the proposedconstruct, the methods used to select content to test, and the psychometric model to use areeach reasonably clear choices for the test developer. Likewise, if the tests purpose is to esti-mate ability to select students for a national and highly competitive professional educationalprogram, the inferences to be made from test scores are clearly stated, the major constructsof interest are delineated, and the content-dening methods, psychometric model, and othermajor test development decisions are well guided.Many other fundamental decisions must be made as part of an overall test developmentplan, including: Who creates test items for selected-response tests or prompts or stimuli forperformance tests? Who reviews newly written test items, prompts or other test stimuli? Howisthe item or prompt production process managed and on what timeline? Who is responsible forthe nal selection of test items or prompts? Who produces, publishes, or prints the test? Howistest security maintained throughout the test development sequence? What quality controls areused to ensure accuracy of all testing materials? When and how is the test administered, andby whom? Is the test a traditional paper-and-pencil test or a computer-based test? If required,howis the cut score or passing score established, and by what method? Who scores the test andhow are the scores reported to examinees? Who maintains an item bank or item pool of securetest items or performance prompts? What are the key dates on the timeline of test developmentto ensure that all major deadlines are met? Who is responsible for the complete documentationof all the important activities, data results, and evaluation of the test?In many important ways, Step 1 is the most important step of the twelve tasks of testdevelopment. A project well begun is often a project well ended. This critical beginning stageof test development outlines all essential tasks to be accomplished for a successful testingproject and clearly highlights all the important sources of validity evidence required for thetesting program. Timelines and responsibilities are clearly stated, creating a reasonable andefcient plan to accomplish all necessary tasks, including allowing sufcient time for adequatequality control procedures and a correction cycle for all detected errors. Detailed, clear testprogram planning is an important rst step toward adequately accomplishing the end goal ofP1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:111. TWELVE STEPS FOR EFFECTIVE TEST DEVELOPMENT 7preparing, administering, scoring, and analyzing a test and presenting reasonable sources ofvalidity evidence to support or refute the intended inferences from test scores. All types oftests, used for whatever purpose, whether using traditional paper-and-pencil selected-responseformats or performance tests using constructed-response formats or high-delity simulationsbenet from the detailed planning activities of Step 1.The Standards (AERA, APA, NCME, 1999) relating to the tasks of Step 1 discuss theimportance of clearly dening the purpose of the test, following careful test developmentprocedures, and providing a denitive rationale for the choice of psychometric model forscoring. For example, in discussing validity evidence, the Standards (AERA, APA, NCME,1999, p. 17) suggest that . . . . the validity of an intended interpretation of test scores relies onall the available evidence relevant to the technical quality of a testing system. This includesevidence of careful test construction. . . .STEP 2: CONTENT DEFINITIONOne of the most important questions to be answered in the earliest stages of test development is:What content is to be tested? All ability and achievement tests rely heavily on content-relatedvalidity evidence to make fundamental arguments to support (or refute) specic interpretationsof test scores (Kane, chap. 7, this volume). No other issue is as critical, in the earliest stages ofdeveloping effective tests, as delineating the content domain to be sampled by the examination.If the content domain is ill dened or not carefully delineated, no amount of care taken withother test development activities can compensate for this inadequacy. The validity of inferencesfor achievement test scores rests primarily and solidly on the adequacy and defensibility ofthe methods used to dene the content domain operationally, delineate clearly the construct tobe measured, and successfully implement procedures to systematically and adequately samplethe content domain.The chapter on practice analysis (Raymond & Neustel, chap. 9, this volume) presents athorough discussion of this empirical method of content denition, especially in the context ofcredentialing examinations. The Standards (AERA, APA, NCME, 1999) clearly endorse theuse of empirical job analysis for selection and employment examinations (Standard 14.8), butleave the methodology for content denition more open-ended for other types of examinations.Webb (chap. 8, this volume) addresses the dening of content in achievement tests.Content dening methods vary in rigor, depending on the purpose of the test, the con-sequences of decisions made from the resulting test scores, and the amount of defensibilityrequired for any decisions resulting from test scores. For some lower stakes achievement tests,the content-dening methods may be very simple and straightforward, such as instructors mak-ing informal (but informed) judgments about the appropriate content to test. For other veryhigh-stakes examination programs, content denition may begin with a multiyear task or jobanalysis, costing millions of dollars, and requiring the professional services of many testingprofessionals.For high-stakes achievement examinations, test content dening methods must be system-atic, comprehensive, and defensible. For instance, a professional school may wish to developan end-of-curriculum comprehensive achievement test covering the content of a two-yearcurriculum, with a passing score on this test required to continue on to a third year of pro-fessional education. In this example, rigorous and defensible methods of content denitionand delineation of the content domain are required; all decisions on content, formats, andmethods of content selection become essential aspects of validity evidence. In this situation,faculty and administrators may need to develop systematic, thorough methods to dene allthe content at risk for such a high-stakes test, looking to curricular documents, teachingP1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:118 DOWNINGsyllabi, instructional materials and content, textbook content, and faculty judgment to behav-iorally and operationally dene content. The input of professional measurement specialists isdesirable for such a high-stakes testing environment. Instructors might be asked to completeempirical methods similar to practice or job analysis to rate and rank the importance to testof content statements, after comprehensive lists of all content had rst been identied. Theirjudgments might be further rened by asking representative faculty to rate the criticalityof content statements or make some judgments about how essential the content statement isto future learning or to the ultimate successful practice of the professional discipline. Fromsuch a multistage content-dening process, a complete content domain could result, fromwhich a sampling plan to guide the creation and selection of item content could then bedeveloped.Credentialing tests in the professions (or occupations) are often required as one of severalsources of evidence of competence to be licensed to practice the profession in a jurisdiction.Public safety is typically the greatest concern for licensing tests, such that the persons usingthe services of the professional have some minimal condence in the professionals ability todeliver services safely. Clearly, the content-dening and delineation methods used for suchhigh-stakes tests must be much more rigorous and defensible than for many (if not most) othertesting situations. Formal practice or task analysis methods (Raymond & Neustel, chap. 9,this volume) are usually required for these high-stakes tests, because the consequences ofmisclassications can have serious impact on both society and the individual professionalseeking a license to practice.Certication examinations in the professions share many of the same content-deningrequirements as licensening examinations noted above. Traditionally, certication was thoughtto be voluntary and over and above a basic license to practice. More recently, the distinctionbetween licensure and certication has been blurred in many professions. In some professions,like medicine, certication in a medical specialty (such as General Surgery) is essentiallyrequired if the individual hopes to practice in the specialty or subspecialty. The content-deningmethods requiredtosupport (or refute) validityarguments andsupport the ultimate defensibilityfor many such certifying examination programs must be extremely rigorous (Raymond &Neustel, chap. 9, this volume).The defensibility of the content-dening process is associated with the rigor of the methodsemployed. One essential feature is the unbiased nature of the methods used to place limitsaround the universe or domain of knowledge or performance. Further, the requirements fordefensibility and rigor of content-dening methods is directly proportional to the stakes ofthe examination and the consequences of decisions made about individuals from the resultingtest scores. Rigor, in this context, may be associated with attributes such as dependability ofjudgments by content experts or subject matter experts (SMEs), qualications of the SMEsmaking the content judgments, the inherent adequacy of the judgmental methods employed(lack of bias), the number, independence, and representativeness of the SMEs, and so on.Empirical methods may be used to support the adequacy of the judgmental methods. Forexample, it maybe important todemonstrate the reproducibilityof independent SMEjudgmentsby computing and evaluating a generalizability coefcient (Brennan, 2001) for raters or someother statistical index of consistency of ratings.Methods chosen to dene test content are critical and depend on the purpose of the test, theconsequences of the decisions made on the basis of the test scores, and the validity evidenceneeded to support test score interpretations. The denition of test content is ultimately a matterof human judgment. Methods and procedures may be developed and used to minimize biasand increase the objectivity of the judgments, butin the endprofessional judgments bycontent experts shape the content domain and its denition. Such inherent subjectivity canlead to controversy and disagreement, based on politics rather than content.P1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:111. TWELVE STEPS FOR EFFECTIVE TEST DEVELOPMENT 9STEP 3: TEST SPECIFICATIONS: BLUEPRINTING THE TESTThe process of creating test specications guides detailed test development activities andcompletes the operational planning for tests in a systematic manner. Test specications andtest blueprint are sometimes used almost interchangeably. For this chapter, test specicationsrefers to a complete operational denition of test characteristics, in every major detail, andthus includes what some authors call the test blueprint. For example, at a minimum, thetest specications must describe (1) the type of testing format to be used (selected responseor constructed response/performance); (2) the total number of test items (or performanceprompts) to be created or selected for the test, as well as the type or format of test items(e.g., multiple choice, three option, single-best answer); (3) the cognitive classication systemto be used (e.g., modied Blooms taxonomy with three levels); (4) whether or not the testitems or performance prompts will contain visual stimuli (e.g., photographs, graphs, charts);(5) the expected item scoring rules (e.g., 1 point for correct, 0 points for incorrect, with noformula scoring); (6) how test scores will be interpreted (e.g., norm or criterion referenced);and (7) the time limit for each item. A test blueprint denes and precisely outlines the number(or proportion) of test questions to be allocated to each major and minor content area andhow many (what proportion) of these questions will be designed to assess specic cognitiveknowledge levels. The higher the stakes or consequences of the test scores, the greater detailshould be associated with the test specications; detailed test specications provide a majorsource of validity evidence for the test. But, at minimum, all test specications must includesome range of expected items/prompts to be selected for each major content category and eachmajor cognitive process level (Linn, chap. 2, this volume).The test specications form an exact sampling plan for the content domain dened inStep 2. These documents and their rationales form a solid foundation for all systematic testdevelopment activities and for the content-related validity evidence needed to support scoreinferences to the domain of knowledge or performance and the meaningful interpretation oftest scores with respect to the construct of interest.How do the content dening activities of Step 2 get translated into exact test specica-tions in Step 3? The rigor and sophistication of the blueprinting methods will depend on theconsequences of testing. Table 1.2 presents a simple example of a test blueprint for an achieve-ment examination over a curriculumon test development. This simple blueprint operationalizesjudgments about the content appropriate for sampling in this achievement test. In this example,the test developer has allocated the item content in proportion to the amount of instructionaltime devoted to each topic and has further weighted these allocations using some professionaljudgment about the relative importance of topics. For example, content dealing with itemTABLE 1.2Example Test BlueprintAchievement Examination on Test DevelopmentContent Area Recall Application Problem Solving TotalsContent dene 4 10 6 20Test specs 3 8 4 15Item writing 4 10 6 20Assembly: print administer 2 5 3 10Test scoring 3 7 5 15Test standards 4 10 6 20Totals 20% 50% 30% 100%P1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:1110 DOWNINGwriting was judged to be about twice as important as content dealing with test assembly,printing, and administration. Such judgments are subjective, but should reect the relativeemphasis in the curriculum as represented by the instructional objectives. This blueprint fur-ther suggests that the test developer has some judgmental basis for allocating one half of all testquestions to the cognitive level labeled application, whereas only 20 percent is targeted forrecall and 30 percent for higher order problem solving. These cognitive level judgmentsmust reect the instructional objectives and the instructional methods used for teaching andlearning.For high-stakes, large-scale examinations such as licensing or certifying examinations,the methods used to operationalize the task or practice analysis are much more formal (e.g.,Raymond, 2001; Raymond & Neustel, chap. 9, this volume; Spray & Huang, 2000). Webbschapter (Webb, chap. 8, this volume) addresses many of these issues for other types of achieve-ment tests, especially those used for high-stakes accountability tests in the schools. Typically,the detailed results of an empirical practice analysis are translated to test specications usingsome combined empirical and rational/judgmental method. For example, the empirical resultsof the practice analysis may be summarized for groups of representative content experts, whoare tasked with using their expert judgment to decide which content areas receive what specicrelative weighting in the test, given their expert experience.The Standards (AERA, APA, NCME, 1999) emphasize documentation of the methodsused to establish test specications and blueprints, their rationale, and the evidence the testdevelopers present to support the argument that the particular test specication fairly representsthe content domain of interest. The Standards relating to test specications and test blueprintsemphasize the use of unbiased, systematic, well-documented methods to create test samplingplans, such that the resulting examination can have a reasonable chance of fairly representingthe universe of content.STEP 4: ITEM DEVELOPMENTThis step concentrates on a discussion of methods used to systematically develop selected-response items, using the multiple-choice item form as the primary exemplar. Downing (chap.12, this volume) discusses the selected-response formats in detail in another chapter and Welch(chap. 13, this volume) discusses development of performance prompts.Creating effective test items may be more art than science, although there is a solid scienticbasis for many of the well-established principles of item writing (Haladyna, Downing, &Rodriguez, 2002). The creation and production of effective test questions, designed to measureimportant content at an appropriate cognitive level, is one of the greater challenges for testdevelopers.Early in the test development process, the test developer must decide what test item for-mats to use for the proposed examination. For most large-scale, cognitive achievement testingprograms, the choice of an objectively scorable item format is almost automatic. The multiple-choice format (and its variants), with some ninety years of effective use and an extensiveresearch basis, is the item format of choice for most testing programs (Haladyna, 2004). Testdevelopers need not apologize for using multiple-choice formats on achievement tests; thereis strong research evidence demonstrating the high positive correlation between constructed-response and selected-response itemscores for measuring knowledge and many other cognitiveskills (Rodriguez, 2003). (To measure complex constructs, such as writing ability, a constructedresponse format is typically required.)The multiple-choice item is the workhorse of the testing enterprise, for very good reasons.The multiple-choice itemis an extremely versatile test itemform; it can be used to test all levelsof the cognitive taxonomy, including very high-level cognitive processes (Downing, 2002a).P1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:111. TWELVE STEPS FOR EFFECTIVE TEST DEVELOPMENT 11The multiple-choice itemis an extremely efcient format for examinees, but is often a challengefor the item writer.The choice of itemformat is a major source of validity evidence for the test. Aclear rationalefor item format selection is required. In practice, the choice of item formselected responseversus constructed responsemay quite legitimately rest largely on pragmatic reasons andissues of feasibility. For example, for a large-scale, paper-and-pencil examination program,it may not be cost effective or time efcient to use large numbers of constructed responsequestions. And, given the research basis supporting the use of multiple choice items (e.g.,Downing, 2002a, 2004; Haladyna, 2004; Rodriguez, 2003), the test developer need not feelinsecure about the choice of a low-delity selected-response format, like the multiple choiceformat, for an achievement test.The principles of writing effective, objectively scored multiple-choice items are well es-tablished and many of these principles have a solid basis in the research literature (Downing,2002b, 2004; Haladyna, 2004; Haladyna & Downing, 1989a,b; Haladyna et al., 2002). Yet,knowing the principles of effective item writing is no guarantee of an item writers ability toactually produce effective test questions. Knowing is not necessarily doing. Thus, one of themore important validity issues associated with test development concerns the selection andtraining of itemwriters (Abedi, chap. 17, this volume; Downing &Haladyna, 1997). For large-scale examinations, many item writers are often used to produce the large number of questionsrequired for the testing program. The most essential characteristic of an effective item writeris content expertise. Writing ability is also a trait closely associated with the best and mostcreative item writers. For some national testing programs, many other item writer character-istics such as regional geographic balance, content subspecialization, and racial, ethnic, andgender balance must also be considered in the selection of itemwriters. All of these itemwritercharacteristics and traits are sources of validity evidence for the testing program and must bewell documented in the technical report (Becker & Pomplun, chap. 30, this volume).Item Writer TrainingEffective item writers are trained, not born. Training of item writers is an important validityissue for test development. Without specic training, most novice item writers tend to createpoor-quality, awed, low-cognitive-level test questions that test unimportant or trivial content.Although item writers must be expert in their own disciplines, there is no reason to believethat their subject matter expertise generalizes to effective itemwriting expertise. Effective itemwriting is a unique skill and must be learned and practiced. For new item writers, it is oftenhelpful and important to provide specic instruction using an item writers guide, paired witha hands-on training workshop (Haladyna, 2004). As with all skill learning, feedback fromexpert item writers and peers is required. The instructionpracticefeedbackreinforcementloop is important for the effective development and maintenance of solid item writing skills(Jozefowicz et al., 2002).Competent content review and professional editing of test questions is also an importantsource of validity evidence for high-stakes examinations. All test items should be writtenagainst detailed test specications; these draft or raw items should then be reviewed and editedby a professional test editor, who has specialized editorial skills with respect to testing. Pro-fessional editing of test items is an essential validity requirement for high-stakes examinations(see Baranowski, chap. 15, this volume, for a complete discussion). Competent professionalediting of test items increases the accuracy and clarity of the questions and often resolves issuesor corrects aws in item content, thus increasing the validity evidence for the test. Professionaleditors are often responsible for initiating so-called sensitivity reviews for items, with the goalof reducing or eliminating cultural, ethnic, religious, or gender issues or sensitivities, especiallyP1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:1112 DOWNINGin high-stakes tests (e.g., Baranowski, chap. 15, this volume; Abedi, chap. 17, this volume).Professional test editors often nd content errors or inconsistencies, which can be claried byitem authors or item content reviewers. Additional content reviews are required, by indepen-dent subject matter experts, for high-stakes examinations. Such independent content reviews,after professional editorial review, strengthen the content-related validity evidence for the test.All test item writers benet from specialized training. Some of the worst test item ex-amples are found on instructor-developed tests, at all levels of education (e.g., Mehrens &Lehmann, 1991). Poor quality and awed test items introduce construct-irrelevant variance(CIV) to the assessment, potentially decreasing student passing rates by increasing the meanitem difculty (Downing, 2002b, 2004). Ideally, all item writers with any responsibility forachievement assessment have some special expertise in item writing and test development,gained through effective training and practice. Unfortunately, this is too often not the case andis one of the major failings of educational assessment at all levels of education, from K12 tograduate professional education.The creation of effective test items (or performance prompts) is a challenging but essentialstep in test development. Because the test item is the major building block for all tests, themethods and procedures used to produce effective test items is a major source of validityevidence for all testing programs. This fact is reected in at least ve separate standards withrespect to test items and their creation and production.STEP 5: TEST DESIGN AND ASSEMBLYAssembling a collection of test items (or performance prompts) into a test or test form is acritical step in test development. Often considered a mundane task, the validity of the naltest score interpretation very much relies on the competent and accurate test assembly process.Quality control is the keyword most associated with test assembly. The absence of errors inthe test assembly process often goes unnoticed; errors or serious aws and omissions in thetest assembly process can be obviously glaring and have the potential of seriously reducingthe validity evidence for examination scores.The overall design of the test, planned in detail in Step 1, provides a sound rationale related tothe purpose of testing and the planned interpretation and use of the test scores. This formal over-all test design creates the theoretical basis for Step 5. Several other chapters in this book addresstest design issues in detail (see Davey & Pitoniak, chap. 24; Luecht, chap. 25; Roid, chap. 23;Jones, Smith, & Talley, chap. 22; Wendler & Walker, chap. 20; Young, chap. 21, this volume).The specic method and process of assembling test items into nal test forms depends onthe mode of examination delivery. If a single test form is to be administered in paper-and-pencil mode, the test can be assembled manually, by skilled test developers (perhaps usingcomputer software to assist manual item selection and assembly). If multiple parallel testforms are to be assembled simultaneously, human test developers using advanced computersoftware can assemble the tests. If the test is to be administered as computer-based, morespecialized computer software will likely be needed to assemble multiple test forms to ensureproper formatting of the xed-length test form for the computer-delivery software. If the test isto be administered as a computer-adaptive test, very advanced computer software (automatictest assembly software) will likely be required to adequately manage and resolve multiple testitem characteristics simultaneously to create many equivalent test forms (Luecht, chap. 25,this volume).For most achievement tests, whatever the delivery mode, the most important validity-relatedissues in test assembly are the correspondence of the content actually tested to the contentP1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:111. TWELVE STEPS FOR EFFECTIVE TEST DEVELOPMENT 13specications developed in Step 3 and the high-level quality control of this entire process. Thetest assembly step operationalizes the exacting sampling plan developed in Steps 2 and 3 andlays the solid foundation for inferential arguments relating sample test scores to populationor universe scores in the domain. This is the essence of the content-related validity argument,which, to be taken seriously, must be independently veriable by independent, noninvestedcontent experts (see Kane, chap. 7, this volume, for a complete discussion).Other major considerations in test assembly, at least for traditional paper-and-pencil tests,relate to formatting issues (see Campion & Miller, chap. 26, this volume, for a complete dis-cussion). Tests must be formatted to maximize the ease of reading and minimize any additionalcognitive burden that is unrelated to the construct being tested (e.g., minimize CIV). Tradi-tional principles such as formatting items such that the entire item, together with any visual orgraphical stimuli, appear on the same page (or frame) falls under this rubric of minimizing thepotential CIV associated with overly complex item formatting.Other formatting issues are more psychometric in nature. For example, the placement ofpretest (tryout) items within the test form is a formatting issue of interest and concern. Ideally,pretest items are scattered throughout the test form randomly, to minimize any effects offatigue or lack of motivation (if items are recognized by examinees as pretest only items).(Such random distribution of pretest items is not always practical or feasible.) Both Standardscited as relating to Step 5 address pretest or tryout items, suggesting thorough documentationof the item sampling plan and the examinee sampling plan.Key Balance of Options and Other IssuesBalance of the position or location of the correct answer is an important principle for allselected-response items assembled into discrete test forms. The principle is straightforward:There should be an approximately equal frequency of correct responses allocated to the rstposition (e.g., A), and the second position (B), and so on. Approximating this goal is sometimesdifcult, given other important constraints from the item writing principles. For example, alloptions containing numbers must be ranked from high to low or low to high, such that thelocation of the correct answer cannot be rearranged in such an item. The reason for thiskey balance principle is that both item writers and examinees have a bias toward the middleposition, such that item writers tend to place the correct answer in the middle position andexaminees tend to select a middle-position answer as a natural testwise inclination (Attali &Bar-Hillel, 2003).The placement of anchor or common items used in a common-item equating design is alsoa formatting issue with validity implications (Kolen & Brennan, 2004). Ideally, the commonitems used to link test forms in a classical equating design should appear in a similar locationin the new test as they did in the prior test to eliminate any potential order effect on thecommon items difculty and/or discrimination. (Because of other operational and practicalconsiderations, it is not always possible to place anchor equating items in exactly the samelocation.)Throughout the test assembly process, test developers must be keenly aware of test securityissues. Although security of tests and test items is a constant concern throughout all steps oftest development, the product of the test assembly stage is a nearly nal form of a test. If thistest form is compromised in any way, the test must be considered insecure. If the stakes arehigh, the loss of a near nal test form is extremely disruptive and costly (see Impara & Foster,chap. 5, this volume).Quality control procedures are essential for accurate and successful test assembly methods.Unless effective and thorough quality control methods are devised and implemented, errorsP1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:1114 DOWNINGof test assembly will occur and these errors will reduce the legitimacy of the nal test scoreinterpretation, thus reducing the validity evidence for the scores.Although many of the details of assembling tests into nal forms may seem routine andmundane, the product of Step 5 is the test that examinees will encounter. Test assembly er-rors reduce validity evidence by introducing systematic errorCIVto the test (Haladyna &Downing, 2004) andmayleadtothe invalidationof scores for some test items, whichpotentiallyreduces the content-related validity evidence for the test.STEP 6: TEST PRODUCTIONThe production, printing, or publication of examinations is another routine step of test devel-opment that is often overlooked with respect to its validity aspects. For example, there appearto be no Standards that bear directly on this important test development activity. All the priortest development work comes to fruition in Step 6, where the months or years of prior test de-velopment work is nally cast in stone. Campion and Miller (chap. 26, this volume) discusstest production issues and their effect on test validity issues in detail.All tests, whether large-scale national testing programs or much smaller local testing pro-grams, must ultimately be printed, packaged for computer administration, or published in someformor medium. Test production activities and their validity implications apply equally to per-formance tests and selected-response tests. Step 5Test Productiontruly operationalizes theexamination, making nal all test items, their order, and any visual stimuli associated with thetest items. This is the test, as it will be experienced by the examinee, for better or worse.Clearly, what the examinee experiences as the nal test form has major implications for howscores on the test can be interpreted, and this is the critical validity aspect associated with testproduction.Security issues are prominent for test production. Human error is the most likely source oftest security breaches, even in this era of high-tech computer-assisted test production.During the production process, whether it be physical test booklet printing or packagingactivities for computer-based tests, nal test items may be available in some form to moreindividuals that at any prior time during test development. All reasonable security precau-tions must be taken during test production, during the electronic transmission of secure testitems, secure shipping of printed test copy and printed booklets, and secure destruction ofexcess secure printed materials. Test production security standards and policies must be de-veloped, implemented, and quality controlled for all high-stakes examinations and shouldreect the consequences of testing somewhat proportionally. Independent audits of these se-curity procedures should be carried out periodically, by security professionals, especially forall computer-based security systems. All secure test materials must be locked in limited-access les at all times not in direct use by test developers and production staff. For high-stakes tests, staff should have access only to those materials needed for completing theirspecied tasks and on a time-limited basis; high security control must be maintained andfrequently reviewed and updated for all computer systems (Impara & Foster, chap. 5, thisvolume).For printed tests, printers usually can provide some type of off-press copy for nal reviewbytest development staff. This nal preprinting quality control step is important even for tests thatare printeddirectlyfromcamera-readycopyor froma direct electronic le; typographical errorsor other major potential item invalidating errors, which were missed at all other proofreadingand quality control stages, can often be identied (and corrected) at this late stage.Other quality control issues are equally important for test production. For example, if atest is being printed by a printing company or service, test development staff (in addition toP1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:111. TWELVE STEPS FOR EFFECTIVE TEST DEVELOPMENT 15printers staff) must take responsibility for many quality assurance procedures. This may meanthat test development staff randomly sample some number of nal printed booklets to ensurethe completeness of test booklets (e.g., no pages missing) and the overall quality (readability)of the nal printed copy, including visual material, and so on.The quality and readability of nal test printing or production is important to the validityevidence for the test (Campion & Miller, chap. 26, this volume). If test items or visual stimuliare unclearly printed, test booklet pages are omitted, items are misordered, or options areout of order, such items are potentially invalidated and must be omitted from nal scoring.Production and printing errors (or formatting errors for computer-based tests) can seriouslyreduce the validity evidence for a test and create an aura of distrust and anxiety concerning manyother aspects of the test, its construction, and scoring. Great care and effective quality controlmeasures must be exercised during the nal production process for all tests, no matter whatthe modality of delivery. Maintaining complete control over test security, with independentlyveriable audit trails, is essential during the production process.STEP 7: TEST ADMINISTRATIONThe administration of tests is the most public and visible aspect of testing. There are major va-lidity issues associated with test administration because much of the standardization of testingconditions relates to the quality of test administration. Whether the test is administered in localschool settings by teachers, in large multisite venues by professional proctors, or by trained staffat nationwide computer-based testing centers, many of the basic practices of sound test admin-istration are the same (see McCallin, chap. 27, this volume, for a detailed discussion of test ad-ministration issues and Thurlow, Thompson, &Lazarus, chap. 28, this volume, for a discussionof special needs test administration issues, including Americans With Disabilities Act issues.)Standardization is a common method of experimental control for all tests. Every test (andeach question or stimulus within each test) can be considered a mini experiment (van derLinden & Hambleton, 1997). The test administration conditionsstandard time limits, proc-toring to ensure no irregularities, environmental conditions conducive to test taking, and soonall seek to control extraneous variables in the experiment and make conditions uniformand identical for all examinees. Without adequate control of all relevant variables affecting testperformance, it would be difcult to interpret examinee test scores uniformly and meaningfully.This is the essence of the validity issue for test administration considerations.Security is a major concern for test administration. For most examinations (e.g., large-scale,high-stakes tests), the entire test development process is highly secure, with extremely limitedaccess to test materials. Much effort and expense are devoted to securing examination items andtest forms. This chain of security is necessarily widened during test production and becomesextremely wide for large-scale test administration. For paper-and-pencil examinations, whichare administered in multiple sites, printed test forms and all testing materials must be securelyshipped to test sites; securely received and maintained by proctors; distributed to examineesin a secure, controlled, and auditable fashion; collected, accounted for, and reconciled; andsecurely reshipped back to the test sponsor. Detailed organizational and logistic planning arerequired for competent, secure test administration.For paper-and-pencil test administration, proctoring is one of the most critical issues. Forlarge-scale testing programs, testing agencies typically have a cadre of well-trained and well-experienced proctors available to oversee test administration. These professional proctors arekey to successful, secure, well-organized and well-administered national examinations. It isgenerally preferable to designate a single, highly experienced proctor as chief proctor forthe testing site; this chief proctor assumes full responsibility for all aspects of the secure test,P1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:1116 DOWNINGincluding supervision of other proctors on site (see McCallin, chap. 27, this volume, for detaileddiscussion of proctoring).For large-scale computer-based tests, proctoring is generally delegated to the agency provid-ing the computer test administration network. Computer-based testing changes the test securityenvironment. Some security issues associated with paper-and-pencil testing are eliminated,such as printing of test booklets, shipping secure test materials, distributing printed test formsto examinees, and so on. However, other potential test security vulnerabilities are increased,such as the electronic transmission of secure test items and response data, the need to have largenumbers of items available on local computer servers, and the possible use of less well-trainedand less professional administrative staff in the testing sites.The Standards associated with test administration generally deal with issues of standardiza-tion and examinee fairness, such as time limits, clarity of directions to examinees, and standardconditions for testing.Test administration is an extremely important component of test development. Competent,efcient, and standardized administration of tests provides important validity evidence fortest scores. Deciencies in the detailed planning and logistics required for high-quality testadministration can lead to a serious reduction in validity evidence for the examination. Theimportance of test administration to validity evidence is equally true for a selected-responsetest as it is for a performance examination; a performance test typically adds many challengesand complex logistics for administration. Security problems during test administration canlead to the invalidation of some or all examinee scores and can require the test developers toretire or eliminate large numbers of very expensive test items. Improper handling of issuespertaining to the Americans with Disabilities Act (ADA) can lead to legal challenges for thetest developers. Improper overuse of ADA accommodations can lead to misinterpretations oftest scores and other types of validity issues associated with score misinterpretation.STEP 8: SCORING EXAMINATION RESPONSESTest scoring is the process of applying a scoring key to examinee responses to the test stimuli.Examinee responses to test item stimuli or performance prompts provide the opportunity tocreate measurement. The responses are not the measurement; rather an application of somescoring rules, algorithms, or rubrics to the responses result in measurement (Thissen &Wainer,2001; van der Linden & Hambleton, 1997).There are many fundamental validity issues associated with test scoring. The most obviousissue relates to accuracy of scoring. If nal test scores are to have valid meaning, especiallythe meaning that was anticipated by the test developers (a measure of the construct of in-terest, an adequate sample of the domain of knowledge, and so on), a scoring key must beapplied with perfect accuracy to the examinee item responses. Scoring errors always reducevalidity evidence for the test and can invalidate the results. Validity evidence can be reducedby either a faulty (inaccurate) scoring key or awed or inaccurate application of the scoringkey to responses. Thus, high levels of quality control of the scoring process are essential tovalidity.Scoring can be extremely simple or very complex, depending on the type of test itemor stimuli. The responses to single-best-answer multiple choice items are easily scored bycomputer software, whereas responses to complex computer simulation problems can be morechallenging to score reliably. Selected-response items are generally more efciently and objec-tively scored than constructed-response items and performance prompts; however, constructed-response and performance items can be scored accurately and reliably by competently trainedand monitored scorers or computer software.P1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:111. TWELVE STEPS FOR EFFECTIVE TEST DEVELOPMENT 17All scoring issues, for all types of tests and testing formatsfrom relatively straightfor-ward selected-response formats to extremely complex performance simulationsconcern theaccurate representation of examinee performance with respect to the measured construct orthe domain of knowledge, skills, or ability (e.g., Thissen & Wainer, 2001). Ideally, the scoredexamination responses correspond closely (nearly one to one) with the examinees true statewith respect to the domain measured by the test or construct of interest. Insofar as the testscores depart from this one-to-one correspondence, random measurement error or systematicconstruct-irrelevant measurement error (CIV) has been introduced into the measurement.Much of psychometric theory begins at the scored response point of test development.In Step 8, a basic test scoring process, which is appropriate for nearly all achievement tests, isdiscussed.Preliminary Scoring and Key ValidationA preliminary scoring and item analysis with a nal verication of the scoring key by contentexperts is an essential quality control procedure for many tests (see Livingston, chap. 19, thisvolume, for a complete discussion of item analysis issues).A nal key validation or key verication step increases the validity evidence for allexaminations. This two-step scoring process is essential for tests containing newly written andnon-pretested items, because it is possible that such items may contain invalidating aws thatwere not detected during the item writing and review process. Key validation is the processof preliminary scoring and item analysis of the test data, followed by a careful evaluation ofthe item-level data to identify potentially awed or incorrect items prior to nal test scoring(Downing & Haladyna, 1997). This key validation process is nearly identical for selected-response tests and performance tests and should be carried out for all types of test modalities,if possible (e.g., Boulet, McKinley, Whelan, & Hambleton, 2003).After examinee item responses are scanned from paper-and-pencil answer sheets, responsestrings are provided from computer-based testing administration, or data are computer enteredfor performance examinations, an initial scoring and item analysis should be completed. Itemsthat perform anomalously should be identied for nal review of the scoring key or otherpotential content problems by subject matter experts. Item difculty and item discriminationcriteria or ranges for key validation item identication should be developed for every testingprogram(see Haladyna, 2004, p. 228, for example). Typically, items that are verydifcult and/orhave very low or negative item discrimination indices are identied for further content review.The key validation criteria should be sufciently sensitive that all potentially problematicquestions are identied for nal content review. The results of key validation, such as thenumber and type of questions identied for review, and the disposition of these items (scoredas is, eliminated from nal scoring, or key changed) are a source of validity evidence andshould be documented in the technical report.If large numbers of items are identied for key validation procedures, the criteria are eithertoo liberal or there are serious problems with the item writing and review process. If largenumbers of items are eliminated from the nal scoring, for reasons of poor item quality orincorrectness of content, content-related validity evidence is compromised.Final ScoringFinal scoring of the examinee responses follows the preliminary scoring and key validationprocedures. The nal answer keymust be carefullyproofreadandqualitycontrolledfor absoluteaccuracy. Great care must be taken to ensure the complete accuracy of this nal scoring key.Subject matter experts who have direct responsibility for the content of the examination mustP1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:1118 DOWNINGformally approve the nal scoring key, especially if testing services are being provided bysome outside or contract agency. If multiple forms of the same examination are used, greatcare must be taken to ensure the use of the correct scoring key for each test form. All pretestor tryout items must be carefully segregated from the nal scoring so that pretest questions donot contribute to the nal test score in any way (see Livingston, chap. 19, this volume).A nal item analysis should be completed and reviewed carefully. The nal item analysisprovides another important quality control step to ensure that any changes made at the key vali-dation stage are accurately reected in the nal scoring. Acomplete nal itemanalysis includessummary test statistics for the test administration. For tests using classical measurement theory,these statistics include the raw score mean and standard deviation, the mean item difculty( p-value), mean item discrimination (point-biserial or biserial), range of raw scores, the testscore reliability (Alpha/Kuder-Richardson 20), plus other appropriate indices of overall testquality such as some index of passfail reproducibility (especially for high-stakes examina-tions). Summary test statistics are critically important validity evidence and must be thoroughlyevaluated and documented. Any anomalies identied by nal item analysis or nal summarytest statistical analyses must be thoroughly investigated and resolved prior to reporting testscores.If test score scaling or equating procedures are to be carried out, these procedures usuallyfollow the nal scoring and item analysis (unless pre-equating methods are used).The standards related to scoring issues discuss ensuring the correspondence of the scoringrules to the stated purpose of testing and clarity of scoring rules to ensure absolute accuracyof nal scores.The most important emphasis in the test scoring step is complete accuracy. Extreme qualitycontrol procedures are required to ensure total accuracy of nal test scores, especially for veryhigh-stakes examinations. Any scoring errors included in nal test scores reduces validity evi-dence and credibility of the examination and introduces CIV to scores (Haladyna & Downing,2004)STEP 9: ESTABLISHING PASSING SCORESMany, but not all, tests require some type of cut score (passing score) or performance standard.For tests that require cut scores, content standard interpretations, or grade levels attributedto certain test scores or score ranges, the methods and procedures used to establish cut scoresare a major source of validity evidence and are an integral part of the test development process(see Cizek, chap. 10, this volume, for a complete discussion of standard setting).For high-stakes tests, the establishment of defensible cut scores is one of the most critical testdevelopment issues. For tests of all types, with any consequences for examinees, the legitimacyof the methods used to identify cut scores is a major source of validity evidence.Step 9 highlights and generally overviews some of the important issues concerning standardsetting. Standard setting is a complex issue with a sound basis in the research literature. Also, itmust be noted that methods and procedures used to establish passing scores may take place atseveral different stages of test development, dependingonthe methods usedandthe overarchingphilosophy of standard setting adopted by the test developers and users. The basic decisionson the type of standard-setting method to use should be made early in the test developmentprocess, because extensive planning may be required to successfully implement certain types ofstandard setting procedures. Further, some methods may require multiple exercises or studies,taking place at different times during the test development process and concluding after thetest has been administered.P1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:111. TWELVE STEPS FOR EFFECTIVE TEST DEVELOPMENT 19Relative and Absolute Standard-Setting MethodsAll methods of establishingpassingscores require humanjudgment andare therefore somewhatarbitrary (e.g., Norcini & Shea, 1997). All examination passing scores answer the question:How much knowledge (skill or ability) is needed to be classied as having passed the exam-ination? Traditionally, standard-setting methods are dichotomized into two major categories:relative or normative methods and absolute methods. But, there are standard-setting methodsthat blend characteristics of both relative and normative methods, such as the relativeabsoluteor Hofstee method (Hofstee, 1983).Relative standard-setting methodsnormative methodsuse the actual test performancedata, usually from some well-dened group of test takers (e.g., rst-time takers of the test) toestablish a point on the distribution of test scores to designate as the cut score or passing point(Cizek, 2001). The passing score point on the distribution of scores represents a judgmentof someone or some group of qualied individuals who are responsible for making suchjudgments. For example, relative passing scores might be expressed as a score that is exactlyone standard deviation below the mean score of all examinees; a T-score of 45, based onall examinees who took the examination for the rst time; or a percentile rank of 30. Theemphasis of all normative passing scores is on the relative position of the examinees score insome distribution of scores. All examinees scoring at or above the selected cut point on thedistribution pass the test and all those scoring below that point fail. (Note that it is the methodused to establish the passing score that makes the chosen score relative, not the score metric.)Relative passing scores do not address the absolute competency of examinees or make anyjudgments about what specic knowledge, skill, or ability has been mastered by the examinee.Absolute passing score methods employ systematic procedures to elicit expert judgmentsfrom SMEs, concerning the amount of knowledge (skill or ability) required on a test to beconsidered a passing examinee. There are many well-established, well-researched methodscommonly used to establish defensible and effective absolute passing scores (Cizek, 2001;Cizek, chap. 10, this volume).All common methods used to establish absolute passing scores on all types of examina-tions require expert judgments about the expected performance of a borderline examinee. Aborderline examinee is usually dened as someone who just barely passes or just barely failsthe test; the borderline examinee has an exactly equal probability (5050) of either passing orfailing the test. Some commonly used methods are the Angoff method and its modications(Angoff, 1971; Impara & Plake, 1997; Downing, Lieska, & Raible, 2003), which requiresexpected passing score judgments about each individual test question; the Ebel method (Ebel,1972), which requires judges to make expected passing score judgments for sets of items thathave been classied into difculty and relevance categories; and the Hofstee method (Hofstee,1983), with both absolute and relative characteristics, which asks judges to state their ownexpected minimum and maximum passing score and failure rate (proportion of examineesfailing the test) on the test and then plots those expert judgments onto actual test data.The absolute methods are not without their critics. For example, Zieky (1997) suggests thatthe judgments required of content experts, regarding the expected pass score for borderlineexaminees, represents an impossible cognitive task. Other controversies concern the amountof performance data to provide the content expert judges, although measurement experts tendtoward providing more rather than less empirical performance data to standard setting judges(e.g., Zieky, 2001).Other, more data-centered methods, such as the contrasting groups method and the bor-derline groups method, are also used for certain types of performance examinations. In thecontrasting groups method, the actual test performance of a known group of masters, experts,P1: IWVLE146c01.tex LE146/Haladyna (2132G) Hamilton December 6, 2005 3:1120 DOWNINGor highly qualied examinees is plotted against the actual test performance of a known groupof nonmasters or those who are known to be non-expert or not qualied. The intersection ofthe two curves describes the passing score, which can be adjusted to minimize false-positive orfalse-negative error. The borderline group method is similar, but requires a direct expert judg-ment concerning which examinees are borderline in their performance; these judgments arethen translated into passing scores on the examination (e.g., Wilkinson, Newble, & Frampton,2001; Kilminster & Roberts, 2003).The comparability of passing scores across different forms (different test administrations)is a major validity issue (e.g., Norcini & Shea, 1997). If absolute passing scores are used,it is critical that test score equating be used to maintain the constancy of the score scale. Ifscores are not equated, even slight differences in mean item difculty across different testadministrations make the interpretation of the passing score impossible and may unfairlyadvantage or disadvantage some examinees (Kolen & Brennan, 2004). Thus, most of theStandards (AERA, APA, NCME, 1999) concerning passing scores discuss equating issues.Other standards for absolute passing score determination address issues of ensuring that thetask of the standard-setting judges is clear and that the judges can in fact make reasonableand adequate judgments. Fairness of cut score procedures and attention to the consequencesor the impact of passing scores are emphasized by the Standards (AERA, APA, NCME,1999).In summary, different standard-setting methods, whether the traditional relative methodsor the more contemporary absolute methods (or other blended methods), produce differentcut scores and passing rates. None of these methods is more correct than other methods.The task of content-expert judges, using absolute standard-setting methods, is not to discoversome true passing score, but rather to exercise their best professional judgment in answeringthe question: How much is enough (to pass)? Passing scores reect policy, values, expertjudgment, and politics. The defensibility and the strength of the validity evidence for passingscores relies on the reasonableness of the unbiased process, its rationale and research basis,and the psychometric characteristics of expert judgments.STEP 10: REPORTING EXAMINATION RESULTSScore reporting is an important, often complex, step in test development. The contents andformat of an examinee score report should be among the many early decisions made forlarge-scale testing programs. There are multiple validity issues concerning score reporting andseveral Standards (AERA, APA, NCME, 1999) address the adequacy of requirements for scorereporting. Ryan (chap. 29, this volume) discusses elements of score reports and the strategiesbehind different types of score reports.For large-scale assessments, the score reporting task is complex (e.g., Goodman &Hambleton, 2004) and is often encumbered with nonpsychometric issues. The score reportingissues emphasized by the relevant Standards deal with fairness, timeliness, appropriatenessof the score, avoidance of score misunderstanding and misuse, and tangentially, issues of testretake (for failing examinees), and test score challenges.As with so many other prior steps of test development, absolute accuracy is of the highestimportance for all reports of scores. Thus, careful and effective quality

handbook of test development

Documents