6694 data capture...census 2000 topic report no.3 census 2000 testing, experimentation, and...

37
Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S. Department of Commerce Economics and Statistics Administration U.S. CENSUS BUREAU

Upload: others

Post on 11-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

Census 2000 Topic Report No.3Census 2000 Testing, Experimentation, and Evaluation Program

TR-3

Issued January 2004

Census 2000 Data Capture

U.S.Department of CommerceEconomics and Statistics Administration

U.S. CENSUS BUREAU

Page 2: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

The Census 2000 Evaluations Executive SteeringCommittee provided oversight for the Census 2000Testing, Experimentation, and Evaluations (TXE)Program. Members included Cynthia Z. F. Clark,Associate Director for Methodology and Standards;Preston J. Waite, Associate Director for DecennialCensus; Carol M. Van Horn, Chief of Staff; TeresaAngueira, Chief of the Decennial ManagementDivision; Robert E. Fay III, Senior MathematicalStatistician; Howard R. Hogan, (former) Chief of theDecennial Statistical Studies Division; Ruth AnnKillion, Chief of the Planning, Research and EvaluationDivision; Susan M. Miskura, (former) Chief of theDecennial Management Division; Rajendra P. Singh,Chief of the Decennial Statistical Studies Division;Elizabeth Ann Martin, Senior Survey Methodologist;Alan R. Tupek, Chief of the Demographic StatisticalMethods Division; Deborah E. Bolton, AssistantDivision Chief for Program Coordination of thePlanning, Research and Evaluation Division; Jon R.Clark, Assistant Division Chief for Census Design ofthe Decennial Statistical Studies Division; David L.Hubble, (former) Assistant Division Chief forEvaluations of the Planning, Research and EvaluationDivision; Fay F. Nash, (former) Assistant Division Chieffor Statistical Design/Special Census Programs of theDecennial Management Division; James B. Treat,Assistant Division Chief for Evaluations of the Planning,Research and Evaluation Division; and VioletaVazquez of the Decennial Management Division.

As an integral part of the Census 2000 TXE Program,the Evaluations Executive Steering Committee char-tered a team to develop and administer the Census2000 Quality Assurance Process for reports. Past andpresent members of this team include: Deborah E.Bolton, Assistant Division Chief for ProgramCoordination of the Planning, Research and EvaluationDivision; Jon R. Clark, Assistant Division Chief forCensus Design of the Decennial Statistical StudiesDivision; David L. Hubble, (former) Assistant DivisionChief for Evaluations and James B. Treat, AssistantDivision Chief for Evaluations of the Planning, Researchand Evaluation Division; Florence H. Abramson,Linda S. Brudvig, Jason D. Machowski, andRandall J. Neugebauer of the Planning, Research and Evaluation Division; Violeta Vazquez of theDecennial Management Division; and Frank A.Vitrano (formerly) of the Planning, Research andEvaluation Division.

The Census 2000 TXE Program was coordinated by thePlanning, Research and Evaluation Division: Ruth AnnKillion, Division Chief; Deborah E. Bolton, AssistantDivision Chief; and Randall J. Neugebauer andGeorge Francis Train III, Staff Group Leaders. KeithA. Bennett, Linda S. Brudvig, Kathleen HaysGuevara, Christine Louise Hough, Jason D.Machowski, Monica Parrott Jones, Joyce A. Price,Tammie M. Shanks, Kevin A. Shaw, George A. Sledge, Mary Ann Sykes, and CassandraH. Thomas provided coordination support. FlorenceH. Abramson provided editorial review.

This report was prepared under contract by DonaldKline of the Titan Systems Corporation. The projectmanager was Kevin A. Shaw of the Planning, Researchand Evaluation Division. The following authors andproject managers prepared Census 2000 experimentsand evaluations that contributed to this report:

Decennial Statistical Studies Division:Joseph D. Conklin

Planning, Research and Evaluation Division:Kevin A. Shaw

Independent contractor:Donald Kline, Titan Systems Corporation

Greg Carroll and Everett L. Dove of the Admin-istrative and Customer Services Division, Walter C.Odom, Chief, provided publications and printing management, graphics design and composition, and editorial review for print and electronic media. Generaldirection and production management were providedby James R. Clark, Assistant Division Chief, andSusan L. Rappa, Chief, Publications Services Branch.

Acknowledgments

Page 3: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

U.S. Department of CommerceDonald L. Evans,

Secretary

Samuel W. Bodman,Deputy Secretary

Economics and Statistics AdministrationKathleen B. Cooper,

Under Secretary for Economic Affairs

U.S. CENSUS BUREAUCharles Louis Kincannon,

Director

Census 2000 Topic Report No. 3Census 2000 Testing, Experimentation,

and Evaluation Program

CENSUS 2000 DATA CAPTURE

TR-3

Issued January 2004

Page 4: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

Suggested Citation

Donald KlineCensus 2000 Testing,

Experimentation, and EvaluationProgram Topic Report No. 3, TR-3,

Census 2000 Data Capture,U. S. Census Bureau,

Washington, DC. 20233

ECONOMICS

AND STATISTICS

ADMINISTRATION

Economics and StatisticsAdministration

Kathleen B. Cooper,Under Secretary for Economic Affairs

U.S. CENSUS BUREAU

Charles Louis Kincannon,Director

Hermann Habermann,Deputy Director and Chief Operating Officer

Cynthia Z. F. Clark,Associate Director for Methodology and Standards

Preston J. Waite, Associate Director for Decennial Census

Teresa Angueira, Chief, Decennial Management Division

Ruth Ann Killion, Chief, Planning, Research and Evaluation Division

For sale by the Superintendent of Documents, U.S. Government Printing Office

Internet: bookstore.gpo.gov Phone: toll-free 866-512-1800; DC area 202-512-1800

Fax: 202-512-2250 Mail: Stop SSOP, Washington, DC 20402-0001

Page 5: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

U.S. Census Bureau Census 2000 Data Capture iii

Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v

1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

2. Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3

3. Data Capture Topics Addressed in this Report . . . . . . . . . . . . . . .53.1 Performance of the data capture system . . . . . . . . . . . . . . .53.2 The system’s ability to capture questionnaire data . . . . . . . .63.3 The impact of data capture requirements on the

questionnaire design and other factors . . . . . . . . . . . . . . . .73.4 The appropriateness of requirements identified

for the data capture system . . . . . . . . . . . . . . . . . . . . . . . ..7

4. Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94.1 Assessing the performance of the data capture system . . . .94.2 Factors affecting the system’s ability to capture

questionnaire data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .114.3 Issues relating to the impact of data capture requirements

on the questionnaire design and other factors . . . . . . . . . .134.4 Examining the appropriateness of requirements

identified for the data capture system . . . . . . . . . . . . . . . .13

5. Results of Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .155.1 System performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . .155.2 Capturing respondent data . . . . . . . . . . . . . . . . . . . . . . . .155.3 The impact of data capture requirements . . . . . . . . . . . . . .155.4 Fluid requirements posed substantial risks . . . . . . . . . . . .165.5 Other salient observations about Census 2000

data capture system . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

6. Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .196.1 Unify the design strategy needed for the data

capture system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .196.2 Define requirements early . . . . . . . . . . . . . . . . . . . . . . . . .196.3 Develop quality assurance standards early . . . . . . . . . . . . .196.4 Focus on redesigning key data capture problem areas . . . .196.5 Limit the number of forms . . . . . . . . . . . . . . . . . . . . . . . .206.6 Assess future role of automated data capture technology . .206.7 Implement a unified Help Desk . . . . . . . . . . . . . . . . . . . . .206.8 Train on the actual working system . . . . . . . . . . . . . . . . . .206.9 Expand the time for testing . . . . . . . . . . . . . . . . . . . . . . .20

6.10 Better define Decennial Management InformationSystems for future contracts . . . . . . . . . . . . . . . . . . . . . . .21

6.11 Provide more information on the scope of documentation requirements . . . . . . . . . . . . . . . . . . . . . .21

6.12 Minimize the use of keying from paper . . . . . . . . . . . . . . .216.13 Produce real-time cost data . . . . . . . . . . . . . . . . . . . . . . . .21

7. Author’s Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . .237.1 Implement a more structured approach to defining

requirements for the data capture system . . . . . . . . . . . . .237.2 Research the potential for expanding the all digital

data capture environment . . . . . . . . . . . . . . . . . . . . . . . . .23

8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27

Contents

Page 6: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

This page intentionally left blank.

Page 7: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

U.S. Census Bureau Census 2000 Data Capture v

The Census 2000 Testing, Experimentation, and Evaluation Programprovides measures of effectiveness for the Census 2000 design,operations, systems, and processes and provides information on the value of new or different methodologies. By providing measuresof how well Census 2000 was conducted, this program fully sup-ports the Census Bureau’s strategy to integrate the 2010 planningprocess with ongoing Master Address File/TIGER enhancements andthe American Community Survey. The purpose of the report that follows is to integrate findings and provide context and backgroundfor interpretation of related Census 2000 evaluations, experiments,and other assessments to make recommendations for planning the 2010 Census. Census 2000 Testing, Experimentation, andEvaluation reports are available on the Census Bureau’s Internet siteat: http://www.census.gov/pred/www/.

Foreword

Page 8: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

This page intentionally left blank.

Page 9: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

This report provides an overallsynthesis of issues that were iden-tified in several studies addressingthe technical and operational ele-ments of the complex and large-scale Census 2000 data capturesystem. The U.S. Census Bureauoutsourced the two major compo-nents of the Census 2000 datacapture program. Those compo-nents were the Data CaptureSystem 2000 (DCS 2000) whichwas awarded to Lockheed Martinand the Data Capture ServicesContract (DCSC) awarded to TRW.Lockheed Martin provided equip-ment for imaging, recognition, anddata keying as well as the process-ing systems for four Data CaptureCenters (DCCs). TRW providedstaff and services for data capture,facilities management, officeequipment, supplies, and officeautomation for three of the DCCs.(A fourth DCC was managed by theNational Processing Center (NPC), apermanent Census Bureau facilityin Jeffersonville, Indiana.) Withinthe report, a distinction is madebetween the two components, asappropriate.

The underlying system technologywas developed through a contractawarded to Lockheed Martin. Thecontractor characterized this pro-gram as one of the largest imageprocessing projects in history. Thedata capture system processed andcaptured data from 152 millioncensus forms with an extremelyhigh accuracy rate, which exceed-ed established goals (see Section4.1). In actuality, the total numberof census forms exceeded this fig-ure. Based on a cost/benefit analy-

sis, low volume forms were delib-erately excluded from DCS 2000 asa risk mitigation strategy. Theautomated system was, in fact,designed to process 80 per cent ofthe forms volume while theremaining 20 percent of low vol-ume forms were processed in a dif-ferent manner.

Advanced technologies wereemployed to capture forms by cre-ating a digital image of each pageand then interpreting respondents’entries using Optical MarkRecognition (OMR) and OpticalCharacter Recognition (OCR)processes.1 This was the first timethat the Census Bureau had usedhigh speed OCR technology to cap-ture hand written entries byrespondents. Although OMR hadbeen used in 1990, the automationin 2000 was more sophisticatedbecause it included key fromimage (KFI) and OCR technologiesas well as OMR. The system washighly automated but still relied onextensive operational support fromcontractors. Despite the relianceon technology, manual data entrymethods were still needed to cap-ture data in cases where the datawere not machine readable, or ifthe form was damaged and couldnot be scanned or if the formswere low volume.

One aspect of the data capture sys-tem that has perhaps been over-shadowed by the highly visible useof technology is the controlprocesses used to manage the flowof forms through the data capturesystem and to monitor image qual-ity. This workflow managementsystem was a very effective mecha-nism that ensured all data werecaptured. According to LockheedMartin, 1.2 million forms werererun through the system.Although this was a small percent-age of the overall number of formsthat were processed, it nonethelessprovided an indication of the strin-gent controls applied to monitorthe process. There was a signifi-cant amount of census data associ-ated with those 1.2 million forms.Different types and sizes of formswere processed through DCS 2000and, in addition to capturingrespondent answers, DCS 2000electronically interpreted identifica-tion and control information on theforms. DCS 2000 had an automat-ed monitoring feature that exam-ined image quality by detectingover a dozen types of errors. Aform recovery procedure wasdeveloped and implemented tohandle questionnaires with thosetypes of errors.

The data capture system employeda two-pass approach to capturedata. The first pass commencedon March 6, 2000 and was com-pleted on September 15, 2000. Itcaptured the 100 percent censusdata (from both the long and shortforms) needed for apportionment.The second pass captured thesocial and economic data (i.e., the

U.S. Census Bureau Census 2000 Data Capture 1

1. Background

1 OMR technology uses an optical scan-ner and computer software to scan a page,recognize the presence of marks in predesig-nated areas, and assign a value to the markdepending on its specific location and inten-sity on a page; OCR technology uses an opti-cal scanner and computer software to "read"human handwriting and convert it into elec-tronic form.

Page 10: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

sample data). This was a shorterphase that started on August 28,2000 and completed on November15, 2000. The two-pass approachwas used because the original key-ing rate estimates were too opti-mistic and the two-pass approachwould ensure that data capturedeadlines were met. The accuracyrate for OCR and OMR during bothpasses exceeded program goals.The manual keying accuracy ratealso exceeded expectations.

Lockheed Martin, the prime con-tractor for DCS 2000, cited the sys-temic nature of DCS 2000 whenexplaining how it achieved highaccuracy rates (Lockheed Martin,2001b):

Automated data capture and thequality of the information pro-duced lies at the heart of theDCS 2000 system. Many timesin the image processing indus-try, products or systems claimautomated character recogni-tion rates of 99% or higher. Butthese rates are frequently calcu-lated on preprocessed charactertest decks that rarely give anindication of how a system willwork in an operations environ-ment. DCS 2000 can make thesame accuracy claim, but at aquestion level and on liveCensus production data.Moreover, this rate is obtainedwith nearly 80% of the data cap-tured automatically. This levelof automated capture did notcome from simply a careful

selection of commercial productsor even by fine tuning the indi-vidual OCR and OMR compo-nents. These production statis-tics are the result of in depthtuning and complex integrationof every component of the sys-tem.

Indeed, there were 15 commercial-off-the-shelf (COTS) products inte-grated into the system. Thisapproach was necessary given thelimited time available to develop,test, and deploy the system. TheCOTS components provided thefollowing functions: mail check-inand sorting; paper to digital imageconversion; data base manage-ment; workflow management; digi-tal image processing; optical char-acter and mark recognition; datareview and correction; digital tapebackup and recovery; and systemadministration. The integrationand tuning of these componentswere major accomplishments giventhe complexity of the DCS 2000architecture.

According to the Data CaptureProgram Master Plan (PMP) (Brinsonand Fowler, 2001), of the approxi-mately 152.3 million census formsentered into data capture, approxi-mately 83.9 million were mailbackforms, 59.7 million were enumera-tor forms, 600,000 were BeCounted forms, and 8.1 millionwere Group Quarters (GQ) forms.2

The Data Capture PMP reportedthat a cost model projected thatthe total number of forms to beprocessed would be 149.7 million.It further stated that approximately1.5 billion form pages wereprocessed during the data captureperiod. DCS 2000 output fileswere transmitted to the DecennialSystems and ContractsManagement Office (DSCMO) on adaily basis. In order to managethis enormous workflow, DCS 2000continually generated progressreports for management.

The overall management of thedata capture system was a criticalelement contributing to the sys-tem’s success. In addition to theNPC and the three DCCs, anOperations Control Center (OCC)was established in Lanham,Maryland to oversee all data cap-ture operations. To assist the OCCwith the management of the DCCsand their associated operations,the DCSC Management InformationSystem (DMIS) was developed toprovide a variety of integratedoffice automation tools. Raw datawere transmitted to the DSCMO ona daily basis.

The data capture system succeed-ed in providing the population dataneeded for purposes of determin-ing congressional apportionment,redistricting, and the distributionof over $100 billion of federalfunds to state and local govern-ments.

2 Actual numbers were reported afterthe completion of Census 2000. The finalPMP issue date was March 30, 2001.

2 Census 2000 Data Capture U.S. Census Bureau

Page 11: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

U.S. Census Bureau Census 2000 Data Capture 3

2. Scope and Limitations

The main focus of this report is toaddress the following four topics:performance of the data capturesystem; the system’s ability to cap-ture questionnaire data; the impactof data capture requirements onthe questionnaire design and otherfactors; and the appropriateness ofrequirements identified for thedata capture system. Other salientobservations are included as wellin view of their potential impor-tance to future data capture sys-tems and processes. The followingdocuments were reviewed for thisreport:

1. Data Capture Program MasterPlan (PMP) - Data CaptureSystems and Operations

2. R.3.d. - Census 2000 DataCapture System RequirementsStudy by Titan SystemsCorporation

3. K.1.b. - Evaluation of the Qualityof the Data Capture System andthe Impact of QuestionnaireCapture and Processing on DataQuality

4. Lockheed Martin - Phase IILessons Learned (includingAppendix A, Technical LessonsLearned White Paper

5. TRW - Lessons Learned fromDCSC Final Report

6. Rochester Institute ofTechnology ResearchCorporation - DCS 2000 Data Quality

7. Census 2000 QuestionnaireDesign Study by Titan SystemsCorporation

8. Assessment Report for DataCapture of PaperQuestionnaires, prepared byAndrea F. Brinson and CharlesF. Fowler, DecennialManagement Division

9. Lessons Learned for Census2000, the Forms Design andPrinting Office

10. Memorandum from HowardHogan, January 24, 2000.Subject: Proposal for QualityAssurance of Census 2000Data Capture.

11. Memorandum from Daniel H.Weinberg, December 7, 2000.Subject: Actions to CorrectPass 2 Keying Errors in CensusSample Monetary Fields.

Only two of the reference sources(#3 and #6 above) are based onempirical research. All othersources provide qualitative data.

In addressing the topics identifiedabove, this report summarizes thekey findings and major recommen-dations of the documents reviewedand seeks to identify any commonthemes or conflicting informationbetween them. Therefore, thisreport is a high level, integratedassessment rather than being a cri-tique of every facet of each studyreviewed. It is not the intent ofthis report to re-visit the detailedstatistical data contained in thedocuments that were reviewed.

Limitations stated in other refer-ence sources also indirectlyapplied to this study. The twoTitan studies and the K.1.b evalua-tion cited the limits identified

below. Specific details on eachlimit are defined within the respec-tive documents and are not fullydescribed here due to space limita-tions.

Census 2000 Data CaptureSystem RequirementsStudy

• The perception of those personsparticipating in the interviewprocess can significantly influ-ence the quality of informationgathered

• In some cases, interviews wereconducted several months, evenyears, after the participant hadbeen involved in system devel-opment activities

• Each interview was completedwithin a one to two hour period,with some telephone followupto solicit clarification on inter-view results

• Every effort was made to identi-fy key personnel and opera-tional customers who activelyparticipated in developmentefforts

Census 2000 QuestionnaireDesign Study

• The perception of those personsparticipating in the interviewprocess can significantly influ-ence the quality of informationgathered

• Nearly two years have passedsince participants were lastinvolved in supporting Census2000 activities

Page 12: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

4 Census 2000 Data Capture U.S. Census Bureau

• Due to availability problems,Titan analysts were unable tointerview the full range of per-sonnel with knowledge aboutprocessing issues

K.1.b Evaluation

• Raw data are not a random rep-resentative sample of the U.S.population

• Failure to obtain all data originally planned

• Resolution of 666,711 recordsnot matched to the twelveregional census center files

• Subjectivity in interpreting themost likely intent of the respon-dent

• Data reflect multiple sources oferror beyond those attributableto system design

The collection of documentsreviewed for this report identifiedimportant issues related to datacapture topics. There were addi-tional evaluations of data captureoperations planned, which mayhave identified more issues.However, these evaluations wereeither not available by the time

this report was completed or werecancelled altogether. Initially, thisreport intended to reflect the con-tent of up to 11 documents, butdue to the smaller number of refer-ences, the consolidated findingsand recommendations will not beas extensive as originally planned.Despite this limitation, the reportstill covers a broad range of datacapture issues, reflecting bothquantitative and qualitative assessments.

Page 13: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

An expansion of each topic is pro-vided below to give an apprecia-tion for the scope of issues thatwere examined across all of thedocuments reviewed.

3.1 Performance of thedata capture system

In order to address performanceissues, a clear definition for thesystem’s objective must be articu-lated. The data capture systemwas comprised of both automatedand manual processes. Data cap-ture equipment and related sys-tems were acquired through a con-tract awarded in 1997 to LockheedMartin Mission Systems. The auto-mated system scanned a variety offorms and created digital imagesthat were read by OMR and OCRsoftware. OMR accomplished morethan merely identifying marks inboxes. It was capable of recogniz-ing cases when multiple marksappeared and used an OpticalAnswer Recognition feature thatapplied an algorithm and logicprocess to determine the most like-ly intended response. The OCRcomponent was even more sophis-ticated. The Lockheed Martinstudy noted that OCR accuracywas a function of both its inherentability to recognize characters andits contextual recognition capabili-ties. The excerpt below (LockheedMartin, 2001b) explains how theOCR engine achieved the highaccuracy rate:

First, not only does it recognizecharacters with a high degree ofaccuracy, it also provides multi-ple choices for each characterand corresponding bounding

character coordinates. Thisallows subsequent custom devel-oped contextual processing tovalidate segmentation results aswell as use an analysis of multi-ple recognition hypotheses incontext and their probabilities ofoccurrence in order to furtherimprove the results. Also, byproviding a dictionary lookupcapability as well as the descrip-tion of the processing used tomatch or reject a word as a dic-tionary entry, the product allowseven more opportunity fordownstream analysis of the dataduring contextual analysis.Finally, because the product pro-vides a vast array of definitionparameters, it is also cus-tomized to treat each individualfield with a high degree of detailand specificity, which will alsomaximize the accuracy andacceptance rates of the output.

An Automated Image QualityAssessment (AIQA) application ana-lyzed each imaged document. Itcorrected problems and enhancedimages where possible. Once theforms were converted into an elec-tronic format, the DCS 2000 soft-ware interpreted the data on theforms to the greatest extent possi-ble. In those cases whereOMR/OCR could not interpret thedata within a certain range of con-fidence limits, the form image wasautomatically sent to KFI (key fromimage), an operation that requiredan operator to interpret the “lowconfidence” response data andthen manually key the data intothe system. Thus, as theRochester Institute of Technology

Research Corporation (RITRC) putit, “KFI got the bulk of the messyor ambiguous responses.” The KFIprocess was described in the DataCapture PMP as follows:

The operators were presentedwith an image, called a “snip-pet,” of the portion of the formthey were to key. If a fieldrequired an action, the cursorwas positioned on that field.Using their best judgement, theoperators then keyed all thecharacters as they understoodthem from the image. For sev-eral situations, keying ruleswere provided to assist the oper-ators in interpreting the infor-mation and entering standardresponses.

The Data Capture PMP notes thatfields read by OCR and designated as “low confidence”images, and therefore automatical-ly sent to KFI, were often correct.KFI had its own quality assuranceprocess involving comparisonswith OCR and/or a second keyer.Forms that could not be imagedwere run through KFP (key frompaper) to capture all data manually.KFP involved two keyers, with thesecond providing verification ofthe data entered by the first operator.

RITRC’s sampling of productiondata looked at the acceptance rate3

for both OCR and OMR for the

U.S. Census Bureau Census 2000 Data Capture 5

3. Data Capture Topics Addressed in this Report

3 RITRC defines acceptance rate as thefraction of fields in which the OCR has highconfidence, usually expressed as a percent.Accepted fields are the ones RITRC scoredfor OCR accuracy; they are not sent to key-ers except for QA purposes.

Page 14: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

mailback and enumerator forms

(100 percent data). For the D-1

short form and the D-2 long form,

the acceptance rate was 83.08 per-

cent for OCR and 99.89 percent for

OMR. For the D-1E and D-2E enu-

merator versions of these forms

the acceptance rates for OCR and

OMR were slightly lower at 79.17

percent and 99.78 percent, respec-

tively. Based on these findings,

and other considerations, RITRC

concluded the data quality from

both sets of forms was about the

same, with both exceeding the

program goals.

The quality of OMR, OCR, KFI, and

KFP was constantly monitored.

The accuracy rates for OMR and

OCR data capture were contractual-

ly specified as 99 percent and 98

percent, respectively. (OCR accura-

cy was actually sub-divided into

two separate accuracy rates of 98

percent for alphabetic data and

98.5 percent for numeric data.)

Keyer accuracy for KFI and KFP was

also measured. The accuracy stan-

dard for KFI was 96.5 percent and

KFP was to have no more than a 2

percent error rate.

Given the complexity of the data

capture environment, the volume

of forms processed, and the use of

state-of-the-art technologies, it is

instructive to examine the per-

formance of the overall system.

The examination of DCS 2000 per-

formance issues is not intended to

be a fault finding exercise. Rather,

it provides a view into issues that

can lead to a better understanding

of the effectiveness of the system

used during Census 2000. This

information can, in turn, benefit

future data capture operations by

providing insights into the benefits

and limitations the technology and

manual systems employed.

3.2 The system’s ability tocapture questionnaire data

A significant drop in the nation-wide mail response rate during the1990 Census led to dramaticchanges in questionnaire designstrategies for Census 2000. Themajor impetus for change in thequestionnaire design came as aresult of Congressional direction,which brought about efforts tomake the mailback forms morerespondent friendly. The assump-tion was that respondent friendlyforms would lead to an increase inresponse rates. During the decadeleading up to Census 2000, theCensus Bureau conducted researchinto forms design issues in aneffort to increase mail responserates and improve data quality.There were a number of method-ological tests targeted at improv-ing the Census Bureau’s under-standing of the array of cognitive,technical, and overall design fac-tors. These factors influence theefficiency and accuracy of data col-lection and capture processes. Thetesting included a range of studiesthat examined the layout of thequestions, the testing of matrixformats against individualized for-mats, and evaluation of differentenvelope colors.

Reflecting the combination of newdesign initiatives and the availabili-ty of sophisticated scanning tech-nologies, the short form under-went significant changes for theCensus 2000. The resulting formexhibited an entirely new andmore respondent friendly individ-ual space layout (separate panelfor each person) and provided fea-tures such as a background color,motivational icons, a Census 2000logo, check boxes, and segmentedwrite-in spaces.

Lockheed Martin came to appreci-ate the criticality of forms design

and its contribution to capturingrespondent data (Lockheed Martin,2001b):

Of all the aspects of an auto-mated data capture system, theabsolutely most critical compo-nent of the system is the designand printing of the forms. Agood form design can increasethe stability, flexibility, errordetection and recovery, and per-formance of the system. A poorform design can adversely affectall of these factors and subse-quently increase the system costexponentially. The experiencesof DCS 2000 helped to empha-size these points.

The Assessment Report on DataCapture of Paper Questionnaires(Brinson and Fowler, 2003) pointedout some particular forms designand subsequent effects on printingthat affected the system’s ability tocapture data. It observed that“Forms Design and Printing wasnot coordinated with the data cap-ture technology...until later in theprocess making it more difficult todesign and test the data capturetechnology.” The report suggestedthat the automation technologyavailable may not have been fullyutilized. The report further statesthe following with regard to howcertain forms design and printingissues impacted the OMR and OCRsubsystems:

The multiple form types, bookletstyle formats, question design,and specific colors used madethe implementation of OMR andOCR technology more challeng-ing. Also, the lateness in finaliz-ing questionnaires and printingof prototypes made the develop-ment of OMR and OCR softwaremore complicated, of higherrisk, and more costly.

The need for forms design andprinting to be tightly integrated

6 Census 2000 Data Capture U.S. Census Bureau

Page 15: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

within the overall data capture sys-tem development environment isapparent and was echoed in sever-al of the documents reviewed forthis report.

3.3 The impact of datacapture requirements onthe questionnaire designand other factors

There were several image capturespecifications that created con-straints in the forms design envi-ronment. In the Census 2000 DataCapture System RequirementsStudy, Titan identified the follow-ing areas where DCS 2000 hadclearly defined specifications(Titan, 2002):

• Size of Scannable Documents. Aset of four specific paper sizedimensions for single sheetswas approved for DCS 2000.The booklet questionnaires alsohad defined limits for the size ofseparated sheets. According toDSCMO, DCS 2000 processedsix different sized forms, butwas not limited to this number.

• Optical Character Recognition(OCR) Write-in Fields. As notedby RITRC, the OCR subsystemwas designed to read all of thewrite-in fields for which therewas a high level of confidence.Consequently, there were a vari-ety of very precise criteria defin-ing dimensions and spacingrequirements for these fields.The basic purpose of these cri-teria was to facilitate characterrecognition by the data capturesystem.

• Optical Mark Recognition (OMR)Check Boxes. In addition to sizeand spacing requirements forthese boxes, there were alsospecifications indicating thatboxes on one side of the pageshould not coincide with checkboxes or image areas on the

reverse side of the page.

However, according to DSCMO,

paper specifications required a

high opacity level to minimize

“show through.”

• Color and Screen. A major data

capture requirement was that

the background color must

“drop out” and should not con-

tain any black content. A drop

out effect can usually be

achieved through a combination

of the color and screen dot size.

• Margins. Required white space

was defined for side, top, and

bottom margins.

• Form Type/Page Identifier.

There was a set of requirements

for the use of the Interleaved 2

of 5 bar code, which served to

identify form type and page

numbers.

• Document Integrity. Since book-

let forms could become separat-

ed during the scanning process,

a unique identifier had to be

included on all sheets of a long

form to link the sheets. Another

bar code was printed in the

margin area to provide the sheet

linkage function necessary to

ensure Document Integrity.

Document Integrity was also

included on both sides of the

short form to mitigate the risk

of non-detected double feeds.

The constraints imposed on the

forms design by the data capture

requirements must be viewed in

terms of their contribution to high-

ly efficient data capture processes.

In all, 26 different forms were

scanned using OMR and OCR tech-

nologies, resulting in a substantial

labor savings achieved from DCS

2000.

3.4 The appropriatenessof requirements identifiedfor the data capture sys-tem

Like all systems, the Census 2000data capture system was designedto satisfy a set of requirements. Asystem cannot provide the rightfunctionality if the requirementsdefined for it were incomplete.Thus, the efficacy of the require-ments definition process deter-mines to a great extent how wellthe system will work.

Research into new technologiesthat could make the data captureprocess more efficient began in theearly 1990s. The RochesterInstitute of Technology (RIT) testeda variety of commercial off-the-shelf (COTS) products in 1993 and1995. A Request for Proposals(RFP) was developed in 1996 toprocure proven technologies andto outsource the development andoperation of the data capture sys-tem.

In the pre-award phase, multiplevendors were asked to conduct anoperational capabilities demonstra-tion. According to the Census2000 Data Capture SystemRequirements Study, this demon-stration allowed the Census Bureauto identify the contractor mostsuited to the task of developingDCS 2000 and served to identifyand fine-tune requirements for thedata capture system. The award toLockheed Martin was issued onMarch 21, 1997 and developmentactivities ensued.

The original statement of work(SOW) was used for developmentup to the Census 2000 DressRehearsal. At that point it wasdetermined that the SOW lackedsufficient detail and required morespecifics. Consequently, aFunctional Baseline (FBL) documentwas developed to focus on what

U.S. Census Bureau Census 2000 Data Capture 7

Page 16: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

functionality was needed. The FBLwas given to Lockheed Martin, butthe requirements for DCS 2000continued to evolve throughout thedevelopment of the system.Similarly, TRW had to refine theOperations and Facilities Plan asdevelopment proceeded. Thebasic requirements for this planwere generated jointly by theCensus Bureau and Lockheed

Martin. The requirements werealso included in the RFP for DCSC,but TRW was not awarded the con-tract until nearly a year afterLockheed Martin had commenced development of thesystem. By this time, refinementsto the plan needed to be made byTRW to keep pace with DCS 2000developments.

The question posed in relation torequirements can be put in thecontext of defining what the entiredata capture system needed to do— not just the automated scanningtechnology. Therefore, this reportlooks at requirements issues fromboth the automated and opera-tional aspects of the data capturesystem.

8 Census 2000 Data Capture U.S. Census Bureau

Page 17: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

4. Findings

This section provides major find-ings and key issues that wereechoed in the documents reviewedfor this report. They are discussedwithin the context of the four maintopic areas of this report; assess-ing performance, factors affectingthe system’s ability to capture data,issues relating to the impact ofdata capture requirements, andexamining the appropriateness ofthose requirements.

4.1 Assessing the perform-ance of the data capturesystem

4.1.1 DCS 2000 exceeded per-formance goals. As noted in theDCS 2000 Data Quality report pre-pared by RITRC, Census 2000 wasthe first time that the CensusBureau had used commerciallyavailable scanners and high-speeddigital data processing to capturecensus data. Although not a com-pletely definitive assessment ofquality assurance (QA) for DCS2000, the RITRC analysis conclud-ed that “The results we obtainedfrom Production Data QualitySampling indicate that DCS 2000production data quality significant-ly exceeded program goals.”Putting this success into context,the data capture evaluation(Conklin, 2003) noted that“although the automated technolo-gy brought increased speed andefficiency to Census 2000 process-ing, considerable human resourceswere still required to handle themany millions of write-in fieldsthat posed a problem for it.”

Performance goals for accuracywere exceeded for both mailback

forms and the enumerator forms.RITRC stated that, based on a sam-pling of images and data, the in-production “OCR accuracy was99.6 percent, KFI was in excess of97.8 percent, and the check-boxquestion accuracy was a little over99.8 percent.” RITRC also reportedanother significant finding: theoverall write-in field accuracy forboth passes of data capture(merged data combining OCR andKFI) was 99.3 percent. For Title 13data (merged from OCR and KFI)on the mailback forms, numericfields were found to have lowererror rates than alpha fields. Thedata capture evaluation found thatfor all fields, OCR had the lowesterror rate, followed by OMR andthen KFI.

(Note: The K.1.b findings concern-ing error rates for the three differ-ent data capture modes suggest adiscrepancy with respect to RITRC’sfindings. The OMR accuracy ratereported by RITRC indicates thatthis mode had the lowest errorrate, however the K.1.b evaluationplaced OMR after OCR in terms oftheir respective error rates (i.e.,OCR was lower than OMR). Thisapparent discrepancy is complicat-ed by the fact that these studiesused different performance classifi-cation methods. It is important tonote that error rates differ fromaccuracy rates. The error rate iscalculated by subtracting the accu-racy rate from 100, typically yield-ing a very small number (e.g., .4percent), whereas accuracy ratesare the fraction of accepted fieldsthat are correct, and are thereforeassociated with much higher num-

bers (e.g., 99.6 percent). Also, thestudies were based on differentdata sets. Attempting to compareOCR to OMR to see which is thebest is not a valid comparisonsince the two methods aredesigned to accomplish differenttypes of recognition tasks. Giventhe exceptionally high performanceof the OCR and OMR subsystems,both of which exceeded programgoals, there is no utility in attempt-ing to determine a “winner”.)

The data capture evaluation exam-ined the issue of whether or notsome fields were sent to KFI moreoften than others. It concludedthat name related fields were morelikely to go to KFI unnecessarily. Italso posed the question of whethercertain fields sent automatically toKFI should be processed instead bythe automated technology. Thediscussion of this issue suggestsfurther research is needed:

We note some fields automatical-ly went to KFI regardless of howwell the technology thought itcould process them. These werecheck-box fields where morethan one box could be selectedand still count as a validresponse. Recognizing that KFIis subject to error from factorsnot affecting the technology,e.g., human fatigue and inatten-tion, a possible future test forthe automated technology is toallow it to process multipleresponse check-box fields. Itwould be helpful to find out ifthe technology can be adjustedto accept such fields without theerrors of keying.

U.S. Census Bureau Census 2000 Data Capture 9

Page 18: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

One important lesson learned citedby Lockheed Martin during theCensus 2000 Dress Rehearsal wasthat the “full-field4 method of key-ing provided the most cost-effec-tive combination of accuracy andthroughput given the requirementsof the DCS 2000 system.” Othertypes of keying methods may haveincreased throughput, thoughaccuracy would have suffered.

4.1.2 Reasons for errors.According to RITRC, some percent-age of KFI “errors” in their produc-tion data quality sampling was dueto ambiguous handwriting, misin-terpretation of keying rules, orvarying interpretations of hand-written entries (where respondentintent was not clear). Therefore,these were not considered as actu-al keying errors. The data captureevaluation noted that the KFI erroris “not necessarily a poor reflectionon the automated technology” andobserved that the automated tech-nology and the evaluation and pro-duction KFI are prone to the fol-lowing errors:

• failure to read a field on theform

• picking up content that is notreally there

• incorrectly capturing the contenton the paper

• correctly capturing what therespondent wrote, but not nec-essarily what the respondentintended

The error rate can be attributed tofactors such as the hardware of theautomated technology or the soft-ware. Lockheed Martin identifiedthe two main contributors to sys-tem errors: noise (including noisegenerated by respondents) that

interferes with character recogni-tion and segmentation errors (mul-tiple characters within one box orsingle characters spanning multipleboxes). This was common to otherimage processing systems as wellas DCS 2000.

Another major cause of keyingerrors was the use of certain key-ing rules. For example, the rulescalled for a write-in field to befilled with “eights” if the numbercould not be determined and“nines” were to be used if a valueexceeded certain limits. RITRCexpressed concern about keyerconfusion over the use of “eights”and “nines”, particularly withrespect to the income field with itsimbedded dollar sign, comma, andpre-printed cents. They also notedthat “the essential KFIproblem...was the conflict between“key what you see” and interpreta-tions required by the keying rules.”The Decennial Statistical StudiesDivision (DSSD) has also reportedproblems with the income field,citing an “error rate of nearly 45percent for income fields filledwith 9s” and “mistakes...in fieldscontaining all 8s.” (Weinberg,2000).

The data capture evaluation report-ed, across various modes of datacapture, that the most frequentreasons for failing to capture theintended responses were:

• Extra-check-box - the outputfrom the automated technologyoutput shows more check-boxesmarked than are in the scannedimage.

• Missing characters - the outputfrom the automated technologyhas fewer characters than thescanned image.

• Wrong character - the outputfrom the automated technologyand the scanned image have the

same number of characters, butoutput from the technology dis-agrees with the image in one ormore characters.

The same report listed the mostcommon reasons for these prob-lems as:

• Poor handwriting - the respon-dent’s handwriting makes oneletter look like another, but aperson can tell what the respon-dent meant.

• No reason found - the responseis written clearly, and there isnothing to suggest why it wasnot captured correctly.

The data capture evaluationreached some significant conclu-sions. First, it found that “if thereis intelligent content in a field, theautomated technology will detect itwith nearly perfect certainty.”Second, despite the fact that thesystem is not perfect, “a sizeableportion of responses will be cap-tured and interpreted correctly atspeeds that are orders of magni-tude above KFI.” And third, “thelargest impediment to automationis not the quality of the hardwareor software, but the quality of theresponses supplied by humanbeings.” The study suggests thatattempting to build a system thatcould capture nearly any type ofresponse would not be a practicalendeavor.

4.1.3 OCR acceptance rate.According to the data capture eval-uation, although the automatedtechnology brought increasedspeed and efficiency to Census2000 processing, considerablehuman resources were stillrequired to handle the many mil-lions of write-in fields that posedproblems. The percent acceptedby OCR for write-in fields (shortand long forms) was 78.9 percent,which is quite close to the 81.2

10 Census 2000 Data Capture U.S. Census Bureau

4 Full field keying involves re-enteringthe entire field as opposed to character keying where only the low confidence characters in a field are entered.

Page 19: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

percent reported by RITRC for Pass1. However, RITRC reported amuch lower rate for Pass 2 of 64.8percent. The lower rate reflectsthe problems inherent in interpret-ing the more difficult long formresponses.

The OCR acceptance rate for formscompleted by enumerators waslower when compared to the ratefor mailback forms. RITRCbelieves that this may have beendue to light pencil marks or incom-plete erasures. Thus, this loweracceptance rate was probably moreof a function associated with thewriting instrument (i.e., pencil),and not a reflection on the effec-tiveness of the OCR subsystem.

4.1.4 Cluster concept. DCS 2000was based on the concept of clus-ter processing. Clusters wereautonomous units of image pro-cessing, constructed around thecapacity of three scanners. EachDCC was equipped with as manyclusters as necessary to processtheir workload. In LockheedMartin’s opinion, the cluster designwas a key factor that contributedto the successful development ofthe overall system. In this con-cept, each cluster functioned inde-pendently, using a set of modules,with a sequential processing orderthat verified the previous actionstaken by other modules. LockheedMartin described the operation ofthe cluster concept as follows:

This efficiency [of verifying eachstep of the process], which fre-quently exceeds automatedacceptance rates of 80% of thedata at greater than 99% accu-racy, also incorporates the self-validation themes of cluster pro-cessing. While the obviousvalidation steps are the keyingoperator functions, there is alsosignificant use of cross charac-ter, cross field, and cross image

contextual checks that are per-formed in order to validate andre-validate the data that is pro-duced. This processing is alsospread across the various stepsof the cluster workflow. At thecontext step, OMR box, charac-ter, field, and image validationsare performed to maximize theefficiency of the automated OCRand OMR. Then, another mod-ule assures that control dataand form integrity have notbeen compromised. Next, keyfrom image functions (KFI) arethen used to complete the cap-ture of data that could not bedone automatically. While thesekeying steps also include real-time edits, functions for opera-tors to reject entire images forquality, and automated qualityassurance checks, there are alsosubsequent full form validationfunctions that assure that thecaptured data is consistent withexpected responses. All theseprocesses work together to con-tinuously validate and improvethe captured data as it pro-gresses through the system.

In short, the cluster conceptensured images were subject tostringent quality checks. Theprime DCSC contractor, TRW, notedthat quality was designed into theprocess due to the short-term, highvolume nature of the work. Theyconcluded that “this provided ahigh-quality product with a mini-mum of QA staff and overhead.”In this regard, the prime DCS 2000contractor, Lockheed Martin, statedonly one percent of the formsneeded to be pulled andreprocessed. Another benefit ofthe cluster architecture was that itallowed for continued scanning offorms, even when there were com-ponent failures within a cluster.

4.1.5 Inefficiencies in the KFPprocess. The DMD Assessment

Report on Data Capture of PaperQuestionnaires characterized KFPas being an “inefficient way to cap-ture forms that could not be cap-tured by scanning.” The reportnoted that the software used forKFP was designed for image keyingand was therefore “cumbersome”.It also took issue with the KFP poli-cy of requiring 100 percent verifi-cation stating that this may have“caused more keying than wasrequired since a sample verifica-tion may have been sufficient.” Ingeneral, the DMD report favoredmaximizing the use of automationand relying less on keying.

4.2 Factors affecting the system’s ability to cap-ture questionnaire data

Noting that many forms processingsystems can define a form to runthrough the system in minutes,Lockheed Martin observed that thecomplete definition of DCS 2000was dependent upon utilizing all ofthe optimizations designed intothe form itself. This means theform ultimately reflected an in-depth analysis of the total environ-ment including Census rules, formcharacteristics, and respondenttendencies.

4.2.1 Keying rules and methods.While allowing for respondent ten-dencies was factored into theforms design process, there wasstill a need to apply keying rulesand methods for capturing data.According to RITRC, one of themajor causes of keying errors wasthe complexity of the rules, whichrequired keyers to make interpreta-tions–while maintaining a highpace of production. According toTRW, the application of keyingrules varied between Passes 1 and2. They noted that the applicationof rules was limited on Pass 1because of the limited number offields; the rules were more critical

U.S. Census Bureau Census 2000 Data Capture 11

Page 20: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

in Pass 2 because of the broaderrange of field types. TRW reportedthat daily audits showed a highdegree of accuracy during Pass 2keying. Nonetheless, in view ofthe need for interpretations bykeyers and variations of rulesbetween the passes, the moreaccurately the forms are filled out,the less need there is for keying.Basically, the system’s ability tocapture questionnaire data can beimproved by better forms designpractices.

4.2.2 Improvements to formsdesign. Given the success of DCS2000 as a high speed, high vol-ume, high quality, data capturemechanism, some aspects of theform could be improved. The datacapture evaluation cited andendorsed possible improvementsto the form that had been identi-fied in Titan’s Census 2000Questionnaire Design Study.

The data capture evaluation recom-mended the following be consid-ered as possible improvements:

• Have the person information forhousehold members be filledout from left to right across thepage instead of up and down.

• Allow the use of pencils sorespondents can easily correctmistakes.

• Change the sizes, fonts, appear-ance, etc. of the instructionicons so they are easier to spot(or simply eliminate them).

• Allow more spaces for the lastname fields.

• Include instructions for fillingout or correcting write-in fields.

• Include more detailed instruc-tions for the race and ethnicityquestions. While additionalinstructions may improve recog-nition, DSCMO and others (e.g.,Dillman) expressed concerns

that an overcrowded form withtoo many instructions may hin-der response and data capture.

• Try to make the instructions tothe head of household for fillingout the form more concise.

• Employ the use of headers toseparate the Asian ethnicityoptions from the ones for PacificIslander.

• Do not spread the choices forcheck-box fields over more thanone row or column on a page.

• Select a background color withbetter visual contrast.

Enhancements to these areas havethe potential to further improvethe quality of data captured andperhaps make the form evenfriendlier to respondents.

(Note: The Census 2000Questionnaire Design Study wasa qualitative study that reflectedthe insights and experience ofsubject matter experts who hadextensive knowledge of formsdesign. While the findings fromthis study suggested that cer-tain aspects of the questionnairecould be improved, we cautionthat further research and testingis needed to determine which, ifany, of the recommendationsshould be implemented. It isimportant to note that one ofthe main purposes of this studywas to identify and highlightforms design issues that couldbe candidates for futureresearch efforts.)

4.2.3 Few instructions were pro-vided to respondents. Several ofthe above bullets touch on thissubject. The system’s ability tocapture respondent data couldhave been impacted, to a certainextent, by a lack of instructions torespondents. As noted in the

Census 2000 Questionnaire DesignStudy, the short form does not pro-vide guidance to respondents onwhat to do if an answer exceedsthe length of a write-in box or howto correct any mistakes on theform. The study suggested thatsome questions could benefit fromexpanded instructions, althoughthere are risk/benefit trade-offsthat would need to be assessed. Itis conceivable that additionalinstructions for respondents couldhave reduced problems andenabled the system to capturemore respondent data, rather thanrejecting it as unreadable.

4.2.4 Use of banking should beminimized. Most interviewees whoparticipated in the Census 2000Questionnaire Design Study under-stood the need for banking but feltthat it was not a desirable designfeature and its use should be mini-mized. This technique, especiallytriple banking, can lead to tightlygrouped check boxes. Besidesbeing confusing, it increased thelikelihood of a processing problemwhen a large mark extends beyondthe box boundary and into adja-cent boxes. Greater spacingbetween boxes may minimize straymarks that create interpretationproblems for OMR.

4.2.5 The race question. Due tothe multiple answers allowed bythis question, Lockheed Martinfound this question difficult toarbitrate with any degree of highaccuracy. Owing to the highimportance placed on accuratelycapturing race data, when multiplemarks were detected by the OMRsubsystem, they were passed tokeyers for resolution. According toLockheed Martin, the accuracyrates for this “particularly sensitivearea of interest” increased signifi-cantly as a result of the manualinterpretation.

12 Census 2000 Data Capture U.S. Census Bureau

Page 21: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

4.3 Issues relating to theimpact of data capturerequirements on the ques-tionnaire design and otherfactors

The complete redesign of the ques-tionnaire for Census 2000 pro-duced a more respondent friendlyform based on an individual-spaceformat. To a large extent, the newdesign of the form was made pos-sible due to technological advance-ments made in OCR and OMR tech-nologies.

4.3.1 Space separation betweenboxes and text. There had to be aspace separation between boxesand text of at least .08 inches, andspace between boxes (from thebottom of one box to the top ofthe next box) of at least .05 inch-es. The DCS 2000 image capturespecifications noted that morespace between boxes was prefer-able in order to prevent a largemark from running into anotherbox. Although conforming tothese specifications, many boxeson the form were tightly grouped.For example, the Hispanic originand race questions, and question#2 for Persons 2 - 6 containednumerous boxes that were tightlygrouped with minimal verticalspace separation. This increasedthe likelihood of a data captureerror when a large mark extendedbeyond the box boundary into anadjacent check box or segmentedboxes. Future use of check boxesshould allow greater spacing asrequested in the DCS 2000 imagecapture specifications.

4.3.2 Darker frame around seg-mented boxes and darker seg-ments. The DCS 2000 image cap-ture specifications prohibited theuse of dark outlines surroundingOCR fields. A darker outline couldhave provided more definition tothe boxes and therefore potentiallyreduce one of the main sources of

data capture problems–segmenta-tion errors. The segmentationlines had low contrast and did notshow up well, which may haveaccounted for some of those typesof errors where characters spannedmore than one box or more thanone character was written into asingle box. The use of a differentbackground color with a highercontrast to white should alleviatethe problem.

4.3.3 Background color was prob-lematic. The choice of the back-ground color (Pantone 129) by thegraphics arts firm, Two TwelveAssociates, met data processingspecifications for “dropping out”,or becoming invisible to the scan-ner. However, according to theTitan’s Census 2000 QuestionnaireDesign Study, this particular colordid not provide the best contrast.Another study conducted byRITRC, after the color had beenselected for the Census Bureau,stated that this particular color“was on the fringe of acceptabledrop-out colors for the Kodak scan-ner used for DCS 2000.”

As stated in the Census 2000Questionnaire Design Study,DSCMO felt the choice of Pantone129 compromised the ability of thescanners to use more aggressivesettings to read lighter shades ofanswers. The choice of this colordid not generally present any prob-lems for dropping out during theforms processing function. Thiscan be attributed to tight QA monitoring during the forms pro-duction process. Without an effec-tive QA process, there could havebeen additional problems with theform background color failing todrop out, causing the scanningequipment to reject a question-naire. While the color on the formwas controlled within the specifica-tion parameters by using instru-ments, according to the intervie-

wees and the reports provided byDSCMO, color generation was notalways consistent during the print-ing process and there were notice-able variations.

Although technically meeting datacapture requirements, the selectionof Pantone 129 was, in retrospect,far from being optimal. Personnelwith expertise in data captureoperations need to have input intothe color selection process toassess the implications on datacapture operations.

4.4 Examining the appropri-ateness of requirementsidentified for the data cap-ture system

Through extensive interviews withCensus personnel, the Census2000 System Requirements Study(Titan, 2002) found that many peo-ple felt DCS 2000 was the rightsystem for the job and provided anefficient and effective means tocapture census data. However,there were several requirementsrelated areas that could beimproved.

4.4.1 Fluid requirements. Leadingedge technologies were beingemployed on a very large-scaleoperation, so it is understandablethat requirements would be alteredand evolve as the plan changed.(Ideally, requirements should besufficiently flexible to accommo-date new technology.) However,the fact that a pre-test version ofDCS 2000 was used late in the lifecycle, in the Census 2000 DressRehearsal, suggests that systemplanning and the requirements def-inition process needs improvementand schedules/timelines shouldhave been adjusted to ensure thesystem was fully prepared for theDress Rehearsal.

According to the AssessmentReport on Data Capture of Paper

U.S. Census Bureau Census 2000 Data Capture 13

Page 22: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

Questionnaires, numerous require-ments (six were identified) had tobe added after the Dress Rehearsalbased on the lessons learned. Thereport concluded that “the additionof requirements after the data cap-ture system had been designed orwas in the final testing phase pro-vided a significant increase to thecontract cost, and risk to the quali-ty of the data, and to the data cap-ture schedule.”

Implementing DCS 2000 and theDCSC operations posed significantchallenges integrating new tech-nologies and complex operations.A robust testing program seemedto compensate for any lack ofrequirements or understanding ofexactly how the DCS 2000 compo-nents and operational procedureswere to function in the productionenvironment. Consequently, therewas reliance on extensive testingto simulate the Census environ-ment.

4.4.2 Keying rules. According tothe Census 2000 Data CaptureSystem Requirements Study, thekeying rules changed after produc-tion began and continued toremain an issue throughout thecontract, creating risk to data qual-ity and to timely completion ofdata capture. It is unknown towhat extent changing the keyingrules impacted the quality of thedata capture, but the changing ofrules between the first and secondpasses were reported to haveoccurred. Requirements for keyingrules were certainly fluid. Forexample, the DCSC contractormade staffing decisions based onthe expected “key what you see”method, which was subsequentlychanged. Recognizing that thistype of major change presents asignificant data quality issue that

can greatly increase risks to theprogram, the Census Bureaushould place more emphasis onfully defining firm requirementsfor keying rules.

4.4.3 Quality Assurance. Therequirements for QA could havebeen better defined. Although theframework for the overall qualityassurance plan was decided by theCensus Management IntegrationTeam (CMIT), QA specialists in theCensus Bureau differed with theCMIT on the application of QAstandards for DCS 2000.Complicating this situation was thefact that Lockheed Martin and TRWhad their own internal QA pro-grams, and the Census Bureau hadimplemented its own independentQA of the Census 2000 data cap-ture process at the NationalProcessing Center. Both the 100percent and long form sample datawere monitored by the CensusBureau at this facility. Since it wasbelieved that Lockheed Martinwould easily find gross errors, theCensus Bureau’s QA monitoringscheme concentrated on reviewingrarer events (e.g., multiple races orAmerican Indian tribe designations)that were not captured.

Specific recommendations from theCensus Bureau for improving thequality assurance aspect of DCS2000 were provided very late inthe development process and, ifadopted, would have necessitateda major redesign of the DCS 2000software and the post-processingoperations at Headquarters. Theparameters, processes, and respon-sibilities for QA measurementshould be included as part of therequirements definition process.

4.4.4 Archiving requirements.Another problematic requirementwas the need for archiving census

data. The Census Bureau was orig-inally advised that ASCII files, notimages, would be required by theNational Archives and RecordAdministration (NARA). Theserequirements were later changedto include microfilmed images andan index. According to theAssessment Report on DataCapture of Paper Questionnairesthe cost incurred by the CensusBureau to meet the new archivingrequirements was approximately$44 million.

4.4.5 Operational perspective.With the DCSC contract awardoccurring nearly one year after the award to Lockheed Martin, TRW never had the opportunity tohave a major influence on therequirements for DCS 2000. TRWnoted that they had “little opportu-nity to influence system require-ments with respect to the userinterface and the need for manage-ment data on the production floor.”Some of the system requirementssuggested by TRW during the test-ing phase could not be imple-mented due to resource and sched-ule constraints.

In the Lessons Learned for Census2000 document, the Forms Designand Printing Office observed that it“did not have a comprehensive setof data capture and processingtechnical requirements to includein the forms design and print con-tracts for the DR.” The samereport noted that after the DressRehearsal “late and changingrequirements led to major revi-sions on all forms being electroni-cally data captured” and there wasno time to test the revisedchanges. In conclusion, the reportstated that “all these changes putthe Census Bureau at risk of failingto meet scheduled deliveries.”

14 Census 2000 Data Capture U.S. Census Bureau

Page 23: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

This section assesses the findingsand key issues for each of the fourareas reviewed and provides anoverall, high-level view thatreflects the information synthe-sized from the documentsreviewed. It also presents othersalient observations about theCensus 2000 data capture systemthat were deemed to be relevant tothis report and planning for the2010 Census.

5.1 System performance

Given the massive volume of formsprocessed with a very high degreeof accuracy and acceptance rates,the data capture system was anunqualified success. As mentionedearlier in this document, the sys-tem exceeded all of its perform-ance goals. While some errorscould be attributable to limitationsof the automated system (noiseand segmentation errors), manywere attributed to ambiguous orpoor handwriting by respondents.In its classification of various Pass2 write-in errors for overall Title 13data, RITRC cited ambiguousresponses, write-overs, crossouts,and poor handwriting as account-ing for a substantial number oferrors. RITRC’s research also foundthat “No reason” accounted forover half of the errors in the samedata. Similarly, the K.1.b evalua-tion found a high number of unex-plained errors in the data itreviewed. Interestingly, RITRC wascomfortable with the large percent-age of “No reason” errors. In theirassessment, this was a “very goodsign...because a well-designed sys-tem should have the bulk of its

errors in the “noise.” The data cap-ture evaluation provided an excel-lent perspective on system per-formance: “The largest impedimentto automation is not the quality ofthe hardware or software, but thequality of the responses providedby human beings.” The data cap-ture evaluation suggested thatattempting to build a system thatcould capture nearly any type ofresponse would not be a practicalendeavor because of the variouspermutations of human errors.

5.2 Capturing respondentdata

Titan’s Census 2000 QuestionnaireDesign Study highlighted thesophistication of the forms designprocess and its awareness of theneed to efficiently capture respon-dent data. The study provided thefollowing description of the formsdesign environment:

The design of the Census 2000short form questionnaire was acomplex undertaking thatreflected the combined efforts ofmany people at the CensusBureau. Every facet of the formwas carefully analyzed for itspotential effect on responserates and data quality. Becausethe forms were to be processedby sophisticated automationemploying both optical charac-ter and mark recognition tech-nologies, the designers faced theextra challenge of also having tomeet image capture specifica-tions that placed constraints onthe form.

While the success of Census 2000

(as discussed in section 5.1)

reflected well on the overall forms

design effort, the data capture

evaluation noted that “considerable

human resources were...required to

handle the many millions of write-

in fields that posed a problem for

it.” This suggests the need for

continued research into ways of

improving forms design to mini-

mize the need for manual keying

operations.

5.3 The impact of data cap-ture requirements

Although meeting data capture

specifications, the background

color was widely recognized as

being problematic with respect to

being on the fringe of acceptable

drop-out colors and from a visual

contrast standpoint. Selection of

this particular color was not the

result of a collaborative effort

involving subject matter experts or

any quantitative analysis. The les-

son learned from this, and other

related experiences, is that there

should be close coordination

between data capture personnel

and those involved in question-

naire development throughout the

forms design life cycle.

Technology may impose some limi-

tations on designers, but that

same technology can also enable

more sophisticated design tech-

niques. For both of these reasons,

tight collaboration and communica-

tion between the two groups are

essential when requirements are

being developed or changed.

U.S. Census Bureau Census 2000 Data Capture 15

5. Results of Analysis

Page 24: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

5.4 Fluid requirementsposed substantial risks

Requirements proved to be subject

to revision in several major areas.

To a large extent, the deficiencies

in requirements were compensated

for by extensive system testing

that helped to refine the Census

2000 data capture system. This

was a very risky approach for such

a major, high profile system. The

importance of having a well-

defined and disciplined structure

for developing requirements can-

not be overemphasized as they

define what the system needs to

do and what the performance met-

rics are, along with establishing

operational processes and QA

parameters. Additionally, the

selection of qualified contractors

depends on a thorough under-

standing of requirements.

It is worth noting that the effort to

define requirements was handi-

capped in several respects. First,

there was no well-established

process in place to guide and facili-

tate the development of require-

ments. Second, Census Bureau

personnel were not accustomed to

preparing requirements within a

contracting environment. This

issue was highlighted in the

Assessment Report for Data

Capture of Paper Questionnaires,

which stated that “this was the

first time that the Decennial pro-

gram areas had to do their work

for data capture within contracting

guidelines and contracting time

constraints.” And third, adequate

funding was not made available

early enough to allow system

requirements to commence in a

proactive fashion. The combined

effects of these three factors creat-

ed a challenging environment for

requirements development.

5.5 Other salient observa-tions about Census 2000data capture system

5.5.1 Agency-contractor partner-ship was a key success factor. To alarge extent, the success of theCensus 2000 data capture programwas due to a healthy partnershipatmosphere that existed betweenthe Census Bureau and the contrac-tors. Considering that the systemwas not completed in time forDress Rehearsal, requirementswere fluid, and that there were dif-ferences over QA processes, it isevident that agency-contractorcooperation was a major factor inensuring the success of the Census2000 data capture system.

Within the agency-contractor part-nership there was a subtle relation-ship that also existed between thetwo prime contractors. Eventhough there were two separatecontracts, one for DCS 2000 andanother for DCSC, there was amutual dependence between thembecause the award fees were tiedto shared performance evaluations.This contractual arrangement fos-tered a cooperative relationshipwhich ultimately benefitted theCensus Bureau.

5.5.2 Change control processeswere effective. Given the dynamicrequirements, and the constantincorporation of new technologiesbeing applied to make DCS 2000 arobust platform, discipline in thechange control process was amajor plus that mitigated risks tothe program. Both prime contrac-tors adhered to strict change con-trol processes for system compo-nents and operational procedures.

The Assessment Report on DataCapture of Paper Questionnairesgenerally agreed that change con-trol was a contributing factor tothe success of the data capturesystem. The report specifically

noted that the DSCMO contractoffice had established a “highlyeffective change control process totrack, evaluate, and controlchanges to the Data Capture pro-gram... throughout the develop-ment of the program.” It addedthat the process received favorablereview from oversight bodiesbecause of their focus on cost con-trol and schedule deadlines. Whileacknowledging the success ofDSCMO, DMD noted that it wasresponsible for gathering require-ments and, unlike DSCMO, DMDdid not have a dedicated staff tomanage change. It would prefer tosee a more centralized require-ments change control processimplemented.

5.5.3 Comprehensive testing. Asnoted in the Census 2000 DataCapture System RequirementsStudy, the thoroughness of thetesting compensated for the lackof a solid set of requirements. Infact, requirements were verydynamic, changing throughout theentire development period for DCS2000 and into production. DCS2000 underwent a series of tests:Site Acceptance Test (SiteAT),Operational Test and Dry Run(OTDR), integrated Four Site pre-production test, and Beta Site test-ing.5 According to the LockheedMartin Phase II Lessons Learneddocument, these tests wereextremely successful. TRW foundthe OTDRs to be “key contributorsto the success of Census 2000data capture” as they closelyresembled live operations andexercised every facet of operationsand included the OCC.

16 Census 2000 Data Capture U.S. Census Bureau

5 According to the Data Capture PMP,the DCS 2000 contractor operated the BowieTest Site which housed prototypes of thesoftware and hardware environments used inthe DCCs. System, integration, and securitytesting and troubleshooting for DCS 2000software was performed at the Bowie TestSite by the contractor.

Page 25: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

5.5.4 Operational discipline.Stringent control over the consis-tency of operational procedureshelped ensure operational consis-tency across each of the sites.TRW reported that each of thethree contractor-run DCCs wasorganized in the same way and allhad essentially the same functionssuch as: operation of the data cap-ture process; human resourcesmanagement; workforce training;QA activities; and facilities man-agement.

5.5.5 Workflow management. Asmentioned in the Background sec-tion of this report, one aspect ofthe data capture system that mayhave been overshadowed by tech-nology is the control process usedto manage the flow of formsthrough the data capture processand monitor image quality. Thisworkflow management system wasa very effective and structuredmechanism that ensured that alldata were captured. Basically, itwas responsible for ensuring thecomplete and orderly flow offorms, identifying problems, andrerouting forms to handle a rangeof exceptions. Lockheed Martinnoted that: “The workflow is super-ficially straightforward, consistingof a series of sequential steps withmost processes passing theirresults to the next, with little fork-ing of workflow cases. However,underneath it is complicatedbecause of the rerouting that isrequired.”

Another inconspicuous aspect ofthe workflow management systemwas the underlying software thatintegrated the DCS 2000 COTSproducts. The unique Generic

Application Interface allowed newor updated workflow COTS prod-ucts to be easily integrated intothe workflow system.

One critical step in the workflowprocess was checkout. Afterbatches of forms were processed,they arrived at the checkout sta-tion and a verification processensured that each form that wasreceived was processed. Anyforms that needed to bereprocessed were sent to theException Checkout handler. Thisillustrates that rigid controls werebuilt into the workflow processthrough the last step of the chain.

Most importantly, the workflowprocess complied with Title 13 pro-tection requirements. TheAssessment Report for DataCapture of Paper Questionnairesdiscussed the importance of ques-tionnaire accountability:

The successful protection andsecurity of the questionnaireswas of primary concern duringthe data capture period andsubsequent forms destruction.The accountability for the datacapture of paper questionnairesonce they were received at theData Capture Centers used a check-in system, batch processing through scanning, orKey From Paper, and a positivecheckout system which verifiedthat all forms were processedand that their data was received by Headquarters DataProcessing.

Understanding the criticality of anorderly workflow managementprocess and designing efficiency

into that process were key successfactors for the data capture sys-tem.

5.5.6 Modular system design.DCS 2000 was a flexible systemarchitecture that could adapt tochanging requirements. For exam-ple, a significant process changeoccurred very late in the programwhen a decision was made to usea two-pass data capture process.According to Lockheed Martin,because of the system’s modulardesign, workflow, image replaycapability, inherent robustness,and a high level of configurability,conversion to the two-pass methodwas a relatively simple matter. Amajor lesson learned that wascited by Lockheed Martinaddressed the overall systemdesign and the adaptability of thesystem:

The fact that these changes[switch over to the two-passmethod] could be implementedso close to the start of produc-tion reemphasizes the positivelessons learned that accompanythe benefits of incorporatingeach of the DCS 2000 designthemes at the earliest stages ofdevelopment.

5.5.7 Forms Printing. The CensusBureau recognized that monitoringprint quality on-site was integral tosuccess. Lockheed Martin con-curred with this direction. Theycautioned that “problems canquickly affect enormous quantitiesof forms” and therefore the qualityof Census 2000 can depend onmaintaining printing standards andconsistency. For this reason, theCensus Bureau had an extensiveand automated print QA process.

U.S. Census Bureau Census 2000 Data Capture 17

Page 26: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

This page intentionally left blank.

Page 27: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

6. Recommendations

This section provides recommenda-tions stemming from the variouslessons learned that were cited inthe documentation reviewed.

6.1 Unify the design strate-gy needed for the data cap-ture system

RITRC recommended future datacapture systems be developedwithin the context of a unifiedframework. Their view of the sys-tem would includes the data cap-ture components and, in addition,the printing, forms design, recog-nition, edits, and coding compo-nents. Lockheed Martin provided acomment that touched on this rec-ommendation:

A technical interface betweenform design, printing, and datacapture was also extremely ben-eficial to the program’s successand should be established veryearly in the program lifecycle.This worked well on DCS 2000,but could have been establishedeven earlier. All three of theseaspects of forms processingmust work in concert with eachother in order to maximize theproductivity of the process as awhole.

In keeping with the theme of a uni-fied data capture system, TRW rec-ommended “starting the systemand services contracts at the sametime so strong working relation-ships can be developed from thebeginning.” They further addedthat “Having both contractors workrequirements together will result ina better system and better opera-tional procedures” if operational

perspectives are reflected in therequirements. TRW specificallynoted that development of DCS2000 was initiated without inputfrom the users (i.e., the operationsstaff). They added that this was asource of frustration at the DCCsand required development of amanagement information systemthat was far more extensive thanthey had originally planned toimplement.

The Assessment Report on DataCapture of Paper Questionnaireswas in general agreement with theneed for a unified system develop-ment environment. It recommend-ed “integrated development”involving internal stakeholdersearly in the planning phase and theneed for “better integration offorms design and printing specifi-cations with data capture systemdevelopment.” It found that formsdesign was largely independent ofdata capture and processing sys-tem designs and therefore this wasa factor that led to questionnairedesigners being over confident inthe capabilities of OMR and OCRtechnologies.

6.2 Define requirementsearly

As noted in Titan’s Census 2000Data Capture System RequirementsStudy, requirements establish thevery foundation for a system.Their importance cannot be over-stated, especially in an environ-ment where a substantial R&Dinvestment is necessary. Delays indefining requirements, or not fullydefining them, increases the likeli-hood that the system will not meet

data capture expectations or per-form at the level required. Or, inthe case of quality assurancerequirements, waiting until late inthe development cycle may notallow for sufficient time for imple-mentation of the mechanismsneeded to generate appropriatemetrics or ensure adequate quality.Starting the planning and develop-ment earlier would provide agreater chance that all identifiedrequirements will be implementedand that sufficient time will existfor testing and system refinement.

6.3 Develop quality assur-ance standards early

Given that requirements for QAwere not well established, the rec-ommendation is that more empha-sis be placed on defining QA meas-urements, processes, and reportsfor the 2010 Census. TRW recom-mended that both the DCS 2000and DCSC participants develop anintegrated QA program.Additionally, as stated in the RITRCreport, the Census Bureau shouldinvestigate new QA technologies tobring QA evaluation time closer toproduction time. In this regard,the Assessment Report on DataCapture of Paper Questionnairespointed out the need for the datacapture system to provide for real-time access to data for qualityassurance review.

6.4 Focus on redesigningkey data capture problemareas

As discussed in this report, seg-mentation errors account for manyof the processing problems. While

U.S. Census Bureau Census 2000 Data Capture 19

Page 28: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

problematic, this facet of formsdesign provides a potentially highpayback area for future research.DCS 2000 image capture specifica-tions prohibited the use of darkoutlines surrounding OCR fields,but the use of a different back-ground color with a higher con-trast to white should alleviate thedefinition problem.

Another example of a significantproblem area on the forms is theHispanic and race question. Manyof the participants in the Census2000 Questionnaire Design Studyagreed that this was an especiallytroublesome area. Although therewere improvements in terms of thepresentation and sequencing ofthese questions in 2000, there isstill room for further improve-ments that could make them easierto understand and less prone tointerpretation problems during thedata capture process. More empiri-cal research is needed on this par-ticular design topic.

6.5 Limit the number offorms

RITRC observed that the number offorms grew to 75 for Census 2000(only 26 were data captured byDCS 2000). They expressed con-cerns that, with separate form defi-nition templates being created foreach form and that each had to betested, this proliferation of formsbecame “unwieldy.” RITRC hassuggested that the “80-20 rule”6

might apply in regards to the num-ber of forms that are used for the2010 Census. That is, the use offewer, more generic forms mightprove to be cost effective from adata capture perspective.

There are two major issues to con-sider. First, the level of effort

required to develop the templateshas major time and cost implica-tions for the Census Bureau.Second, since identical informationis being asked across multipleforms, there is a risk that wordingor phraseology may not be consis-tent across all forms. The recom-mendation is for the CensusBureau to consider combiningforms, when possible, so a singleform may serve more than onepurpose. There is a risk inherentin over reliance on too manygeneric forms.

TRW noted that it encountered dis-ruptions to operations due to theneed to adapt to different types offorms. They also recommendedthe Census Bureau should mini-mize the number of form typesand design them as early as possi-ble so the processing implicationscan be understood prior to theDCSC developing its procedures.

6.6 Assess future role ofautomated data capturetechnology

The data capture evaluation citedtwo possibilities with regard to thefuture use of automated data cap-ture and imaging technology in thedecennial census. If it is seen ashaving a supporting role, it wouldbe used primarily for rapidly cap-turing the clear and easy respons-es. In this scenario, traditionalmethods, although resource inten-sive, would still be used to captureespecially difficult responses. Onthe other hand, automated technol-ogy could have a dominant roleassuming that census forms weredramatically streamlined and thelong form with its numerous write-in responses was no longer cap-tured in the decennial census. Ifthe long form is retained, the datacapture evaluation asked if it isthen worthwhile to put a high pri-ority on improving the quality per-formance of the automated tech-

nology. The data capture evalua-tion suggested that this issuecould be a research question fortesting leading up to the 2006Census Test.

With poor handwriting accountingfor many errors, the data captureevaluation suggested giving con-sideration to reducing some write-in fields to check boxes, reducingthe set of questions, or using moreenumerators to get long form data.

6.7 Implement a unifiedHelp Desk

In TRW’s opinion, the existence oftwo different types of help desks(i.e., one for DCS 2000 and one forDCSC) “introduced unnecessarycomplexity, overlap and confu-sion.” They recommend using onlyone integrated help desk.

6.8 Train on the actualworking system

Given the size of the workforce fordata capture operations (over10,000 temporary employees dur-ing the data capture operationsphase), TRW was concernedwhether all employees had thenecessary skills. TRW recognizedthe need for consistency in train-ing across all sites. To providequality training, TRW cited a keylesson learned: “From the start ofthe project, instructional designteams need as much access to theactual working system and equip-ment as possible to ensure accu-rate training documentation. Thisincludes participation in dressrehearsals and OTDRs.”

6.9 Expand the time fortesting

As previously mentioned, the vari-ous tests were extremely beneficialand exercising the system provid-ed invaluable insights. However,TRW was concerned that theOTDRs started too late in the

20 Census 2000 Data Capture U.S. Census Bureau

6 Also referred to as Pareto's Law, this isthe principle that 20 percent of somethingaccounts for 80 percent of the results.

Page 29: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

U.S. Census Bureau Census 2000 Data Capture 21

Census schedule (August 1999)and would like to see more timeallotted between tests. Since man-agers from all four DCCs partici-pated in each sequentially sched-uled OTDR, their time was dividedbetween participation in an OTDRat another site, preparing theirown OTDR, and preparing their sitefor data capture operations. TRWnoted that “preparing for an OTDRis a full-time project that couldhave been performed better if theparticipants were not busy at pre-vious tests at the same time thatthey needed to be preparing fortheir own test.”

6.10 Better defineDecennial ManagementInformation Systems forfuture contracts

TRW expressed concerns about thelevel of effort required to imple-ment the DCSC ManagementInformation System (DMIS) and thefact that it required more resourcesthan anticipated. Their assessmentwas stated as follows:

The scope of DMIS was severelyunderestimated. Instead of afew COTS tools sitting on desk-tops, it grew into a mediumsized networked informationsystem. This was a difficult taskto complete in less than twoyears. We should have scaledback the scope of DMIS giventhe time constraint. We wereunderstaffed early in theprocess. It was difficult toattract developers to the projectdue to the lack of programmingwork and the short-term natureof this project.

These concerns should be notedfor future planning purposes andthe scope of requirements for sys-tems like DMIS must be fully

understood. A management infor-mation system typically interfaceswith numerous other systems andthis increases the complexity ofdevelopment and testing. Themagnitude of effort and the timerequired to develop robust infor-mation systems essential to themanagement of large-scale pro-grams needs to be factored intoplanning.

6.11 Provide more informa-tion on the scope of docu-mentation requirements

As with DMIS, TRW had similarconcerns about the amount ofeffort it took to produce manage-ment documentation. They provid-ed the following perspective onthis issue:

In retrospect, when the entiredocumentation effort is consid-ered, it needs to be recognizedthat the importance of the docu-mentation was not emphasizedin the original RFP and thereforeassumed to be less significantby TRW in our proposal. We aresure in hindsight that no onemeant to minimize the effort,but it should be recognized thatcomprehensive technical andadministrative procedures wereessential to running consistentoperations at the data capturecenters. The operational proce-dures and assembly-line-likeprocess necessary to handle theforms required all parties toknow and understand what pro-cedures to follow and under-stand what changes were madeto the procedures during opera-tions to ensure optimum produc-tivity. It was also critical thatthe DCC administrative andfacilities staffs have guidelinesand procedures to follow toreduce employee-relations

issues. And, certain documentswere added to the effort andwere critical to the program,but were not in Section F [of thecontract]. These documentsincluded DCSC ManagementInformation System (DMIS) docu-mentation and operations docu-mentation such as the OCCCONOPS [Concept of Operations]and other OCC related docu-ments.

TRW recommended that documen-tation requirements be given moreemphasis in future RFPs.

6.12 Minimize the use ofkeying from paper

The Assessment Report on DataCapture of Paper Questionnairescites inefficiencies associated withKFP due to its cumbersome soft-ware and 100 percent validationrequirement. With much of theneed for KFP being generated by damaged forms, lightpencil marks, and incomplete era-sures, it may be possible to devisemethods to significantly reduce thenumber of forms submitted for KFP

6.13 Produce real-time costdata

The Assessment Report on DataCapture of Paper Questionnairesaddressed the need for currentcost information. The combinedcosts for DCS 2000 and DCSC wereapproximately $520 million, andcontract modifications accountedfor significant increases over theoriginal contract baselines. Inspite of the magnitude of the datacapture system, real-time laborexpenses were not available, norwere cost data by operation foreach DCC. To facilitate manage-ment of the 2010 Census, thereport recommends using moredetailed and current cost data.

Page 30: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

This page intentionally left blank.

Page 31: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

7.1 Implement a morestructured approach todefining requirements forthe data capture system

More attention needs to be paid todefining a solid set of require-ments, especially because the datacapture system in 2010 may wellincorporate additional functionalityand have a more complex architec-ture. For instance, the LockheedMartin Phase II Lessons Learneddocument speculated that:

Additions such as data entryfrom wireless and internet-based applications or complexaddress recognition can havesignificant positive effects onvarious aspects of the systemand its operation. It may alsobe beneficial to look outside thecurrent bounds of the systemfor opportunities to improve theoverall Census capture opera-tion. Such areas of interestinclude the integration of postprocessing coding systems oruses for two-dimensional barcodes. So, in order to take fulladvantage of all of the experi-ences of the 2000 Census,enhancements of these typesshould be investigated and pos-sibly applied to future systemswith similar requirements.

This will require a pro-active, disci-plined, and systematic approach torequirements definition. A majormanagement commitment will benecessary to make this a reality.Without it, the Census Bureau will

face a high degree of risk in imple-menting data capture systems andoperations for the 2010 Census.

7.2 Research the potentialfor expanding the all digi-tal data capture environ-ment

In citing lessons learned, one ofthe recommendations put forwardby RITRC suggested that “Due tothe unique nature and scope of thecensus, it requires an early invest-ment in research and develop-ment.” Titan endorses RITRC’sadvice, but suggests that researchand development efforts focusbeyond paper-based systems. Tobe sure, the capabilities andsophistication of such systemshave steadily improved, as evi-denced by the DCS 2000 perform-ance and quality issues discussedin this report. However, these sys-tems still incur substantial laborcosts for printing and processingforms and for nonresponse fol-lowup. The system also requiresmajor capital investment in scan-ning technologies in order toprocess millions of forms. Inshort, paper based systems arestill cumbersome because of theconversion from paper intoimages, and subsequent storage in electronic format. This is a verycomplex process.

Although the data capture systemwas unquestionably successfulduring Census 2000, we have tran-sitioned into an era when informa-tion is predominately collected,

stored, transformed, and transmit-ted in a digital format. In theory,future census activities could beperformed in an all-digital environ-ment. In fact, the Census Bureauis already moving toward expand-ing the use of digital media havingmade a commitment to use mobilecomputing devices in data collec-tion operations involving enumera-tors. And, the feasibility of a Web-based census data collectionsystem was proven during Census2000 and will provide yet anothermeans of collecting data digitallyin the next decennial census. Thisis especially true if the Censusproactively markets an onlineoption for the 2010 Census.Numerous electronic data filesbeing maintained by state and fed-eral agencies may provide moresources of personal data that canbe conveniently extracted for non-respondents.

An ambitious goal for the CensusBureau would be to eliminate theneed for processing massive quan-tities of paper forms by planningfor an all digital data capture envi-ronment. Advanced technologieson the horizon are introducing newforms of digital media, some ofwhich are in the early stages ofdevelopment. These efforts mayidentify enabling technologies thatcould further streamline censusdata capture operations. Thismay not be achievable in time forthe 2010 Census, but research canhelp pave the way for the transi-tion into the digital world.

U.S. Census Bureau Census 2000 Data Capture 23

7. Author’s Recommendations

Page 32: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

This page intentionally left blank.

Page 33: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

The Census 2000 data capture sys-tem was a very complex, well man-aged system that employedadvanced technologies to auto-mate data capture processes. Itcaptured massive amounts of datawithin a relatively short period oftime with high accuracy rates.However effective the automation,it still relied on some manual pro-cessing to handle millions of write-in fields that posed a problem forthe automated character and markrecognition subsystems. The datacapture evaluation raises an inter-esting issue: is it worth attemptingto make further refinements to the

automated capture processes inorder to capture the problematicresponses? This is something thatthe Census Bureau may want tostudy in order to determinewhether the extra cost and effortcould be justified. Alternatively,improved forms design may allowmore data to be captured withfewer errors. Many of the prob-lematic types of responses wereidentified in the documentationreviewed for this report and arewell understood by Census Bureaustaff. Perhaps some of the designimprovements listed in this reportcould help to minimize these types

of errors from occurring in thefuture.

A goal of this report was to lookfor common themes in the docu-mentation as well as any conflict-ing perceptions about the datacapture system. No significant dif-ferences of opinion were detectedduring the review. To the contrary,there were many similar, compli-mentary views and findings thatreinforced our confidence in stat-ing the issues presented in thisreport.

U.S. Census Bureau Census 2000 Data Capture 25

8. Conclusion

Page 34: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

This page intentionally left blank.

Page 35: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

References

Brinson, A. and Fowler, C. (2003)“Assessment Report for DataCapture of Paper Questionnaires,”February 19, 2003.

Conklin, J. (2003) “Evaluation ofthe Quality of the Data CaptureSystem and the Impact of the DataCapture Mode on the Data Quality,(K.1.b),” Final Report, March 12,2003.

Brinson, A. and Fowler, C. (2001)“Program Master Plan, DataCapture Systems and Operations,”March 30, 2001.

Forms Design and Printing Office.(2002) “Forms Design and PrintingLessons Learned,” May 30, 2002.

Lockheed Martin. (2001a) “Phase IILessons Learned,” February 21,2001. (Includes Appendix A listedbelow as a separate white paper.)

Lockheed Martin. (2001b)“Appendix A, Technical LessonsLearned,” White Paper, Release 1Version 1, February 23, 2001.

Rochester Institute of Technology.(2002) “DCS 2000 Data Quality,v.2.1 Final,” September 20, 2002.

Titan Systems Corporation/SystemResource Division, (2002) “Census2000 Data Capture SystemRequirements Study (R.3.d),” FinalReport, August 23, 2002.

Titan Systems Corporation/CivilGovernment Services Group.(2003) “Census 2000Questionnaire Design Study,” Final Report, March 19, 2003.

TRW. (2001) “Data CaptureServices Contract Final Report,”February 9, 2001.

Weinberg, D. (2000) Memorandumfrom Daniel H. Weinberg, Subject:Actions to Correct Pass 2 KeyingErrors in Census Sample MonetaryFields, December 7, 2000.

U.S. Census Bureau Census 2000 Data Capture 27

Page 36: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

This page intentionally left blank.

Page 37: 6694 Data Capture...Census 2000 Topic Report No.3 Census 2000 Testing, Experimentation, and Evaluation Program TR-3 Issued January 2004 Census 2000 Data Capture U.S.Department of Commerce

Census 2000 Topic Report No.3Census 2000 Testing, Experimentation, and Evaluation Program

TR-3

Issued January 2004

Census 2000 Data Capture

U.S.Department of CommerceEconomics and Statistics Administration

U.S. CENSUS BUREAU