using ocr for census data capture in china national bureau of statistics of china

17
Using OCR for Census Data Capture in China National Bureau of Statistics of China

Upload: martin-mathews

Post on 29-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using OCR for Census Data Capture in China National Bureau of Statistics of China

Using OCR for Census

Data Capture in China

National Bureau of Statistics of China

Page 2: Using OCR for Census Data Capture in China National Bureau of Statistics of China

Background

5 population censuses have been conducted in 1953, 1964, 1982, 1990, 2000 respectively

1953,1964 census: manual tabulationSince 1982 census, using computer for data

process. 1982,1990 census, manual data entry2000 census, using OCR for data capture 2006, the second Agriculture census: also

use OCR for data capture

2000 population census and 2006 agriculture census are the two cases of OCR use for large-volume data capture

Page 3: Using OCR for Census Data Capture in China National Bureau of Statistics of China

Two cases of OCR for large-volume Census Data Capture

The data capture of 2000 Population Census

Census reference time: Nov. 1,2000 Data Capture cycle : Jan. -- June., 2001.(6 mont

hs) Scale :

Types of Census Form : 4Short Form: 49 Census items, 90% HH, 360 milli

on A4 size double sheets in totalLong Form: 95 Census items, 10% HH, 40 millio

n A3 double sheets in total Other Forms: Death pop, temporary residents. 1

0 million A4 double sheets in total

Original Census data : About 64 GBImage volume: 5.5TB

Page 4: Using OCR for Census Data Capture in China National Bureau of Statistics of China

The Second National Agricultural Census

Census reference time: end of 2006 Data Capture cycle : April to mid-July, 2007

,100 days Scale :

Types of Census Form : 8Total census items : 541Total agricultural Families : 250 millions Total Census Forms : about 500 million piec

es of paperOriginal Census data : about 300GBImage data : 40TB

Two cases of OCR for large-volumes Census Data Capture

Page 5: Using OCR for Census Data Capture in China National Bureau of Statistics of China

Organizational Structure for data process

EA

Town County

1 13

Checked& packed

Checked& packed

Coding

Prefecture

5 million

400002847

340

DataCaptureediting

Province

31

NBS

Data process

Data capture was decentralized at prefecture offices Village

0.9 million

Checked& packed

Page 6: Using OCR for Census Data Capture in China National Bureau of Statistics of China

Function framework of OCR data capture

(2006 agriculture census)

task management

System management

scanning

Editing and checkup

Data management

Image management

User management System Initialization

Log management

Space management

Sys management Address base Client management

Archiving management

process management

Check scan

numeric data Chinese character

numeric data edit

OCR

Image reported

Restore

Delete

Enquiry Browse

Import export

Forms checkScanner self-inspection

Browse Input

Restore

Output

Backup

Backup

Delete

QA

Add scan Alternative scanBatch form scan

Generation image management ID

Image merger

Chinese character edit

Generation census form management ID

Repeat scan

English character Special character

Progress monitoring

Receiving file

Statistics summary

Information display

System

Fu

nction

s

English character editSpecial character edit

checkup

Page 7: Using OCR for Census Data Capture in China National Bureau of Statistics of China

The scanning module generates image files and transmits them to image management module and also transmits the status information to task management module.

The task management module executes task distribution according to the state of vacancy of each OCR clients.

The Process of OCR data capture

Page 8: Using OCR for Census Data Capture in China National Bureau of Statistics of China

The OCR module performs recognition of numerical data and Chinese characters and transmits the data and Chinese characters to data management module and transmits the status information to task management module.

The Process of OCR data capture

Page 9: Using OCR for Census Data Capture in China National Bureau of Statistics of China

The task management dispatches the data to edit module for editing. If original image is needed, corresponding image is fetched by image management module for comparison, the cleansed data after edit are returned back to data management module. when data capture work is all finished, report upward the data.

The Process of OCR data capture

Page 10: Using OCR for Census Data Capture in China National Bureau of Statistics of China

Quality Control

To ensure the quality of captured data, quality control is executed in three stages: scanning, recognizing and data editing.

During the process of scanning, recognizing batch cover data and scanner count, the system checks if the total page count, total household count for each batch are consistent with the results of scanning; Comparing the actual address code with address code repository, ensure that the address codes are validity, uniqueness and correctness.

During the recognition, collecting real time statistics for rejection ratio and suspect ratio. If rejection ratio and suspect ratio is too high, the task administrator checks the reason.

Page 11: Using OCR for Census Data Capture in China National Bureau of Statistics of China

During the process of editing, checking the consistency between recognized record count and the record count in controller document; Checking the basic logic relationship and value range; indicate the items which have mistakes in logic relationships or value ranges, recognition results and corresponding items from original scanned images are displayed comparatively in parallel windows, and convenient modification means are provided for those which need get modified.

After the whole set of data has been captured, quality is assured through executing sampling quality check through all phases

Quality Control

Page 12: Using OCR for Census Data Capture in China National Bureau of Statistics of China

Main Problems and Solutions

In large-scale census data capture projects, there’re three aspects of problems we regard as the most outstanding: 1. How to enhance OCR’s recognition capability. 2. Availability and reliability of the system. 3. Project management. What we have done are:

1. Improve the capability of recognizing numeric characters

Two kinds of recognition algorithms and two kinds of recognition engines based on the two algorithms were developed, after a series of onsite test, which better suites the census project is chosen.

2. Improve the recognition capability for Chinese characters

By collecting large number of actual samples and training the recognizer, recognition capability for Chinese names is improved.

Page 13: Using OCR for Census Data Capture in China National Bureau of Statistics of China

Main Problems and Solutions

3. Improve orientation capability Aiming at print deviation and filling

deviation, smart locating algorithm has been developed which has minimized the impact of the print deviation and filling deviation.

4. Enhance efficiency of recognition Improve the fundamental software of

scanner, to achieve the best match between hardware drivers and OCR software and improve the efficiency of recognition.

5. Improve the quality of forms filling Prescribe the filling standards for form filling

so that OCR error rate will be reduced, meanwhile rejection rate could also be reduced.

Page 14: Using OCR for Census Data Capture in China National Bureau of Statistics of China

Main Problems and Solutions

6. Establish regulation, working guidance and processes to make every data entry site to execute work following uniform regulations, processes and standards.

7. Strengthen the training. we organized centralized training and on-site training for the users. Lecturing and actual operations are combined during centralized training, through the combination of these two ways, the familiarity with the system has get deepened.

8. Organize multi-target pilot. We organized multiple pilots in many locations aiming at different targets.

Page 15: Using OCR for Census Data Capture in China National Bureau of Statistics of China

Lessons Learned

Using advanced technology to raise efficiency

Combining technical and administrative methods to resolve quality problems and security issues

Choose partners with the higher capability of system development and service

Early project preparationManage project with partners Training, pilot projects and management

is the key to successControl the printing quality of the census

forms and census data filling quality Project change control

Page 16: Using OCR for Census Data Capture in China National Bureau of Statistics of China

Prospect of the 2010 Population Census

Census time: Nov. 1, 2010

Short form and long form, death population form Foreigners living in China are considered to be en

umerated

Data capture in 2011 OCR data capture will be the main data entry met

hod Modifying the existing system of agricultural cen

sus and make some innovate Adding more OCR equipments

Page 17: Using OCR for Census Data Capture in China National Bureau of Statistics of China

THANKS