automated data extraction system for handwritten student information … · automated data...

Automated Data Extraction System For HandwrittenStudent Information Cards

A Manuscript

Submitted to

the Department of Computer Science

and the Faculty of the

University of Wisconsin-La Crosse

La Crosse, Wisconsin

by

Zhicheng Fu

in Partial Fulfillment of the

Requirements for the Degree of

Master of Software Engineering

May, 2013

Automated Data Extraction System For Handwritten StudentInformation Cards

By Zhicheng Fu

We recommend acceptance of this manuscript in partial fulfillment of this candidates re-quirements for the degree of Master of Software Engineering in Computer Science. Thecandidate has completed the oral examination requirement of the capstone project for thedegree.

Dr.XXXXXXXXXXXXXX DateExamination Committee Chairperson

Dr.XXXXXXXXXXXXXX DateExamination Committee Member

Dr.XXXXXXXXXXXXXX DateExamination Committee Member

ii

Abstract

Fu, Zhicheng, Automated Data Extraction System For Handwritten Student Informa-

tion Cards, Master of Software Engineering, May 2013. Advisor: Kenny Hunt.

The University of Wisconsin-La Crosse visits regional high schools and, as part of their

recruitment effort, asks students to fill out a form that collects information about a students

contact information, GPA, ACT and academic interests. These cards are then hand-entered

into a central database for further recruitment efforts. This project describes the design

of a semi-automated system to extract information from these Student Information Cards.

The project addresses two central problems: extracting handwritten markings from the

surrounding form and recognizing the extracted handwritten characters. In the extraction

phase we mark feature points in a template image and perform image registration with re-

spect to these features. Image subtraction yields an approximate result that is later refined

via specialized ad-hoc filtering rules. The form fields are then fed into a custom character

recognition engine and are semantically checked against a dictionary to improve the overall

recognition rate.

iii

Acknowledgements

I would like to express my sincere thanks to my project advisor Dr. Kenny Hunt for

initiating this project and providing the support for this project. Besides I want to show

my thanks to Jeremiah Collins from The Admission Office at University of Wisconsin-

La Crosse for supporting for this project. I would also like to express my thanks to the

Computer Science Department and the University of Wisconsin-La Crosse for providing

the computing environment for my project.

iv

Contents1 Introduction 1

2 Requirement Gathering And Analysis 32.1 Gathering Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Selection of Life Cycle Model . . . . . . . . . . . . . . . . . . . . . . . . 62.4 GUI Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Design 113.1 Extraction Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Recognition Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Implementation 204.1 Extracting Handwritten Information . . . . . . . . . . . . . . . . . . . . . 21

4.1.1 Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1.2 Handwriting Extraction From Image . . . . . . . . . . . . . . . . . 25

4.2 Description of Character Recognition . . . . . . . . . . . . . . . . . . . . 26

5 Testing 31

6 Conclusion And Future Enhancements 32

v

List of Figures1 Student Information Card . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Use Case Diagram for System User . . . . . . . . . . . . . . . . . . . . . 53 Iterative and Incremental Development Model . . . . . . . . . . . . . . . . 74 Main User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Main User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Functionalities about ’File’ menu . . . . . . . . . . . . . . . . . . . . . . . 107 Operations of processing images . . . . . . . . . . . . . . . . . . . . . . . 108 UML class diagram of the MainPanel class. . . . . . . . . . . . . . . . . . 129 UML class diagram of the entity classes. . . . . . . . . . . . . . . . . . . . 1410 UML class diagram of the ImageRegistration class. . . . . . . . . . . . . . 1511 UML class diagram of the ImageSubtration class. . . . . . . . . . . . . . . 1712 UML class diagram of the DigitRecognition class. . . . . . . . . . . . . . . 1813 Key Components In The Template Image . . . . . . . . . . . . . . . . . . 2214 An Example Of Image Registration . . . . . . . . . . . . . . . . . . . . . 2415 This shows how noise components are eliminated. The rectangle areas in

Figure 15a are noise components to be eliminated. . . . . . . . . . . . . . . 2516 Chain Coding Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2617 Zip code field and the most likely matches for each digits. . . . . . . . . . . 2818 This shows an example where only 4 of the 5 digits are correctly identified. 2819 Test Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3120 Test Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3221 Connected And Separated Digits In The Image . . . . . . . . . . . . . . . 3322 A component is cut into several components after extraction of handwritten

data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

vi

List of Tables1 Functional Requirement to Method Map . . . . . . . . . . . . . . . . . . . 122 Core Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 The Description of Entity Classes . . . . . . . . . . . . . . . . . . . . . . 144 Description of ImageRegistration Class . . . . . . . . . . . . . . . . . . . 165 Description of ImageSubtraction Class . . . . . . . . . . . . . . . . . . . . 176 Description of DigitRecognition Class . . . . . . . . . . . . . . . . . . . . 197 The ranked list corresponded with Figure 17 . . . . . . . . . . . . . . . . . 308 The ranked list corresponded with Figure 18 . . . . . . . . . . . . . . . . . 30

vii

GLOSSARY

Student Information Card

The Admissions Office at UW-L collects information regarding potential students by

visiting high school career fairs across the region and having students fill out information

cards. A student information card has fields for contact information (student name, address,

email, phone number), high school, academic information (GPA, SAT, ACT) and other data

relevant to the admissions process

Incremental Model

The incremental model is a method of software development where the software is de-

signed, implemented and tested incrementally until the product is finished. The incremental

model is an evolution of the waterfall model, where the waterfall model is incrementally

applied [2].

Chain Coding

Chain coding is a technique for representing the contour of a component. A chain code

defines the contour of a component as each boundary pixel is traversed by describing the

direction of each next contour pixel.

Recognition Library

The recognition library is a third-party library for recognizing handwritten characters.

viii

1 Introduction

The Admissions Office at UW-L collects information regarding potential students by

visiting high school career fairs across the region and having students fill out information

cards. Each card has fields for contact information (student name, address, email, phone

number), high school, academic information (GPA, SAT, ACT) and other data relevant to

the admissions process.

The Admissions Office presently collects thousands of these cards annually and must

manually enter all of the data into a database for later tracking and processing of the stu-

dents. Manual data entering is time consuming and error prone. The Admissions Office

recently indicated that hand-entering the data related to our recent sentences of activities

consumed several interns and took 3 months time.

This project seeks to automate the process of extracting the information on these cards.

The information will be extracted using handwriting recognition system and automatically

(or semi-automatically) entered into a database. Figure 1 shows an example of such a card.

1

Figure 1: Student Information Card

2

2 Requirement Gathering And Analysis

2.1 Gathering Process

The Automated Extraction System for Student Card Project was conceived by Dr. Hunt

and Jeremiah Collins from the Admission Office of University of Wisconsin-La Crosse be-

fore the formal beginning of this program. For this project, Jeremiah Collins served as the

sponsor of this project; Dr. Hunt served as the supervisor.

The project initially included the development of a handwriting recognition engine.

However, the scope of the recognition engine was too large and imposed too much com-

plexity on the developer. To address this complexity, the developer, with assistance from

Dr. Hunt, researched third-party software packages which would be able to perform the

functionalities of character recognition. A software package was chosen depending on the

required features, The scope of the program was therefore appropriately reduced by using

the third-party software for the complex character recognition.

The sponsor met with the developer several times to gather informal functional require-

ments of this program. These informal functional requirements helped to define the scope

of the program as well as capture the true nature and purpose of the application. During the

first of these informal meetings, the sponsor provided samples of real student information

cards and identified the principal data elements obtained from the student. Based on the in-

formation collected from the meetings, the requirement document version 1.0 was created,

in which the following fundamental requirements are listed:

• This program must support a graphical user interface for the user to use this program.

• The scanned image of student information card should be full color and 300 dpi. This

3

step is manual and is performed outside of the software system

• The format of the scanned image is TIF, PNG or JPEG.

• These images are placed into a folder, and the folder is loaded into the program as

input.

• The information extracted from the scanned image must be stored in a text file. The

sequence of the field information must be arranged according to a standard that will

be provided by the sponsor.

• This program must run on Windows platform with JVM.

Overall, the project produced 2 requirement documents. The requirement document

version 1.0, included the detailed functional requirements described in Section 2.2. In

version 2.0 of the requirement document we added a selected life cycle model, shown in

Section 2.3, as well as GUI requirements shown in Section 2.4.

2.2 Functional Requirements

This program is a stand-alone desktop system. The program does not manage user

accounts and hence there is only one role, the System User, supported by this system.

Figure 3 gives a Use Case diagram for the System User.

As shown in the Figure 2, there are seven use cases in the diagram. Each use case

describes a functional requirement. These functional requirements are narrated as follows:

• The ’Load Images’ function allows a user to load scanned images into the system.

The user is able to load singe or multiple images. If some image loaded into the

4

Figure 2: Use Case Diagram for System User

system does not satisfy the standard mentioned in section 3.1, the system will notify

the user that an error has occurred.

• The ’View Images’ function allows users to view all the images that have been al-

ready loaded into the system.

• The ’Process Single Image’ function allows users to extract and recognize all of the

handwritten information from a single image.

• The ’Process All Images’ function allows users to extract and recognize all of the

handwritten information from every loaded image.

• The ’Store Information’ function allows users to store the extracted data into a database.

• The ’Modify Information’ function allows users to manually view and modify all the

results of the recognition.

5

• The ’Exit System’ function allows users to terminate the application.

2.3 Selection of Life Cycle Model

We analyzed the requirements in requirement document version 1.0 and identified the

risks listed below:

• Lack of detailed specification for the GUI.

• Lack of experience on handwriting recognition.

• Potential misunderstanding between sponsor and developer.

• Potential change of the student card format, which will result in potential modifica-

tion of the requirements.

These risks described above are rough estimated based on the situations described by

the sponsor and developer. To alleviate these risks, we determined an adaptive software

development model. The incremental model is a method of software development where

the model is designed, implemented and tested incrementally until the product is finished.

Also as shown in Figure 3 [3], the time of incrementation of a product is infinite.

6

As opposed to waterfall model, this one has more flexibility to fine-tune the current

developmental direction and gradually satisfy the sponsors anticipation based on changes

without a chain reaction in all phases Requirements, Design, Implementation and Test.

Especially, it is realized that some system characteristics are uncertain and changing re-

quirements will occur often. Through the iterative improvement in each increment, the

development process can be refined accordingly.

As a result, the sponsor was able to experience and review partial functionalities. Af-

terwards, they gave feedback for developing additional requirements in the next increment.

Also, should the sponsor want to change the standards of the student card or scanned im-

age, system development would be able to response to these changes rapidly.

Figure 3: Iterative and Incremental Development Model

Each increment includes the completion of several functional requirements. The ma-

jor consideration to determining which functionalities should be included in the earlier

7

increments is according to their importance and their contribution to the entire system de-

velopment.

Functionalities concerning handwritten extraction procedure and character recognition

should therefore have a higher priority than others. And the increments that occurred in

this project are listed below:

• Increment1: Graphical User Interface functionalities related to user interaction.

• Increment2: Handwritten extraction’s functionalities related to application interac-

tion.

• Increment3: Handwritten recognition’s functionalities related to application interac-

tion.

• Increment4: Enhancing application review and evaluation.

• Increment5: System configuration.

• Increment6: Writing and executing test cases.

• Increment7: Enhancing interactivity of Graphical User Interface.

2.4 GUI Functional Requirements

The GUI functional requirements closely mirror the functional requirements. In version

2.0, the GUI shown in Figure 4, was designed based on the requirements of the sponsor.

But the requirements of GUI were changed after some meetings with sponsor. So we did

change the main graphical user interface to be the one shown in Figure 5, which contains

two menu options, an image panel and a recognition panel.

8

Figure 4: Main User Interface

Figure 5: Main User Interface

The ’Image Panel’ would show the images loaded into the system. In the ’Recognition

Panel’, we construct the same form as the form in the student information card.Before the

system user load images into this system, all the text fields in the recognition panel are

empty and the ’Store Information’ button is not enabled.

9

The ’File’ menu has two options, one is ’LoadFile’ and the other is ’Exist’ shown in

Figure 6.

Figure 6: Functionalities about ’File’ menu

The ’Operation’ menu also has two options, one is ’Process All Images’, the other is

’Process Single Image’ shown in Figure 7.

Figure 7: Operations of processing images

After users load scanned images into the system, the options ’Process Single Image’ and

10

’Process All Images’ are enabled. After recognition, the result of the recognition will be

shown in the ’Recognition Panel’ and all the information in the text fields can be modified.

Users can manually view or modify the extracted data to make sure the data is correct as

shown in corresponding scanned image. Then users can store the data into the database.

3 Design

In the design document version, the architectural design of this application is described

using the UML class diagram notation. But the class diagram did not include any method

and attribute. There are no details about the character recognition engine in the design doc-

ument because we planed to use a third-party recognition engine to assist this application.

Further down the road, we reanalyzed the requirements and updated the class diagrams with

adding attributes and methods, shown in section 3.1. Besides, we also determined which

third-party recognition engine we used for this handwriting recognition. As a whole, the

design of the application consists of two pieces. The extraction application design de-

scribes how the main application is organized and functions. The recognition library is a

third-party library for recognizing handwritten characters and was developed by Dr. Hunt.

3.1 Extraction Application

The extraction application implements all of GUI Functional Requirements and is solely

responsible for interaction with the system user. The GUI will communicate with the Main-

Panel class to activate the actual Functional Requirements.

11

The MainPanel class, shown in Figure 8, maps each of functional requirements to a

method. The method’s name is nearly identical to the requirement name, making it easy

to identify. Table 1 shows the exact mapping of the functional requirements to the class

method.

Figure 8: UML class diagram of the MainPanel class.

Functional Requirement Implementation Method SignatureLoad Images loadImagesActionPerformed(ActionEvent evt): void.Exist System existSystemActionPerformed(ActionEvent evt): void.Process Single Image processSingleImageActionPerformed(ActionEvent evt): void.Process All Images processAllImagesActionPerformed(ActionEvent evt): void.Store Information storeInformationActionPerformed(ActionEvent evt): void.

View ImagesdoPreviousActionPerformed(ActionEvent evt): void.doNextActionPerformed(ActionEvent evt): void.

Table 1: Functional Requirement to Method Map

12

The MainPanel class has private methods which perform some of the more complex

operations, such as processSingleImageActionPerformed(), which would call many other

methods of different classes. Within this application, there are three core classes to be used

for extraction and recognition. The core classes are listed in Table 2.

Class Name Class FunctionImageRegistration Providing the methods for image rotation.

ImageSubtractionProviding the methods for extracting handwritten informa-tion from the registered image

DigitRecognition Providing the methods for digit recognition

Table 2: Core Classes

Before we introduce the three core classes, there are some entity classes in this applica-

tion for modeling corresponding objects. All the entity classes are shown in Figure 9 and

the descriptions of the entity classes are listed in Table 3. As shown in Figure 8, all the

entity classes just present the attributes. But each attribute has set and get method to access

or change it.

13

Figure 9: UML class diagram of the entity classes.

Class Name Class Function

DoublePointA class to model a point whose coordinate are double pre-cision

ComponentA class to model a connected component in the binary im-age

StudentA class to model a student based on the information inputin the student information card.

SingleDigitGuessedListA class to model the guessed recognition list of a singledigit.

State A class to model a state based on its zip code and city.

Table 3: The Description of Entity Classes

14

ImageRegistration class is designed to support methods for image rotation. As shown

in Figure 10, all the methods in the ImageRegistration class has been listed. All the method

in this class are public. Some variables are private. we describe each method of ImageReg-

istration class in Table 4.

Figure 10: UML class diagram of the ImageRegistration class.

15

Method Name Method FunctiongetBufferedImage(Image image) Convert the image to be BufferedImage typegetBinaryImage(BufferedImage image,

int thresholdValue)Based on the threshold value, convert the fullcolor image to be binary image.

setSelectedBoxedToStudent(Student student)Store the selected boxes’ information to the re-lated student.

getMarkedFeatures(BufferedImage src,BufferedImage dest)

Find the marked features in the image and storethe information of features.

showMarkedFeatures(BufferedImage src,BufferedImage dest,LinkedList<Regions> regs)

Mark the features highlight.

setSourceDoublePoint(BufferedImage src)Get all information of feature components in thetemplate image and store them.

getRegisterationPoints(BufferedImage src)Get all information of feature components in thescanned image and store them.

sortRegions(LinkedList<Regions> reg1,LinkedList<Regions> reg2)

Reorder the sequence of the marked features inthe storage.

getCenterPoint(LinkedList<Point> points) Get the centroid of each feature component.

getTranslationXY(LinkedList<Point> points)Get the traslation matrix between template im-age and scanned image.

getTranslationImage(BufferedImage src,DoublePoint temp)

Get a image after doing traslation on thescanned image based on translation matix.

getAverageAngle(LinkedList<Point> points)Get the rotation degree between the templateimage and the translated image.

getTranslationImage(BufferedImage src,DoublePoint temp)

Get a new image after doing rotation on thetranslated image.

getRidOfBlackEgdes(BufferedImage src,BufferedImage dest)

Eliminate the nosie components at the edges ofthe image.

checkBounds(BufferedImage src,BufferedImage dest)

Check the bound of the image after doing imagerotation.

getRegistrationFinalImage(BufferedImage src) Get the registered image.createCompatibleDestImage(BufferedImage src,

ColorModel destCM)Create a copy of the input image.

Table 4: Description of ImageRegistration Class

16

ImageSubtraction is a class providing the methods for extracting handwritten informa-

tion from the registered image. All the input image in the methods here are binary image.

Besides, we have binary template image as baseline image for subtraction. All the methods

are listed in Figure 11 and the description of the methods are shown in Table 5.

Figure 11: UML class diagram of the ImageSubtration class.

Method Name Method Function

getExtractionImage(BufferedImage sourceBinary,BufferedImage registerdImage)

Using registered image to subtract sourceBi-nary image then getting a new image only con-taining handwritten information

eraseNoiseFromImage(BufferedImage src, int w,int h)

Eliminate the noise components whose heightor width is less than the input value.

Table 5: Description of ImageSubtraction Class

17

DigitRecognition is a class providing the methods to connect the third-party recognition

engine and get recognition results of numeric fields. Figure 12 is the UML classes diagram

of this class. We also describe each method function in Table 6.

Figure 12: UML class diagram of the DigitRecognition class.

18

Method Name Method Function

getAllRankedZipCodeList(BufferedImage src)This method will take a binary image contain-ing only zip code as input, then generate aranked list of all the possible results.

getAllMatchedDigits(BufferedImage src, int type)

This method will connect third-party recogni-tion engine, load the input image into the rec-ognizer then get the ranked list of possible digitfor each individual digit in the image. Parame-ter ’type’ will decide which numeric field imageas the input image.

getAllPossibleResults(LinkedList<SingleDigitGuessedList> t,int index, double confidence, String out,LinkedList<SingleDigitGuessedList> res,int type)

This method will combine each possible digitfrom the ranked list, then generate all the possi-ble combinations.

getAllRankedGPAList(BufferedImage src)This method will take a binary image contain-ing only GPA as input, then generate a rankedlist of all the possible results.

getAllRankedACTorSATList(BufferedImage src)This method will take a binary image contain-ing only score code as input, then generate aranked list of all the possible results.

getAllRankedGraduationYears(BufferedImage src)This method will take a binary image contain-ing only graduation year as input, then generatea ranked list of all the possible results.

getAllRankedHomePhoneList(BufferedImage src)This method will take a binary image contain-ing only home phone number as input, then gen-erate a ranked list of all the possible results.

getAllRankedCellPhoneList(BufferedImage src)This method will take a binary image contain-ing only cell phone number as input, then gen-erate a ranked list of all the possible results.

getAllRankedBirthdayList(BufferedImage src)This method will take a binary image contain-ing only birthday data as input, then generate aranked list of all the possible results.

sortDigitsRankedList()This method will reorder the ranked list basedon the confidence of each possible result.

Table 6: Description of DigitRecognition Class

19

3.2 Recognition Library

We use a neural network technique to recognize individual characters within the form.

The neural network is divided into a recognizer for numeric and mixed inputs. The digit

recognizer uses a fully-connected three layer topology consisting of 408 input nodes with

10 output nodes. A connected component is rasterized into a 20x20 grayscale images to

account for 400 of the features. The 8 remaining features are defined by the chain code

histogram. The network was trained on the NIST handwritten image database, one of the

ad-hoc standards of handwritten data. we feed the individual components of this field into

the digit recognition engine. The neural network generates a ranked list of likely matches

for any particular component that is fed into the network. We take the rankings for each

individual digit and generate a short-list of the most likely zip codes that can be constructed

from the individual digits.

4 Implementation

This project addresses two central problems: extracting handwritten markings from

the surrounding form and recognizing the extracted handwritten characters. Variations in

the scanned forms introduce complexity into the task of extracting the handwritten data.

Scanning thousands of card images will result in images where the location and orientation

of the fields in each card is different. Also, variations in handwriting make handwritten

recognition extraction even more difficult. Especially when the handwritten is so bad that

even human eyes can not recognize the results.

In the extraction phase of this project, we mark feature points in a template image

20

and perform image registration with respect to these features. Image subtraction yields an

approximate result that is later refined via specialized ad-hoc filtering rules. As to character

recognition, we make our own character recognition engine and to better recognize digit

and alphabet separately, we create one training data for alphabet and another one is for

digit; also we create our own dictionary to improve the recognition rate. In the end, we put

handwritten characters into this engine and get the extraction.

4.1 Extracting Handwritten Information

4.1.1 Image Registration

Scanning thousands of card images will result in images where the location and ori-

entation of the fields in each card are varied. Image registration is used to normalize the

location and orientation of the fields in the scanned image. Image registration is performed

as described below.

1. Assume that we have a template image T. This image is an empty form with zero ro-

tation, forms the baseline image for the system. We need to identify key components

of T to use for image registration. Key components are components in T, that are

used to identify the rotation of the image. To identify key components, we analysed

the baseline form to find the representative components, which are easy to identify

no matter how the image is rotated. Key components are identified in advance as

shown in Figure 13 by the highlighted elements.

2. Let Kc be the sequence of key components, where each key component has location,

height, width and centroid. Let Tcs be sequence of all centroids of the key compo-

21

Figure 13: Key Components In The Template Image

nents. Let Tcc be the centroid of the centroids.

3. Let S be the full-color source image. Let B be the binarized image of S. Identify

the key components in B corresponding with the key components of T. Compute the

centroids of key features in B. Call these centroids Bcs.

4. Compute the centroid of the Bcs and call it Bcc. Let TRANS be a translation matrix

that takes Bcc and maps it to Tcc. Translate all points in Bcs by this amount.

5. For each centroid in Bcs, compute the amount by which that point must be rotated

(rotate around Tcc) to align with the corresponding Tcs point. Compute the average

22

amount of rotation. Let ROT be a rotation matrix that performs this rotation.

6. Take the source image S and apply TRANS and then ROT.

7. Binarize this rotated source based on the adaptive value. Let Sr be the binarized

image.

Figure 14 shows the process of image registration. The baseline image of the system

is shown in (a); the scanned image of student card is captured in (b); the thresholded image

of scanned image is shown in (c); finally the rotated image is shown in (d).

23

(a) Image T (b) Image S

(c) Image B (d) Image Sr

Figure 14: An Example Of Image Registration

24

4.1.2 Handwriting Extraction From Image

In this phase, we take the registered form and eliminate the pixels related to the form

while keeping the pixels related to the handwritten data. Once the handwritten data is

identified, we feed fields into the character recognition engine. The following are the steps

how to get handwritten data.

1. Let T be the baseline form image.

2. Let image Sr be the registered image. Let Sb = Sr-T .

3. Eliminate noise components in Sb whose width or height is less than the predefined

threshold values and get another image Sbn.

(a) Image Sb (b) Image Sbn

Figure 15: This shows how noise components are eliminated. The rectangle areas in Figure15a are noise components to be eliminated.

25

4.2 Description of Character Recognition

We use a neural network technique to recognize individual characters within the form.

The digit recognizer uses a fully-connected three-layer topology consisting of 408 input

nodes with 10 output nodes. A connected component is rasterized into a 20x20 grayscale

images to account for 400 of the features. The 8 remaining features are defined by the chain

code histogram.

Chain coding is a technique for representing the contour of a component. A chain code

defines the contour of a component as each boundary pixel is traversed by describing the

direction of each next contour pixel [1]. Under chain coding, a component is defined by the

location of a foreground pixel laying on the boundary of the component. We will refer to

this pixel as the starting point Ps. A list of directions as shown in Figure 16a defines how

to traverse each of the boundary pixels of the component beginning and ending with Ps.

Figure 16b gives an example of how chain coding could be used to represent a component.

The start point Ps is given as (0, 0) and the 8-connected code is given by {011033446666}.

(a) 8-connected chain code [1] (b) 8-connected chain code [1]

Figure 16: Chain Coding Example

The chain code histogram (CCH) is commonly used as a feature extraction technique

26

for character recognition [4]. Similar component/sharp has similar CCH distribution. The

directional information captured by the CCH is the key method to identify the feature of

any shape or pattern. Let CC be a chain code. The histogram of CC is defined as h.

hi =∑

j∈cc count(i, j) i ∈ [0, 7]

count(i, j) =

0 if i 6= j

1 if i = j

The cumulative distribution function can essentially compute CCH of a component.

As the example shown in Figure 16b, the CCH of the gray component is described as

following.

i 0 1 2 3 4 5 6 7

hi 2 2 0 2 2 0 4 0

Since we know a similar component/shape has a similar CCH distribution. Then we

create thousands of samples for all the handwritten digits and we use these samples to es-

tablish our training data for recognition engine.

As we know that the numeric consists of purely numeric data, we feed the individual

components of this field into the digit recognition engine. For each character of the field

in the image, the digit recognition engine generates a list of possible matched digits and

indicates the similarities of each possible matched digit. The range of similarities is [0, 1]

where higher similarity means better matching.

27

There are two handwritten zip code fields shown in Figure 17 and Figure 18. Each digit

is extracted from the zip code field and fed into the digit recognition engine. A ranked

list of estimates is generated for each digit. Each digit of Figure 17 is identified correctly,

While only 4 of 5 digits are recognized correctly in Figure 18.

(a) Zip Code Image

0.654 : [5]0.560 : [8]0.544 : [4]0.534 : [2]0.527 : [3]0.526 : [6]0.524 : [9]0.521 : [7]

0.564 : [2]0.546 : [5]0.523 : [7]0.512 : [3]0.511 : [4]0.508 : [6]0.505 : [8]0.499 : [0]

0.665 : [1]0.598 : [8]0.564 : [4]0.551 : [2]0.538 : [7]0.533 : [5]0.524 : [3]0.513 : [9]

0.632 : [6]0.623 : [0]0.555 : [5]0.491 : [2]0.489 : [3]0.486 : [8]0.463 : [4]0.433 : [9]

0.565 : [3]0.532 : [7]0.513 : [5]0.502 : [9]0.487 : [2]0.480 : [8]0.475 : [4]0.458 : [1]

Figure 17: Zip code field and the most likely matches for each digits.

(a) Zip Code Image

0.635 : [5]0.562 : [3]0.547 : [0]0.532 : [6]0.528 : [8]0.501 : [9]0.500 : [2]0.483 : [4]

0.572 : [8]0.563 : [1]0.561 : [7]0.547 : [5]0.539 : [9]0.533 : [3]0.532 : [2]0.476 : [4]

0.657 : [0]0.523 : [6]0.500 : [3]0.482 : [5]0.471 : [2]0.469 : [8]0.464 : [9]0.443 : [4]

0.716 : [1]0.595 : [8]0.537 : [2]0.519 : [5]0.509 : [4]0.493 : [6]0.488 : [3]0.473 : [9]

0.633 : [4]0.539 : [9]0.520 : [7]0.512 : [8]0.478 : [2]0.477 : [6]0.474 : [5]0.469 : [1]

Figure 18: This shows an example where only 4 of the 5 digits are correctly identified.

28

Since we have semantic knowledge related to the zip-code field, we can use this knowl-

edge to increase our confidence and accuracy. Since most of our student recruitment efforts

involve Wisconsin and the states that neighbor upon Wisconsin, we construct a database of

zip codes from these regions and look within the database for the most likely match. By

means of this idea, we also construct a database of Birthday, as well as GPA, Graduation

Year, ACT/SAT Score. The following is the algorithm that how to rank the results in the

database based on summing confidence.

Algorithm 1 Compute ranked list of databaseLet DB be a list of all semantic data in the databasefor each semantic data Z in DB do

Sum← 0for each digit D in Z do

Sum← Sum + confidence of D from the digit recognition listend forZ.confidence← Sum

end forSort DB by the total confidence of the datareturn DB

By means of using the semantic dictionary, the ranked list of zip codes that corre-

sponded with Figure 17 are shown in Table 7, as well as Table 8 shows the results corre-

sponded with Figure 18.

29

Guessed Result Total Confidence52163, Protivin, IA. 3.08155103, Saint Paul, MN. 3.05352165, Ridgeway, IA. 3.02955107, Saint Paul, MN. 3.02053103, Big Bend, WI. 3.01952169, Wadena, IA. 3.01850163, Melcher, IA. 3.01655165, Saint Paul, MN. 3.01150103, Garden Grove, IA. 3.00652803, Davenport, IA. 3.004

Table 7: The ranked list corresponded with Figure 17

Guessed Result Total Confidence51014, Cleghorn, IA. 3.20455014, Circle Pines, MN. 3.18853014, Chilton, WI. 3.17454014, Diamond Bluff, WI. 3.11751019, Danbury, IA. 3.11061014, Chadwick, IL. 3.10255019, Dundas, MN. 3.09451018, Cushing, IA. 3.08353019, Eden, WI, IA. 3.08055017, Dalbo, MN. 3.074

Table 8: The ranked list corresponded with Figure 18

As shown in Figure 17, only 4 of 5 digits are correctly identified, which would result in

invalidate zip codes in the ranked list of possible zip codes. By means of using the semantic

dictionary, we can get a more precise ranked list of validated zip codes. From Figure 17,

the most matched zip code is ’58014’. But we have to ignore this one for ’58014’ does not

exist in the dictionary. So we concluded that using the semantic dictionary will improve

the accuracy of recognition.

30

5 Testing

In iterative and incremental development, testing is conducted during each increment

after its implementation phase. So after the implementation of functional requirements, we

began to test all the functions. Functional requirements like LoadImage and ExistSystem

are tested by manually observation. But extraction functionalities are not easily tested. To

test extraction functionalities, we need thousands of actual forms from the Admission Of-

fice. We do not presently have access to actual forms due to privacy issues, therefore, we

have collected a few dozen forms that were filled out with simulated information for testing

extraction functionalities.We cannot determine the accuracy of the system since we do not

yet have accessed to a image set of actual student information cards. Figure 19 and Figure

20 are two test examples:

Figure 19: Test Case 1

31

Figure 20: Test Case 2

6 Conclusion And Future Enhancements

With this program, we can extract handwritten information successfully from the scanned

image. And we can recognize numeric fields using third-party recognition library provided

by Dr. Hunt. Besides this program can recognize the checkboxes selected by the students,

such as gender field with high accuracy.

Before any new features are added to the main application, the testing of accuracy of

numeric fields recognition needs to be completed, which requires thousands of actual forms

from the Admission Office.

As described in Section 4.2, the digit recognizer accepts a numeric field image as input,

then segments the digits in the image into many individual digit. Then for each digit, the

digit recognizer will generate a ranked guessed list. In this program, we assumed all the

digits in the image are not connected shown in Figure 21b. But as shown in Figure 21a,

’6’ and ’0’ are connected. So if we input Figure 21a into the digit recognizer, it will treat

32

’60’ as an individual digit, which is an obvious error. so in future, a function will be added

to this program, which can intelligently segment components into separate characters for

later recognition.

(a) Connected Digits (b) Separated Digits

Figure 21: Connected And Separated Digits In The Image

As shown in Figure 22b, when we extracted the handwritten data of email field as

shown in Figure 22a, ”@” was cut into two parts. So a future enhancement would be to

implement that can merge some parts, which belong to a same component in the registered

image, back to together after extracting the handwritten data.

(a) Image a (b) Image b

Figure 22: A component is cut into several components after extraction of handwritten data

Recognition for mixed fields that contains digits, symbols or alphabets will also be

added to this main application. Essentially, the main application should not only recognize

numeric fields, but also mixed fields, such as name field.

We also need to implement the function that stores extracted data to database. To im-

plement this, we need to talk to our sponsors to make sure what kind of data format they

want to store in the database.

33

References[1] Kenny Hunt. The Art of Image Processing with Java. CRC Press, 2010.

[2] Roger Pressman. ”A practitioners approach”. Software Engineering, 2:41-42, 2010.

[3] Wen-Kai Shen. ”An online scholarship application system”. Casptone Project, Depart-ment of Computer Science, University of Wisconsin-La Crosse, WI, USA, 2011.

[4] S. R. Mahadeva Prasanna Soyuj Kumar Sahoo, Jitendra Jain. ”Chain code histogrambased facial image feature extraction under degraded conditions”. Advances in Com-puting and Communications in Computer and Information Science, 192:326-333,2011.

34

automated data extraction system for handwritten student information … · automated data...

Documents