automated data extraction system for handwritten student information … · automated data...
TRANSCRIPT
Automated Data Extraction System For HandwrittenStudent Information Cards
A Manuscript
Submitted to
the Department of Computer Science
and the Faculty of the
University of Wisconsin-La Crosse
La Crosse, Wisconsin
by
Zhicheng Fu
in Partial Fulfillment of the
Requirements for the Degree of
Master of Software Engineering
May, 2013
Automated Data Extraction System For Handwritten StudentInformation Cards
By Zhicheng Fu
We recommend acceptance of this manuscript in partial fulfillment of this candidates re-quirements for the degree of Master of Software Engineering in Computer Science. Thecandidate has completed the oral examination requirement of the capstone project for thedegree.
Dr.XXXXXXXXXXXXXX DateExamination Committee Chairperson
Dr.XXXXXXXXXXXXXX DateExamination Committee Member
Dr.XXXXXXXXXXXXXX DateExamination Committee Member
ii
Abstract
Fu, Zhicheng, Automated Data Extraction System For Handwritten Student Informa-
tion Cards, Master of Software Engineering, May 2013. Advisor: Kenny Hunt.
The University of Wisconsin-La Crosse visits regional high schools and, as part of their
recruitment effort, asks students to fill out a form that collects information about a students
contact information, GPA, ACT and academic interests. These cards are then hand-entered
into a central database for further recruitment efforts. This project describes the design
of a semi-automated system to extract information from these Student Information Cards.
The project addresses two central problems: extracting handwritten markings from the
surrounding form and recognizing the extracted handwritten characters. In the extraction
phase we mark feature points in a template image and perform image registration with re-
spect to these features. Image subtraction yields an approximate result that is later refined
via specialized ad-hoc filtering rules. The form fields are then fed into a custom character
recognition engine and are semantically checked against a dictionary to improve the overall
recognition rate.
iii
Acknowledgements
I would like to express my sincere thanks to my project advisor Dr. Kenny Hunt for
initiating this project and providing the support for this project. Besides I want to show
my thanks to Jeremiah Collins from The Admission Office at University of Wisconsin-
La Crosse for supporting for this project. I would also like to express my thanks to the
Computer Science Department and the University of Wisconsin-La Crosse for providing
the computing environment for my project.
iv
Contents1 Introduction 1
2 Requirement Gathering And Analysis 32.1 Gathering Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Selection of Life Cycle Model . . . . . . . . . . . . . . . . . . . . . . . . 62.4 GUI Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Design 113.1 Extraction Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Recognition Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Implementation 204.1 Extracting Handwritten Information . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1.2 Handwriting Extraction From Image . . . . . . . . . . . . . . . . . 25
4.2 Description of Character Recognition . . . . . . . . . . . . . . . . . . . . 26
5 Testing 31
6 Conclusion And Future Enhancements 32
v
List of Figures1 Student Information Card . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Use Case Diagram for System User . . . . . . . . . . . . . . . . . . . . . 53 Iterative and Incremental Development Model . . . . . . . . . . . . . . . . 74 Main User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Main User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Functionalities about ’File’ menu . . . . . . . . . . . . . . . . . . . . . . . 107 Operations of processing images . . . . . . . . . . . . . . . . . . . . . . . 108 UML class diagram of the MainPanel class. . . . . . . . . . . . . . . . . . 129 UML class diagram of the entity classes. . . . . . . . . . . . . . . . . . . . 1410 UML class diagram of the ImageRegistration class. . . . . . . . . . . . . . 1511 UML class diagram of the ImageSubtration class. . . . . . . . . . . . . . . 1712 UML class diagram of the DigitRecognition class. . . . . . . . . . . . . . . 1813 Key Components In The Template Image . . . . . . . . . . . . . . . . . . 2214 An Example Of Image Registration . . . . . . . . . . . . . . . . . . . . . 2415 This shows how noise components are eliminated. The rectangle areas in
Figure 15a are noise components to be eliminated. . . . . . . . . . . . . . . 2516 Chain Coding Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2617 Zip code field and the most likely matches for each digits. . . . . . . . . . . 2818 This shows an example where only 4 of the 5 digits are correctly identified. 2819 Test Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3120 Test Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3221 Connected And Separated Digits In The Image . . . . . . . . . . . . . . . 3322 A component is cut into several components after extraction of handwritten
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
vi
List of Tables1 Functional Requirement to Method Map . . . . . . . . . . . . . . . . . . . 122 Core Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 The Description of Entity Classes . . . . . . . . . . . . . . . . . . . . . . 144 Description of ImageRegistration Class . . . . . . . . . . . . . . . . . . . 165 Description of ImageSubtraction Class . . . . . . . . . . . . . . . . . . . . 176 Description of DigitRecognition Class . . . . . . . . . . . . . . . . . . . . 197 The ranked list corresponded with Figure 17 . . . . . . . . . . . . . . . . . 308 The ranked list corresponded with Figure 18 . . . . . . . . . . . . . . . . . 30
vii
GLOSSARY
Student Information Card
The Admissions Office at UW-L collects information regarding potential students by
visiting high school career fairs across the region and having students fill out information
cards. A student information card has fields for contact information (student name, address,
email, phone number), high school, academic information (GPA, SAT, ACT) and other data
relevant to the admissions process
Incremental Model
The incremental model is a method of software development where the software is de-
signed, implemented and tested incrementally until the product is finished. The incremental
model is an evolution of the waterfall model, where the waterfall model is incrementally
applied [2].
Chain Coding
Chain coding is a technique for representing the contour of a component. A chain code
defines the contour of a component as each boundary pixel is traversed by describing the
direction of each next contour pixel.
Recognition Library
The recognition library is a third-party library for recognizing handwritten characters.
viii
1 Introduction
The Admissions Office at UW-L collects information regarding potential students by
visiting high school career fairs across the region and having students fill out information
cards. Each card has fields for contact information (student name, address, email, phone
number), high school, academic information (GPA, SAT, ACT) and other data relevant to
the admissions process.
The Admissions Office presently collects thousands of these cards annually and must
manually enter all of the data into a database for later tracking and processing of the stu-
dents. Manual data entering is time consuming and error prone. The Admissions Office
recently indicated that hand-entering the data related to our recent sentences of activities
consumed several interns and took 3 months time.
This project seeks to automate the process of extracting the information on these cards.
The information will be extracted using handwriting recognition system and automatically
(or semi-automatically) entered into a database. Figure 1 shows an example of such a card.
1
Figure 1: Student Information Card
2
2 Requirement Gathering And Analysis
2.1 Gathering Process
The Automated Extraction System for Student Card Project was conceived by Dr. Hunt
and Jeremiah Collins from the Admission Office of University of Wisconsin-La Crosse be-
fore the formal beginning of this program. For this project, Jeremiah Collins served as the
sponsor of this project; Dr. Hunt served as the supervisor.
The project initially included the development of a handwriting recognition engine.
However, the scope of the recognition engine was too large and imposed too much com-
plexity on the developer. To address this complexity, the developer, with assistance from
Dr. Hunt, researched third-party software packages which would be able to perform the
functionalities of character recognition. A software package was chosen depending on the
required features, The scope of the program was therefore appropriately reduced by using
the third-party software for the complex character recognition.
The sponsor met with the developer several times to gather informal functional require-
ments of this program. These informal functional requirements helped to define the scope
of the program as well as capture the true nature and purpose of the application. During the
first of these informal meetings, the sponsor provided samples of real student information
cards and identified the principal data elements obtained from the student. Based on the in-
formation collected from the meetings, the requirement document version 1.0 was created,
in which the following fundamental requirements are listed:
• This program must support a graphical user interface for the user to use this program.
• The scanned image of student information card should be full color and 300 dpi. This
3
step is manual and is performed outside of the software system
• The format of the scanned image is TIF, PNG or JPEG.
• These images are placed into a folder, and the folder is loaded into the program as
input.
• The information extracted from the scanned image must be stored in a text file. The
sequence of the field information must be arranged according to a standard that will
be provided by the sponsor.
• This program must run on Windows platform with JVM.
Overall, the project produced 2 requirement documents. The requirement document
version 1.0, included the detailed functional requirements described in Section 2.2. In
version 2.0 of the requirement document we added a selected life cycle model, shown in
Section 2.3, as well as GUI requirements shown in Section 2.4.
2.2 Functional Requirements
This program is a stand-alone desktop system. The program does not manage user
accounts and hence there is only one role, the System User, supported by this system.
Figure 3 gives a Use Case diagram for the System User.
As shown in the Figure 2, there are seven use cases in the diagram. Each use case
describes a functional requirement. These functional requirements are narrated as follows:
• The ’Load Images’ function allows a user to load scanned images into the system.
The user is able to load singe or multiple images. If some image loaded into the
4
Figure 2: Use Case Diagram for System User
system does not satisfy the standard mentioned in section 3.1, the system will notify
the user that an error has occurred.
• The ’View Images’ function allows users to view all the images that have been al-
ready loaded into the system.
• The ’Process Single Image’ function allows users to extract and recognize all of the
handwritten information from a single image.
• The ’Process All Images’ function allows users to extract and recognize all of the
handwritten information from every loaded image.
• The ’Store Information’ function allows users to store the extracted data into a database.
• The ’Modify Information’ function allows users to manually view and modify all the
results of the recognition.
5
• The ’Exit System’ function allows users to terminate the application.
2.3 Selection of Life Cycle Model
We analyzed the requirements in requirement document version 1.0 and identified the
risks listed below:
• Lack of detailed specification for the GUI.
• Lack of experience on handwriting recognition.
• Potential misunderstanding between sponsor and developer.
• Potential change of the student card format, which will result in potential modifica-
tion of the requirements.
These risks described above are rough estimated based on the situations described by
the sponsor and developer. To alleviate these risks, we determined an adaptive software
development model. The incremental model is a method of software development where
the model is designed, implemented and tested incrementally until the product is finished.
Also as shown in Figure 3 [3], the time of incrementation of a product is infinite.
6
As opposed to waterfall model, this one has more flexibility to fine-tune the current
developmental direction and gradually satisfy the sponsors anticipation based on changes
without a chain reaction in all phases Requirements, Design, Implementation and Test.
Especially, it is realized that some system characteristics are uncertain and changing re-
quirements will occur often. Through the iterative improvement in each increment, the
development process can be refined accordingly.
As a result, the sponsor was able to experience and review partial functionalities. Af-
terwards, they gave feedback for developing additional requirements in the next increment.
Also, should the sponsor want to change the standards of the student card or scanned im-
age, system development would be able to response to these changes rapidly.
Figure 3: Iterative and Incremental Development Model
Each increment includes the completion of several functional requirements. The ma-
jor consideration to determining which functionalities should be included in the earlier
7
increments is according to their importance and their contribution to the entire system de-
velopment.
Functionalities concerning handwritten extraction procedure and character recognition
should therefore have a higher priority than others. And the increments that occurred in
this project are listed below:
• Increment1: Graphical User Interface functionalities related to user interaction.
• Increment2: Handwritten extraction’s functionalities related to application interac-
tion.
• Increment3: Handwritten recognition’s functionalities related to application interac-
tion.
• Increment4: Enhancing application review and evaluation.
• Increment5: System configuration.
• Increment6: Writing and executing test cases.
• Increment7: Enhancing interactivity of Graphical User Interface.
2.4 GUI Functional Requirements
The GUI functional requirements closely mirror the functional requirements. In version
2.0, the GUI shown in Figure 4, was designed based on the requirements of the sponsor.
But the requirements of GUI were changed after some meetings with sponsor. So we did
change the main graphical user interface to be the one shown in Figure 5, which contains
two menu options, an image panel and a recognition panel.
8
Figure 4: Main User Interface
Figure 5: Main User Interface
The ’Image Panel’ would show the images loaded into the system. In the ’Recognition
Panel’, we construct the same form as the form in the student information card.Before the
system user load images into this system, all the text fields in the recognition panel are
empty and the ’Store Information’ button is not enabled.
9
The ’File’ menu has two options, one is ’LoadFile’ and the other is ’Exist’ shown in
Figure 6.
Figure 6: Functionalities about ’File’ menu
The ’Operation’ menu also has two options, one is ’Process All Images’, the other is
’Process Single Image’ shown in Figure 7.
Figure 7: Operations of processing images
After users load scanned images into the system, the options ’Process Single Image’ and
10
’Process All Images’ are enabled. After recognition, the result of the recognition will be
shown in the ’Recognition Panel’ and all the information in the text fields can be modified.
Users can manually view or modify the extracted data to make sure the data is correct as
shown in corresponding scanned image. Then users can store the data into the database.
3 Design
In the design document version, the architectural design of this application is described
using the UML class diagram notation. But the class diagram did not include any method
and attribute. There are no details about the character recognition engine in the design doc-
ument because we planed to use a third-party recognition engine to assist this application.
Further down the road, we reanalyzed the requirements and updated the class diagrams with
adding attributes and methods, shown in section 3.1. Besides, we also determined which
third-party recognition engine we used for this handwriting recognition. As a whole, the
design of the application consists of two pieces. The extraction application design de-
scribes how the main application is organized and functions. The recognition library is a
third-party library for recognizing handwritten characters and was developed by Dr. Hunt.
3.1 Extraction Application
The extraction application implements all of GUI Functional Requirements and is solely
responsible for interaction with the system user. The GUI will communicate with the Main-
Panel class to activate the actual Functional Requirements.
11
The MainPanel class, shown in Figure 8, maps each of functional requirements to a
method. The method’s name is nearly identical to the requirement name, making it easy
to identify. Table 1 shows the exact mapping of the functional requirements to the class
method.
Figure 8: UML class diagram of the MainPanel class.
Functional Requirement Implementation Method SignatureLoad Images loadImagesActionPerformed(ActionEvent evt): void.Exist System existSystemActionPerformed(ActionEvent evt): void.Process Single Image processSingleImageActionPerformed(ActionEvent evt): void.Process All Images processAllImagesActionPerformed(ActionEvent evt): void.Store Information storeInformationActionPerformed(ActionEvent evt): void.
View ImagesdoPreviousActionPerformed(ActionEvent evt): void.doNextActionPerformed(ActionEvent evt): void.
Table 1: Functional Requirement to Method Map
12
The MainPanel class has private methods which perform some of the more complex
operations, such as processSingleImageActionPerformed(), which would call many other
methods of different classes. Within this application, there are three core classes to be used
for extraction and recognition. The core classes are listed in Table 2.
Class Name Class FunctionImageRegistration Providing the methods for image rotation.
ImageSubtractionProviding the methods for extracting handwritten informa-tion from the registered image
DigitRecognition Providing the methods for digit recognition
Table 2: Core Classes
Before we introduce the three core classes, there are some entity classes in this applica-
tion for modeling corresponding objects. All the entity classes are shown in Figure 9 and
the descriptions of the entity classes are listed in Table 3. As shown in Figure 8, all the
entity classes just present the attributes. But each attribute has set and get method to access
or change it.
13
Figure 9: UML class diagram of the entity classes.
Class Name Class Function
DoublePointA class to model a point whose coordinate are double pre-cision
ComponentA class to model a connected component in the binary im-age
StudentA class to model a student based on the information inputin the student information card.
SingleDigitGuessedListA class to model the guessed recognition list of a singledigit.
State A class to model a state based on its zip code and city.
Table 3: The Description of Entity Classes
14
ImageRegistration class is designed to support methods for image rotation. As shown
in Figure 10, all the methods in the ImageRegistration class has been listed. All the method
in this class are public. Some variables are private. we describe each method of ImageReg-
istration class in Table 4.
Figure 10: UML class diagram of the ImageRegistration class.
15
Method Name Method FunctiongetBufferedImage(Image image) Convert the image to be BufferedImage typegetBinaryImage(BufferedImage image,
int thresholdValue)Based on the threshold value, convert the fullcolor image to be binary image.
setSelectedBoxedToStudent(Student student)Store the selected boxes’ information to the re-lated student.
getMarkedFeatures(BufferedImage src,BufferedImage dest)
Find the marked features in the image and storethe information of features.
showMarkedFeatures(BufferedImage src,BufferedImage dest,LinkedList<Regions> regs)
Mark the features highlight.
setSourceDoublePoint(BufferedImage src)Get all information of feature components in thetemplate image and store them.
getRegisterationPoints(BufferedImage src)Get all information of feature components in thescanned image and store them.
sortRegions(LinkedList<Regions> reg1,LinkedList<Regions> reg2)
Reorder the sequence of the marked features inthe storage.
getCenterPoint(LinkedList<Point> points) Get the centroid of each feature component.
getTranslationXY(LinkedList<Point> points)Get the traslation matrix between template im-age and scanned image.
getTranslationImage(BufferedImage src,DoublePoint temp)
Get a image after doing traslation on thescanned image based on translation matix.
getAverageAngle(LinkedList<Point> points)Get the rotation degree between the templateimage and the translated image.
getTranslationImage(BufferedImage src,DoublePoint temp)
Get a new image after doing rotation on thetranslated image.
getRidOfBlackEgdes(BufferedImage src,BufferedImage dest)
Eliminate the nosie components at the edges ofthe image.
checkBounds(BufferedImage src,BufferedImage dest)
Check the bound of the image after doing imagerotation.
getRegistrationFinalImage(BufferedImage src) Get the registered image.createCompatibleDestImage(BufferedImage src,
ColorModel destCM)Create a copy of the input image.
Table 4: Description of ImageRegistration Class
16
ImageSubtraction is a class providing the methods for extracting handwritten informa-
tion from the registered image. All the input image in the methods here are binary image.
Besides, we have binary template image as baseline image for subtraction. All the methods
are listed in Figure 11 and the description of the methods are shown in Table 5.
Figure 11: UML class diagram of the ImageSubtration class.
Method Name Method Function
getExtractionImage(BufferedImage sourceBinary,BufferedImage registerdImage)
Using registered image to subtract sourceBi-nary image then getting a new image only con-taining handwritten information
eraseNoiseFromImage(BufferedImage src, int w,int h)
Eliminate the noise components whose heightor width is less than the input value.
Table 5: Description of ImageSubtraction Class
17
DigitRecognition is a class providing the methods to connect the third-party recognition
engine and get recognition results of numeric fields. Figure 12 is the UML classes diagram
of this class. We also describe each method function in Table 6.
Figure 12: UML class diagram of the DigitRecognition class.
18
Method Name Method Function
getAllRankedZipCodeList(BufferedImage src)This method will take a binary image contain-ing only zip code as input, then generate aranked list of all the possible results.
getAllMatchedDigits(BufferedImage src, int type)
This method will connect third-party recogni-tion engine, load the input image into the rec-ognizer then get the ranked list of possible digitfor each individual digit in the image. Parame-ter ’type’ will decide which numeric field imageas the input image.
getAllPossibleResults(LinkedList<SingleDigitGuessedList> t,int index, double confidence, String out,LinkedList<SingleDigitGuessedList> res,int type)
This method will combine each possible digitfrom the ranked list, then generate all the possi-ble combinations.
getAllRankedGPAList(BufferedImage src)This method will take a binary image contain-ing only GPA as input, then generate a rankedlist of all the possible results.
getAllRankedACTorSATList(BufferedImage src)This method will take a binary image contain-ing only score code as input, then generate aranked list of all the possible results.
getAllRankedGraduationYears(BufferedImage src)This method will take a binary image contain-ing only graduation year as input, then generatea ranked list of all the possible results.
getAllRankedHomePhoneList(BufferedImage src)This method will take a binary image contain-ing only home phone number as input, then gen-erate a ranked list of all the possible results.
getAllRankedCellPhoneList(BufferedImage src)This method will take a binary image contain-ing only cell phone number as input, then gen-erate a ranked list of all the possible results.
getAllRankedBirthdayList(BufferedImage src)This method will take a binary image contain-ing only birthday data as input, then generate aranked list of all the possible results.
sortDigitsRankedList()This method will reorder the ranked list basedon the confidence of each possible result.
Table 6: Description of DigitRecognition Class
19
3.2 Recognition Library
We use a neural network technique to recognize individual characters within the form.
The neural network is divided into a recognizer for numeric and mixed inputs. The digit
recognizer uses a fully-connected three layer topology consisting of 408 input nodes with
10 output nodes. A connected component is rasterized into a 20x20 grayscale images to
account for 400 of the features. The 8 remaining features are defined by the chain code
histogram. The network was trained on the NIST handwritten image database, one of the
ad-hoc standards of handwritten data. we feed the individual components of this field into
the digit recognition engine. The neural network generates a ranked list of likely matches
for any particular component that is fed into the network. We take the rankings for each
individual digit and generate a short-list of the most likely zip codes that can be constructed
from the individual digits.
4 Implementation
This project addresses two central problems: extracting handwritten markings from
the surrounding form and recognizing the extracted handwritten characters. Variations in
the scanned forms introduce complexity into the task of extracting the handwritten data.
Scanning thousands of card images will result in images where the location and orientation
of the fields in each card is different. Also, variations in handwriting make handwritten
recognition extraction even more difficult. Especially when the handwritten is so bad that
even human eyes can not recognize the results.
In the extraction phase of this project, we mark feature points in a template image
20
and perform image registration with respect to these features. Image subtraction yields an
approximate result that is later refined via specialized ad-hoc filtering rules. As to character
recognition, we make our own character recognition engine and to better recognize digit
and alphabet separately, we create one training data for alphabet and another one is for
digit; also we create our own dictionary to improve the recognition rate. In the end, we put
handwritten characters into this engine and get the extraction.
4.1 Extracting Handwritten Information
4.1.1 Image Registration
Scanning thousands of card images will result in images where the location and ori-
entation of the fields in each card are varied. Image registration is used to normalize the
location and orientation of the fields in the scanned image. Image registration is performed
as described below.
1. Assume that we have a template image T. This image is an empty form with zero ro-
tation, forms the baseline image for the system. We need to identify key components
of T to use for image registration. Key components are components in T, that are
used to identify the rotation of the image. To identify key components, we analysed
the baseline form to find the representative components, which are easy to identify
no matter how the image is rotated. Key components are identified in advance as
shown in Figure 13 by the highlighted elements.
2. Let Kc be the sequence of key components, where each key component has location,
height, width and centroid. Let Tcs be sequence of all centroids of the key compo-
21
Figure 13: Key Components In The Template Image
nents. Let Tcc be the centroid of the centroids.
3. Let S be the full-color source image. Let B be the binarized image of S. Identify
the key components in B corresponding with the key components of T. Compute the
centroids of key features in B. Call these centroids Bcs.
4. Compute the centroid of the Bcs and call it Bcc. Let TRANS be a translation matrix
that takes Bcc and maps it to Tcc. Translate all points in Bcs by this amount.
5. For each centroid in Bcs, compute the amount by which that point must be rotated
(rotate around Tcc) to align with the corresponding Tcs point. Compute the average
22
amount of rotation. Let ROT be a rotation matrix that performs this rotation.
6. Take the source image S and apply TRANS and then ROT.
7. Binarize this rotated source based on the adaptive value. Let Sr be the binarized
image.
Figure 14 shows the process of image registration. The baseline image of the system
is shown in (a); the scanned image of student card is captured in (b); the thresholded image
of scanned image is shown in (c); finally the rotated image is shown in (d).
23
(a) Image T (b) Image S
(c) Image B (d) Image Sr
Figure 14: An Example Of Image Registration
24
4.1.2 Handwriting Extraction From Image
In this phase, we take the registered form and eliminate the pixels related to the form
while keeping the pixels related to the handwritten data. Once the handwritten data is
identified, we feed fields into the character recognition engine. The following are the steps
how to get handwritten data.
1. Let T be the baseline form image.
2. Let image Sr be the registered image. Let Sb = Sr-T .
3. Eliminate noise components in Sb whose width or height is less than the predefined
threshold values and get another image Sbn.
(a) Image Sb (b) Image Sbn
Figure 15: This shows how noise components are eliminated. The rectangle areas in Figure15a are noise components to be eliminated.
25
4.2 Description of Character Recognition
We use a neural network technique to recognize individual characters within the form.
The digit recognizer uses a fully-connected three-layer topology consisting of 408 input
nodes with 10 output nodes. A connected component is rasterized into a 20x20 grayscale
images to account for 400 of the features. The 8 remaining features are defined by the chain
code histogram.
Chain coding is a technique for representing the contour of a component. A chain code
defines the contour of a component as each boundary pixel is traversed by describing the
direction of each next contour pixel [1]. Under chain coding, a component is defined by the
location of a foreground pixel laying on the boundary of the component. We will refer to
this pixel as the starting point Ps. A list of directions as shown in Figure 16a defines how
to traverse each of the boundary pixels of the component beginning and ending with Ps.
Figure 16b gives an example of how chain coding could be used to represent a component.
The start point Ps is given as (0, 0) and the 8-connected code is given by {011033446666}.
(a) 8-connected chain code [1] (b) 8-connected chain code [1]
Figure 16: Chain Coding Example
The chain code histogram (CCH) is commonly used as a feature extraction technique
26
for character recognition [4]. Similar component/sharp has similar CCH distribution. The
directional information captured by the CCH is the key method to identify the feature of
any shape or pattern. Let CC be a chain code. The histogram of CC is defined as h.
hi =∑
j∈cc count(i, j) i ∈ [0, 7]
count(i, j) =
0 if i 6= j
1 if i = j
The cumulative distribution function can essentially compute CCH of a component.
As the example shown in Figure 16b, the CCH of the gray component is described as
following.
i 0 1 2 3 4 5 6 7
hi 2 2 0 2 2 0 4 0
Since we know a similar component/shape has a similar CCH distribution. Then we
create thousands of samples for all the handwritten digits and we use these samples to es-
tablish our training data for recognition engine.
As we know that the numeric consists of purely numeric data, we feed the individual
components of this field into the digit recognition engine. For each character of the field
in the image, the digit recognition engine generates a list of possible matched digits and
indicates the similarities of each possible matched digit. The range of similarities is [0, 1]
where higher similarity means better matching.
27
There are two handwritten zip code fields shown in Figure 17 and Figure 18. Each digit
is extracted from the zip code field and fed into the digit recognition engine. A ranked
list of estimates is generated for each digit. Each digit of Figure 17 is identified correctly,
While only 4 of 5 digits are recognized correctly in Figure 18.
(a) Zip Code Image
0.654 : [5]0.560 : [8]0.544 : [4]0.534 : [2]0.527 : [3]0.526 : [6]0.524 : [9]0.521 : [7]
0.564 : [2]0.546 : [5]0.523 : [7]0.512 : [3]0.511 : [4]0.508 : [6]0.505 : [8]0.499 : [0]
0.665 : [1]0.598 : [8]0.564 : [4]0.551 : [2]0.538 : [7]0.533 : [5]0.524 : [3]0.513 : [9]
0.632 : [6]0.623 : [0]0.555 : [5]0.491 : [2]0.489 : [3]0.486 : [8]0.463 : [4]0.433 : [9]
0.565 : [3]0.532 : [7]0.513 : [5]0.502 : [9]0.487 : [2]0.480 : [8]0.475 : [4]0.458 : [1]
Figure 17: Zip code field and the most likely matches for each digits.
(a) Zip Code Image
0.635 : [5]0.562 : [3]0.547 : [0]0.532 : [6]0.528 : [8]0.501 : [9]0.500 : [2]0.483 : [4]
0.572 : [8]0.563 : [1]0.561 : [7]0.547 : [5]0.539 : [9]0.533 : [3]0.532 : [2]0.476 : [4]
0.657 : [0]0.523 : [6]0.500 : [3]0.482 : [5]0.471 : [2]0.469 : [8]0.464 : [9]0.443 : [4]
0.716 : [1]0.595 : [8]0.537 : [2]0.519 : [5]0.509 : [4]0.493 : [6]0.488 : [3]0.473 : [9]
0.633 : [4]0.539 : [9]0.520 : [7]0.512 : [8]0.478 : [2]0.477 : [6]0.474 : [5]0.469 : [1]
Figure 18: This shows an example where only 4 of the 5 digits are correctly identified.
28
Since we have semantic knowledge related to the zip-code field, we can use this knowl-
edge to increase our confidence and accuracy. Since most of our student recruitment efforts
involve Wisconsin and the states that neighbor upon Wisconsin, we construct a database of
zip codes from these regions and look within the database for the most likely match. By
means of this idea, we also construct a database of Birthday, as well as GPA, Graduation
Year, ACT/SAT Score. The following is the algorithm that how to rank the results in the
database based on summing confidence.
Algorithm 1 Compute ranked list of databaseLet DB be a list of all semantic data in the databasefor each semantic data Z in DB do
Sum← 0for each digit D in Z do
Sum← Sum + confidence of D from the digit recognition listend forZ.confidence← Sum
end forSort DB by the total confidence of the datareturn DB
By means of using the semantic dictionary, the ranked list of zip codes that corre-
sponded with Figure 17 are shown in Table 7, as well as Table 8 shows the results corre-
sponded with Figure 18.
29
Guessed Result Total Confidence52163, Protivin, IA. 3.08155103, Saint Paul, MN. 3.05352165, Ridgeway, IA. 3.02955107, Saint Paul, MN. 3.02053103, Big Bend, WI. 3.01952169, Wadena, IA. 3.01850163, Melcher, IA. 3.01655165, Saint Paul, MN. 3.01150103, Garden Grove, IA. 3.00652803, Davenport, IA. 3.004
Table 7: The ranked list corresponded with Figure 17
Guessed Result Total Confidence51014, Cleghorn, IA. 3.20455014, Circle Pines, MN. 3.18853014, Chilton, WI. 3.17454014, Diamond Bluff, WI. 3.11751019, Danbury, IA. 3.11061014, Chadwick, IL. 3.10255019, Dundas, MN. 3.09451018, Cushing, IA. 3.08353019, Eden, WI, IA. 3.08055017, Dalbo, MN. 3.074
Table 8: The ranked list corresponded with Figure 18
As shown in Figure 17, only 4 of 5 digits are correctly identified, which would result in
invalidate zip codes in the ranked list of possible zip codes. By means of using the semantic
dictionary, we can get a more precise ranked list of validated zip codes. From Figure 17,
the most matched zip code is ’58014’. But we have to ignore this one for ’58014’ does not
exist in the dictionary. So we concluded that using the semantic dictionary will improve
the accuracy of recognition.
30
5 Testing
In iterative and incremental development, testing is conducted during each increment
after its implementation phase. So after the implementation of functional requirements, we
began to test all the functions. Functional requirements like LoadImage and ExistSystem
are tested by manually observation. But extraction functionalities are not easily tested. To
test extraction functionalities, we need thousands of actual forms from the Admission Of-
fice. We do not presently have access to actual forms due to privacy issues, therefore, we
have collected a few dozen forms that were filled out with simulated information for testing
extraction functionalities.We cannot determine the accuracy of the system since we do not
yet have accessed to a image set of actual student information cards. Figure 19 and Figure
20 are two test examples:
Figure 19: Test Case 1
31
Figure 20: Test Case 2
6 Conclusion And Future Enhancements
With this program, we can extract handwritten information successfully from the scanned
image. And we can recognize numeric fields using third-party recognition library provided
by Dr. Hunt. Besides this program can recognize the checkboxes selected by the students,
such as gender field with high accuracy.
Before any new features are added to the main application, the testing of accuracy of
numeric fields recognition needs to be completed, which requires thousands of actual forms
from the Admission Office.
As described in Section 4.2, the digit recognizer accepts a numeric field image as input,
then segments the digits in the image into many individual digit. Then for each digit, the
digit recognizer will generate a ranked guessed list. In this program, we assumed all the
digits in the image are not connected shown in Figure 21b. But as shown in Figure 21a,
’6’ and ’0’ are connected. So if we input Figure 21a into the digit recognizer, it will treat
32
’60’ as an individual digit, which is an obvious error. so in future, a function will be added
to this program, which can intelligently segment components into separate characters for
later recognition.
(a) Connected Digits (b) Separated Digits
Figure 21: Connected And Separated Digits In The Image
As shown in Figure 22b, when we extracted the handwritten data of email field as
shown in Figure 22a, ”@” was cut into two parts. So a future enhancement would be to
implement that can merge some parts, which belong to a same component in the registered
image, back to together after extracting the handwritten data.
(a) Image a (b) Image b
Figure 22: A component is cut into several components after extraction of handwritten data
Recognition for mixed fields that contains digits, symbols or alphabets will also be
added to this main application. Essentially, the main application should not only recognize
numeric fields, but also mixed fields, such as name field.
We also need to implement the function that stores extracted data to database. To im-
plement this, we need to talk to our sponsors to make sure what kind of data format they
want to store in the database.
33
References[1] Kenny Hunt. The Art of Image Processing with Java. CRC Press, 2010.
[2] Roger Pressman. ”A practitioners approach”. Software Engineering, 2:41-42, 2010.
[3] Wen-Kai Shen. ”An online scholarship application system”. Casptone Project, Depart-ment of Computer Science, University of Wisconsin-La Crosse, WI, USA, 2011.
[4] S. R. Mahadeva Prasanna Soyuj Kumar Sahoo, Jitendra Jain. ”Chain code histogrambased facial image feature extraction under degraded conditions”. Advances in Com-puting and Communications in Computer and Information Science, 192:326-333,2011.
34