1 introduction of optical character recognition patterns

17
1 INTRODUCTION OF OPTICAL CHARACTER RECOGNITION PATTERNS Introduction Statement of the Problem Objective of the Study Advantages of the Study Scope of the Study Limitation of the Study Literature Review of the Study Glossary of the Terms Planning of the Remaining Chapters References

Upload: vanduong

Post on 16-Dec-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

1 INTRODUCTION OF OPTICAL CHARACTER

RECOGNITION PATTERNS

Introduction

Statement of the Problem

Objective of the Study

Advantages of the Study

Scope of the Study

Limitation of the Study

Literature Review of the Study

Glossary of the Terms

Planning of the Remaining Chapters

References

Chapter 1

1

CHAPTER 1

INTRODUCTION OF OPTICAL CHARACTER

RECOGNITION PATTERNS

1.1 Introduction

In corporations, institutes and offices overwhelming volume of paper-based

data challenges their ability to manage documents and records. Most of the

fields are become computerized because Computers can work faster and more

efficiently than human beings. Paperwork is reduced day by day and computer

system took place of it.

But what about already stored papers? All the early paper work is now become

paperless work. To convert any handwritten, printed or historical documents

(e.g. Book) into text document, either we have to type the whole document or

scan the document. If document size is very large, it becomes cumbersome

and time consuming to type the whole document and there may be the

possibilities of typing errors. Another option is scanning of document.

Whenever document size is very large instead of typing, scanning of document

is preferable. Forms can be scanned through scanner. Scanning is merely an

image capture of the original document, so it can’t be edited or searched

through in any way and it also occupies more space because it is stored as an

image. To edit or search from the document, document must be converted to

doc (text) format.

Optical Character Recognition (OCR) is a process of converting scanned

document into text document so it can be easily edited if needed and becomes

searchable. OCR is the mechanical or electronic translation of images of

handwritten or printed text into machine-editable text. Recognition engine of

the OCR system interpret the scanned images and turn images of handwritten

or printed characters into ASCII data (Machine readable characters). It

occupies very less space than the image of the document. The document which

Chapter 1

2

is converted to text document using OCR occupies 2.35KB while the same

document converted to an image occupies 5.38MB. So whenever document

size is very large instead of scanning, OCR is preferable.

1.2 Statement of the Problem

Title of the present study is: “A study of optical character patterns

identified by the different OCR algorithms and generation of a model for

the elimination of deficiency in identifying the patterns of optical

character”.

The prime theme of the study is to study optical character recognition

algorithms and tools, study character patterns by analyzing structural and

statistical features of characters and find deficiency in identifying the optical

character patterns and develop a model to eliminate it. Present study focuses

on offline isolated handwritten English capital characters and digits. English

language consists of different 26 capital alphabets A to Z, small alphabets a to

z and 10 digits 0 to 9.

1.3 Objectives of the Study

The objective of this research is to study character recognition patterns,

character recognition algorithms and tools and eliminate deficiency in

identifying character patterns.

To fulfill this objective some sub objectives were formed which are as

following.

1.3.1 Study different optical character recognition algorithms.

1.3.2 Study online and offline tools of the character recognition.

1.3.3 Identify character recognition patterns by Study and analysis of

English capital characters for structural and statistical features.

1.3.4 Study and analysis of digits 0 to 9 for structural and statistical features.

Chapter 1

3

1.3.5 Collect handwritten data samples and recognize each character and

digit using different character recognition tools.

1.3.6 Identify the common deficiency in most of the character recognition

software/tools by calculating the recognition rate of each character and

digit and find out the characters and digits whose recognition rate is

very less.

1.3.7 Designing and development of the model to eliminate the common

deficiency identified.

1.3.8 Develop the algorithm to implement the above model.

1.3.9 Testing and Performance evaluation by analyzing results of model.

1.4 Advantages of the study

Advantages of the study can be identified as following:

1.4.1 Scanned document stored as an image file which doesn’t allow editing

and searching. The proposed model will convert a document into

digital files so that document becomes searchable and editable.

1.4.2 It recognizes handwritten characters available on bank cheques,

government records, passport, invoice, bank statement and receipt etc.

1.4.3 Data entry errors can be reduced.

1.4.4 Paper work reduces so it becomes easy to access and manage.

1.4.5 It occupies less space as compared to an image file so it saves

computer space.

1.4.6 The model benefiting users who are unfamiliar with optical character

recognition or who might become confused with multiple applications.

1.4.7 It is possible to combine this model with speech recognition software

and it helps to generate speech output (e.g. Kurzweil 1000 and Open

Book) and it becomes very useful for blind people.

Chapter 1

4

1.5 Scope of the study

The study is concerned with character recognition, for conversion of the

scanned image file into editable and searchable text file. The study is

distributed among the following different software components, which are

developed using Java and OpenCV.

Proposed study focuses on analyzing and recognizing isolated offline English

capital characters and digits. For experimental study, researcher has taken all

capital characters and digits but for the proposed model characters G, N, O, Y

and digit 4 are selected because its recognition rate is very less in selected

character recognition tools.

1.6 Limitation of the study

1.6.1 The model supports only English language.

1.6.2 It identifies only English capital alphabets and digit, it can’t identify

small letters.

1.6.3 Accuracy of model is directly dependent on the quality of input

document.

1.6.4 The model is able to recognize only 4 English capital characters.

1.6.5 The model is able to recognize only 1 digit.

1.6.6 The model can identify only individual character. It can’t identify a

word.

1.6.7 It does not contain any word dictionary so spell check is not possible.

1.6.8 If input document is scanned below 300 dpi then it may affect the

quality and accuracy of OCR results.

Chapter 1

5

1.7 Literature Review of the study

1.7.1 A Glance over Related Literature

Character recognition is an art of detecting segmenting and identifying

characters from image. More precisely Character recognition is process of

detecting and recognizing characters from input image and converts it into

ASCII or other equivalent machine editable form [1], [2], [3].

OCR system is most suitable for the applications like data entry for business

documents (e.g. check, passport etc.), automatic number plate recognition,

multi choice examination; almost all kind of form processing system.

� History

In 1929 Gustav Tauschek obtained a patent on OCR in Germany, followed by

Paul W. Handel who obtained a US patent on OCR in USA in 1933 (U.S.

Patent 1,915,993). In 1935 Tauschek was also granted a US patent on his

method (U.S. Patent 2,026,329). Tauschek's machine was a mechanical device

that used templates and a photo detector [4].

In 1949 RCA engineers worked on the first primitive computer-type OCR to

help blind people for the US Veterans Administration, but instead of

converting the printed characters to machine language, their device converted

it to machine language and then spoke the letters. It proved far too expensive

and was not pursued after testing [4].

In 1950, David H. Shepard, a cryptanalyst at the Armed Forces Security

Agency in the United States, addressed the problem of converting printed

messages into machine language for computer processing and built a machine

to do this, reported in the Washington Daily News on 27 April 1951 and in the

New York Times on 26 December 1953 after his U.S. Patent 2,663,758 was

issued. Shepard then founded Intelligent Machines Research Corporation

Chapter 1

6

(IMR), which went on to deliver the world's first several OCR systems used in

commercial operation [4].

In 1955, the first commercial system was installed at the Reader's Digest. The

second system was sold to the Standard Oil Company for reading credit card

imprints for billing purposes. Other systems sold by IMR during the late 1950s

included a bill stub reader to the Ohio Bell Telephone Company and a page

scanner to the United States Air Force for reading and transmitting by teletype

typewritten messages. IBM and others were later licensed on Shepard's OCR

patents [4].

In about 1965, Reader's Digest and RCA collaborated to build an OCR

Document reader designed to digitize the serial numbers on Reader's Digest

coupons returned from advertisements. The fonts used on the documents were

printed by an RCA Drum printer using the OCR-A font. The reader was

connected directly to an RCA 301 computer (one of the first solid state

computers). This reader was followed by a specialized document reader

installed at TWA where the reader processed Airline Ticket stock. The readers

processed documents at a rate of 1,500 documents per minute, and checked

each document, rejecting those it was not able to process correctly. The

product became part of the RCA product line as a reader designed to process

"Turn around Documents" such as those utility and insurance bills returned

with payments [4].

The United States Postal Service has been using OCR machines to sort mail

since 1965 based on technology devised primarily by the prolific inventor

Jacob Rabinow. The first use of OCR in Europe was by the British General

Post Office (GPO). In 1965 it began planning an entire banking system, the

National Giro, using OCR technology, a process that revolutionized bill

payment systems in the UK. Canada Post has been using OCR systems since

1971. OCR systems read the name and address of the addressee at the first

mechanized sorting center, and print a routing bar code on the envelope based

Chapter 1

7

on the postal code. To avoid confusion with the human-readable address field

which can be located anywhere on the letter, special ink (orange in visible

light) is used that is clearly visible under ultraviolet light. Envelopes may then

be processed with equipment based on simple barcode readers [4].

In 1974 Ray Kurzweil started the company Kurzweil Computer Products, Inc.

and led development of the first omni-font optical character recognition

system — a computer program capable of recognizing text printed in any

normal font. He decided that the best application of this technology would be

to create a reading machine for the blind, which would allow blind people to

have a computer read text to them out loud. This device required the invention

of two enabling technologies — the CCD flatbed scanner and the text-to-

speech synthesizer. On January 13, 1976 the successful finished product was

unveiled during a widely-reported news conference headed by Kurzweil and

the leaders of the National Federation of the Blind [4].

In 1978 Kurzweil Computer Products began selling a commercial version of

the optical character recognition computer program. LexisNexis was one of

the first customers, and bought the program to upload paper legal and news

documents onto its nascent online databases. Two years later, Kurzweil sold

his company to Xerox, which had an interest in further commercializing

paper-to-computer text conversion. Kurzweil Computer Products became a

subsidiary of Xerox known as Scan soft, now Nuance Communications [4].

1992-1996 Commissioned by the U.S. Department of Energy (DOE),

Information Science Research Institute (ISRI) conducted the most

authoritative of the Annual Test of OCR Accuracy for 5 consecutive years in

the mid-90s. Information Science Research Institute (ISRI) is a research and

development unit of University of Nevada, Las Vegas. ISRI was established in

1990 with funding from the U.S. Department of Energy. Its mission is to foster

the improvement of automated technologies for understanding machine

printed documents [4].

Chapter 1

8

1.7.2 Process of OCR

Any character recognition system performs following steps, i.e. Image

acquisition, Pre-processing, Segmentation, Feature extraction and

Classification.

1.7.2.1 Image Acquisition

Images of OCR system might be acquired by scanning document or by

capturing photograph of document.

1.7.2.2 Pre-processing

Pre-processing step involves RGB to grayscale image conversion,

thresholding, noise removal etc. In RGB to grayscale image conversion, RGB

image is converted to gray image. Thresholding converts gray image to black

and white image. Proper filter like median filter, Gaussian filter etc may be

applied to remove noise from document [5]. Input document may be resized if

Figure 1.1 Optical Character Recognition Process

Chapter 1

9

needed. Proper image pre-processing has a big impact on the quality of the

optical character recognition process.

1.7.2.3 Segmentation

Image segmentation is the process of dividing an image into multiple parts.

This is typically used to identify objects or other relevant information in

digital images [6]. Generally document is processed in hierarchical way. At

first level lines are segmented using row histogram. From each row, words are

extracted using column histogram and finally characters are extracted from

words [5].

1.7.2.4 Feature Extraction

The heart of any character recognition system is the formation of feature

vector to be used in the recognition stage. Feature extraction can be

considered as finding a set of parameters (features) that define the shape of the

underlying character as precisely and uniquely as possible. The term feature

selection refers to algorithms that select the best subset of the input feature set.

Methods that create new features based on transformations, or combination of

original features are called feature extraction algorithms [7].

1.7.2.5 Classification

The classification stage is the decision making part of a recognition system

and it uses the features extracted in the previous stage [8]. Classifiers compare

the input feature with stored pattern and find out the best matching class for

input. In simple terms, it is this part of the OCR which finally recognizes

individual characters and outputs them in machine editable form [7].

1.7.3 Information and Communication Technology (ICT) for data

collection

Collecting data can be a time-consuming, labor intensive process. So

businesses are constantly looking for ways in which data capture and

analysis can be automated. However, manual data collection is still

Chapter 1

10

common for many business processes. This revision note summarizes the

main kinds of data collection you need to be aware of.

1.7.3.1 Manual Input Methods

I. Keyboard

Keyboard is a familiar input device. It is used to input data into personal

computer applications such as databases and spreadsheets.

II. Touch-sensitive screens

Developed to allow computer monitors to be used as an input device.

Selections are made by users touching areas of a screen. Sensors built into the

screen surround, detect what has been touched. These screens are

increasingly used to help external customers input transactional data -

e.g. Buying transport tickets, paying for car parking or requesting

information.

1.7.3.2 Automated Input Methods

I. Magnetic ink character recognition (MICR)

MICR involves the recognition by a machine of specially-formatted characters

printed in magnetic ink. This is an expensive method to set up and use - but it

is accurate and fast. A good example is the use of magnetic ink characters

on the bottom of each cheque in a cheque book.

II. Optical mark reading (OMR)

Optical Mark Reading (OMR) uses paper based forms which users simply

mark (using a dash) to answer a question. OMR needs no special equipment to

mark a form other than a pen/pencil. Data can be processed very quickly and

with very low error rates. An OMR scanner then processes the forms directly

into the required database. An example you are probably familiar with is the

National Lottery entry forms, or answer sheets for those dreaded multiple

choice exam papers.

Chapter 1

11

III. Intelligent Character Recognition (ICR)

Intelligent Character Recognition (ICR) again uses paper based forms which

responders can enter hand printed text such as names, dates etc. as well as

dash marks with no special equipment needed other than a pen/pencil. An

ICR scanner then processes the forms, which are then verified and stored the

required database.

IV. Bar coding and EPOS

A very important kind of data collection method - in widespread use is a Bar

Code. Bar codes are made up of rectangular bars and spaces in varying widths.

Read optically, these enable computer software to identify products and items

automatically. Numbers or letters are represented by the width and position of

each code's bars and spaces, forming a unique 'tag'. Bar codes are printed on

individual labels, packaging or documents. When the coded item is handled,

the bar code is scanned and the information gained is fed into a computer.

Codes are also often used to track and count items.

V. Electronic Point of Sale (EPOS) technology

It is familiar in super markets and many retail operations. Not only saving

time at checkout, EPOS cuts management costs by providing an

automatic record of what is selling and stock requirements. Customers

receive an accurate record of prices and items purchased. Producers use bar

coding for quick and accurate stock control, linking easily to customers.

Distributors use bar codes as a crucial part of handling goods. Larger

businesses and those with high security requirements can use bar codes

for personnel identification and access records for sensitive areas.

VI. EFTPOS (Electronic Funds Transfer at Point Of Sale)

You will find EFTPOS terminals at the till in certain shops. An

EFTPOS terminal electronically prints out details of a plastic card transaction.

The computer in the terminal gets authorization for the payment amount (to

Chapter 1

12

make sure it's within the credit limit) and checks the card against a list of lost

and stolen cards.

VII. Magnetic stripe cards

A card (plastic or paper) with a magnetic strip of recording material on

which the magnetic tracks of an identification card are recorded.

Magnetic stripe cards are in widespread use as a way of controlling access

(e.g. swipe cards for doors, ticket barriers) and confirming identity (e.g. use in

bank and cash cards).

VIII. Smart cards

A smart card (sometimes also called a "chip card") is a plastic card with an

embedded microchip. It is widely expected that smart cards will eventually

replace magnetic stripe cards in many applications. The smart chip provides

significantly more memory than the magnetic stripe. The chip is also capable

of processing information. The added memory and processing capabilities

are what enable a smart card to offer more services and increased

security. Some smart cards can also run multiple applications on one card, this

reducing the number of cards required by any one person.

One of the key functions of the smart card is its ability to act as a stored value

card, such as Mondex and Visa cash. This enables the card to be used as

electronic cash. Smart cards can also allow secure information storage,

making them ideal as ID cards and security keys.

IX. Optical character recognition (OCR) and scanners

OCR is the recognition of printed or written characters by software that

processes information obtained by a scanner. Each page of text is

converted to a digital using a scanner and OCR is then applied to this

image to produce a text file. This involves complex image processing

algorithms and rarely achieves 100% accuracy so manual proof reading is

recommended.

Chapter 1

13

1.7.4 Review of related literature

The researcher has studied all the above mentioned work, and came to some

conclusion as summarized below:

Recognition of printed document is easily done by OCR. But handwritten

character recognition is still the subject of active research. Handwritten

character recognition is difficult task because handwriting differs from writer

to writer. Every individual person has an own style of writing.

1.8 Glossary of Terms

IMR : Intelligent Machines Research

ICR : Intelligent Character Recognition

ICT : Information and Communication Technology

MICR : Magnetic Ink Character Recognition

OMR : Optical Mark Reading

OCR : Optical Character Recognition

EPOS : Electronic Point of Sale

EFTPOS : Electronic Funds Transfer at Point Of Sale

1.9 Planning of the Remaining Chapters

The researcher has distributed the entire work into five different chapters. The

chapter summary for the remaining chapter i.e. from chapter 2 to chapter 5 is

as follow:

Chapter 2 Study of Optical Character Recognition Algorithms and

Tools

This chapter presents study of different optical character recognition

algorithms and offline and online character recognition tools. Researcher has

collected handwritten data samples of capital characters and digits. Result

analysis of each character and digit is carried out and calculated the

Chapter 1

14

recognition rate of each character and digit. Researcher has found out the

characters and digits whose recognition rate is very less in studied tool.

Chapter 3 Study and Analysis of English Characters and Digits for

Structural and Statistical Feature Extraction

This chapter presents study and analysis of structural and statistical features of

English capital characters and digits. Features like Zoning, Number of

endpoints, End point existence in zone, Zone with zero foreground pixels,

Number of vertical lines and Number of horizontal lines are studied and

analyzed for each capital character and digit. Decision tree method is used for

classification. Classification approach as decision table and decision tree is

presented for characters and digits.

Chapter 4 Designing and development of offline handwritten isolated

character recognition model

In this chapter researcher has designed and developed English handwritten

character recognition model for characters G, N, O and Y and digit 4.

Handwritten data samples are collected for those characters and digits.

Digitization of the handwritten data samples, character segmentation,

preprocessing, feature extraction and classification steps are performed to

recognize the characters and digits. Component development and functionality

of the proposed model is described in detail.

Chapter 5 Results, Conclusion and Future Scope for extension of

research work

In this chapter results and analysis of the result as outcome generated by the

proposed model is described. Writer wise result analysis for the characters G,

N, O, Y and digit 4 in the proposed model and performance analysis of

selected tools with the proposed model is described. Further this chapter

presents conclusion of the proposed research work along with direction for

future scope in the present research area.

Chapter 1

15

Reference:

[1] Kai Ding, Zhibin Liu, Lianwen Jin, Xinghua Zhu, “A Comparative

study of GABOR feature and gradient feature for handwritten 17hinese

character recognition”, International Conference on Wavelet Analysis

and Pattern Recognition, pp. 1182-1186, Beijing, China, 2-4 Nov.

2007

[2] Pranob K Charles, V.Harish, M.Swathi, CH. Deepthi, "A Review on

the Various Techniques used for Optical Character

Recognition",International Journal of Engineering Research and

Applications, Vol. 2, Issue 1, pp. 659-662, Jan-Feb 2012

[3] Om Prakash Sharma, M. K. Ghose, Krishna Bikram Shah, "An

Improved Zone Based Hybrid Feature Extraction Model for

Handwritten Alphabets Recognition Using Euler Number",

International Journal of Soft Computing and Engineering, Vol.2, Issue

2, pp. 504-58, May 2012

[4] http://en.wikipedia.org/wiki/Optical_character_recognition

[5] https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&c

d=1&cad=rja&sqi=2&ved=0CCoQFjAA&url=http%3A%2F%2Ffiles.

figshare.com%2F983548%2F2204.pdf&ei=tmP7Us-9HYf-

rAeu6IHYBA&usg=AFQjCNEekQKpByH4GO-LgK-

13ExJ6k8qQA&bvm=bv.61190604,d.bmk

[6] http://www.mathworks.in/discovery/image-segmentation.html

[7] http://shodhganga.inflibnet.ac.in:8080/jspui/bitstream/10603/4166/10/1

0_chapter%202.pdf

[8] http://arxiv.org/ftp/arxiv/papers/1103/1103.0365.pdf

[9] http://www.dataid.com/aboutocr.htm

[10] http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html

[11] http://www.codeproject.com/KB/dotnet/simple_ocr.aspx

16