an arabic optical character recognition system using...

19
Pattern Recognition 34 (2001) 215}233 An Arabic optical character recognition system using recognition-based segmentation A. Cheung, M. Bennamoun*, N.W. Bergmann Space Centre for Satellite Navigation, School of Electrical & Electronic Systems Engineering, Queensland University of Technology, GPO 2434, Brisbane, QLD 4001, Australia Received 15 December 1998; received in revised form 16 November 1999; accepted 16 November 1999 Abstract Optical character recognition (OCR) systems improve human}machine interaction and are widely used in many areas. The recognition of cursive scripts is a di$cult task as their segmentation su!ers from serious problems. This paper proposes an Arabic OCR system, which uses a recognition-based segmentation technique to overcome the classical segmentation problems. A newly developed Arabic word segmentation algorithm is also introduced to separate horizontally overlapping Arabic words/subwords. There is also a feedback loop to control the combination of character fragments for recognition. The system was implemented and the results show a 90% recognition accuracy with a 20 chars/s recognition rate. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Cursive script; Word segmentation; Character fragmentation; Recognition; OCR 1. Introduction Recognizing characters is not a di$cult task for hu- mans who repeat the process thousands of times every day as they read papers or books. However, after more than 40 years of intensive investigation, the ultimate goal of developing an optical character recognition (OCR) system with the same reading capabilities as humans still remains unachieved. OCR is the process of converting a raster image representation of a document into a format that a computer can process. Thus, it involves many subdisciplines of computer science including image pro- cessing, pattern recognition, natural language processing, arti"cial intelligence, and database systems. Compre- hensive reviews of OCR histories and developments are given in Refs. [1,2]. * Corresponding author. Tel.: #61-7-3864-1204; fax: #61- 7-3864-1516. E-mail address: m.bennamoun@qut.edu.au (M. Bennamoun). OCR has attracted an immense research interest not only because of the very challenging nature of this prob- lem * to shorten the reading capabilities gap between machines and humans } but also because it improves human}machine interaction in many applications. Example applications include o$ce automation, cheque veri"cation, and a large variety of banking, business and data entry applications [1]. Most commercially available products are for typed Latin, Chinese and Japanese. They are some of the most commonly used scripts in the world, and their characters are well separated from one another with spaces. This is the reason why their OCR techniques and systems are easier and well developed. Arabic is also a popular script. It is estimated that there are more than one billion Arabic script users in the world. If OCR systems are available for Arabic charac- ters, they will have a great commercial value. However, due to the cursive nature of Arabic script, the develop- ment of Arabic OCR systems involves many technical problems, especially in the segmentation stage. Although many researchers are investigating solutions to solve the problems and some of them have made remarkable 0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 2 2 7 - 7

Upload: nguyendat

Post on 22-Feb-2019

226 views

Category:

Documents


0 download

TRANSCRIPT

Pattern Recognition 34 (2001) 215}233

An Arabic optical character recognition system usingrecognition-based segmentation

A. Cheung, M. Bennamoun*, N.W. Bergmann

Space Centre for Satellite Navigation, School of Electrical & Electronic Systems Engineering, Queensland University of Technology,GPO 2434, Brisbane, QLD 4001, Australia

Received 15 December 1998; received in revised form 16 November 1999; accepted 16 November 1999

Abstract

Optical character recognition (OCR) systems improve human}machine interaction and are widely used in manyareas. The recognition of cursive scripts is a di$cult task as their segmentation su!ers from serious problems. Thispaper proposes an Arabic OCR system, which uses a recognition-based segmentation technique to overcome theclassical segmentation problems. A newly developed Arabic word segmentation algorithm is also introduced toseparate horizontally overlapping Arabic words/subwords. There is also a feedback loop to control the combination ofcharacter fragments for recognition. The system was implemented and the results show a 90% recognition accuracywith a 20 chars/s recognition rate. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rightsreserved.

Keywords: Cursive script; Word segmentation; Character fragmentation; Recognition; OCR

1. Introduction

Recognizing characters is not a di$cult task for hu-mans who repeat the process thousands of times everyday as they read papers or books. However, after morethan 40 years of intensive investigation, the ultimate goalof developing an optical character recognition (OCR)system with the same reading capabilities as humans stillremains unachieved. OCR is the process of converting araster image representation of a document into a formatthat a computer can process. Thus, it involves manysubdisciplines of computer science including image pro-cessing, pattern recognition, natural language processing,arti"cial intelligence, and database systems. Compre-hensive reviews of OCR histories and developments aregiven in Refs. [1,2].

*Corresponding author. Tel.: #61-7-3864-1204; fax: #61-7-3864-1516.

E-mail address: [email protected] (M. Bennamoun).

OCR has attracted an immense research interest notonly because of the very challenging nature of this prob-lem * to shorten the reading capabilities gap betweenmachines and humans } but also because it improveshuman}machine interaction in many applications.Example applications include o$ce automation, chequeveri"cation, and a large variety of banking, business anddata entry applications [1]. Most commercially availableproducts are for typed Latin, Chinese and Japanese. Theyare some of the most commonly used scripts in the world,and their characters are well separated from one anotherwith spaces. This is the reason why their OCR techniquesand systems are easier and well developed.

Arabic is also a popular script. It is estimated thatthere are more than one billion Arabic script users in theworld. If OCR systems are available for Arabic charac-ters, they will have a great commercial value. However,due to the cursive nature of Arabic script, the develop-ment of Arabic OCR systems involves many technicalproblems, especially in the segmentation stage. Althoughmany researchers are investigating solutions to solvethe problems and some of them have made remarkable

0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 2 2 7 - 7

achievements, e.g., Amin et al. [3], more research is stillneeded to improve the systems' performance.

The purpose of this paper is to propose an opticalcharacter recognition system using a newly developedword segmentation algorithm and a recognition-basedsegmentation technique for recognizing Arabic charac-ters. The newly developed word segmentation algorithmcan overcome the segmentation problems caused bythe word/subword overlapping of Arabic script [4].The advantage of the recognition-based segmentationtechnique is that no accurate character segmentationpath is necessary. Thus, it bypasses the character segmen-tation stage. The segmentation of characters has comeout as a by-product of the recognition results [5,24]. Thestructure of the proposed system is composed of "vemajor stages and a feedback loop. The "ve stages are (1)image acquisition, (2) preprocessing, (3) word segmenta-tion and character fragmentation, (4) feature extractionand (5) classi"cation. The feedback loop, which is linkedbetween the segmentation and recognition stages, con-ducts a signal to control the combination of characterfragments.

The organization of the paper is as follows: somebackground knowledge about the characteristics ofArabic script and the recognition-based segmentationtechnique are given in Section 1 below. The structure ofthe OCR system is given in Section 2. The processesinvolved in the image acquisition and preprocessingstages are also presented in this section. A detailed de-scription of the processes used to extract documentregions is found in Section 3. Section 4 presents theprocesses of the recognition-based segmentation tech-nique involved in the proposed system. The system'sperformance is presented in Section 5. Finally, the paperconcludes in Section 6 by looking at how the proposedsystem can be extended and improved.

1.1. Background

An in-depth understanding of the characteristics of thetarget script is necessary for the development of its OCRsystem. This knowledge helps us to discover the suitabil-ity of existing techniques to the system and may also leadto the development of new techniques. Therefore, thecharacteristics of Arabic script are presented "rst in Sec-tion 1.2. In Section 1.3, the di!erences in concept betweenrecognition-based segmentation and dissection segmen-tation are discussed. Emphasis are made on the advant-ages of using recognition-based segmentation on cursivecharacter recognition and potential problems that mayoccur.

1.2. The characteristics of Arabic script

Arabic is a cursive-type language, which is writtenfrom right to left, and so recognition should occur this

way. There are 28 characters in the Arabic alphabet.Each character has two to four di!erent forms whichdepends on its position in the word or subwords. As aresult, there are 100 classes to be recognized. The Arabiccharacter set is shown in Fig. 1 [5]. Fig. 2 illustratesthe variation of the Arabic characters' shape dependingon their positions in the word.

An Arabic word can have one or more subwords, e.g.,there are three subwords in the word shown in Fig. 2.Most characters have dot (s) or zigzag (s) associated withthe character and this can be above, below, or inside thecharacter. Many characters have a similar shape. Theposition or number of these secondary strokes makes theonly di!erence, e.g., B , T , and TH . We have alsonoticed that Arabic words may horizontally overlap andcharacters may stack on others. These induce problemsfor both the word and the character segmentations. Fig.3 demonstrates an example of word overlapping, thedotted box is where the overlapping occurs. At this stage,it is not hard to understand that segmentation is a crucialstep in the development of an Arabic OCR system.

1.3. Dissection vs. recognition-based charactersegmentation

Character segmentation can be performed by eitherthe dissection or recognition-based technique [5]. Dis-section means the decomposition of the image into a se-quence of subimages using general features. It involvesanalysis of the image to "nd the subimage segmentationpaths. Each subimage is treated as a character forrecognition. It is worth mentioning that classi"cationof characters is carried out at a later stage. Projectionanalysis, connected component processing, and whitespace and pitch "nding are some of the common dissec-tion techniques used by OCR systems [5]. These tech-niques are suitable for scripts which have spaces betweencharacters. If a dissection technique is used for cursivescripts, a more `intelligenta and speci"c analysis tech-nique for the particular script is needed. However, thereis still no guarantee that high segmentation accuracy canbe achieved.

The basic principle of recognition-based character seg-mentation is to use a mobile window of variable width toprovide the tentative segmentations which are con"rmed(or not) by the classi"cation [5]. Characters are by-products of the character recognition for systems usingsuch a principle to perform character separation. Themain advantage of this technique is that it bypassesserious character separation problems. In principle, nospeci"c segmentation algorithm for the speci"c script isneeded and recognition errors are mainly due to failuresduring the classi"cation stage. For these reasons, moreand more cursive script OCR systems use this technique(e.g., see Refs. [6,7]) for improving the recognition accu-racy. This approach is also known as `segmentation-freea

216 A. Cheung et al. / Pattern Recognition 34 (2001) 215}233

Fig. 1. The character set of the Arabic script. (EFPend form, MFPmiddle form, BFPbeginning form, and IFPisolated form).

recognition due to the virtual absence of the characterseparation stage [6,7].

2. An overview of the proposed system

Following the well-de"ned framework of existingOCR systems, the proposed system comprises "ve major

stages. However, as a recognition-based character seg-mentation technique is used, a feedback loop is linkedbetween the output of the classi"cation stage andthe input of the character fragments combination stage.Fig. 4 illustrates the block diagram of the proposedsystem. Apart from the feedback loop, notice that a user-interface was implemented to present the recognitionresults to the user through a window-based program, so

A. Cheung et al. / Pattern Recognition 34 (2001) 215}233 217

Fig. 2. An example of an Arabic cursive word.

Fig. 3. An example of overlapping Arabic word.

Fig. 4. The structure of the proposed Arabic OCR system.

that the user can edit, reformat and print the recognizeddocument. Image acquisition and preprocessing are thetwo relatively simple stages, which are presented "rst.

Image acquisition is at the image representation levelof pattern recognition (PR). It is the process of acquiringa digitized representation of a document or an article tobe recognized. A #atbed scanner is used at this stageto acquire 200 dpi, 8-bits gray-level images. Preprocess-ing is at the image-to-image transformation level. It is theprocess of compensating a poor-quality original and/orpoor-quality scanning. There are two processes to en-hance the acquired image in the proposed system, whichare binarization and smoothing.

Binarization is a special case of thresholding, of whichthere are only two states of outputs in the resultingimage, either black or white. It reduces the computa-tional requirements of the system and may enableremoval of some noise. A document can be binarized

218 A. Cheung et al. / Pattern Recognition 34 (2001) 215}233

Fig. 5. (a) An Arabic document. (b) The binarization result.

Fig. 6. A smoothed document image.

globally or adaptively. Unless the document is printed onan uneven coloured paper, global thresholding is goodenough to carry out the binarization. Two global thre-sholding algorithms were studied and implemented. Theyare Otsu's [8] and Tsai's [9] algorithms. Their results

were compared and the Otsu's algorithm was chosen. Anexample of binarization is shown in Fig. 5.

A smoothing process was taken. It uses the spatial"lter proposed by Amin et al. [10]. Fig. 6 shows anexample of the smoothed result. It is important to note

A. Cheung et al. / Pattern Recognition 34 (2001) 215}233 219

Fig. 7. The nature of the last stroke of Arabic characters: a horizontal straight line or an upward curve.

that this algorithm not only can smooth the image butcan also restore missing pixels. For example, comparingFigs. 5(a) and 6, we can note that some pixels in thefourth last word (Arabic text is read from right to left) inline 1 were restored.

3. The document image analysis step

Document image analysis step is at the image-to-imagetransformation level of PR. It divides an image (usuallyenhanced) into meaningful regions for analysis. It is anessential and determinant step for Arabic OCR systemsdue to the cursive nature and overlapping appearance ofArabic script. Three processes are involved in this step ofthe proposed system. They are structural layout analysis,word segmentation and character fragmentation. Projec-tion pro"le analysis is a common method used by manyOCR systems (e.g., see Ref. [11]) for structural layoutanalysis. It is advantageous for pure text documents as it issimple to implement and requires less computational load.

3.1. Word segmentation

As mentioned earlier, Arabic script may horizontallyoverlap (refer to Fig. 3), therefore a speci"c Arabic wordsegmentation method is needed.

Recall that each Arabic character has two to fourdi!erent forms (refer to Fig. 1), which depends on itsposition in the word/subword. Any Arabic word segmen-tation algorithm will be concerned with three forms: thebeginning form, the end form, and the isolated form. Wenoticed a particular characteristic in the appearance ofthe isolated and end forms of Arabic characters. Namely,the last strokes of these two forms of characters are eitherhorizontal straight lines or upward curves. Fig. 7 illus-trates this characteristic. An Arabic character can pos-sibly overlap with another character within a word if thelast stroke of its isolated or end form is a horizontalstraight line.

In order to minimize the computational load, thenewly developed word segmentation method utiliz-ed another characteristic of Arabic script: it is writtenfrom right to left. That means word overlappingwill only occur between the left-hand side contourof a word and the right-hand side contour of the suc-ceeding word. Therefore, it is only necessary to tracethrough the left-hand side contour of Arabic words.There are two stages in this word segmentationmethod [4]:

Stage 1: It searches for the preliminary segmentationpaths.

Stage 2: It includes secondary strokes.

220 A. Cheung et al. / Pattern Recognition 34 (2001) 215}233

Fig. 8. The structure of the mask for the determination of wordsegmentation paths.

Table 1The algorithm for searching the preliminary segmentation path of Arabic words

Step 0 Place the mask on the top-right-hand corner of the image.Step 1 Build up the vertical projection pro"le of this column.

if its value is greater than 1record w

startas the starting column number of the word segmentation line, goto Step 2.

elsemove the mask one pixel to the left, goto Step 1.

Step 2 Check whether Z1

is a white pixel.if Z

1is a white pixel

put the Z1's coordinate in the array Seg[ ], goto Step 3.

else if Z1

is a black pixelmove the mask one pixel to the left and goto Step 1.

Step 3 Check whether Z2

is a white pixelif Z

2is a white pixel

move the mask one pixel down and goto Step 2.else if Z

2is a black pixel

Check whether Z3

is a white pixelif Z

3is a white pixel

place the mask at Z3

and goto Step 2.else if Z

3is a black pixel

Check whether Z4

is a black pixelif Z

4is a black pixel

record the current column number and place the mask at the "rst row of that column and goto Step 2.else if Z

4is a white pixel

Check whether Z5

is a black pixelif Z

5is a black pixel

record the current column number and place the mask at the "rst row of that column and goto Step 2.else if Z

5is a white pixel

place the mask on Z4

and goto Step 2.

In this word segmentation algorithm, a specially de-signed mask as shown in Fig. 8 is employed. Table 1describes the steps used to search the preliminary seg-mentation paths [4]. The steps in Table 1 are repeated

until Z1

(see Fig. 8) has reached the last row of the linesegment image. The column number is recorded as w

4501.

Fig. 9 illustrates the preliminary segmentation paths oftwo lines of text, which are generated after Stage 1.

There are 17 Arabic alphabets having dot (s) or zig-zag (s), which can be above, below or inside the character.After Stage 1, there is an indication of where to cut thewords. However, the segmentation path may not be theshortest and it usually excludes the secondary strokes(dots and zigzags). This is shown in Fig. 9(b). Therefore,obviously, Stage 2 is to include secondary strokes andsearch for the shortest segmentation path. The algorithmis listed in Table 2 [4].

The "nal word segmentation results are shown in Fig.10. This algorithm has also been used to segment hand-written Arabic text and it produces good results in termsof accuracy and e$ciency [4]. Fig. 11 shows the wordsegmentation results of a line of hand-written text.

3.2. Character fragmentation

The purpose of this step is to produce a sequence oftentative character segmentation lines. Although theselines do not necessarily segment characters from a wordcorrectly, the location and the number of them would

A. Cheung et al. / Pattern Recognition 34 (2001) 215}233 221

Fig. 9. (a) The original line image. (b) The preliminary segmentation result after Stage 1 of word segmentation algorithm.

Table 2The algorithm for Stage 2 of the Arabic word segmentation

from wstop

to wstart

Step 2 and step 3 of Table 1 are repeatedThe coordinates of Z

1are recorded in the array tmpSeg[ ].

The coordinate of the current ending of the word segmentation line are recorded as w@stop

if the values of wstop

are equaled to the values of w@stop

Take the values in the array tmpSeg[ ] as the best segmentation path and store it in the array Seg[ ].Continue the search for the best segmentation path

ElseThe best word segmentation path has been found.

End

Fig. 10. (a) Two lines of printed Arabic text. (b) The word segmentation result after Stage 2.

222 A. Cheung et al. / Pattern Recognition 34 (2001) 215}233

Fig. 11. (a) A line of handwritten Arabic text. (b) The word segmentation result.

directly a!ect the accuracy and speed of the OCR system.We used two processes to determine the "nal sequence oftentative lines. The "rst process is a simpli"ed version ofAmin's character segmentation algorithm [12]. The sec-ond process is the convex dominant points (CDPs) detec-tion algorithm developed by Bennamoun [13]. It isimportant to note that this step only produces a sequenceof fragments, while the segmentation of characters iscon"rmed at the classi"cation stage.

Amin proposed a dissection algorithm for Arabic char-acters using the projection pro"le analysis [12]. Thisalgorithm tried to detect the connectivity points betweencharacters within a word. In the algorithm, the verticalprojection pro"le of the word to be segmented was "rstdetermined. The connectivity point will show the leastsum of the average value (M

c) [12]

Mc"A

1

NcB

Nc

+i/1

Cic, (1)

where Nc

is the number of columns of the word imageand C

icis the number of black pixels in the ith column.

Hence, each part showing a value less than Mcshould be

segmented into a character. However, as the above for-mula over-estimated the number of connectivity points,two more rules were included to eliminate some incor-rectly estimated points, they are [12]

DdiD(

dL3

, (2)

¸i`1

'1.5]¸i, (3)

where diis the distance between the ith and i#1 peak,

dL

is the total width of the character, and ¸i

is the ithpeak in the histogram. A more detailed description of thisArabic character dissection algorithm can be found inRef. [12]. For the proposed system, it is good to haveevery possible segmentation point, therefore only Eq. (1)was used.

It is understood that most complex objects containCDPs. We can segment objects into parts by joining acorresponding pair of CDPs, and then model each partwith a simple structure. Object recognition is performedby analyzing their structural relationships. This is theprinciple of the Bennamoun's vision system [13]. Inthe second process of character fragmenation, we triedto detect the CDPs within a word using Bennamoun'sobject segmentation technique. CDPs detection is relatedto our system as each CDP will be the ending point ofa stroke, which allows us to have a sequence of "nefragmentation points.

There are two stages, namely boundary extraction andpart segmentation, involved in detecting CDPs [13]. Theboundary extraction stage consists of three subsequentoperations namely edge detection, edge linking andcontour following. The word is separated from the back-ground based on the edge information. Edge linkingproduces a closed edge so that the contour followingoperation can extract the outermost boundary of theobject.

Edge detection extracts the word boundary from thebackground. It utilizes a hybrid edge detector whichcombines the "rst and second derivative of the Gaussianfunction [14]. Fig. 12 illustrates the block diagram of thehybrid edge detector. The upper branch is responsible forachieving a precise localization of the edge, whereas thelower branch is used to "lter out noise by thresholdingthe image at an appropriate level [13]. The p of theGaussian function was chosen to be 0.5. The methodused to link edges is based on analyzing the character-istics of the edge pixels in a small neighborhood. Allpoints having similar properties are linked to forma closed boundary. The reader is referred to Ref. [15] forthe edge linking criteria. Fig. 13 shows the results of edgedetection and edge linking.

The main purpose of contour following is to extractthe outermost boundary of a word. The "rst step is toscan the word contour image from the top-right-hand

A. Cheung et al. / Pattern Recognition 34 (2001) 215}233 223

Fig. 12. The hybrid edge detector [13].

Fig. 13. Examples of some contours of Arabic words.

corner. The scanning operation is carried out column-wise from the top to the bottom row until a point on thecontour is met. The "rst contour point encountered isstored as the contour starting point. The next contourpoint is searched by performing the second scanning tothe closest neighboring pixels of the starting contour

point in a clockwise direction. This scanning is started atthe neighboring point of the contour starting pointlocated outside the object along its normal. Once a con-tour point is found, a similar procedure is used to "nd thenext contour point. The process terminates once thecontour starting point is met again [13]. The result ofthis process is two sequences of contour coordinates, x(t)and y(t).

The convex-dominant points of a word are detected inthe part segmentation stage. The algorithm used in thedetection of the CDPs is illustrated in Fig. 14. At "rst,a contour-smoothing operation is carried out usinga Gaussian kernel so that the problem of discontinuity inthe calculation of the derivative of curvature can beavoided. The degree of smoothness is governed by thep1

value of the Gaussian kernel. Once a smooth contour

224 A. Cheung et al. / Pattern Recognition 34 (2001) 215}233

Fig. 14. The extraction of convex-dominant points.

Fig. 15. The CDPs detection results on some Arabic words (the small gray crosses are the CDPs).

is produced, the curvature is computed using Eq. (4) [13]:

K2(t)"

'5x(t)

'K

y(t) !'5

y(t)'K

x(t)

A'5

x(t)2 #

'5y(t)2B

3@2, (4)

where'5

x(t)"d'

x(t)/dt,'5

y(t)"d'

y(t)/dt,'K

x(t)"d2'

x(t)/dt2,'K

y(t)"d2'

y(t)/dt2,'

x(t) and'

y(t) denote the smoothversion of the x and y coordinates of the contour,respectively. The upper branch of the block diagramshown in Fig. 14 extracts all the dominant points onthe contour by convolving the curvature with thederivative of the Gaussian function, followed by zero-crossing detection. A dominant point is de"ned as thepoint for which the derivative of the curvature

equals zero, i.e.,

dKKs(t, p

2)

dt"0. (5)

The lower branch is responsible for selecting the con-vex points for which the smoothed curvature KK

s(t) is

greater than a certain threshold ¹h. Both branches areANDed to produce the convex dominant points (CDPs)and each CDP is a tentative fragmentation point. Exam-ples are given in Fig. 15 to illustrate the CDPs detectedfrom some Arabic words. The results of the "rst andsecond process are combined so that every possible frag-mentation point is included for analysis in the laterstages. Fig. 16 shows the "nal sequence of fragmentationpoints of some Arabic word examples.

A. Cheung et al. / Pattern Recognition 34 (2001) 215}233 225

Fig. 16. The "nal sequence of tentative fragmentation lines.

4. The recognition-based segmentation step

In this section, the functions of processes involved inthe recognition-based segmentation step are described.They include feature extraction, classi"cation and a feed-back loop. An example is given in Section 4.3 to demon-strate this step.

4.1. The feature extraction stage

After an image has been segmented into regions, it isready to enter the next level of PR * the image-to-parameters transformation level, that is, the feature ex-traction stage. It is not surprising that each OCR systemhas its own feature extractor, just as each person would

226 A. Cheung et al. / Pattern Recognition 34 (2001) 215}233

Fig. 17. The Arabic character `Alif a.

describe the same object in a di!erent way. However,good features should have four characteristics [16]: (1)Discrimination, (2) Reliability, (3) Independence, and(4) Small number. Basically, there are two choices torepresent a region [17]: (1) one can represent the re#ec-tivity properties of a region based on its internal charac-teristics (i.e., the pixels comprising the region); or (2) onemay choose to represent the shape characteristics of anobject in term of its external characteristics (i.e., theboundary).

Chain codes, an external feature, were used to repres-ent a boundary by a connected sequence of straight linesegment of speci"c length and direction. Typically, thisrepresentation is based on 4- or 8-connectivity of seg-ments, where the direction of each segment is coded. Anexample is proposed by Amin [18]. Another method isproposed in Ref. [19]. The contour of an object is extrac-ted by using the hybrid edge detector described in Sec-tion 3.2. Then, the tracing process is started from the topright-hand black pixel of the object contour and tracedthrough its whole contour. A sequence of codes is thenobtained. For example, the sequence of codes generatedby this method of the Arabic character `Alif a shownin Fig. 17 is 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 7, 7, 6, 7, 7, 7, 7, 7,6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 1, 3, 2, 3, 3, 3, 3, 3, 3. ApplyEq. (6) to smooth the chain code and it becomes 3, 3, 3, 3,3, 3, 3, 3, 3, 3, 5, 5, 5, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,7, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3. The purpose of these equationsis to reduce the e!ect of noise on the chain code:

CiC

jC

jPC

iC

jC

j

CiC

iC

jPC

iC

iC

j

CiC

jC

iPC

iC

iC

i

CiC

jC

kPC

lC

lC

l, (6)

where Ci, C

j,C

k,C

l3M1, 2,2, 8N, C

i,C

j,C

kand C

lare

codes and Clis the resultant direction of C

i,C

jand C

k.

It is clear that the contour of an object is a closed loop.The last "ve elements of the above sequence are &3', thesame contour can be represented if we move them to thebeginning of the sequence. The chain codes are "nallyconcentrated by dividing the run-length of a code witha threshold ¹

1provided that the run-length of that code

exceeds a threshold ¹2. The purpose of ¹

2is to make the

"nal code chain have a certain degree of robustness tonoise. If ¹

1is set to 8 and ¹

2is set to 2, then the above

code chain becomes: 3, 3, 5, 7, 7, 2.

4.2. The classixcation step

Classi"cation belongs to the parameter-to-decisiontransformation level. It is the "nal step of most OCRsystems. A character will be assigned to one of themdepending on how its feature measurements compare

with the established criteria. The processes of how torecognize an Arabic character using the feedback loopare described below.

String matching [20] was used in the proposed OCRsystem. It utilized state machine to search for the identityof a character. State machine comprises of "ve elements[21]: M"[I,Q,Z, d, u], where I is a set of input sym-bols (i.e., the chain codes in our case), Q is a set of states,Z is a set of output symbols (i.e., the Arabic charactercodes), d is a mapping of I]Q into Q, called `next state

A. Cheung et al. / Pattern Recognition 34 (2001) 215}233 227

Fig. 18. (a) An Arabic word. (b) Its fragmentation results. (c) Fragments are numbered from right to left.

functiona, and u is a mapping of I]Q onto Z, called`output functiona.

To recognize characters, words are "rst fragmentedinto a sequence of character fragments by the methoddescribed in Section 3.2. Each fragment is numberedfrom right to left. During the recognition process, the "rstfragment is fed into the feature extraction stage to deter-mine the concentrated chain codes. These chain codes arethen input to the string matcher to "nd the best match. Inorder to minimize the confusion of character fragmentswith characters and to save search time, there are four

databases. According to the position of a fragment (s) ina word, the corresponding database is used to search forthe best match. For example, if a tentative character isformed by the combination of the "rst and second frag-ment, the database for the beginning form characters isused. If this tentative character could not be recognized,a signal is fed back to the character combination processto combine the "rst three character fragments (refer toFig. 4). The above processes are repeated until a charac-ter is recognized. If a character is recognized after thecombination of the "rst n fragments, then the feedback

228 A. Cheung et al. / Pattern Recognition 34 (2001) 215}233

Fig. 19. The (a) right-to-left and (b) left-to-right feedback loops for recognition of the Arabic word shown in Fig. 18(a).

loop will start again at the (n#1)th fragment. The abovefeedback loop occurs twice for each word. The "rst iswith the fragment combination directed from right to leftof the word. If not all characters in that word are recog-nized, the second feedback loop proceeds. This time, thefragment combination is directed from left to right of theword. The results from these two feedback loops are com-bined to form the "nal recognition results. To illustrate theabove processes more clearly, an example is given below.

4.3. An example

In this subsection, a simple example is presented todemonstrate the recognition-based segmentation step. The

word shown in Fig. 18(a) comprised three Arabic charac-ters, which are `Lama, `Ghayna, and `Raa. The word was"rst fragmented and the results are shown in Fig. 18(b).

As mentioned earlier, fragments were numbered fromright to left as shown in Fig. 18(c). Fragment 1 was inputto the feature extraction stage and obtained its concen-trated chain code sequence. This code sequence is fedinto the string matcher, which utilizes state machine toidentify Fragment 1 by comparing it with the database.Recall that there are four databases. `The BeginningForma database was the one that we used for recognizingFragment 1.

It is appropriate to describe the databases. Eachdatabase contains a set of sequences of chain code with

A. Cheung et al. / Pattern Recognition 34 (2001) 215}233 229

Fig. 20. (a) The original document. (b) The recognition result.

Fig. 21. (a) The original document. (b) The recognition result.

230 A. Cheung et al. / Pattern Recognition 34 (2001) 215}233

states. Thus, each code sequence can be represented bya state diagram. The number of code sequences in thedatabase is dependent on the number of characters in theparticular form. Databases were constructed from aset of manually segmented characters containing all fourforms.

If Fragment 1 could not be identi"ed, a signal, whichrepresents that the character is unrecognized, is sent fromthe classi"er through the feedback loop to the characterfragments combination stage. As it was the "rst trial torecognize characters of that word, character fragmentswere combined in the right-to-left direction, that meansFragments 1 and 2 were combined to form a tentativecharacter. It is why we called this feedback loop a right-to-left feedback loop.

After Fragments 1 and 2 were combined, they wereagain fed to the feature extraction stage and then to thestring matcher. At this time, the tentative character wasrecognized as the beginning form of the character `Lama.Another signal, which represents that the character isrecognized, is sent to the character fragments combina-tion stage to indicate that the next recognition willstart from Fragment 3. For Fragment 3, the above pro-cesses were repeated except that `The Middle Formadatabase was used at this time. Fig. 19(a) illustratesthe right-to-left feedback recognition process for theArabic word shown in Fig. 18(a). If not all characters inthe word are recognized, the second directed feedbackloop is performed. It is directed from the left to theright. Fig. 19(b) illustrates the left-to-right feedback rec-ognition process. The recognition results from the "rstand second feedback loops being compared and concat-enated if necessary.

5. Experimental results

In this section, experimental results are presented toillustrate the accuracy and e$ciency of the proposedsystem. The performance of the OCR system was testedusing articles from an Arabic book. They were scannedusing a #atbed scanner and the images were then bi-narized and smoothed. Lines of text were then extractedfrom the images by analyzing their horizontal projectionpro"les. Word segmentation was performed on each lineto segment words. This word segmentation method wasspecially designed for separating horizontally overlappedArabic words/subwords. It shows a very good wordsegmentation accuracy * 99%. Moreover, it is a real-time process [4].

Each word was then further segmented into a sequenceof fragments using the simpli"ed version of Amin's seg-mentation method and the Bennamoun's CDPs detec-tion method. The "rst set of parameters to be pre-de"nedappeared here. They are the sigma values of Gaussianfunctions, which control the degree of smoothing e!ect

at di!erent stages of the CDPs detection process. Thep value of the Guassian function of the hybrid edgedetector was chosen as 0.5. Clear and exact edges ofwords were detected, word contour coordinates werethen determined by a clockwise contour following pro-cess. After that, CDP detection was carried out. Weneeded another two sigma values and a threshold value:p1"1.7, p

3"3, and ¹h"0.05. These values were ob-

tained after many experimental trials for the best result ofthe CDP detection, although they can be determinedadaptively [14] at the expense of computational load.

A sequence of character fragments was the inputto the recognition-based segmentation step. We pre-de-"ned the two thresholding values, ¹

1and ¹

2, to concen-

trate the chain code. ¹1

equals 8 and ¹2

equals 2.¹

1a!ects the length of the resultant chain code, while

¹2

a!ects the sensitivity to the curve change and in factgives a certain degree of noise immunity. Two directedfeedback loops were used to recognize characters (at thesame time, they segmented characters from words). Aftermany tests were made, a recognition accuracy of 90%at a rate of 20 chars/s was reached. Some experimentalresults are shown in Figs. 20 and 21.

To justify the performance of the proposed method, wecompared it with the two systems that are presented inRefs. [22,23]. One of them is a statistical-based OCRsystem and the other is a neural network-based OCRsystem. They both have a recognition accuracy of 85%.The improvement of accuracy is due to the addition oftwo methods (the word segmentation method and therecognition-based segmentation technique) which wereused to solve some of the segmentation problems ofArabic script (i.e., its cursive nature and overlapping).

6. Conclusions

A segmentation-free [7,8] Arabic optical character rec-ognition system is proposed in this paper. It has a win-dow-based user-interface to present the recognition re-sults to the users and allow them to edit, reformat orprint the results. As we know, segmentation introducesthe most serious problem in the development of a cursivescript OCR system. In order to overcome this problem,the system used a newly developed Arabic word segmen-tation method and a recognition-based segmentationtechnique. Since the word segmentation method can ac-curately separate horizontally overlapping Arabicwords/subwords e$ciently* it is a real-time process. Itis hard to develop dissection rules for a cursive script.Therefore, we fragmented Arabic words using their struc-tural properties * connectivity points and CDPs. Wethen recognized characters by combining fragments. Thistechnique bypasses the segmentation step so that we donot have to worry about determining the actual charac-ter segmentation points.

A. Cheung et al. / Pattern Recognition 34 (2001) 215}233 231

About the Author*ANTHONY CHEUNG received the Bachelor of Electrical and Electronic and Bachelor of Information System fromQueensland University of Technology (QUT), Australia, in 1996 and Master of Engineering by research in `Character Recognitionafrom QUT in 1998. His areas of interest are Image Processing, Computer Vision and Arti"cial Intelligent.

However, the proposed system still su!ers from prob-lems. Refer to the Arabic character set shown in Fig. 1, wecan see that some characters look similar to a part ofsome other characters, e.g., the middle form of B , T ,and TH , look similar to a fragment of the middle formof S ,N and SH ,N. As there is no exact charactersegmentation point provided, confusion between thesecharacters may occur and lead to mis-recognition. Sincewe have to understand that `recognizeda or `non-recog-nizeda is the guideline for combining character frag-ments, any mis-recognition in the middle of the wordswill a!ect the lot. That is the reason why the seconddirected feedback loop was used. It is used to compensatefor some errors occurring due to mis-recognition in the"rst trial.

The second problem is that characters may deform orstack on the other characters, e.g., in the second line ofFig. 21(a), the "rst two characters of word 18 deformedand the "rst character of word 11 stacked on its succeed-ing character. This appearance characteristic makes thecharacters look di!erent from their original form andthus causes the extracted features have a large derivationfrom the database. The simplest solution to solve thisproblem is to include all possible deformed charactersand stacked character sets in the databases. Alterna-tively, a horizontal fragmentation algorithm is needed tosolve the character stacking problem.

References

[1] V.K. Govindan, A.P. Shivaprasad, Character recognition* review, Pattern Recognition 23 (7) (1990) 671}683.

[2] S. Mori, C.Y. Suen, K. Yamamoto, Historical review ofOCR research and development, Proc. IEEE 88 (1992)1029}1058.

[3] A. Amin, O!-line Arabic character recognition: the state ofart, Pattern Recognition 31 (5) (1998) 517}530.

[4] A. Cheung, M. Bennamoun, N.W. Bergmann, A new wordsegmentation algorithm for Arabic script, DICTA: DigitalImaging Comput. Tech. Appl. (1997) 431}435.

[5] H. Al-Youse", S.S. Udpa, Recognition of Arabic character,IEEE Trans. Pattern. Anal. Mach. Intell. 14 (8) (1992)853}857.

[6] B. Al-Badr, R.M. Haralick, Segmentation free word recog-nition with application to Arabic, ICDAR: Third Interna-tional Conference on Document Analysis and Recogni-tion, 1995.

[7] C.H. Chen, J.L. DeCurtins, Word recognition in a segmen-tation-free approach to OCR, second International Con-

ference on Document Analysis and Recognition, 1993, pp.573}576.

[8] N. Otsu, A threshold selection method from Gray-levelhistogram, IEEE Trans. Systems Man Cybernet. 9 (1)(1979) 62}66.

[9] W.H. Tsai, Moment-preserving thresholding: a new ap-poach, Comput. Vision Graphics Image Process. 29 (1985)377}393.

[10] A. Amin, W.H. Wilson, Hand-printed character recogni-tion system using arti"cial neural network, Proceeding ofsecond International Conference on Document Analysisand Recognition, 1993, pp. 943}946.

[11] T. Akiyama, N. Hagita, Automated entry system forprinted document, Pattern Recognition 23 (11) (1990)1141}1154.

[12] A. Amin, Recognition of Arabic handprinted math-ematical formulae, Arabian J. Sci. Eng. 16 (4B) (1991)532}542.

[13] M. Bennamoun, B. Boashash, A structural description-based vision system for automatic object recognition,IEEE Trans. Systems Man Cybernet 16 (27) (1997)893}906.

[14] M. Bennamoun, B. Boashash, A probabilistic approach forautomatic parameters selection for the hybrid edge de-tector, IEICE Trans. Fund. Electron. Commun. Comput.Sci. E80-A (8) (1997).

[15] R.C. Gonzalez, R.E. Woods, Digital Image Process-ing, Addison-Wesley Publishing Company, New York,1992.

[16] K.R. Castleman, Digital Image Processing, Prentice-HallSignal Processing Series, Prentice-Hall Inc.,, USA, 1979.

[17] R.C. Gonzalez, P. Wintz, Digital Image Processing, 2ndEdition, Addison-Wesley Publishing Company, Califor-nia, 1987.

[18] A. Amin, J.F. Mari, Machine recognition and correction ofprinted Arabic text, IEEE Trans. Systems Man Cybernet19 (1989) 1300}1306.

[19] A. Cheung, M. Bennamoun, N.W. Bergmann, A recogni-tion-based Arabic optical character recognition system,IEEE International Conference on Systems, Man andCybernetics, 1989, pp. 4189}4194.

[20] R.J. Schalko!, Pattern Recognition: Statistical, Structuraland Neural Network, Wiley, New York, 1992.

[21] M. Nadler, E.P. Smith, Pattern Recognition Engineering,A Wiley-Interscience Publication, New York, 1993.

[22] A. Cheung, M. Bennamoun, N.W. Bergmann, Implemen-tation of a statistical character system, TENCON (1997)531}534.

[23] A. Cheung, M. Yip, M. Bennamoun, N.W. Bergmann, TheArabic optical character recognition systems: statisticaland neural network, IAIF (1997) 293}298.

[24] R.G. Casey, E. Lecolinet, A survey of methods and strat-egies in character segmentation, IEEE Trans. pattern.Anal. Mach. Intell. 18 (7) (1996) 690}706.

232 A. Cheung et al. / Pattern Recognition 34 (2001) 215}233

About the Author*MOHAMMED BENNAMOUN "nished his Masters of Science (M.Sc.) degree by research in `Control Theorya atQueen's University, Canada and Ph.D. from Queensland University of Technology (QUT), Australia. He is currently the Director of theSpace Centre for Satellite Navigation, School of Electrical & Electronics Systems Engineering at QUT, and the Vice-Chair of the IEEEQueensland Section. His areas of interest are Signal/Image Processing, Computer Vision, Control Theory, Robotics, Obstacleavoidance, and Arti"cial Neural Networks.

About the Author*Dr NEIL BERGMANN is an Associate Professor at the Queensland University of Technology and Project leaderfor the High Performance Computing Program within the Cooperative Research Centre for Satellite Systems. He has previously workedat the University of Edinburgh, University of Queensland and CSIRO/Flinders University. His research interests are in computersystems engineering, particularly the application of recon"gurable computing technology to problems in image and video processing,automated visual inspection, data compression and satellite instrument processing.

A. Cheung et al. / Pattern Recognition 34 (2001) 215}233 233