[ieee 2009 2nd ieee international conference on computer science and information technology -...

6
Urdu Compound Character Recognition Using Feed Forward Neural Networks Zaheer Ahmad ، Jehanzeb Khan Orakzai and Inam Shamsher Center of Information Technology, Institute of Management Sciences, Hayatabad, Peshawar, Pakistan e-mail: [email protected] , [email protected] and [email protected] AbstractUrdu compound Character Recognition is a scarcely developed area and requires robust techniques to develop as Urdu being a family of Arabic script is cursive, right to left in nature and characters change their shapes and sizes when they are placed at initial, middle or at the end of a word. The developed system consists of two main modules segmentation and classification. In the segmentation phase pixels strength is measured to detect words in a sentence and joints of characters in a compound/connected word for segmentation. In the next phase these segmented characters are feeded to a trained Neural Network for classification and recognition, where Feed Forward Neural Network is trained on 56 different classes of characters each having 100 samples. The main purpose of the system is to test the algorithm developed for segmentation of compound characters. The prototype of the system has been developed in Matlab, currently achieves 70% accuracy on the average I. INTRODUCTION OCR is a field of research in pattern recognition, artificial intelligence and machine vision. An OCR system enables to take a book or a magazine article, feed it directly into an electronic computer file, and then edit the file using a word processor. Here an Urdu OCR (UOCR) is designed and developed to recognize images of Urdu text/characters. The sole purpose of the OCR developed is to test the algorithm developed for feature extractions and segmentation therefore no robust noise detection and removal techniques are applied. Similarly the system is fixed to work on a specified font (Ariel, size 36) without considering diacritics characters. The system gets a single line of Urdu text, converts text into words and then into characters. A Multilayer Feed Forward Neural Network is trained to recognize these segments as characters, for the purpose each character is feeded to a trained Neural Net, which on successful recognition shows the correct character otherwise Character not Recognizedmessage is generated. The result percentage of the system is 70%. II. URDU ACURSIVE SCRIPT Urdu is the national language of Pakistan and one of the popular script in the Indian subcontinent evolved in the subcontinent from the mixture of Arabic, Turkish, Farsi and Hindi Languages with 58 character set defined by National Language Authority Pakistan[1-16] as shown in figure 2.1. But only 40 basic and one do-chashmi-hey is used to form all composite alphabets; therefore the working set is consists of 41 alphabets. It is a modification of the Persian alphabet, which is itself a derivative of the Arabic alphabet. Urdu shares a common script and many characteristics of Arabic script with additional set of alphabets from Farsian and Hindi character sets. The graphical representation of each alphabet has more than one form depending on its position and context in the word. In general each letter has four forms that is beginning, middle, final and standalone as shown in table.1. TABLE I DIFFERENT FORMS OF URDU CHARACTERS Charter Forms Name # ف ل ام 0 ءhamzah 1 ا alif 1a alif madd 2 ب bē 2h bhē 3 پ pē 3h phē Fig-1. Character Set (58 alphabets) of Urdu Script. _____________________________ 978-1-4244-4520-2/09/$25.00 ©2009 IEEE

Upload: inam

Post on 24-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2009 2nd IEEE International Conference on Computer Science and Information Technology - Beijing, China (2009.08.8-2009.08.11)] 2009 2nd IEEE International Conference on Computer

Urdu Compound Character Recognition Using Feed Forward Neural Networks

Zaheer Ahmad ، Jehanzeb Khan Orakzai and Inam ShamsherCenter of Information Technology, Institute of Management Sciences, Hayatabad, Peshawar, Pakistan

e-mail: [email protected] , [email protected] and [email protected]

Abstract— Urdu compound Character Recognition is a scarcelydeveloped area and requires robust techniques to develop asUrdu being a family of Arabic script is cursive, right to left innature and characters change their shapes and sizes when theyare placed at initial, middle or at the end of a word.The developed system consists of two main modulessegmentation and classification. In the segmentation phasepixels strength is measured to detect words in a sentence andjoints of characters in a compound/connected word forsegmentation. In the next phase these segmented characters arefeeded to a trained Neural Network for classification andrecognition, where Feed Forward Neural Network is trained on56 different classes of characters each having 100 samples. Themain purpose of the system is to test the algorithm developed forsegmentation of compound characters. The prototype of thesystem has been developed in Matlab, currently achieves 70%accuracy on the average

I. INTRODUCTIONOCR is a field of research in pattern recognition, artificialintelligence and machine vision. An OCR system enables totake a book or a magazine article, feed it directly into anelectronic computer file, and then edit the file using a wordprocessor.Here an Urdu OCR (UOCR) is designed and developed torecognize images of Urdu text/characters. The sole purpose ofthe OCR developed is to test the algorithm developed forfeature extractions and segmentation therefore no robustnoise detection and removal techniques are applied. Similarlythe system is fixed to work on a specified font (Ariel, size 36)without considering diacritics characters.The system gets a single line of Urdu text, converts text intowords and then into characters. A Multilayer Feed ForwardNeural Network is trained to recognize these segments ascharacters, for the purpose each character is feeded to atrained Neural Net, which on successful recognition showsthe correct character otherwise ‘Character not Recognized’message is generated. The result percentage of the system is70%.

II. URDU A CURSIVE SCRIPT

Urdu is the national language of Pakistan and one of thepopular script in the Indian subcontinent evolved in thesubcontinent from the mixture of Arabic, Turkish, Farsi and

Hindi Languages with 58 character set defined by NationalLanguage Authority Pakistan[1-16] as shown in figure 2.1.But only 40 basic and one do-chashmi-hey is used to form allcomposite alphabets; therefore the working set is consists of41 alphabets.

It is a modification of the Persian alphabet, which is itself aderivative of the Arabic alphabet. Urdu shares a commonscript and many characteristics of Arabic script withadditional set of alphabets from Farsian and Hindi charactersets. The graphical representation of each alphabet has morethan one form depending on its position and context in theword. In general each letter has four forms that is beginning,middle, final and standalone as shown in table.1.

TABLE IDIFFERENT FORMS OF URDU CHARACTERS

Charter Forms Name#

��ف ا���ل ��م

0 ء hamzah

1 ا � alif

1a � � alif madd

2 ب � bē

2h bhē

3 پ � � � pē

3h � � phē

Fig-1. Character Set (58 alphabets) of Urdu Script.

_____________________________ 978-1-4244-4520-2/09/$25.00 ©2009 IEEE

Page 2: [IEEE 2009 2nd IEEE International Conference on Computer Science and Information Technology - Beijing, China (2009.08.8-2009.08.11)] 2009 2nd IEEE International Conference on Computer

Charter Forms Name#

��ف ا���ل ��م

4 ت � � � tē

4h � � thē

5 ٹ ٹ ٹ ٹ t.ē

5h ٹ ٹ t.hē

6 ث � � � sē

7 ج � ! jīm

7h � jhē

8 چ $ % & čē = cē

8h $ % čhē = chē

9 ح ) * + bar.ī Hē

10 خ - . / xē = khē

11 د 1 dāl

11h 1هده dhē

12 ڈ ڈ d.āl

12h ڈهڈه d.hē

13 ذ 7 zāl

14 ر 9 rē

14h ره rē

15 ڑ ڑ r.ē

15h ڑهڑه r.hē

16 ز = zē

17 ژ ? žē = zhē

18 س A B C sīn

19 ش E F G šīn = shīn

20 ص I J K Sād, Suād

21 ض M N O Żād, Żuād

22 ط R S T Tōē

23 ظ W X Y Zōē

24 ع [ \ ] ‘ain

25 غ _ ̀ a ğain

26 ف d e f fē

27 ق h i j qāf

28 k l m n kāf

28h l m khē

29 گ p q r gāf

29h p q ghē

Charter Forms Name#

��ف ا���ل ��م

30 ل t u v lām

30h t lām

31 م x y z mīm

31h x mīm

32 ن | } ~ nūn

32h }�}�| nūn

32a ں nūn-e ں ğunnah

32ah nūn-e ğunnah

33 و � vāō

33h وه vāō

34 ہ ہ ہ ہ čhōt.ī hē

34b ه � ه dō-čašmī hē

35 � � � � čhōt.ī yē

ہ35 � ��� čhōt.ī yē

35b ے ے bar.ī yē

III. PROBLEMS OF URDU SCRIPT

Some problems will be presented here from characterrecognition point of view [7-15].1. Urdu is written from right to left in both printed and

handwritten forms.2. No upper or lower cases exist in Urdu, but

sometimes the last character of a word is considerdas upper case because it’s always remains in its fullform.

3. Urdu is always written cursively. Words areseparated by spaces. However, there are 6 characterscan be connected only from the right, these are: ا , د ,ذ , ر , و , ز .

4. Urdu characters are ‘normally’ connected on animaginary line called baseline and each alphabet in acharacter has some fixed size depending upon thepen (‘Qalam’) used which is called ‘khat’.

5. Some Urdu characters have dots associated with thecharacter, they can be above or below.

6. Some characters contain closed loop (refer to Table1). Loop is an important feature to describe acharacter. Character ـ contains two loops. The openportion of characters جـ , (ـ and خـ sometimes, ifwritten by hand, is closed to form a triangle . Theloop of character م , و and ـ\ـ sometimes becomes

Page 3: [IEEE 2009 2nd IEEE International Conference on Computer Science and Information Technology - Beijing, China (2009.08.8-2009.08.11)] 2009 2nd IEEE International Conference on Computer

too small that the internal opening part isdisappeared .

7. Hamza (ء) zigzag shape, is not really a letter but itcan cause difficulty in segmentation process as itresembles with the character ein ( ع ) .

8. There are only three characters that represent vowels,ا , و or ي . However, there are other shorter vowelsrepresented by diacritics in the form of overscores orunderscores but usage of overscore and underscorein Urdu is less as compare to Arabic language.

9. Dots may appear as two separated dots, toucheddots, hat or as a stroke.

10. Another style of Urdu handwriting is the artistic ordecorative calligraphy which is usually full ofoverlapping making the recognition process evenmore difficult by human being rather than bycomputers.

IV. FEED FORWARD NEURAL NETWORKS

Neural networks are composed of simple inputs and outputsnodes, operating in parallel. These elements are inspired bybiological nervous systems. As in nature, the networkfunction is determined largely by the connections betweenelements. We can train a neural network to perform aparticular function by adjusting the values of the connections(weights) between elements. Commonly neural networks areadjusted, or trained, so that a particular input leads to aspecific target output. The network is adjusted, based on acomparison of the output and the target, until the networkoutput matches the target. Feed Forward Neural Networksoften have one or more hidden layers of nodes followed by anoutput layer of neurons. Multiple layers of neurons withnonlinear transfer functions allow the network to learnnonlinear and linear relationships between input and outputvectors. There are a number of algorithms to train NeuralNetworks. Back-propagation is one of them. Theback-propagation (BP) algorithm is the most popular methodfor neural networks training and it has been used to solvenumerous real life problems. BP is multilayer feed forwardneural networks that consist in an iterative minimization of acost function, by making weight connection adjustmentsaccording to the error between the computed and the desiredoutput values.

V. URDU CONNECTED CHARACTER RECOGNITION(UOCR)

Any OCR consists of two main modules, one work as afeature extraction and segmentation and the other is used torecognize the segments as characters. The UOCR worksimilarly, it is also composed of two main modules withsubmodules as shown figure.2.

VI. FEATURE EXTRACTION AND SEGMENTATION

During this phase, pixels strength is measured to detect wordsin a sentence and joints of characters in a word to segmentsentences into words and words into characters. The pixelsstrength or energy is the number of black pixels in a specificdirection. A search for finding a path in different directionse.g. bottom to top, right to left is made during which blackpixels are counted, and select that path on which minimumnumber of black pixels are encountered (minimum number ofblack pixels are found).The method to find the strength/energy path/seam is to findthe minimum value in the last row first (which becomes the(i,j)’th pixel), saving the pixel location and change its statusto 1, then working backwards by finding the minimum of the3 neighboring pixels of (i,j) in the (i-1)’th row and saving thatpixel to the seam path. After the strength of the seam is found,the path of pixels that make up the seam are set to 1 in theimage to increase its energy level and discourage these pixelscontribution in the next search for seams. As a first prioritythose seams are selected which are vertically straight forwords segmentation and for character segmentation verticalseams are preferred but if the size of the segment is largeenough to a threshold value then horizontal seams are appliedon the same segment to further get it segmented.In the table-II, colored cells of column II,III and IV as a unitmake a seam, column V,VI when combined make a seam and

Input Urdu Text Image

Preprocessing

Segmentation

Segmented Character

Binary Character ( Resized )

Character Code (Results)

Fig.2. Character Segmentation and Recognition

Page 4: [IEEE 2009 2nd IEEE International Conference on Computer Science and Information Technology - Beijing, China (2009.08.8-2009.08.11)] 2009 2nd IEEE International Conference on Computer

column I,VIII independently make seams. These seams areselected for segmenting the image for words or characters.

VII. GARBAGE CHARACTERS

Dduring the whole process some garbage characters areproduced, these are unnecessary, undesired segments of acharacter, which in many cases are merged with its parentcharacter ( main part of character) but in some cases thealgorithm remains unable to merge these small segments withits relevant segments and treated as characters till it isdeclared as garbage character in the recognition phase.Infigure-3, is a line of Urdu text (upper line) and characters

segmented (lower line ) from the above line text. The lowerline shows both correctly segmented and garbage charactersproduced during the line of action.The 5th segmented character ( in second line ) from the rightside and the 2nd last segmented character from the right sideare not making their full or differentiable forms and even ahuman eye will not be able to correctly recognize it. As itmore looks like ‘re’( ر) than noon ( ( ن or noon-ghuna . ( ں )

VIII. RECOGNITION USING NEURAL NETWORK

Recognition phase is performed through Feed ForwardNeural Network. It works as a second module of the software.But it is further classified into training and simulation parts.

TABLE III. CHARACTERS AND GARBAGE PRODUCED

Character Garbage

Noon ن ) )

Chotee yee (�)

Seen ، Sheen,Swad,Dwad ص ) ش، س، ( ،ضbe, pe,te and tay ت،ٹ) ،پ ( ب،

Yee (unsegmented ) ( ے� )

a. NEURAL NETWORK ARCHITECTURE AND TRAINING

The Multilayer Feed Forward Neural Network(FFNN) usedhere for recognition of characters is consists of 21x15 (315)input nodes, a single hidden layer with 2000 nodes and outputlayer of 6 nodes. Matlab functions tansig and logsig are usedfor hidden and output layer respectively. Training function‘trainscg’ was used in here because of its optimized memoryusage with all of its defaults.Hidden layer of 2000 nodes was finally selected after testingon different layer sizes for its optimum results, where as Inputlayer of 315 nodes was selected keeping in view the averagesize of the characters produced by using Ariel font of size 36.The FFNN with above parameters taken 2000 epochs to gettrained/meet the goal of 0.0005.

b. TRAINING SET

The 41 alphabets were classified into 56 categories to trainthe neural net, for example character sheen (ش) and swad (ص)are used as single classes in all of its forms but tay ( (ٹ isdivided into two classes. Same is the case with tee (ت). Someof the training samples are shown in figure-3 below.

c. SIMULATION RESULTS

Neural Network output for different characters are shown inFigure-4. Recognition of character family of (ب ), pee (پ),tee (ت ) tay (ٹ) ,cee ( ث ) and fee (ف ) is around 80 % same is

TABLE II. PIXEL L SELECTION

i ii iii iv v vi vii viii

i 0 0 0 0 1 0 1 0

ii 0 0 1 0 0 0 1 0

iii 0 1 1 1 1 0 1 0

iv 0 1 0 1 0 0 1 0

v 0 1 0 1 0 1 1 0

vi 0 0 0 1 0 1 1 0

vii 0 0 1 1 0 1 1 0

Fig.3. Line of Urdu text (above) Segmented character (below)

Fig.3. Training set of single and two classes

Page 5: [IEEE 2009 2nd IEEE International Conference on Computer Science and Information Technology - Beijing, China (2009.08.8-2009.08.11)] 2009 2nd IEEE International Conference on Computer

the case of character family of kaf ( k ) and gaf ( ( گ as theseare the most simple characters and despite their similaritywith each other they are totally different from the othercharacters.The character lam (ل ) when used in middle of a word behaveslike an alif ( which ( ا decrease its recognition percentage butalif is not misunderstood as lam (ل ) in most of the cases. Thecharacter waw(و) and choty yee (�) are as difficult to bedifferentiated by the NN as the segment of choty yee (�) afterit produces the garbage is very similar to waw (و) . Charactersfee (ف) , mem (م ) and ein (ع ) when used in the middle formof a character can deceive neural network for each otherduring the recognition process which leads to a lowpercentage for their recognition. Character noon (ن ) whenused in the beginning, it looks like ze (ز ) and zal ( ذ ) andthus produces low results.In the segmentation part, garbage characters are producedduring the segmentation ofseen(س) ,sheen(ش) ,swad(ص),dwad(ض), noon(ن), noonghuna(ں) which in most of the cases get passes the charactertest during segmentation, where as bee (ب ), pee (پ), teeت) ) , tay (ٹ) , cee ( ث ) and fee (ف ) also produces garbagecharacters but in most of the cases they are identified asgarbage characters. But the good thing is that, these characterproduce garbage only when they are located at the end of aword. Combination of lam (ل) and alif ( ( ا when used in ( ما�� ) like words make some what a new character, in the

segmentation phase as shown in figure- 5. This needs to betreated carefully.

Each time the algorithm produces the same results when usedon same line of text and environment. The same is the casewith a saved neural network results on same line of text.As compared to the neural network training timeconsumption of 5-7 hours, the simulation phase requires0.131 seconds to segment and classify a character through atrained Neural Network where as the algorithm developed tosegment the character only takes 0.078 seconds to segment asingle character.Therefore it can be deduced easily that the neural networkexecution time is 0.052 per second. Matlab function profileand profreport is used on a number of text images to find theaverage execution time.

IX. CONCLUSION

This paper describes a system for Character Recognition ofcompound printed Urdu script. Most of the errors (garbagecharacters) are produced at the end character of a word whenthe word is ending on noon or a character having similarshape like noon. But as it is hard to find which character is theend character therefore the problem cannot be overcomeeasily. A large percentage of error is produced by thecharacter seen(س) ,sheen(ش) ,swad(ص),dwad(ض), noon(ن),noon ghuna(ں) which in most of the cases get passes thecharacter test during segmentation, where as bee (ب ), pee,(پ) tee (ت ) , tay (ٹ) , cee ( ث ) and fee (ف ) also producesgarbage characters in some cases.

Fig.5. Character-wise Recognition %ge

Fig.5. Lam or Alif of Islam

Page 6: [IEEE 2009 2nd IEEE International Conference on Computer Science and Information Technology - Beijing, China (2009.08.8-2009.08.11)] 2009 2nd IEEE International Conference on Computer

REFERENCES

[1] Zaheer Ahmad, Jehanzeb Khan, “Urdu Nastaleeq OCR(Optical Character Recognition)”, Proceedings of WorldAcademy of Science, Engineering and Technology,Volume 2, ISSN:1307-6884, December 2007.

[2] “A ‘layman’s’ Urdu Alphabet “, Wikipedia.com.Feb,13,2009, available:http://en.wikipedia.org/wiki/Urdu_alphabet.[ Accessed:Mar. 3, 2009]

[3] Amin, A. “Arabic Character Recognition”, Handbook ofCharacter Recognition and Document Image Analysis,World Scientific Publishing Company, 1997, pp. 398.

[4] Towards Neural Network Recognition Of HandwrittenArabic Letters By Tim Klassen thesis for MASTER OFCOMPUTER SCIENCE (M.C.Sc.) 2001

[5] “A ‘layman’s’ Connectors and non-connectors “.available: http://www.columbia.edu/itc/mealac/pritchett/00urdu/urduscript/section00.html?urdu#00_02.[ Accessed: Apr. 12, 2008]

[6] “A ‘layman’s’ Devangari and Urdu Alphabets”.Nov,25,2008 available: http://freenet-homepage.de/prilop/urdu-alphabet.html.[ Accessed: Mar.3, 2009]

[7] Shai Avidan and Ariel Shamir,” seamcarving forcontent-aware image resizing”.seamcarving.comavailable: www.seamcarving.com.[ Accessed: Mar. 3,2009]

[8] Ahmed M. Zeki and Mohamad S. Zakaria ,Challenges inRecognizing Arabic Character,International IslamicUniversity Malaysia (IIUM), Kuala Lumpur, Malaysia,National University of Malaysia (UKM), Bangi,Selangor, Malaysia.

[9] A. Amin, “Off-line Arabic Character Recognition - theState of the Art”, Pattern Recognition,Vol. 31, No. 5,517-530, 1998.

[10] F. Al-Fakhri, On-Line Computer Recognition ofHand-Written Arabic Text, Master’s Thesis, ScienceUniversity of Malaysia, 1997.

[11] A. Zeki, Plausable inference Approach to CharacterRecognition, Master’s Thesis, National University ofMalaysia, 1999.

[12] A. Amin, H. Al-Sadoun and S. Fischer, “Hand-PrintedArabic Character Recognition System using An ArificialNetwork” Pattern Recognition, Vol. 29, No. 4, pp.663-675, 1996.

[13] T. Kanungo, G. Marton and O. Bulbul, “PerformanceEvaluation of Two Arabic Products”, in Proceeding ofAIPR Workshop on Advances in Computer AssistedRecognition, SPIE, Vol.3584, Washington DC, 1998.

[14] T. Kanungo, G. Marton and O. Bulbul, “OmniPage vs.Sakhr: Paired Model Evaluation of Two Arabic OCRProducts”, in Proceeding of SPIE Conference onDocument Recognition and Retrieval (VI), Vol. 3651,San Jose, 1999.

[15] A. Amin, “Off line Arabic Character Recognition - ASurvey”, in Proceeding of the 4thInternationalConference Document Analysis and Recognition(ICDAR '97), pp. 596-599, 1997.

[16] K. Jumari and M. Ali, “A Survey and ComparativeEvaluation of Selected off-line Arabic handwrittenCharacter Recognition Systems”, Jurnal Teknology,Malaysian University of Technology, 2001.

[17] Inam Shamsheer, Zaheer Ahmad, OCR For Printed UrduScript Using Feed Forward Neural Network, MLPR2007: International Conference on Machine Learningand Pattern Recognition, Germany, 2007

[18] Hyder, S.S., "A System for Generating Urdu/Farsi/Arabic Script", Information Processing 71, NorthHolland Publishing Co. Amsterdam, pp. 1145-1149,1972.

[19] Hyder, S.S., Richer, F., "The Theory and Design of aSystem for Printing and Communicating inArabic-Urdu-Farsi", 3ournal of Bio-SciencesCommunications, Vol. 3, pp. 181-206, 1977.

[20] Larry Chang & I. Scott MacKenzie. “A Comparison ofTwo Handwriting Recognizers for Pen-basedComputers”1994. available:http://www.yorku.ca/mack/CASCON94.html. .[ Accessed: Aug. 3, 2008]

[21] H. Bunke and P. S. P. Wang. Handbook of CharacterRecognition and Document Image Analysis. WorldScientific Publishing, Singapore, 1997.

[22] S. Mori, H. Nishida, and H. Yamada. Optical CharacterRecognition, Wiley Interscience, New Jersey, 1999.

[23] Optical Character Recognition and the Years Ahead. TheBusiness Press, Elmhurst, IL,1969.

[24] Pas d’auteur. Auerbach on Optical CharacterRecognition. Auerbach Publishers, Inc.,Princeton, 1971.

[25] S. V. Rice, G. Nagy, and T. A. Nartker. OpticalCharacter Recognition: An Illustrated Guide to theFrontier. Kluwer Academic Publishers, Boston, 1999.

[26] H. F. Schantz. The History of OCR. RecognitionTechnologies Users Association, Boston,1982.

[27] C. Y. Suen. Character recognition by computer andapplications. In T. Y. Young and K. S. Fu, editors,Handbook of Pattern Recognition and Image Processing.Academic Press, Inc., Orlando, FL, 1986, pp. 569–586.