confronting the constraints for optical character ...abu saleh md. abir, sanjana rahman, samia...

Confronting the Constraints for Optical CharacterSegmentation from Printed Bangla Text Image

Abu Saleh Md. AbirUnited International University

Dhaka, [email protected]

Sanjana RahmanUnited International University


Samia EllinUnited International University


Maisha FarzanaUnited International University


Md. Hridoy ManikUnited International University


Chowdhury Rafeed RahmanUnited International University


ABSTRACTIn a world of digitization, optical character recognition holds theautomation to written history. Optical character recognition systembasically converts printed images into edible texts for better stor-age and usability. To be completely functional, the system needsto go through some crucial methods such as pre-processing andsegmentation. Pre-processing helps printed data to be noise freeand gets rid of skewness efficiently whereas segmentation helpsthe image fragment into line, word and character precisely for bet-ter conversion. These steps hold the door to better accuracy andsustainable results for a printed image to be ready for conversion.Our proposed algorithm is able to segment characters both fromideal and non-ideal cases of scanned or captured images giving asustainable outcome.

KEYWORDSBangla characters, image processing, segmentationACM Reference Format:Abu Saleh Md. Abir, Sanjana Rahman, Samia Ellin, Maisha Farzana, Md.Hridoy Manik, and Chowdhury Rafeed Rahman. 2020. Confronting theConstraints for Optical Character Segmentation from Printed Bangla TextImage. In Proceedings of . , 7 pages.

1 INTRODUCTIONOptical character recognition (OCR) is the process of convertingprinted text images into edible texts. As physical media such asbooks, newspapers, important files can get destroyed easily or canbe damaged after a certain period of time, converting them intoa more persistent media is the only option. OCR system worksas a digital media that can store valuable information from thephysical media in an effective way. It is the key towards a bettermechanism which can be time-efficient, effortless and productive.Though Bangla is a popular language, it does not have a properOCR system compared to other languages such as English. Banglaas a language is complex and the writing structure is differentfrom other languages. Bangla language has consonants (Fig. 1),vowels (Fig. 2), modified vowels (Fig. 3) and around 170 compound

, ,© 2020 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00

characters (Fig. 4) [17]. Such complex writing structure needs bettersegmentation process for conversion into digital media, hence theapplications for it is difficult.

Figure 1: Bangla Consonants.

Figure 2: Bangla Vowels.

Figure 3: Bangla Modified Vowels.

For an OCR system to work properly, we need to segment eachof the characters properly. A printed text image needs to be pre-processed and segmented properly before it can be converted intoedible text. The main challenge is to prepare the image for segmen-tation. It is the pre-processing phase that gets rid of skewness ofimage, straighten the curved lines, eliminate unwanted noise in theimage and many more. For segmentation to work, these obstaclesneed to be removed with care. Then for an image to be prepared

arX

iv:2

003.

0838

4v4

[cs

.CV

] 1

1 A

ug 2

020

, , Abu Saleh Md. Abir, Sanjana Rahman, Samia Ellin, Maisha Farzana, Md. Hridoy Manik, and Chowdhury Rafeed Rahman

Figure 4: Some Bangla Compound Characters.

for transforming into edible text, the text needs to be segmentedinto lines, words and characters accurately.

Given an image as input, we provide the segmented imagesof each character of the input image along with each segmentedcharacter’s line number and word number in that page as output.Our method works well even if the image capture condition is farfrom ideal. We do not tackle the challenges of character recog-nition in this research. When an image is scanned, it enters thepre-processing stage. At first the image is cropped to remove anyborders around the text. Then the orientation of the image is cor-rected and an algorithm is used to correct the warped images andstraighten the curved text lines of the image. If there is any noisein the image, it is removed. Finally, binarization is performed toconvert the pixel values of the image into 1s and 0s. After the pre-processing is done, the image goes through segmentation phase. Atfirst the image is scanned horizontally and each of the lines are seg-mented. Then each of the line is scanned vertically to segment eachof the words. Finally, the segmented words are used to segmenteach of the characters to give the final outcome.

We have reviewed Bangla OCR researches from the perspec-tive of limitation finding and have implemented them to gain fullunderstanding of the challenges associated with Bangla OCR de-velopment. We have developed our own approach to overcome thecurrent limitations of pre-processing and segmentation phase ofBangla OCR system. The accuracy of character recognition andsentence reconstruction depend largely on the precision of thesephases. We have collected images of many non-ideal cases and haveconfronted the common obstacles in this research. Results of ouralgorithm on such non-ideal case images show the effectiveness ofour method.

2 LITERATURE REVIEWSome problem scopes of Bangla OCR system are mentioned in [13],which include lack of standard samples and complex structure ofdocuments. One of the drawbacks of the OCR system is that it doesnot work properly if the resolution of the image is less than 300dpi[14, 15]. A printed Bangla OCR systemwas developed using a singlehidden BLSTM-CTC architecture which includes pre-processing,line detection and recognition [14]. But the system only worksfor fixed font size which was used to train the model. One of theapproaches include two zone approach for character segmentation[20], which do not work for all fonts and sizes. The accuracy of themodel was reduced due to some connected characters. A group ofresearchers suggested a complete Bangla OCR system methodologywhere they experimented their methods for different fonts and sizesand got a good accuracy rate only for larger font size [3]. Tesseractis an open source OCR engine which works for many languages

[18]. Recently, it has started to work on Bangla script as well. Butone drawback of this OCR engine is that it does not work for allBangla fonts and sizes. An effective way of a complete model forBangla OCR system was proposed in [5], but it was performed onlyfor three fixed fonts and no compound characters were includedin training data. A research showed some degradation of printedBangla script due to connected characters, broken characters andfor light or heavy printed documents [16].

A proposed method for character segmentation in a handwrit-ten document suggests vertex characterization of outer isotheticpolygonal covers. This method shows poor performance for somepathological cases, as it suffers from under-segmentation and over-segmentation problem [4]. For improving the character segmen-tation accuracy, a new algorithm was proposed using back prop-agation neural network [2], though the accuracy was not good.A high performance OCR for Bangla script was presented in [7]which works only for some specific documents and articles. But theproposed method has problems with segmentation which lowersthe accuracy of the system. Another OCR system was proposed torecognize the segmented characters by using an artificial neuralnetwork in [9]. But this approach shows poor accuracy for charactersegmentation and fails to recognize similar Bangla characters.

In order to improve the output result of Bangla OCR system, dif-ferent correction algorithms were proposed. The main problem ofthese researches is the misclassification of some compound charac-ters [1]. To recognize these compound characters, a deep CNN withRELU as nonlinear function was proposed in [6], but it only worksfor fixed data-set and cannot recognize all the character classes.Performance evaluation of different algorithms related to Banglacharacter recognition has been presented in [11], though this ap-proach misclassifies some similar characters. Another recognitionmethod [8] was proposed for handwritten Bangla numeral whichincludes three variants of Local Binary Pattern (LBP) using neuralnetwork. They compared the performances of these three variants.

3 IDEAL CASE VS. NON-IDEAL CASESCENARIO

Many of the works performed on Bangla OCR system have workedwith the ideal case scenarios only. Fig. 5 shows example of anideal and a non-ideal case scenario. An ideal case image is croppedprecisely from the border of the page and the text lines are straight.The image is not skewed and there is also no noise in the image.A non-ideal image may have portions on the image outside ofthe page and needs to be cropped. The text lines are not alwaysstraight and the page may be warped and may contain noises. Soit is easier to work with ideal case images and segment the lines,words, and characters from it. Our proposed algorithm performanceis satisfactory even for the non-ideal cases.

4 METHODOLOGYWe tackle the limitations of the steps shown in Fig. 6 in this researchthrough smart algorithm design. Each step of our algorithm hasbeen described as follows:

Confronting the Constraints for Optical Character Segmentation from Printed Bangla Text Image , ,

Figure 5: Ideal Case (left) and Non-ideal Case (right) Sce-nario.

Figure 6: OCR Pre-Processing and Segmentation Phases

4.1 Image CroppingWhen an image is captured from a book or document, the imagemaycontain some portion outside of the text page. One of the challengeshere is to identify the text of the image and crop the image so thatthe unwanted parts outside of the text can be eliminated. Whenbinarized, these unwanted parts provide a chunk of black pixelswhich result in poor segmentation of lines. Fig. 7 shows unwantedchunk of black pixels marked in a red box.

To solve the problem, the input image is cropped. For appropriatecropping of the image, we have performed page-layout analysisand have found out where the text is. Fig. 8 shows step by step pro-cedures of how a text image is cropped which have been describedbelow:

Figure 7: Unwanted Chunk of Black Pixels.

(1) We are given a casually captured input image containingtext.

(2) We use canny edge detection to detect all the pixels wherever,there is an edge. The source of edges in the image are theborders of the page and the text.

(3) To remove the borders, we use rank filters. The text areashave lots of white pixels, but the borders consist of thin singlepixel lines. Rank filter replaces these pixels with the medianof the pixels. This eliminates the one pixel lines or edgesafter applying a vertical and horizontal rank filter removingthe border pixels.

(4) The next step is to find the contours of the connected com-ponents which will be only of the text.

(5) We form a bounding box around the text and crop the image.(6) Our output is the heterogeneous background free cropped

image containing only text.

4.2 Skew CorrectionInput images may be skewed due to position of the camera or dueto the position of the book on the scanner. Many of the BanglaOCR systems fail in this common scenario. It is important to rotatethe image to an angle where the image is vertically straight. Wehave used connected component analysis to check for the text lines.Sum of each row of pixels is calculated to generate a row-wise pixelsum histogram. Then the image is rotated at various angles in eachiteration. To determine the skew angle, we compare the maximumdifference between the peaks of the rotated angles of images withthe initial image. The skewness of the image is corrected with theangle where the difference between the peaks is maximum. Fig. 9shows an example of an image before and after skew correction.

4.3 Image DewarpingImage dewarping is associated to perspective correction. Geometricdistortion of a captured image lines is a common real life scenario.The formation of curved lines due to view angle of camera orwarped page leads to poor line segmentation, as most of the lines


Figure 8: Steps of Text Image Significant Portion Cropping

overlap with each other. As a result, multiple lines get segmentedas a single line. Fig. 10 shows such a problematic scenario.

The steps of our proposed image dewarp algorithm are as follows.

• We obtain page boundaries, which consist of the four cornersof the image that we have cropped earlier.

• We detect text contours using connected component analysis.Each of the lines is detected as connected component and isassembled as a span.

• Each span of text is now remapped with a calculated param-eter estimation.

• Finally, we optimize the remapping of span to minimize there-projection error.

Fig. 11 shows an example of a geometrically distorted imagebefore and after the use of image dewarping.


Figure 9: Before (left) and After (right) Image Skew Correc-tion Execution.

Figure 10: Curved Lines Due to View Angle of Camera orWarped Page.

Figure 11: Before (left) and After (right) Image Dewarp Exe-cution.

4.4 Noise ReductionImages can have unwanted noise in them. Noises are mainly of twotypes. Salt and pepper noise is created due to sudden disturbancesin the image, while background noise occurs due to poor intensityduring image capture. Since the Bangla characters are complex, thisnoise reduction step is essential. Otherwise, some of the noises canbe assumed as part of the characters. We use a Gaussian smoothingfilter based denoising technique to eliminate these noises.

4.5 BinarizationBinarization is performed to convert the pixel values of the imageto either 1 or 0. This step is very important as it helps us to dis-tinguish between the text and the background when we performsegmentation. First the image is converted into a gray-scale imageand an adaptive threshold algorithm is used for binarization of theimage.

4.6 Line SegmentationLines are detected easily from an image horizontally. At first theimage pixel values are calculated for each of the rows and arecompared. Line segmentation is performed where the sum of thepixel value is close to zero (Fig. 12).

Figure 12: Line Segmentation.

Here, we have faced another challenge while working with mul-tiple font sizes in single page, where we fail to segment each of thelines properly. When multiple font sizes are present on the sameimage, line segmentation is performed for the bigger font size. As aresult, all the lines are not segmented correctly. As shown in Fig.13, the first two lines get segmented together due to different fontsizes.

Figure 13: Problem with Multiple Font Size in Line Segmen-tation.

We resolve this problem using a smart line segmentation tech-nique. The steps of this technique are as follows.

• At first, we segment all the lines and check if all the linesare of the same height or not.

• If any segmented line has height greater than the averageheight, that segmented line image contains multiple sizefonts. To separate the different fonts, we scan and segment


the image horizontally which will separate the differentfonts.

• Line segmentation is performed again on the two segmentedimages to separate any lines that were segmented togetherdue to multiple fonts.

• The larger font text portion is resized and is attached withthe first segmented line.

4.7 Word SegmentationWords are segmented easily from segmented line images in a verti-cal manner. At first, the image pixel values are calculated for eachof the columns of a segmented line image. Word segmentation isperformed where the sum of the pixel value is close to zero (Fig.14).

Figure 14: Word Segmentation.

4.8 Character SegmentationCharacter segmentation is performed on segmented word images.It is the most difficult part among the three segmentation phases.Bangla characters are connected with each other with a headlineknown asmatra line. Bangla characters also have some overlap-ping modified vowels with them which make this segmentationprocess more complicated. Separating these overlapping charactersand connecting them back again make the task difficult. Charactersegmentation is divided into two parts - removal of the detectedmatra line and segmentation of each character.

Figure 15: Detection of Matra Line Region.

To separate each character, we need to detect matra line andthen remove it. To properly detect the matra line, we have dividedthe word image into half horizontally and matra line is detectedwhere the sum of pixel value of rows are greater than 60% on theupper half of the image. Fig. 15 shows the region of Matra line fora word.

After removing the matra line, we can find gaps in between eachof the characters. Characters then can be detected from an image

Figure 16: Character Segmentation.

vertically. At first the image pixel values are calculated for each ofthe columns. Character segmentation is performed where the sumof the pixel values is close to zero (Fig. 16). Character segmentationis considered to be correct if a consonant or a vowel or a compoundcharacter is segmented alone or alongside with a modified vowel.Fig. 17 shows examples of some correctly segmented characters.

Figure 17: Correctly Segmented Characters.

5 RESULTS AND DISCUSSIONSegmentation of Bangla text images provide results on segmentedlines, words and characters. Our algorithm was run on 10 differentBangla text images of different fonts. The results are shown inTable 1. For line and word segmentation, we have been able tosegment all the lines and almost all words accurately. For charactersegmentation, the result is convenient with 94.32% accuracy. Inspite of some limitations in our work, the accuracy level of linesegmentation, word segmentation and character segmentation areremarkable.

We have not performed character recognition part in the currentresearch. State-of-the-art Convolutional Neural Network (CNN)models are capable of performing multi-output classification taskon input image if provided with properly labeled training samples([12], [19], [10]). We can formulate character recognition as a twotask problem. Task one should be of main character recognition.Each of the main vowels, consonants and compound charactersshould be considered as a separate class for this task. Task twoshould deal in recognizing modified vowel, where each modifiedvowel should be of a separate class. Now, both tasks must have anull valued class, because a segmented character image may containonly a modified vowel or only a main character. We leave this datadriven extension as part of future work.

Our segmentation process does not always give perfectly seg-mented characters. Some common problems occur during removalof matra line and during character segmentation.

When we straighten a curved line, a fewwords may stay tilted. Insuch cases, the matra line goes undetected, because the horizontalpixel sum criteria does not work here. This makes it harder for us


Table 1: Results of Line, Word and Character Segmentation

Types No. of Samples No. Segmented Accuracy (%)Line 312 312 100%Word 3525 3496 99.1%Character 11208 10572 94.32%

Figure 18: Region of Matra line Detected.

Figure 19: Region of Matra Line Undetected.

to eliminate the matra line. Fig. 18 shows the region of matra linebeing detected of a slightly tilted word along with row-wise blackpixel histogram. Fig. 19 shows the region of matra line which staysundetected and was not removed. As a result, we fail to segmentsuch words into characters properly. Fig. 20 shows some exam-ples where the character segmentation of the words are not donecorrectly as removal of matra line was not successful.

Figure 20: Matra Line Detection Failure Scenario Examples.

Some of the Bangla printed characters overlap with each otherdepending on font type. It is difficult to segment such overlappingcharacters as there is no gap between them, which leads to poorcharacter segmentation. If we try to segment such overlapping char-acters, some pixels from one character will be segmented with theother overlapping character. Fig. 21 shows some of the failure casesin character segmentation due to overlapping of the characters.

6 CONCLUSIONThis research aims at confronting the common challenges asso-ciated with non-ideal capture conditions of printed Bangla textimage. As a pre-requisite of perfecting the segmentation phase ofa Bangla OCR system, we have gone through a number of pre-processing steps. We have also identified and confronted commonlimitations of line, word and character segmentation phase. Prop-erly segmented characters are the key to accurate recognition and

Figure 21: Character Overlap Example Scenario.

reconstruction of the characters in digital media. Future researchesmay aim at resolving character overlap and matra line detectionproblem during character segmentation phase. For developing afully functional Bangla OCR system, the current research has tobe extended by developing a character recognition algorithm ca-pable of identifying main character (vowel/ consonant/ compoundcharacter) and modified vowel from a segmented character image.

REFERENCES[1] Md Sajib Ahmed, Teresa Gonçalves, and Hasan Sarwar. 2016. Improving Bangla

OCR output through correction algorithms. (2016), 338–343.[2] Shamim Ahmed and Mohammod Abul Kashem. 2013. Enhancing the Character

Segmentation Accuracy of Bangla OCR using BPNN. International Journal ofScience and Research (IJSR) ISSN (Online) (2013), 2319–7064.

[3] Md Mahbub Alam and M Abul Kashem. 2010. A complete Bangla OCR systemfor printed characters. JCIT 1, 01 (2010), 30–35.

[4] Soumen Bag, Partha Bhowmick, Gaurav Harit, and Arindam Biswas. 2011. Char-acter segmentation of handwritten Bangla text by vertex characterization ofisothetic covers. (2011), 21–24.

[5] Ahmed Asif Chowdhury, Ejaj Ahmed, Shameem Ahmed, Shohrab Hossain, andChowdhury Mofizur Rahman. 2002. Optical Character Recognition of BanglaCharacters using neural network: A better approach. (2002).

[6] Asfi Fardous and Shyla Afroge. 2019. Handwritten Isolated Bangla CompoundCharacter Recognition. (2019), 1–5.

[7] Md Abul Hasnat, SMMurtoza Habib, and Mumit Khan. 2008. A high performancedomain specific OCR for Bangla script. (2008), 174–178.

[8] Tasnuva Hassan and Haider Adnan Khan. 2015. Handwritten bangla numeralrecognition using local binary pattern. (2015), 1–4.

[9] SK Alamgir Hossain and Tamanna Tabassum. 2014. Neural net based completecharacter recognition scheme for Bangla printed text books. (2014), 71–75.

[10] Francisco Massa, Renaud Marlet, and Mathieu Aubry. 2016. Crafting a multi-taskCNN for viewpoint estimation. arXiv preprint arXiv:1609.03894 (2016).

[11] Syed Irfan Ali Meerza, Moinul Islam, andMdMohiuddin Uzzal. 2019. PerformanceEvaluation of Different Algorithms for Handwritten Isolated Bangla CharacterRecognition. (2019), 412–416.

[12] Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. 2016. Ordinalregression with multiple output cnn for age estimation. In Proceedings of the IEEEconference on computer vision and pattern recognition. 4920–4928.

[13] Farjana Yeasmin Omee, Shiam Shabbir Himel, Md Bikas, and Abu Naser.2012. A complete workflow for development of bangla OCR. arXiv preprintarXiv:1204.1198 (2012).

[14] Debabrata Paul and Bidyut Baran Chaudhuri. 2019. A BLSTMNetwork for PrintedBengali OCR System with High Accuracy. arXiv preprint arXiv:1908.08674 (2019).

[15] Ahnaf Farhan Rownak, Md Fazle Rabby, Sabir Ismail, and Md Saiful Islam. 2016.An efficient way for segmentation of Bangla characters in printed documentusing curved scanning. (2016), 938–943.

[16] Manoj Kumar Shukla and Haider Banka. 2012. A study of different kinds ofdegradation in printed Bangla script. (2012), 119–123.

[17] Md Sifat, Habibur Rahman, Chowdhury Rafeed Rahman, Mohammad Rafsan,Md Rahman, et al. 2020. Synthetic Error Dataset Generation Mimicking BengaliWriting Pattern. arXiv preprint arXiv:2003.03484 (2020).

[18] Ray Smith. 2007. An overview of the Tesseract OCR engine. 2 (2007), 629–633.[19] Yunchao Wei, Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao,

and Shuicheng Yan. 2015. HCP: A flexible CNN framework for multi-label imageclassification. IEEE transactions on pattern analysis and machine intelligence 38, 9(2015), 1901–1907.

[20] Tasnim Zahan, Muhammed Zafar Iqbal, Mohammad Reza Selim, and Moham-mad Shahidur Rahman. 2018. Connected Component Analysis Based Two ZoneApproach for Bangla Character Segmentation. (2018), 1–4.

confronting the constraints for optical character ...abu saleh md. abir, sanjana rahman, samia...

Documents