a knowledge-based system for extracting text-lines from mixed and overlapping text/graphics compound...

14
A knowledge-based system for extracting text-lines from mixed and overlapping text/graphics compound document images Yen-Lin Chen a , Zeng-Wei Hong b , Cheng-Hung Chuang b,a Department of Computer Science and Information Engineering, National Taipei University of Technology, 1, Sec. 3, Chung-hsiao E. Rd., Taipei 10608, Taiwan b Department of Computer Science and Information Engineering, Asia University, 500 Liufeng Rd., Wufeng, Taichung 41354, Taiwan article info Keywords: Document image analysis Knowledge-based systems Text extraction Region segmentation Complex compound document images abstract This paper presents a new knowledge-based system for extracting and identifying text-lines from various real-life mixed text/graphics compound document images. The proposed system first decomposes the document image into distinct object planes to separate homogeneous objects, including textual regions of interest, non-text objects such as graphics and pictures, and background textures. A knowledge-based text extraction and identification method obtains the text-lines with different characteristics in each plane. The proposed system offers high flexibility and expandability by merely updating new rules to cope with various types of real-life complex document images. Experimental and comparative results prove the effectiveness of the proposed knowledge-based system and its advantages in extracting text- lines with a large variety of illumination levels, sizes, and font styles from various types of mixed and overlapping text/graphics complex compound document images. Ó 2011 Elsevier Ltd. All rights reserved. 1. Introduction Despite the recent adoption of electronic documents and books, paper-based published documents and books continue to spread. Because paper-based publications are less convenient than elec- tronic publications in the aspects of archiving, modification, and retrieval, the practical applications of document image analysis for paper-based documents and books has recently attracted a lot of attention. Examples of these applications include text informa- tion extraction and analysis, optical character recognition, docu- ment retrieval, compression, and archiving (Doermann, 1998; O’ Gorman & Kasturi, 1995). However, textual information extrac- tion is the most essential task in document image analysis. There- fore, researchers have presented several studies on textual information extraction and analysis from monochromatic docu- ment images (Fisher, Hinds, & D’Amato, 1990; Fletcher & Kasturi, 1988; Lee, Choy, & Cho, 2000; Niyogi & Srihari, 1996; Shih, Chen, Hung, & Ng, 1992). Most of these methods mostly rely on prior publication-specified knowledge of printed text-lines on mono- chromatic document images with regular typesetting and layouts. Recent advances in multimedia publishing and mixed text/graph- ics printing technology have enabled an increasing number of real-life paper publications that print various stylistic text-lines with graphical, pictorial, and non-text decorated objects, and often include colorful, textured backgrounds. However, conventional text extraction methods do not perform well when extracting text-lines from real-life mixed and overlapping text/graphics com- pound document images. Extracting text-lines from mixed text/ graphics compound document images is much more complicated than extracting them from monochromatic document images. This is because text-lines in document images are often printed with various colors or levels of illumination, and superimposed with graphics, pictures, or other textured backgrounds. Therefore, a sys- tem that can efficiently locate and extract text-lines printed in the pictorial and textured regions of complex compound documents remains an open research topic in document image analysis. Researchers have developed various methods of extracting text regions from mixed text/graphic compound document images. Some of these methods are based on the fact that most textual regions show distinctive texture features that are unlike other non-text background regions (Hasan & Karam, 2000; Jain & Bhattacharjee, 1992; Wu, Manmatha, & Riseman, 1999; Yuan & Tan, 2001). Such methods adopt texture detection filters to extract the texture features of possible text regions, and use these features to extract text from document images. Jain and Bhattacharjee’s (1992) method extracts the texture features of text regions by applying Gabor filters, and segments the text regions of interest based on these texture features. However, the limitation of this method is that its text-line extraction performance may sensitively influenced by their various font sizes and styles. Wu et al.’s (1999) Textfinder system uses nine second-order Gaussian derivative fil- ters to obtain texture feature vectors of each pixel at three different scales, and applies the K-means clustering process to these texture feature vectors to classify the corresponding pixels into text 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.07.040 Corresponding author. E-mail address: [email protected] (C.-H. Chuang). Expert Systems with Applications 39 (2012) 494–507 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Upload: yen-lin-chen

Post on 05-Sep-2016

215 views

Category:

Documents


3 download

TRANSCRIPT

Expert Systems with Applications 39 (2012) 494–507

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

A knowledge-based system for extracting text-lines from mixedand overlapping text/graphics compound document images

Yen-Lin Chen a, Zeng-Wei Hong b, Cheng-Hung Chuang b,⇑a Department of Computer Science and Information Engineering, National Taipei University of Technology, 1, Sec. 3, Chung-hsiao E. Rd., Taipei 10608, Taiwanb Department of Computer Science and Information Engineering, Asia University, 500 Liufeng Rd., Wufeng, Taichung 41354, Taiwan

a r t i c l e i n f o

Keywords:Document image analysisKnowledge-based systemsText extractionRegion segmentationComplex compound document images

0957-4174/$ - see front matter � 2011 Elsevier Ltd. Adoi:10.1016/j.eswa.2011.07.040

⇑ Corresponding author.E-mail address: [email protected] (C.-H. Chua

a b s t r a c t

This paper presents a new knowledge-based system for extracting and identifying text-lines from variousreal-life mixed text/graphics compound document images. The proposed system first decomposes thedocument image into distinct object planes to separate homogeneous objects, including textual regionsof interest, non-text objects such as graphics and pictures, and background textures. A knowledge-basedtext extraction and identification method obtains the text-lines with different characteristics in eachplane. The proposed system offers high flexibility and expandability by merely updating new rules tocope with various types of real-life complex document images. Experimental and comparative resultsprove the effectiveness of the proposed knowledge-based system and its advantages in extracting text-lines with a large variety of illumination levels, sizes, and font styles from various types of mixed andoverlapping text/graphics complex compound document images.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction

Despite the recent adoption of electronic documents and books,paper-based published documents and books continue to spread.Because paper-based publications are less convenient than elec-tronic publications in the aspects of archiving, modification, andretrieval, the practical applications of document image analysisfor paper-based documents and books has recently attracted a lotof attention. Examples of these applications include text informa-tion extraction and analysis, optical character recognition, docu-ment retrieval, compression, and archiving (Doermann, 1998;O’ Gorman & Kasturi, 1995). However, textual information extrac-tion is the most essential task in document image analysis. There-fore, researchers have presented several studies on textualinformation extraction and analysis from monochromatic docu-ment images (Fisher, Hinds, & D’Amato, 1990; Fletcher & Kasturi,1988; Lee, Choy, & Cho, 2000; Niyogi & Srihari, 1996; Shih, Chen,Hung, & Ng, 1992). Most of these methods mostly rely on priorpublication-specified knowledge of printed text-lines on mono-chromatic document images with regular typesetting and layouts.Recent advances in multimedia publishing and mixed text/graph-ics printing technology have enabled an increasing number ofreal-life paper publications that print various stylistic text-lineswith graphical, pictorial, and non-text decorated objects, and ofteninclude colorful, textured backgrounds. However, conventional

ll rights reserved.

ng).

text extraction methods do not perform well when extractingtext-lines from real-life mixed and overlapping text/graphics com-pound document images. Extracting text-lines from mixed text/graphics compound document images is much more complicatedthan extracting them from monochromatic document images. Thisis because text-lines in document images are often printed withvarious colors or levels of illumination, and superimposed withgraphics, pictures, or other textured backgrounds. Therefore, a sys-tem that can efficiently locate and extract text-lines printed in thepictorial and textured regions of complex compound documentsremains an open research topic in document image analysis.

Researchers have developed various methods of extracting textregions from mixed text/graphic compound document images.Some of these methods are based on the fact that most textualregions show distinctive texture features that are unlike othernon-text background regions (Hasan & Karam, 2000; Jain &Bhattacharjee, 1992; Wu, Manmatha, & Riseman, 1999; Yuan &Tan, 2001). Such methods adopt texture detection filters to extractthe texture features of possible text regions, and use these featuresto extract text from document images. Jain and Bhattacharjee’s(1992) method extracts the texture features of text regions byapplying Gabor filters, and segments the text regions of interestbased on these texture features. However, the limitation of thismethod is that its text-line extraction performance may sensitivelyinfluenced by their various font sizes and styles. Wu et al.’s (1999)Textfinder system uses nine second-order Gaussian derivative fil-ters to obtain texture feature vectors of each pixel at three differentscales, and applies the K-means clustering process to these texturefeature vectors to classify the corresponding pixels into text

Y.-L. Chen et al. / Expert Systems with Applications 39 (2012) 494–507 495

regions. Hasan and Karam (2000) introduced a morphological tex-ture extraction scheme that recursively applies morphological dila-tion and erosion operations to the extracted closure edge texturesto locate text regions. Texture information is useful for detectingthe existence of textual objects in a specific region. Texture-basedextraction methods are capable of identifying most textual regionsin mixed-text/graphical document images. However, most of thesemethods fail to provide consistent accuracy in locating text-lines,which in turn reduces the performance of subsequent documentanalysis processes. Moreover, texture feature extraction is verytime consuming for practical document image processing applica-tions. When textual objects border or overlap graphical objects,non-text texture patterns, or backgrounds with similar texture fea-tures, the texture features of these non-text objects may be identi-fied as textual objects. Such non-text objects smear the text-linesin extracted regions.

Researchers have recently proposed several color-segmenta-tion-based methods for text extraction from color documentimages. Jain and Yu (1998) used bit-dropping quantization andthe single-link color-clustering algorithm to decompose a colordocument into a set of foreground images in the RGB color space.Strouthopoulos et al.’s adaptive color reduction technique (2002)utilizes an unsupervised neural network classifier and a tree-search procedure to determine prototype colors. Some alternativecolor spaces can determine prototype colors for finding textual ob-jects of interest. Yang and Ozawa (1999) used the HSI color spaceto segment homogenous color regions to extract bibliographicinformation from book covers, while Sobottka, Kronenberg,Perroud, and Bunke (2000) presented a hybrid method combiningtop-down and bottom-up analysis techniques to extract text-linesfrom color journal and book covers. Hase, Shinokawa, Yoneda, andSuen (2001) applied a histogram-based approach to select proto-type colors on the CIELab color space, and adopted a multi-stagerelaxation approach to label and classify extracted homogeneousconnected-components to obtain character strings. However, mostof these methods have difficulty extracting text-lines embedded incomplex backgrounds or touching other pictorial and graphical ob-jects. This is because the prototype colors are determined from aglobal view, making it difficult to select appropriate prototype col-ors to differentiate textual objects from nearby pictorial objectsand complex backgrounds without sufficient contrast.

Moreover, few of the methods above can provide can cope withthe various types of real-life complex compound document images.The advantages of extensibility and flexibility that a knowledge-based system can provide are suitable for a large variety of practi-cal applications. As a result, many researchers have recently ap-plied knowledge-based systems to image processing (Avci & Avci,2009; Cho, Quek, Seah, & Chong, 2009; Cucchiara, Piccardi, & Mello,2000; Kang & Bae, 1997; Lee et al., 2000; Levine & Nazif, 1985;Niyogi & Srihari, 1996; Subasic, Loncaric, & Birchbauer, 2009). Le-vine and Nazif (1985) proposed an efficient three-level knowl-edge-based model for low-level scene image segmentation. Forthe thresholding applications on object segmentation in images,Kang and Bae (1997) developed an adaptive image thresholdingmethod that integrates the fuzzy inference method with the logicallevel technique to extract character objects with linearity features.Avci and Avci (2009) proposed an expert system for analyzing fuz-zy entropy. Their system selects an optimal threshold for segment-ing foreground objects. Subasic et al. (2009) presented an expertsystem-based face segmentation system that integrates a low-levelimage segmentation module and a multi-stage rule-based labelingsystem. Cucchiara et al. (2000) and Cho et al. (2009) successfullyapplied knowledge-based systems to real-time video-based trafficmonitoring applications. In previous studies on document imageanalysis, Fisher, Hinds, and D’Amato (1990), Niyogi and Srihari(1996), and Lee et al. (2000) applied the concepts of

knowledge-based systems to the structural and geometric analysisof typical document images such as newspaper images and journalimages. However, their methods are only applicable to publication-specified monochromatic documents with regular and ordered lay-outs, and cannot easily process various types of mixed and overlap-ping text/graphics compound document images.

Levine and Nazif (1985), Niyogi and Srihari (1996), and Lee et al.(2000) applied a three-level rule-based model, consisting of knowl-edge, control, and strategy rules, to low-level scene image segmen-tation and structural analysis of newspaper images and journalimages. This rule-based reasoning model provides feasible model-ing for applications on the image analysis domain, and offers highflexibility for further improving and extending the system byupdating the knowledge rules in the inference mechanism. Theknowledge-based system proposed in this study adopts this effi-cient concept of three-level rule-based reasoning model for text-line extraction and identification in real-life complex compounddocument images. Knowledge rules encode the geometric and sta-tistical features of text-lines, such as colors, illumination levels,sizes, and font styles, and form two rule sets: text region extractionrules and text-line identification rules.

This study proposes a novel knowledge-based system forextracting text-lines from various types of mixed and overlappingtext/graphics compound document images that contain text-lineswith different illumination levels, sizes, and font styles. The text-lines can be superimposed on various background objects with un-even, gradational, and sharp variations in contrast, illumination,and texture, such as figures, photographs, pictures, or other back-ground textures. This system first applies the multi-plane segmen-tation technique to decompose the document image into distinctobject planes to extract and separate homogeneous objects includ-ing textual regions of interest, non-text objects such as graphicsand pictures, and background textures (Chen & Wu, 2009). Thismulti-plane segmentation technique processes document imagesregionally and adaptively based on their local features, the pro-posed method can easily handle text-lines that border or overlapwith pictorial objects and backgrounds with uneven, gradational,and sharp variations in contrast, illumination, and texture. The sys-tem applies a knowledge-based text extraction and identificationprocedure to the resulting planes to detect, extract, and identifytext-lines with various characteristics in each plane. This methodconsists of two processing phases: the text region extraction andtext-line identification phases. Knowledge rules encode the geo-metric and statistical features of text-lines, such as different illumi-nation levels, sizes, and font styles, and establish two rule sets oftext region extraction and text-line identification based on thephase in which they are performed. To perform the processes oftext-line extraction and identification from real-life complex com-pound document images, the inference engine of the proposed sys-tem is based on hierarchically structured control and strategyrules. The proposed system enables high flexibility and expand-ability by merely updating new rules for coping with new and var-ious types of real-life complex document images. Experimentalresults demonstrate that the proposed knowledge-based approachcan provide accurately extraction of text-lines with different illu-mination levels, sizes, and font styles from various complex com-pound document images.

2. Multi-plane segmentation approach

Complex compound document images often contain text-lineswith different illumination levels, sizes, and font styles, and areprinted on varying or inhomogeneous background objects with un-even, gradational, and sharp variations in contrast, illumination,and texture. Examples of these backgrounds include illustrations,

496 Y.-L. Chen et al. / Expert Systems with Applications 39 (2012) 494–507

photographs, pictures, or other background patterns. A criticalproblem in text extraction is that no global segmentation tech-niques work well for such kinds of document images. In a globalview of statistical features cannot identify these text-lines whenthe regions of interesting text-lines consisting of multiple colorsor gray intensities are smaller than those of the touched pictorialobjects are and complex backgrounds with indistinct contrastsare. Fig. 1(a) shows a typical example of these characteristics. Thissample image consists of different colored text-lines printed on avarying and shaded background. More sample images of mixedand overlapping text/graphics compound document images canalso be found in Section 4 and our experimental sample database.

Fig. 1. An example of object planes obtained from the test image

For the purpose of extracting textual objects of interest in suchcompound document images, observing some localized regionswould make it much easier to distinguish the statistical featuresof the textual objects, pictorial objects, and backgrounds. There-fore, this study applies a regional and adaptive technique, calledthe multi-plane segmentation approach, which has been presentedin our previous work (Chen & Wu, 2009), to segment textual ob-jects from complex document images. This approach decomposesthe document image into separate object planes by applying twoprocessing stages: automatic localized histogram multilevel thres-holding, and multi-plane region matching and assembling. The firststage decomposes distinct objects embedded in block regions into

by the multi-plane segmentation (image size = 1388 � 1368).

Y.-L. Chen et al. / Expert Systems with Applications 39 (2012) 494–507 497

separate ‘‘sub-block regions (SRs)’’ by applying the localized histo-gram multilevel thresholding process. Afterward, the second stageapplies the multi-plane region matching and assembling process tothe resulting sub-block regions to classify and arrange them intohomogeneous object planes. This two-step process extracts homo-geneous objects, including textual regions of interest, non-text ob-jects such as graphics and pictures, and background textures, andseparates them into distinct object planes. Then the proposedknowledge-based text-line extraction and identification approachis performed on the resulting planes to extract and identify textualobjects with different characteristics in the respective planes.

2.1. Localized multilevel thresholding

The multi-plane segmentation process begins by applying a col-or-to-grayscale transformation on the RGB components of imagepixels in a color document image to obtain its illumination imageY. The resulting illumination image Y is sectored into non-overlap-ping rectangular block regions measuring MH �MV. To facilitateanalysis in the following stage, the objects of interest must be ex-tracted from these localized block regions into separate ‘‘sub-blockregions,’’ each of which contains objects with homogeneous fea-tures. To achieve this goal, an efficient multilevel thresholdingtechnique must automatically determine a suitable number ofthresholds for segmenting the block region into different decom-posed object regions. Using the properties of discriminant analysis,we have developed an automatic multilevel global thresholdingtechnique for image segmentation (Wu, Chen, & Chiu, 2005). Thistechnique extends and applies the concept of discriminant crite-rion to analyzing the separability among the gray levels in an im-age, and can automatically determine the best number ofthresholds, and utilizes a fast recursive selection strategy to selectthe optimal thresholds for segmenting the image into separate ob-jects with similar features in a computationally frugal way. In thisstudy, this localized histogram multilevel thresholding techniqueis adopted to decompose distinct objects with homogeneous fea-tures in localized regions into separate sub-block regions.

2.2. Multi-plane region matching and assembling process

After decomposing all localized block regions into several sepa-rate classes of pixels using the localized multilevel thresholdingprocedure, various objects embedded or superimposed in differentbackground objects and textures are separated into relevant SRs.This necessitates a methodology for grouping them into meaning-ful objects, especially textual objects of interest, for further extrac-tion. Therefore, this study applies an effective segmentationapproach, called the multi-plane region matching and assemblingprocess, which has been introduced in our previous study (Chen& Wu, 2009). This segmentation process adopts both the localizedspatial dissimilarity relation and the global feature information toclassify and assemble the SRs into a set of ‘‘object planes’’ of homo-geneous features, especially textual regions of interest. The multi-plane region matching and assembling process involves recursivelyperforming the following three phases – the initial plane selectionphase, the matching phase, and the plane construction phase.

To assemble the object planes of interest, first perform the ini-tial plane selection phase on the unclassified SRs to determine arepresentative set of seed SRs, and then set up N initial objectplanes using these selected seed SRs. Afterward, perform thematching phase on the rest of unclassified SRs in the Pool and theseinitial planes to determine the association and belongingness ofthese SRs in the existing object planes. For the unclassified SRs thathave perceptibly distinct features compared to currently existingplanes, the plane construction phase creates and initializes anappropriate new plane for assembling SRs with such features into

a new plane. This forms another homogeneous object region inthe subsequent matching phase recursion. After performing thefirst pass of the multi-plane region matching and assembling pro-cess, the segmentation process recursively performs the matchingphase and the plane construction phase on the rest of unclassifiedSRs in the Pool and emerging planes until each SR has been classi-fied and associated with a particular plane, and the Pool is cleared.As a result, the whole illumination image Y will be segmented intoa set of separate object planes fPq : q ¼ 0 : L� 1g after completingthe multi-plane region matching and assembling process. Each ofthese object planes consists of homogenous objects with con-nected and similar features, such as textual regions of interest,non-text objects such as graphics and pictures, and backgroundtextures. Consequently,

[L�1

q¼0

Pq ¼ Y; with Pq1

\q1–q2

Pq2¼ / ð1Þ

where L is the number of the resulting planes obtained.Fig. 1 shows an example of processing results of the sample im-

age in Fig. 1(a) using the multi-plane segmentation process. In thesample image of Fig. 1(a), several different colored textual regionswere printed on a varying and shaded background. After obtainingthe SRs from the localized multilevel thresholding procedure, theregion matching and assembling process is applied to these ob-tained SRs. The resulting SRs are analyzed and assembled by recur-sively performing the matching and the plane construction phase.These phases segment the homogeneous objects in which all tex-tual objects and background textures into separate object planes.This multi-plane segmentation process produces five major objectplanes P0 � P4 (while those insignificant planes are discarded) asFig. 1(b)–(f) show. Within these object planes, the planes P1 andP2 in Fig. 1(c) and (d), respectively, contain text-lines of interest.These planes reveal that several textual regions with differentcharacteristics are separated. The following section presents aknowledge-based system for extracting and identifying actualtext-lines from these object planes.

3. Knowledge-based textline extraction and identification

Having performed the multi-plane segmentation process, theentire image is decomposed into various object planes. Each objectplane may consist of various considerable objects, such as textualregions, graphical and pictorial objects, background textures, orother objects. Here, each individual object plane Pq is binarizedby setting its object pixels to black, and setting other non-objectpixels to whit. This creates a ‘‘binarized plane,’’ denoted as BPq, ineach plane Pq. Performing a text-line extraction and identificationprocess on each individual binary plane BPq then reveals any text-lines of interest. To obtain the character-like components fromeach binary plane BPq, a fast connected-component extractiontechnique (Suzuki, Horiba, & Sugie, 2003) first locates the con-nected-components of the black pixels in BPq. Fig. 2 illustratesthe obtained connected-components of possible textual objectsin the binarized object plane BP1 derived from Fig. 1. These con-nected-components may represent the character components,graphical and pictorial objects, or background textures.

This study presents a knowledge-based text-line extraction andidentification system for grouping connected-components intoapplicable sets and extracting possible text-lines from each ofthe binarized planes. A resulting set of connected-componentsmay include an actual text-line, a larger graphical object, or agroup of isolated background components within the characterstrings. The proposed knowledge-based text-line identificationprocess examines each of these connected-component sets, de-noted as CS, to determine if they are actual text-lines.

Fig. 2. The located connected-components of possible textual objects in thebinarized object plane BP1.

498 Y.-L. Chen et al. / Expert Systems with Applications 39 (2012) 494–507

3.1. The knowledge-based system

The proposed knowledge-based text-line extraction and identi-fication system consists of a multi-plane segmentation module, apartitioned global data structure, and a rule-based reasoning sys-tem, as Fig. 3 shows.

The proposed knowledge-based system adopts a three-levelrule-based reasoning model, which comprises of knowledge,control, and strategy rules, for efficiently text-line extraction andidentification in real-life mixed and overlapping text/graphicscompound document images. This knowledge-based system canprovide high flexibility to further improve and extend the systemby updating the knowledge rules in the inference mechanism.The knowledge rules of the proposed system are constructed bytwo sets, text region extraction rules and text-line identificationrules, which encode the geometric and statistical features of con-nected-components and connected-component groups containedin text-lines having a variety of different colors, illuminations,sizes, font styles, and arrangements.

Then, the control rules, which composed of an inference engine,determine which connected-components and connected-compo-nent groups to evaluate and which subsequent process to perform.These control rules include two categories: the focus-of-attentionrules, and the meta-rules. The focus-of-attention rules determinethe next connected-component and connected-component groupto evaluate using the knowledge rules, while the meta-rules deter-mine the processing phases and feature configurations to specifywhich set of knowledge rules to perform next. The strategy rulesdecide the invocation process of a given set of control rules anddetermine their execution order on the connected-componentsand connected-component groups.

To facilitate the rule-based reasoning process on extracting andidentification of text-lines, the proposed method adopts a globaldata structure containing domain and control data partitions tokeep the critical processing information of the connected-compo-nents and connected-component groups processed and immediatecontrol statuses. The domain data partition includes the features

and information about the connected-components and con-nected-component groups from binarized planes to be processedby the text-line extraction and identification modules. The controldata partition includes control information about the statuses ofthe extraction and identification processes, and the detailed re-cords about any results kept in the global data structure.

3.2. Knowledge-based text region extraction

The domain data of the textual components used in the knowl-edge-based text extraction and identification process can be cate-gorized into two types:

(1) The component units: These units represent the geometricand connection features of connected-components, denotedby C, and reveal potential character components.

(2) The component-set units: These units represent the logicalgroups of the component units, denoted as CS, and revealthe potential text-lines.

First, the features that represent a component unit are asfollows:

(a) The identification number of the current connected-compo-nent of the current BPq, as denoted by Ci.

(b) The width and height of a given component Ci, as denoted byW(Ci) and H(Ci).

(c) The location features of the bounding box of a certain com-ponent Ci employed in the knowledge-based text regionextraction process are its top, bottom, left and right posi-tions, as denoted by t(Ci), b(Ci), l(Ci), and r(Ci), respectively.

(d) The identification number of the component-set to which thiscurrent component unit belongs. This number is placedduring the process of the knowledge-based text regionextraction.

(e) The identification numbers of the previous and next blocklinked to the current component units in the text readingorder, which are filled in during the process of the knowl-edge-based system.

To facilitate the spatial clustering process of the componentunits in the knowledge-based text region extraction process, thefollowing spatial features of pairing component units are definedas follows:

(1) The horizontal and vertical distances between a pair of com-ponent units are

DhðCi;CjÞ ¼max½lðCiÞ; lðCjÞ� �min½rðCiÞ; rðCjÞ� ð2Þand DvðCi;CjÞ ¼max½tðCiÞ; tðCjÞ� �min½bðCiÞ; bðCjÞ� ð3Þ

If the two components overlap in the horizontal or verticaldirections, the values of Dh(Ci, Cj) or Dv(Ci, Cj) will be negative.

(2) The measures of overlap between the horizontal and verticalprojections of a pair of component units is

PhðCi;CjÞ ¼ �DhðCi;CjÞ=min½WðCiÞ;WðCjÞ� ð4Þand PvðCi;CjÞ ¼ �DvðCi;CjÞ=min½HðCiÞ;HðCjÞ� ð5Þ

Next, the feature data utilized for representing a component-set unit includes the following:

(a) The identification number of the current set of connected-components, as denoted by CSj. Here, a component-set unitCSj represents a preliminary general text-line.

(b) The number of connected-components contained in CSj, asdenoted by Ncc(CSj).

Fig. 3. The proposed knowledge-based system.

Y.-L. Chen et al. / Expert Systems with Applications 39 (2012) 494–507 499

(c) The listed records of the belonged component units, i.e.,CSj = {Ci:i = 0, 1, 2, ..., Ncc(CSj) � 1}.

(d) The width and height of CSj, as denoted by W(CSj) and H(CSj),respectively. These values are determined by the boundingbox enclosing all the component units belonging to CSj.

(e) The location features of CSj, including its top, bottom, left,and right positions, as denoted by t(CSj), b(CSj), l(CSj), andr(CSj), respectively. These values are determined by thebounding box enclosing its contained component units.

(f) The identified type of CSj, such as a text-line, or a non-textobject, labeled by the knowledge-based text identificationprocess.

Similar to the computation of the spatial features of componentunits, the horizontal and vertical distances and the horizontal andvertical projection overlapping measures of two component-setunits CSj and CSk can be determined by their associated boundingboxes of contained component units as Dh(CSj, CSk), Dv(CSj, CSk),Ph(CSj, CSk), and Pv(CSj, CSk) as in Eq. (2)–(5), respectively.

Based on the domain data of the textual components definedabove, the proposed knowledge-based textual region segmentationprocess is as follows. This knowledge-based textual region extrac-tion process uses the following knowledge rules, control rules, andstrategy rules to recursively perform the horizontal and verticalclustering procedures on the component units of the current pro-cessing binary plane BPq.

The text region extraction process recursively performs the hor-izontal and vertical clustering procedures on the extracted compo-nent units of each binary plane BPq, and consists of the followingknowledge, control, and strategy rules.

First, the horizontal clustering procedure- H-cluster(CSin)(where the subscript ‘‘in’’ refers to ‘‘the input of the component-set unit’’) uses the following rules:

Meta rule (M-1-1):IF:

(a) The text region segmentation is performing.(b) The current process corresponds to the ‘‘horizontal

clustering procedure.’’THEN:

(1) Project all the bounding boxes of the component unitscontained in the input component-set unit CSin horizontallyonto the vertical y-axis.

(2) Apply the ‘‘horizontal clustering rules’’ on thecomponent units.

Focus-of-Attention rule (F-1-1):IF:

(a) The status is ‘‘text region extraction.’’(b) The current process corresponds to the ‘‘horizontal

clustering procedure.’’THEN:

(1) Scan the horizontal projections of all component unitscontained in the input component-set unit CSin on the y-axis, and determine their ‘‘projection overlappingsegments’’ on the y-axis (as Fig. 4(a) illustrates).

(2) Find the component units that share the samehorizontal projection overlapping segments.

Meta rule (M-1-2):IF:

(continued on next page)

500 Y.-L. Chen et al. / Expert Systems with Applications 39 (2012) 494–507

(a) There are more than one resulting component-setunits, CSk (where k = 0, 1, 2, . . ., K � 1, and K is the numberof resulting sets) being obtained after performing thehorizontal clustering rules.THEN:

(1) Perform all the ‘‘vertical clustering rules’’ on each ofthe resulting component-set units.

Knowledge rule (K-1-1):IF:

(a) The current process corresponds to the ‘‘horizontalclustering procedure.’’

(b) The component units share the same horizontalprojection overlapping segment on the y-axis, that is, thesetwo component units C1 and C2 overlap with each other,which can be determined by the horizontal overlappingcondition: Pv(B(C1), B(C2)) > 0.THEN:

(1) Assign any two component units that are aligned witheach other into the same component-set unit CSj.

For character components in a text-line, the correspondingbounding boxes of component units should be well aligned witheach other. Therefore, the alignment-condition can reveal the align-ment between component units that share the same projectionoverlapping segments by applying the following knowledge rule:

Knowledge rule (K-1-2):IF:

(a) The pair of bounding boxes of connected-componentsC1 and C2, satisfy the following alignment-condition,

HðCsÞ � HðCs \ CTÞ=HðCsÞ < Ta ð6Þ

where Cs is the shorter of the two component units C1 and C2, and CT

is the taller one; Ta is a pre-defined threshold with the value 0.33being experimentally set in this study. The left term of this condi-tion can be simplified as,

HðCsÞ � HðCs \ CTÞHðCsÞ

¼ 1� PvðC1;C2Þ ð7Þ

In other words, the alignment-condition indicates that the non-overlapping part of the projection (i.e., the part of the projectionthat does not overlap the taller component unit) of the shortercomponent unit is less than one-third of its height.

THEN:(1) Merge these two component units into the samecomponent-set unit, CSk.

Then the vertical clustering procedure V-cluster(CSin) is con-ducted by applying the following rules:

Meta rule (M-1-3):IF:

(a) The text region segmentation is performing.(b) The current process is the ‘‘vertical clustering

procedure.’’THEN:

(1) Project all the bounding boxes of the containedcomponent units in the input component-set unit CSin

vertically onto the x-axis.(2) Apply the ‘‘vertical clustering rules’’ on the component

units.

Focus-of-Attention rule (F-1-2):IF:

(a) The status is ‘‘text region extraction.’’(b) The current process corresponds to the ‘‘vertical

clustering procedure.’’THEN:

(1) Scan the vertical projections of all component unitscontained in the input component-set unit CSin onto the x-axis and determine the projection overlapping segments (asFig. 4(b) illustrates).

(2) Find the component units that share the same verticalprojection overlapping segments.

Knowledge rule (K-1-3):IF:

(a) The current process corresponds to the ‘‘verticalclustering procedure.’’

(b) The component units share the same verticalprojection overlapping segment on the x-axis, that is, thetwo component units C1 and C2 overlap with each other,which can be determined by the vertical overlappingcondition: Ph(B(C1), B(C2)) > 0.THEN:

(1) Cluster two component units that are aligned witheach other into the respective component-set unit CSl.

Meta rule (M-1-4):IF:

(a) The vertical clustering rules produce more than oneresulting component-set units, CSl (where l = 0, 1, 2, . . . , L� 1, and L is the number of resulting sets).THEN:

(1) Apply the ‘‘horizontal clustering rules’’ to each of theresulting component-set units.

Knowledge rule (K-1-4):IF:

(a) There exists partial determined component-set unitsCSl.

(b) The horizontal space between two adjacentcomponent-set units CSk1 and CSk2 is small enough toindicate that they should belong to the same textual objectset. This can be determined by the following horizontalspace condition between the two adjacent component-setunits CSk1 and CSk2:

� �

DhðCSk1;BCSk2Þ < max

WðCSk1ÞNccðCSk1Þ

;WðCSk2ÞNccðCSk2Þ

ð8Þ

where the term W(CS)/Ncc(CS) reflects the average width of allcomponent units that belong to a given component-set unit CS.

THEN:(1) Merge these two adjacent component-set units CSk1

and CSk2 (including their comprised component units) intothe same component-set unit.

Strategy rule (S-1-1):IF:

(a) There is any more resulting component-set units canstill be incrementally obtained.THEN:

(1) Apply all the control rules of the text regionsegmentation to each component-set unit CSl until no morecomponent-set units can be obtained.

Initially, the initial component-set unit CSin includes all of the compo-nent units of the current processing binary plane BPq. Based on thetwo procedures above, the knowledge-based text region extraction

Fig. 4. Illustration of the projection overlapping segments.

Y.-L. Chen et al. / Expert Systems with Applications 39 (2012) 494–507 501

process begins with performing the horizontal clustering rules on theinitial component-set unit CSin. If the first recursion of performing thehorizontal clustering rules cannot divide the initial component-setunit CSin into more than one component-set, perform the vertical clus-tering rules on CSin. Perform the clustering process by recursivelyapplying the horizontal and vertical clustering rules until the resultingcomponent-set units cannot be divided into more sub-sets.

Fig. 5 depicts the proposed knowledge-based text regionextraction process. As Fig. 5(a) shows, the corresponding compo-nent units of possible textual objects in the binary plane BP1 (asin Fig. 2) are clustered into several resulting CSs, which are en-closed by green rectangles after the first recursion of the horizon-tal clustering procedure. Fig. 5(a) shows that these resulting CSsconsist of component units that share the same horizontal pro-jection overlapping segments on the y-axis and have compatibleheights. For example, the horizontal clustering rules cluster theCSs of the five text-lines with compatible heights (‘‘GuglielmoTell,’’ ‘‘La gazza ladra,. . . etc.) near the artist’s large portrait with-out improperly merging with the portrait. Next, the vertical clus-tering procedure is performed on the resulting CSs in Fig. 5(a),and Fig. 5(b) shows the resulting CSs. This figure shows thatthe text-line ‘‘La scala di seta,’’ which was merged with the art-ist’s eyes in Fig. 5(a), is now appropriately clustered into oneseparated CS; while the text-lines ‘‘Semiramide’’ and ‘‘Il Barbieredi Siviglia,’’ are now separated from the formerly merged artist’snose and mouth thanks to the vertical projection information.The second recursion of the horizontal clustering procedure isapplied to the obtained CSs in Fig. 5(b), and Fig. 5(c) showsthe resulting CSs. Fig. 5(c) shows that the merged text-lines‘‘Semiramide’’ and ‘‘Il Barbiere di Siviglia’’ in Fig. 5(b) are clus-tered into two individual CSs thanks to the horizontal projectioninformation. Finally, since applying the following recursions ofthe vertical clustering procedure cannot separate the resultingCSs in Fig. 5(c) into more CSs by, the knowledge-based text re-gion extraction process on BP1 is terminated at the results ofFig. 5(c). The knowledge-based text region extraction process isapplied to all the binary planes derived from Fig. 1(a) to obtaintheir associated CSs. These final CSs are text-line candidates, andwill be processed by the knowledge-based text identificationprocess presented in the following subsection to recognize actualtext-lines among them.

3.3. Knowledge-based text-line identification

The knowledge-based text identification process distinguisheswhether each one of the obtained component-set unit CSs com-prises actual text-lines or non-textual objects. Before distinguish-ing and extracting text-lines, the proposed method first identifieshalftone pictorial objects and background regions using normal-ized correlation features (Pavlidis & Zhou, 1992). The associatednormalized correlation features for each CS are computed basedon the bounding box region covered by its contained components.If the normalized correlation features of one CS meet the discrim-ination rules of halftone pictorial objects, it is a pictorial object or abackground region.

After identifying and eliminating pictorial objects and back-ground regions, text identification is performed on the remainingCSs. If a CS actually comprises a text-line, it may have the followingdistinguishing characteristics: (1) its contained component unitsshould be aligned, and in quantities proportional to the width ofthe whole CS; (2) the contained object pixels in the enclosed regionof this CS show distinctive spatial variations. This characteristic ofthe contained object pixels of the CS can be determined by the fol-lowing statistical features: considering that ‘‘0’’ represents objectpixels and ‘‘1’’ background pixels, the number of transition pixelsTp in the enclosing box of the CS is determined by computing thenumber of ‘‘0’’ to ‘‘1’’ and ‘‘1’’ to ‘‘0’’ transitions. A CS is an actualtext-line if it satisfies both of the above-mentioned distinguishingcharacteristics.

Based on the above-mentioned concepts, the text identificationprocess employs the following knowledge-based processing rulesto identify whether one component-set unit CS consists of a text-line or non-text objects. A CS is a real text-line if the followingknowledge rule is satisfied.

Focus-of-Attention rule (F-2-1):IF:

(a) The status is ‘‘text-line identification.’’(b) A component-set unit CS has been selected.

THEN:(1) Select an adjacent CS that is not identified as a text-line

(continued on next page)

Fig. 5. Illustrative examples of the proposed knowledge-based text region extrac-tion process.

Fig. 6. Examples of the knowledge-based text extraction and identification process.

502 Y.-L. Chen et al. / Expert Systems with Applications 39 (2012) 494–507

or a non-textual object.Meta rule (M-2-1):

IF:(a) The text-line identification is performing.(b) There are any CSs unidentified.

THEN:(1) Select one unidentified CS.

(2) Apply the text-line identification rules on the selectedCS.

Knowledge rule (K-2-1):IF:

(a) The ratio of the width W and the height H of theenclosing bounding box of this CS satisfy the size-ratiocondition,

WðCSÞ=HðCSÞP sr ð9Þ

where the threshold sr on the size-ratio condition is 2.0 to reflectthe rectangular-shaped appearance of a text-line.

(b) The number of the contained component units Ncc of aCS satisfies the condition

sn1 � ðWðCSÞ=HðCSÞÞ 6 NccðCSÞ 6 sn2 � ðWðCSÞ=HðCSÞÞ; andNccðCSÞ > sn3 ð10Þ

where the values of sn1, sn2 and sn3 are 0.5, 8.0, and 3, respectively,according to our analysis of typical arrangement and quantity char-acteristics of the characters in a text-line.

(c) The ratio of the total area of the bounding boxes ofcontained component units of a CS to the area of itsenclosing box meets the condition

Y.-L. Chen et al. / Expert Systems with Applications 39 (2012) 494–507 503

X

sa1 6

Ci2CS

AðCiÞ=WðCSÞ � HðCSÞ 6 sa2 ð11Þ

where Ci is the ith component unit contained by the CS, and A(Ci) isthe area of the bounding box of Ci; the thresholds sa1 and sa2 are 0.5and 0.95, respectively, and reveal the alignment property of thecharacters appearing in a text-line.

(d) The horizontal transition pixel ratio of the CS satisfiesthe condition

st1 6 TpðCSÞ=NColðCSÞ 6 st2 ð12Þ

where NCol is the number of the column lines in which the objectpixels are present, the values of st1 and st2 are 1.2 and 3.6, respec-tively, to reflect the typical pixel transition features of characterstrokes.

The density of object pixels in the CS should satisfy thecondition ,

sd1 6XCi2CS

OpðCiÞ WðCSÞ � HðCSÞ 6 sd2 ð13Þ

where Op(Ci) is the number of object pixels of the ith componentunit of the CS; the thresholds sd1 and sd2 are 0.3 and 0.8, respec-tively, to reveal the typical occupation characteristic of pixels with-in a text-line.

THEN:(1) Identify this CS as a ‘‘text-line.’’

Strategy rule (S-2-1):IF:

(a) There any unidentified component-set units CSsremain.

Fig. 7. Text-line extraction results of the sample image 1 (size: 3118 � 4498)

THEN:(1) Apply all the control rules of the text-line

identification to each CS until there are no moreunidentified CSs.

The discrimination parameters utilized in the knowledge rules were

obtained by analyzing many experimental results of processingdocument images with text strings having various types, lengths,and sizes, and can yield good performance in most general cases.Fig. 6(a) shows that the proposed knowledge-based text identifica-tion process accurately identifies eight actual text-lines (enclosedby blue rectangles) from the candidate component-set units CSsin Fig. 5(c). After applying the knowledge-based text-line extractionand identification process to all binarized object planes; the text-lines extracted and identified from these planes are composed intoa textual plane, as Fig. 6(b) shows.

4. Experimental results

This section evaluates the performance of the proposed knowl-edge-based text-line extraction technique and compares it to Jainand Yu’s color-based method (Jain & Yu, 1998). The document im-age database used in this study consists of 50 real-life complexmixed and overlapping text/graphics compound document images.These images consist of text-lines printed in various colors or illu-minations, font styles, and sizes, including sparse and dense tex-tual regions, adjoining or overlapping pictorial, watermarked,textured, shaded, or unevenly illuminated objects and backgroundregions. The knowledge rules for text-line extraction and identifi-cation are constructed according to the geometric and spatial sta-tistical characteristics of the text-lines and their connected-components printed on a variety of real-life complex compounddocument images in the sample database. The proposed techniquewas implemented on a 2.8 GHz Pentium-4 personal computerusing C++ programming language. For a typical A4-sized documentpage scanned at 200–400 dpi resolution, the average image size is2,400 pixels by 3,200 pixels, with an average of 1.25 s processingtime.

by the proposed knowledge-based approach, and Jain and Yu’s method.

504 Y.-L. Chen et al. / Expert Systems with Applications 39 (2012) 494–507

Fig. 7(a) shows a representative example of mixed text/graphiccompound document image. In this test image, 30 text-lines withdifferent colors and various sizes, colors, and styles are printedand superimposed on pictorial objects and textured backgrounds.For instance, the title text-line ‘‘A life time Switches what will last’’consists of character strings with two different colors, sizes, andstyles. Fig. 7(b) shows that after applying the multi-plane segmen-tation and knowledge-based text extraction processes, these 30text-lines are extracted. Some marks are also extracted as text-lines, because they are still comprised by characters and have sim-ilar features with those of text-lines. The text-line extraction re-

Fig. 8. Text-line extraction results of the sample image 2 (size: 2309 � 3102)

Fig. 9. Text-line extraction results of the sample image 3 (size: 2412 � 3312)

sults produced by Jain and Yu’s color-segmentation-basedmethod in Fig. 7(c) showed that several pieces of text-lines aremissing or fragmented. This is due to unsatisfactory global colorsegmentation on the text regions that overlap pictorial objectsand textured backgrounds.

Figs. 8 and 9(a) illustrate two more test images from Englishmagazine documents with several notable characteristics of text-line arrangement. The sample image in Fig. 8(a) has 18 multiple-colored text-lines with different sizes and arrangements appearingin background regions in indistinct contrasts, while the test imagein Fig. 9(a) includes 56 different-sized and colored text-lines

by the proposed knowledge-based approach, and Jain and Yu’s method.

by the proposed knowledge-based approach, and Jain and Yu’s method.

Y.-L. Chen et al. / Expert Systems with Applications 39 (2012) 494–507 505

packed closely together and mixed in various arrangement styles,and printed on shaded and pictorial background regions. Figs. 8and 9(b) show that the proposed knowledge-based method appro-priately segments and extracts most of the text-lines with large tosmall sized characters, and different font types and colors undervarious difficulties associated with the complexity of backgroundimages. The exception is one vertical oriented character string,‘‘http://www.computer.org,’’ which is not extracted. Fig. 8(c) andFig. 9(c) show that Jain and Yu’s method does not provide satisfac-tory results when extracting several text-lines of interest, particu-larly those text-lines without sufficient contrasts with backgroundregions, as in Fig. 8(c), and text-lines with a large variety of sizesand arrangements, as in Fig. 9(c).

Fig. 10. Text-line extraction results of the sample image 4 (size: 2197 � 2840

Fig. 11. Text-line extraction results of the sample image 5 (size: 2300 � 3155

Figs. 10 and 11 depict two more representative experimentalexamples of Chinese complex document images. These imagescontain 11 and 38 text-lines, respectively. In these two Chinesedocument images, the Chinese text-lines are printed on texturedand shaded non-text pictorial objects and backgrounds withdegraded and sharply varied contrasts. These text-lines includedifferent sizes, colors, and types of mixed Chinese/English charac-ters in various formations. Unlike English characters, a Chinesecharacter is mostly composed of multiple connected-components,which may arise some difficulties for conventional text-line extrac-tion methods designed for processing English text-lines. Fig. 10(b)and Fig. 11(b) show that the proposed knowledge-based methodefficiently extracts and identifies most of these text-lines, even

) by the proposed knowledge-based approach, and Jain and Yu’s method.

) by the proposed knowledge-based approach, and Jain and Yu’s method.

Table 1Experimental data of Jain and Yu’s method and our proposed approach.

Method Total number ofextracted text-lines

Recallrate (%)

Precisionrate (%)

Jain and Yu’s 1071 75.9 94.2The proposed 1402 99.3 98.9

506 Y.-L. Chen et al. / Expert Systems with Applications 39 (2012) 494–507

when they appear across several sharply varied and degraded non-text pictorial background regions. This is true for text-lines acrossthe degraded and textured background regions of the rope pillar inFig. 10(b). Fig. 10(c) and Fig. 11(c) show the text-lines obtained byJain and Yu’s method for comparison. These figures show that sev-eral large pieces of text-lines are lost or fragmented. This is be-cause the global color quantization of Jain and Yu’s method isunable to select appropriate representative colors of text-linesacross different shaded and textured background regions withdegraded contrasts. Further, their text-line extraction method isdesigned for English text-lines, and has not performed well whenextracting text-lines including multi-component Chinesecharacters.

Subsequently, the proposed approach applies two performancemetrics, the recall rate, and the precision rate, for the quantitativeevaluation of text-line extraction and identification. These metricsare frequently used to evaluate performance in information retrie-val, and are written as

Recall Rate ¼ No: of correctly extracted text-linesNo: of actual text-lines

ð14Þ

Precision Rate ¼ No: of correctly extracted text-linesNo: of extracted text-like CSs

ð15Þ

The recall rate reflects a given method’s text-line extraction perfor-mance for different complex compound documents. A high recallrate reveals that the false-negative detection rate is low. The preci-sion rate represents a method’s ability to identify actual text-lineswithout confusing them with non-text objects with similar features.A high precision rate indicates low false-positive detection of spuri-ous non-text objects. The recall and precision rates for text-lineextraction and identification results of our sample images werecomputed by manually counting the number of text-lines actuallyprinted on the document images, the number of extracted text-likeconnected-component sets (CSs), and the correctly extracted text-lines.

Table 1 compares the experimental results of Jain and Yu’smethod with the proposed knowledge-based approach. The exper-iments of quantitative evaluation were performed on our test data-base of 50 complex document images with totaling 1412 readabletext-lines including 27820 visible characters. The quantitative re-sults in Table 1 show that the proposed knowledge-based approachprovides better text-line extraction performance than Jain and Yu’smethod. Table 1 shows that the proposed knowledge-based ap-proach not only can achieve effective capability on extractingtext-lines from extensively various types of complex compounddocument images, but also efficiently avoids spurious non-text ob-jects with similar features, preserving the quality of extracted text-lines. Accordingly, the experimental results in Figs. 7–11 and Table1 show that even if text-lines include a large variety of colors, illu-mination levels, sizes, styles, and arrangements, or are overlappedwith pictorial objects and shaded and textured backgrounds withuneven, degraded, or sharp variations in contrasts and illumina-tions, the proposed knowledge-based text-line extraction andidentification method successfully extracts and identifies almostall of the text-lines.

5. Conclusions

This study presents a new knowledge-based system for extract-ing text-lines from various types of mixed and overlapping text/graphics complex compound document images. Text-lines in suchcomplex compound document images may appear in different col-ors, illumination levels, sizes, or font styles, and are printed andoverlapped on various background objects with uneven, shaded,and sharp variations in contrast, illumination, and texture. Exam-ples of these background objects include figures, photographs, pic-tures, or other background textures. The first step in theknowledge-based text-line extraction system is the multi-planesegmentation technique process, which processes documentimages regionally and adaptively based on their local features. Thisstep produces distinct object planes, making it possible to separatehomogeneous objects, including textual regions of interest, fromnon-text objects such as graphics, pictures, and background tex-tures. Next, a knowledge-based text extraction and identificationprocedure extracts and identifies text-lines with various character-istics in the obtained object planes. The proposed knowledge-based method consists of two processing phases: text regionextraction and text-line identification. The knowledge rules fortext-line extraction and identification were established and en-coded according to the geometric and spatial statistical character-istics of text-lines having different colors, illumination levels, sizes,or font styles printed in the real-life complex compound documentimages in our sample database. These knowledge rules include tworule sets, text region extraction rules and text-line identificationrules, according to the phase in which they are performed. Theinference engine of the proposed knowledge-based system is ahierarchical structure of the control and strategy rule sets, and effi-ciently performs the text-line extraction and identification pro-cesses from real-life complex compound document images. Inthis way, the proposed knowledge-based system can provide highflexibility and expandability by merely updating rules to deal withnew and various types of real-life complex compound documentimages. Experimental results demonstrate that the proposedknowledge-based system provides efficient extraction and identifi-cation results for text-lines with various illumination levels, colors,sizes, styles, and arrangements from a large variety of real-lifecomplex compound document images with overlapped pictorialobjects and shaded and textured backgrounds with uneven, de-graded, or sharp variations in contrast and illumination.

Acknowledgements

This paper was supported by the National Science Council ofR.O.C. under Contract No. NSC-99-2221-E-027-100, NSC-99-2221-E-468-022, and NSC-100-2219-E-027-006.

References

Avci, E., & Avci, D. (2009). An expert system based on fuzzy entropy for automaticthreshold selection in image processing. Expert Systems with Applications, 36,3077–3085.

Chen, Y.-L., & Wu, B.-F. (2009). A multi-plane segmentation approach for textextraction from complex document images. Pattern Recognition, 42(7),1419–1444.

Cho, S.-Y., Quek, C., Seah, S.-X., & Chong, C.-H. (2009). HebbR2-Taffic: A novelapplication of neuro-fuzzy network for visual based traffic monitoring system.Expert Systems with Applications, 36, 6343–6356.

Cucchiara, R., Piccardi, M., & Mello, P. (2000). Image analysis and rule-basedreasoning for a traffic monitoring system. IEEE Transactions on IntelligentTransportation Systems, 1(2), 119–130.

Doermann, D. (1998). The indexing and retrieval of document images: a survey.Computer Vision and Image Understanding, 70, 287–298.

Fisher, J. L., Hinds, S. C., & D’Amato, D. P. (1990). Rule-based system for documentimage segmentation. In: Proceedings of 10th International Conference in PatternRecognition, pp. 567–572.

Y.-L. Chen et al. / Expert Systems with Applications 39 (2012) 494–507 507

Fletcher, L. A., & Kasturi, R. (1988). A robust algorithm for text string separationfrom mixed text/graphics images. IEEE Transactions on Pattern Analysis andMachine Intelligence, 10(6), 910–918.

Hasan, Y. M. Y., & Karam, L. J. (2000). Morphological text extraction from images.IEEE Transactions on Image Processing, 9(11), 1978–1983.

Hase, H., Shinokawa, T., Yoneda, M., & Suen, C. Y. (2001). Character string extractionfrom color documents. Pattern Recognition, 34, 1349–1365.

Jain, A. K., & Bhattacharjee, S. (1992). Text segmentation using Gabor filters forautomatic document processing. Machine Vision and Applications, 5, 169–184.

Jain, A. K., & Yu, B. (1998). Automatic text location in images and video frames.Pattern Recognition, 31(12), 2055–2076.

Kang, B. H., & Bae, Y. L. (1997). Binary character/graphic image extraction usingfuzzy inference and logical level methods. Expert Systems with Applications, 12,65–70.

Lee, K.-H., Choy, Y.-C., & Cho, S.-B. (2000). Geometric structure analysis of documentimages: a knowledge-based approach. IEEE Transactions on Pattern Analysis andMachine Intelligence, 22, 1224–1240.

Levine, M. D., & Nazif, A. M. (1985). Dynamic measurement of computer generatedimage segmentation. IEEE Transactions on Pattern Analysis and MachineIntelligence, 7, 155–164.

Niyogi, D., & Srihari, S. N. (1996). An integrated approach to documentdecomposition and structural analysis. International Journal of Imaging Systemsand Technology, 7, 330–342.

O’ Gorman, L., & Kasturi, R. (1995). Document image analysis. Silver Spring, MD: IEEEComputer Society Press.

Pavlidis, T., & Zhou, J. (1992). Page segmentation and classification. ComputerGraphics and Image Processing, 54(6), 484–496.

Shih, F. Y., Chen, S. S., Hung, D. C. D., & Ng, P. A. (1992). Document segmentation,classification and recognition system. In Proceedings of the 2nd InternationalConference on Systems Integration, pp. 258–267.

Sobottka, K., Kronenberg, H., Perroud, T., & Bunke, H. (2000). Text extraction fromcolored book and journal covers. International Journal of Document Analysis &Recognition, 2, 163–176.

Strouthopoulos, C., Papamarkos, N., & Atsalakis, A. E. (2002). Text extraction incomplex color documents. Pattern Recognition, 35, 1743–1758.

Subasic, M., Loncaric, S., & Birchbauer, J. (2009). Expert system segmentation of faceimages. Expert Systems with Applications, 36, 4497–4507.

Suzuki, K., Horiba, I., & Sugie, N. (2003). Linear-time connected-component labelingbased on sequential local operations. Computer Vision and Image Understanding,89, 1–23.

Wu, B.-F., Chen, Y.-L., & Chiu, C.-C. (2005). A discriminant analysis based recursiveautomatic thresholding approach for image segmentation. IEICE Transactions onInformation and Systems, E88-D(7), 1716–1723.

Wu, V., Manmatha, R., & Riseman, E. M. (1999). Textfinder: an automatic system todetect and recognize text in images. IEEE Transactions on Pattern Analysis andMachine Intelligence, 21(11), 1224–1229.

Yang, H., & Ozawa, S. (1999). Extraction of bibliography information based on theimage of book cover. IEICE Transactions on Information and Systems, E82-D(7),1109–1116.

Yuan, Q., & Tan, C. L. (2001). Text extraction from gray scale document images usingedge information. In: Proceedings of the 6th Interational Conference in DocumentAnalysis & Recognition, pp. 302–306.