chapter 2 literature survey and...

27

Chapter 2

Literature Survey and Objectives

2.1 Literature Survey

In India, there are 18 official (Indian constitution accepted) languages. Two or more of

these languages may be written in one script. Twelve different scripts are used for writing

these languages. Many of the Indian documents are supposed to be written in three

languages namely, English, Hindi and the state official language as per the three language

formula. For example, a money order form in the Tamil Naidu state is written in English,

Hindi and Tamil, because Tamil is the state official language of Tamil Naidu. The need to

have some form of automated or semi automated OCR has been recognized for decades.

As segmentation is the crucial part of the OCR, therefore more stress should be given to

this phase. Today, there are numerous algorithms that perform this task, each with its own

strengths and weaknesses. In this survey, a number of papers are reviewed and presented

which are related to the present work.

Dunn and Wang [1992] surveyed the techniques for segmenting images of handwritten

text into individual characters. The topic is broken into two categories, one is

segmentation and other is segmentation recognition techniques. First one discussed in the

paper, is straight segmentation which is the technique of forming rules to identify

members of a character set without identifying their specific classification. It is useful for

printed character set but a bit less effective for cursive text. It greatly reduces the

complexity of search for a word hypothesis since the character boundaries are pre

determined. Several approaches to segmentation recognition are discussed in the paper.

Each is analyzed for its relevance to printed, cursive, on line and off line input data.

28

Segmentation recognition strategies are more expensive due to the increased complexity of

search for finding optimum word hypotheses. However, the inherent ambiguity of cursive

text requires this type of segmentation.

Fujisawa et. al. [1992] presented a pattern oriented segmentation method for optical

character recognition that leads to document structure analysis. As a case study,

segmentation of handwritten numerals, which touch to each other, is taken first. Connected

pattern components are extracted, and spatial interrelations between components are

measured and grouped into meaningful character patterns. Stroke shapes are analyzed. On

the basis of that analysis, a method is described to find the touching positions that separate

about all of connected numerals correctly. Authors handled ambiguities by making

multiple hypotheses and verification by recognition. An extended form of pattern oriented

segmentation, tabular form recognition, is considered. Images of tabular forms are

analyzed, and frames in the tabular structure are segmented. By identifying semantic

relationships between label frames and data frames, information on the form can be

properly recognized.

Abulhaiba and Ahmed [1993] presented an automatic off line character recognition

system for totally unconstrained handwritten numerals using “Fuzzy logic”. The system

was trained and tested on the field data collected by the U.S. Postal Services Department

from dead letter envelopes. It was trained on one thousand seven hundred sixty three

unnormalized samples. The training process produced a feasible set of one hundred five

Fuzzy Constrained Character Graph Models (FCCGMs). FCCGMs tolerate large

variability in size, shape and writing style. Characters were recognized by applying a set

of rules to match a character tree representation to a FCCGM. A character tree is obtained

by first converting the character skeleton into an approximate polygon and then

29

transforming the polygon into a tree structure suitable for recognition purposes. The

system was tested on (not including the training set) one thousand eight hundred and

twelve unnormalized samples and it proved to be powerful in recognition rate and

tolerance to multi writer, multi pen, multi textured paper, and multi color ink.

Akindele and Belaid [1993] described a page segmentation method that allows

one to cut a document page image into polygonal blocks as well as into classical

rectangular blocks. The inter column and inter paragraph gaps are extracted as horizontal

and vertical lines. This builds an intersection table from the lines. The points of

intersection between these lines are treated as vertices of polygonal blocks. With the

aid of the four connected chain codes and the derived intersection table, simple isothetic

polygonal blocks are constructed from these points of intersection. The method is robust

enough to be applied to obtain polygonal blocks of any shape and any number of sides.

Pavlidis [1993] stated that research in optical character recognition (OCR) has focused

on the shape analysis of binarized images, by assuming that there would be good quality

document and isolated characters. Such assumptions are challenged by the conditions met

in practice. Binarization is difficult for low contrast documents because characters often

touch each other, not only on the sides but also between lines, etc. Author has discussed

current efforts to deal with OCR as a signal processing problem where the causes of noise

and distortions as well the idealized images (definitions of typefaces) are modeled and

subjected to a quantitative analysis. The key idea of the analysis is that while printed text

images may be binary in an ideal state, the images seen by the sensors are gray scale

because of convolution distortion and other causes. Finally, it is stated that binarization

should be carried out at the same time as feature extraction.

Liang et. al. [1994] proposed a new discrimination function for segmenting touching

30

characters. This function is based on both pixel projection and profile projection. A

dynamic recursive segmentation algorithm is developed for effectively segmenting

touching characters. Contextual information and a spelling checker are used to correct

errors caused by incorrect recognition and segmentation. As per the paper, the proposed

algorithm achieved good recognition accuracy.

Seni and Cohen [1994] described techniques to separate a line of unconstrained

(written in a natural manner) handwritten text into words. When the writing style is

unconstrained, recognition of individual components may be unreliable so these

components must be grouped together into word hypotheses, before recognition

algorithms, which may require dictionaries, can be used. The proposed system uses

original algorithms to determine distances between components in a text line and to detect

punctuation. The algorithms are tested on number of handwritten text lines extracted from

postal address blocks. A detailed performance analysis of the complete system and its

components is presented in the paper.

Avi-Itzhak et. al. [1995] stated that optical character recognition (OCR) refers to a

process by which printed documents are transformed into ASCII files for the purpose of

compact storage, editing, fast retrieval, and other file manipulations through the use of a

computer. The recognition stage of an OCR process is made difficult by added noise,

image distortion, and the various character typefaces, sizes, and fonts that a document may

have. In the proposed study, a neural network approach is introduced to perform high

accuracy recognition on multi size and multi font characters. A novel centroid dithering

training process with a low noise sensitivity normalization procedure is used to achieve

high accuracy results. The study is divided in two parts. The first part focuses on single

size and single font characters, and a two layered neural network is trained to recognize

31

the full set of 94 ASCII character images in 12 pt Courier font. The second part trades

accuracy for additional font and size capability, and a larger two layered neural network is

trained to recognize the full set of 94 ASCII character images for all font sizes from 8 to

32 and for 12 commonly used fonts. The performance of these two networks is evaluated

based on a database of more than one million character images from the testing data set.

Congedo et. al. [1995] presented a procedure for the segmentation of handwritten

numeric strings. The proposed procedure first uses hypothesis then verification strategy. In

the paper, multiple segmentation algorithms, which were based on contiguous row

partition, work sequentially on the binary image until an acceptable segmentation is

obtained. To achieve this purpose a new set of algorithms simulating a "drop falling"

process is introduced. Drop fall algorithms attempt to build a segmentation path by

mimicking an object falling or rolling in between the two characters which make up a

connected component. There are four primary types of drop fall algorithms which differ

on the direction and the starting point of the drop fall. These are top left (or left

descending), top right (or right descending), bottom left (or left ascending), and bottom

right (or right ascending). The experimental tests demonstrate the effectiveness of the new

algorithms in obtaining high confidence segmentation hypotheses.

Lu [1995] provided the insight of character segmentation. Though the information in

this paper is related with machine printed character but it gives a basis to understand

segmentation. According to the paper the segmentation can be divided in three parts. First

part is the Classical Approach in which segmentations are identified based on character

like properties. This process of cutting up the image into meaningful components is called

dissection. The second part is Recognition Based Segmentation, in which the system

searches the image for components that match classes in alphabet. Holistic Methods is the

32

third one, in which the system seeks to recognize words as a whole, thus avoiding the need

to segment into characters.

Casey and Lecolinet [1996] aimed at providing an appreciation for the range of

character segmentation techniques that have been developed. The segmentation is listed

under four headings. Classical approach consists of methods that partition the input image

into sub images, which are then classified. The second class of methods segments the

image either explicitly by classification of pre specified windows, or implicitly by the

classification of subsets of spatial features collected from the image as a whole. The third

proposed strategy is the hybrid of first two, employing dissection together with

recombination rules but using classification to select from the range of admissible

segmentation possibilities offered by these sub images. Finally, holistic approach, which

avoids segmentation by recognizing entire character strings as units.

Lee [1996] proposed a new scheme for off line recognition of totally unconstrained

handwritten numerals using a simple multilayer cluster neural network trained with the

back propagation algorithm. This method highlighted that the use of genetic algorithms

avoids the problem of finding local minima in training the multilayer cluster neural

network with gradient descent technique. Hence, the recognition rates are improved. In the

proposed scheme, Kirsch masks are adopted for extracting feature vectors and a three

layer cluster neural network with five independent sub networks to be developed for

classifying similar numerals efficiently. In order to verify the performance of the proposed

multilayer cluster neural network, it was experimented with handwritten numeral database

and correct recognition rates were obtained.

Lu and Shridhar [1996] presented an overview on the most important techniques used

in segmenting characters from handwritten words. It is well recognized that it is difficult

33

to segment individual characters from handwritten words without the support from

recognition and context analysis. One common characteristic of all the existing

handwritten word recognition algorithms is that the character segmentation process is

closely coupled with the recognition process. This review consists of three major portions,

hand printed word segmentation, handwritten numeral segmentation and cursive word

segmentation. Every algorithm discussed in the paper is accompanied with a flow chart to

give a clear grasp of the algorithm. One section summarizes the terms and measurements

commonly used in handwritten character segmentation.

Messelodi and Modena [1996] presented an algorithm for text segmentation and

recognition mainly suited for complex problems where many merged characters are

present. The basic idea is to define a distance, between lines of text and strings, which

helps to postpone the final decision about text segmentation and character classification

until the contextual analysis is performed. The distance takes into account both the

hypotheses about segmentation generated by a text segmentation module and the

hypotheses about character classification produced by a probabilistic classifier. The

algorithm has been tested by reading text on books' covers. The experimental results

highlight the quality of the solution proposed.

Trier et. al. [1996] presented an overview of feature extraction methods for offline

recognition of segmented (isolated) characters. Selection of a feature extraction method is

probably the single most important factor in achieving high recognition performance in

character recognition systems. The feature extraction methods which are discussed in the

paper, are categorized with reference to invariance properties, reconstructability, and

expected distortions and variability of the characters. Paper also suggested the problem of

choosing the appropriate feature extraction method for a given application. Different

34

feature extraction methods are designed for different representation of the characters.

Yu and Jain [1996] proposed a robust and fast skew detection algorithm based on

hierarchical Hough transformation. It is capable of detecting the skew angle for various

document images, including technical articles, postal labels, handwritten text, forms,

drawings and bar codes. The algorithm is robust even when black margins introduced by

photocopying are present in the image and when the document is scanned at a low

resolution of 50 dpi. The algorithm has two steps. In the first step, the centroids of

connected components are quickly extracted using a graph data structure. Then, in second

step, a hierarchical Hough transform (at two different angular resolutions) is applied to the

selected centroids. The skew angle corresponds to the location of the highest peak in the

Hough space. The performance of the algorithm is shown on a number of document

images collected from various application domains. The algorithm is not very sensitive to

algorithmic parameters.

Chaudhuri and Pal [1997 a] proposed an OCR system that can read two Indian

language scripts which are Bangla and Devnagari (Hindi). These two are the most popular

ones in Indian subcontinent. These scripts, having the same origin in ancient Brahmi

script, have many features in common and hence a single system can be modeled to

recognize them. The proposed model did document digitization, skew detection, text line

segmentation and zone separation, word and character segmentation, character grouping

into basic, modifier and compound character category. These are done for both scripts by

the same set of algorithms. The feature sets classification tree as well as knowledge base

(required for error correction such as lexicon) differ for Bangla and Devnagari. The

system shows a good performance for single font scripts printed on clear document.

Chaudhuri and Pal [1997 b] considered scanned documents in Devnagari and Bangla

35

for skew angle detection of scanned documents. Most characters in these scripts have

horizontal lines at the top, called head lines. The character head lines mostly join one

another in a word and the word appears as a single component. In the proposed method the

components are labeled. The upper envelope of a component is found by column wise

scanning from an imaginary line above the component. Portions of upper envelope

satisfying the properties of digital straight line are detected. They are clustered as

belonging to single text lines. Estimates from individual clusters are combined to get the

skew angle. An advantage of the method is that character segmentation and zone detection

can be readily done from headline information, which is useful in Optical Character

Recognition approaches for these scripts.

Chung and Yoon [1997] presented a performance comparison of several feature

selection methods based on neural network node pruning. It is assumed that features are

extracted and presented as the inputs of a three layered perceptron classifier. After the

assumption, authors had applied five feature selection methods before/during/after neural

network training in order to prune only input nodes of the neural network. Four of them

are node pruning methods such as node saliency method, node sensitivity method, and two

interactive pruning methods using different contribution measures. The last one is a

statistical method based on principle component analysis (PCA). The first two of them

prune input nodes during training whereas the last three do before/after network training.

For gradient and upper down, left right hole concavity features, the proposed scheme was

performed on several experiments of handwritten English alphabet and digit recognition

with/without pruning using the five feature selection algorithms, respectively. The

experimental results show that node saliency method outperforms the others.

Peake and Tan [1997] presented a detailed review of current script and language

36

identification techniques. The proposed method is based on texture analysis for script

identification which does not require character segmentation whereas the existing schemes

rely on either connected component analysis or character segmentation. A uniform text

block on which texture examination can be performed is produced from a document image

by simple processing. Multiple channel (Gabor) filters and grey level co-occurrence

matrices are used in independent experiments in order to extract texture features.

Classification of test documents is made on the basis of the features of training documents

using the K NN classifier. The method shows strength with respect to noise, the presence

of foreign characters or numerals, and can be applied to very small amounts of text too.

Alpaydin [1998] suggested that learners based on different paradigms can be combined

for improved accuracy. Each learning method assumes a certain model that comes with a

set of assumptions which may lead to error if the assumptions do not hold. Learning is an

ill posed problem and with finite data, each algorithm converges to a different solution and

fails under different circumstances. Authors stated that classifiers based on these

paradigms did generalize differently, failed on different patterns and to a certain extent

complement each other and thus they looked for ways to combine them for higher

accuracy. One way to get complementary classifiers is by using different input

representations. The methods, which are investigated, are voting, mixture of experts,

stacking and cascading. The proposed method is experimented on real world applications

like optical handwritten digit recognition, and pen based handwritten digit recognition and

it is claimed in the paper that proposed method gave satisfactory results.

Chaudhuri and Pal [1998] presented a complete Optical Character Recognition (OCR)

system for printed Bangla, the fourth most popular script in the world, in this paper. This

is the first OCR system among all script forms used in the Indian sub continent. The

37

captured image is subjected to skew correction, text graphics separation, line

segmentation, zone detection, word and character segmentation using some conventional

and some newly developed techniques. From zonal information and shape characteristics,

the basic, modified and compound characters are separated for the convenience of

classification. The basic and modified characters which are about seventy five in number

and which occupy about ninety six percent of the text corpus, are recognized by a

structural feature based tree classifier. The compound characters are recognized by a tree

classifier followed by template matching approach. The feature detection is simple and

robust where preprocessing like thinning and pruning are avoided.

Madhvanath and Govindaraju [1998] proposed a methodology of coarse holistic

features and heuristic prediction of ideal features from ASCII to address certain issues.

One of the issues included is perceptual holistic feature. This is visually obvious feature of

the word shape that has been cited in reading studies as being utilized in fluent reading.

While these features have been used for word recognition when the lexicon of possible

words is small and static, their application to the general problem of omni scriptor

handwritten word detection is limited by their variability at the word level and the paucity

of samples for word level training. The real world examples of handwritten words are

instances of the ideal paradigm of the word class distorted by the scriptor, stylus, medium

and intervening electronic imaging processes. This provides a basis for the proposed

methodology. The proposed scheme has applications in verification and lexicon reduction

for handwritten word recognition.

Reddy and Nagabhushan [1998] described a three dimensional (3-D) neural network

recognition system for conflict resolution in recognition of unconstrained handwritten

numerals. This neural network classifier is a combination of modified self organizing map

38

(MSOM) and learning vector quantization (LVQ). The 3-D neural network recognition

system has many layers of such neural network classifiers and the number of layers forms

the third dimension. The proposed scheme is experimented by employing SOM, MSOM,

SOM and LVQ, and MSOM and LVQ networks. These experiments on a database of

unconstrained handwritten samples show that the combination of MSOM and LVQ

performs better than other networks in terms of classification, recognition and training

time. The 3-D neural network eliminates the substitution error.

Tang et. al. [1998] presented an offline recognition system based on multifeature and

multilevel classification for handwritten Chinese characters. Ten classes of multifeatures,

such as peripheral shape features, stroke density features, and stroke direction features, are

used in the proposed system. The multilevel classification scheme consists of a group

classifier and a five-level character classifier, where two new technologies which are

overlap clustering and Gaussian distribution selector, are developed. Experiments have

been conducted to recognize number of daily used Chinese characters. The recognition

rate is about high as claimed in the research paper.

Jung et. al. [1999] proposed a segmentation method for a machine printed character

string with arbitrary length. It exploits recognition based segmentation, combined with

heuristic and holistic methods. The merged part of touching characters generates different

shape of patterns from the primitive character patterns. However, far left side and far right

side patterns in the touching characters are not affected by the touching. The algorithm

firstly constructs a line adjacency graph (LAG) from a word image. Blobs are found as

connected components of the LAG and small dot noises are removed. Secondly, as a word

in English can be divided into three typographical zones such as the ascender, the x height

and the descender, the location of the connected components among those zones are also

39

examined. Thirdly, the right profile of the touching character is compared with that of the

sample characters in the prototype and then the touching characters are segmented with the

width of one of the candidates in the prototype. Finally, upward, downward and left

profiles of the segmented pattern are compared with those of the candidate respectively.

Third and final steps are continued until confirmed by successful matchings of the

resulting character patterns. It has been tested with touching characters in ‘Times’ and in

‘Helvetica’ fonts that are proportional pitch fonts and found that the proposed method is

promising.

Lee and Kim [1999] proposed an integrated segmentation and recognition method

using cascade neural network. The proposed method as discussed in the paper, a new type

of cascade neural network is developed to train the spatial dependences in connected

handwritten numerals. This cascade neural network is originally extended from the

multilayer feed forward neural network. This extension improves the discrimination and

generalization power. The performance of the proposed method is verified by performing

it on recognition experiments. As is clear from the experimental results, the proposed

method has higher discrimination and generalization power than the previous integrated

segmentation and recognition (ISR) methods. The network size of the method proposed in

the paper is smaller than that of previous integrated segmentation and recognition

methods.

Lehal and Singh [1999] described a feature extraction and hybrid classification scheme

for machine recognition of Gurmukhi characters, using binary decision tree and nearest

neighbor. Classification process is completed in three stages, where in the first stage, the

characters are grouped into sets depending on their zonal positions. In the second stage,

the characters in middle zone set are further distributed into smaller sub sets by a binary

40

decision tree using a set of robust and font independent features. In the third stage, the

nearest neighbor classifier is used using the special features distinguishing the characters.

The significant point of this scheme is that a character image is tested against only certain

subsets of classes at each stage, which enhances the computational efficiency.

Oh et. al. [1999] proposed a new approach to combine multiple features in handwriting

recognition based on two ideas: feature selection based combination and class dependent

features. A non parametric method is used for feature evaluation. The first part of this

paper is devoted to the evaluation of features in terms of their class separation and

recognition capabilities. In the second part, multiple feature vectors are combined to

produce a new feature vector. Based on this fact that a feature has different discriminating

powers for different classes, a new scheme of selecting and combining class dependent

features is proposed. In this scheme, a class is considered to have its own optimal feature

vector for discriminating itself from the other classes. Using architecture of modular

neural networks as the classifier, a series of experiments were conducted on unconstrained

handwritten numerals. The results indicate that the selected features are effective in

separating pattern classes and the new feature vector, derived from a combination of two

types of features further improves the recognition rate

Arica and Yarman [2000] introduced a set of one dimensional features to represent two

dimensional shape information for HMM (Hidden Markov Model) based handwritten

optical character recognition problem. The proposed feature set embeds two dimensional

information into an observation sequence of one dimensional string, selected from a code

book. It provides a consistent normalization among distinct classes of shapes, which is

very convenient for HMM based shape recognition schemes. The normalization

parameters, which maximize the recognition rate, are dynamically estimated in the training

41

stage of HMM. The proposed character recognition system is tested on handwritten data of

the NIST database and a local database. The experimental results indicate very high

recognition rates.

Chen and Wang [2000] proposed a new approach of segmenting single or multiple

touching handwritten numeral strings (two digits). Most of the available algorithms, used

for the segmentation of connected digits, mainly focus on the analysis of foreground

pixels. Some of them concentrated on the analysis of background pixels only and others

are depending upon the concept based on a recognizer. But in this paper, the combination

of background and foreground analysis is used to segment single or multiple touching

handwritten numeral strings. Thinning of both foreground and background regions are first

processed on the image of connected numeral strings and the feature points on foreground

and background skeletons are extracted. Several possible segmentation paths are then

constructed while doing these, useless stroke is removed. Finally, the parameters of

geometric properties of each possible segmentation paths are determined and these

parameters are analyzed by the mixture Gaussian probability function to decide the best

segmentation path otherwise these are rejected. Experimental results show that the

proposed algorithm can get a good accuracy rate.

Kim et. al. [2000 a] presented a methodology which combine HMM (hidden Markov

model) and MLP (multilayer perceptron) for cursive word recognition. An explicit

segmentation based on HMM is designed. This scheme is combined with implicit

segmentation based MLP using weighting coefficients. The main reason behind the

proposed methodology is that more distinct classifiers can complement each other in a

better way. A new probability measure for the hybrid classifier as well as conventional

combining schemes is also introduced. Results mentioned in the paper showed good

42

segmentation.

Kim et. al. [2000 b] described a scheme for recognizing unconstrained handwritten

numeral strings by a composite segmentation method. Two concepts, one is recognition

free and other is recognition based segmentation, are combined. A digit group detector has

been designed to separate touching digits from isolated digits by the recognition free

segmentation method. Subsequently touching digits are segmented by prioritizing

segmentation points. These points are accomplished by analyzing the ligature and

touching types. Four special kinds of candidate segmentation points and six touching types

are defined to obtain more stable segmentation points. As per the claim made in paper, the

proposed algorithm achieved good success rate.

Lehal and Singh [2000] presented a system for recognition of machine printed

Gurmukhi script. Character recognition in Gurmukhi script faces major problems mainly

related to the unique characteristics of the script like connectivity of characters on the

headline, a larger number of similar characters and two or more characters in a word

having intersecting minimum bounding rectangles. A set of very simple and easy to

compute features is used and a hybrid classification scheme consisting of binary decision

tree and nearest neighbors is employed.

Nicchiotti and Scagliola [2000] proposed a simple procedure for the over segmentation

of cursive word, which is based on the analysis of the handwritten profiles and on the

extraction of “white holes”. Straight segmentation tries to decompose the image in a set of

sub images, each one corresponding to a character. In segmentation recognition strategies

the image is subdivided in a set of sub images (strokes) whose combinations are used to

generate character candidates. The number of sub images is greater than the number of

characters and the process is referred to also as over segmentation. Recognition is then

43

used to select the correct character hypothesis from character candidates. It follows the

policy of using simple rules on complex data and sophisticated rules on simpler data.

Experimental results show robustness and performances comparable with the best ones

presented in the literature.

Plamondon and Srihari [2000] described that handwriting has continued to persist as a

means of communication and recording information in day to day life even with the

introduction of new technologies. This has significance in human transactions, machine

recognition of handwriting has practical significance, as in reading handwritten notes, in

postal addresses on envelopes, in amounts in bank cheques, in handwritten fields in forms,

etc. This overview describes the nature of handwritten language and how it is transduced

into electronic data. It also gave the insight of the concepts behind written language

recognition algorithms. Both the online case (which pertains to the availability of

trajectory data during writing) and the off line case (which pertains to scanned images) are

considered. Algorithms for preprocessing, character and word recognition, and

performance with practical systems are indicated. Other fields of application, like

signature verification, writer authentification, and handwriting learning tools are also

considered in the paper.

Alimoglu and Alpaydin [2001] investigated techniques to combine multiple

representations of a handwritten digit to increase classification accuracy without

significantly increasing system complexity or recognition time. In pen based recognition,

the input is the dynamic movement of the pen tip over the pressure sensitive tablet. There

is also the image formed as a result of this movement. On a real world database of

handwritten digits containing more than eleven thousand handwritten digits, authors

noticed that the two multi-layer perceptron (MLP) based classifiers using these

44

representations make errors on different patterns implying that a suitable combination of

the two would lead to higher accuracy. Therefore, they implemented and compared voting,

mixture of experts, stacking and cascading. Combining the two MLP classifiers, higher

accuracy is achieved because the two classifiers/representations fail on different patterns.

So it is advocated, especially, multistage cascading scheme where the second costlier

image based classifier is employed only in a small percentage of cases.

Arica and Yarman [2001] served as an update for the readers working in the character

recognition area. First, an overview of the character recognition systems and their

evolution over time is presented. Then, the available classification recognition (CR)

techniques with their superiorities and weaknesses are reviewed. Finally, the current status

of CR is discussed and directions for future research are suggested. Special attention is

given to the offline handwriting recognition, since this area requires more research to

reach the ultimate goal of machine simulation of human reading.

Madhvanath and Govindaraju [2001] surveyed to take a fresh look at the potential role

of the Holistic paradigm in handwritten word recognition. According to Holistic paradigm

in handwritten word recognition, a word is treated as a single, indivisible entity and

attempts to recognize words from their overall shape, as opposed to their character

contents. In this survey, an overview of studies of reading process is presented which

provide evidence for the existence of a parallel holistic reading process in both developing

and skilled readers. The handwriting recognition approaches are characterized as forming

a continuous spectrum based on the visual complexity of the unit of recognition employed.

An attempt is made to interpret well known paradigms of word recognition in this

framework. An overview of features, methodologies, representations, and matching

techniques employed by holistic approaches is presented, in the paper.

45

Srihari et. al. [2001] undertook a study to objectively validate the hypothesis that

handwriting is individualistic. Handwriting samples of one thousand five hundred

individuals, representative of the US population with respect to gender age, ethnic groups,

etc., were obtained. Analyzing differences in handwriting was done by using computer

algorithms for extracting features from scanned images of handwriting. Attributes

characteristic of the handwriting were obtained. The attributes chosen were line

separation, slant, character shapes, etc. These attributes, which are a subset of attributes

used by expert document examiners, were used to quantitatively establish individuality by

using machine learning approaches. Using global attributes of handwriting and very few

characters in the writing, the ability to determine the writer with a high degree of

confidence was established. The work is a step towards providing scientific support for

admitting handwriting evidence in court. The mathematical approach and the resulting

software also have the promise of aiding the expert document examiner.

Acharyya and Kundu [2002] presented an efficient and computationally fast method

for segmenting text and graphics part of document images based on textural cues. It is

assumed that the graphics part have different textural properties than the nongraphics

(text) part. The segmentation method uses the notion of multiscale wavelet analysis and

statistical pattern recognition. Authors have used M band wavelets which decompose an

image into M×M bandpass channels. Various combinations of these channels represent the

image at different scales and orientations in the frequency plane. The objective is to

transform the edges between textures into detectable discontinuities and create the feature

maps which give a measure of the local energy around each pixel at different scales. From

these feature maps, a scale space signature is derived, by which the vector of features at

different scales is taken at each single pixel in an image. It is claimed in the paper that

segmentation is achieved by simple analysis of the scale space signature with traditional

46

‘k’ mean clustering. Any prior information regarding the font size, scanning resolution,

type of layout, etc. of the document in the proposed segmentation scheme is not assumed.

Arica and Yarman [2002] proposed a new analytic scheme, which uses a sequence of

image segmentation and recognition algorithms, for the off line cursive handwriting

recognition problem. First, some global parameters, such as slant angle, baselines, stroke

width and height, are estimated. Second, a segmentation method finds character

segmentation paths by combining gray scale and binary information. Third, a hidden

Markov model (HMM) is employed for shape recognition to label and rank the character

candidates. For this purpose, a string of codes is extracted from each segment to represent

the character candidates. The estimation of feature space parameters is embedded in the

HMM training stage together with the estimation of the HMM model parameters. Finally,

information from a lexicon and from the HMM ranks is combined in a graph optimization

problem for word level recognition. This method corrects most of the errors produced by

the segmentation and HMM ranking stages by maximizing an information measure in an

efficient graph search algorithm. The experiments indicate higher recognition rates

compared to the available methods reported in the literature.

Ashwin and Sastry [2002] described an OCR system for printed text documents in

Kannada, which is a South Indian language. Scanned image of a page written in Kannada

is given as an input to the system and the output, as a machine editable file, is achieved.

This output file is compatible with most typesetting software. The proposed system

extracts words from the document image. The segmented words are differentiated into

sub character level pieces. The structure of the script is used in the proposed scheme for

segmentation. A novel set of features for the recognition problem, which are

computationally simple to extract, is proposed. The final recognition is achieved by

47

employing a number of two class classifiers which is based on the Support Vector

Machine (SVM) method. The recognition is independent of the font and size of the printed

text

Garain and Chaudhuri [2002] described that one of the important reasons for poor

recognition rate in optical character recognition (OCR) system is the error in character

segmentation. Existence of touching characters in the scanned documents is a major

problem to design an effective character segmentation procedure. In this paper, a new

technique, based on fuzzy multifactorial analysis, is presented for identification and

segmentation of touching characters. A predictive algorithm is developed for effectively

selecting possible cut columns for segmenting the touching characters. The proposed

method has been applied to printed documents in Devnagari and Bangla as authors felt

that these two scripts are the most popular scripts of the Indian sub continent. The results

obtained from a test set of considerable size show that a reasonable improvement in

recognition rate can be achieved with a modest increase in computations.

Kapoor et. al. [2002] proposed an accurate and exhaustive approach to detect the skew

angle of the images of words/ characters of cursive Devanagari script. This approach was

applied to 235 writing samples and a total collection of around 6000 samples. It is efficient

in terms of time and is a simpler process as compared to the existing ones. The method is

an extension to the work carried out by Pal and Chaudhuri. Heuristic approach has been

applied to detect the skew angle. The inherent dominating features of the structure of the

Devanagari script have been used to accurately calculate the skew of the Devanagari word.

Pal et. al. [2002] dealt with a new scheme for automatic segmentation of unconstrained

handwritten connected numerals. This approach is mainly based on water reservoir. A

reservoir is a metaphor to illustrate where the region numerals touch. Reservoir is obtained

48

by considering accumulation of water poured from the top or from the bottom of the

numerals. At first, considering reservoir location and size, touching positions are decided.

Next, analyzing the reservoir boundary, touching position and topological features of the

touching pattern, the best cutting point is determined. Finally, combined with

morphological structural features the cutting path for segmentation is generated.

Pal and Datta [2003] proposed a robust scheme to segment unconstrained handwritten

Bangla texts into lines, words and characters. For line segmentation, at first, the text is

divided into vertical stripes. Stripe width of the document is computed by statistical

analysis of the text height in the document. The horizontal histogram of these stripes and

the relationship of the minimal values of the histograms are used to segment text lines.

Based on the vertical projection profile, lines are segmented into words. For segmentation

of characters, water reservoir principle is used. At first, isolated and touching characters in

a word are identified. Next touching characters of the word are segmented based on the

reservoir base area points and structural feature of the component.

Devessar et. al. [2003] suggested a new approach to segment machine printed

Gurmukhi text. To resolve the issues of touching characters, a two pass mechanism is

used. In pass one, the segmentation point is approximated, while in pass two the cutting

point is optimized. This approach has been very successful in segmenting a pair as well as

triplets of touching characters. This approach can easily be extended to the other Indian

languages scripts such as Devnagri and Bangla, which have horizontal lines at the top

called headlines.

Pal and Sarkar [2003] worked on Optical Character Recognition system for printed

Urdu. Here, the document image is captured using a flatbed scanner and passed through

skew correction, line segmentation and character segmentation modules. These modules

49

are developed by combining conventional and newly proposed techniques. Next,

individual characters are recognized using a combination of topological, contour and water

reservoir concept based features. The feature detection methods are simple and robust.

This approach achieves a good character level accuracy on average.

Pal et. al. [2003 a] dealt with a new technique for automatic segmentation of

unconstrained handwritten connected numerals. To take care of variability involved in the

writing style of different individuals a robust scheme is presented in the paper. The

scheme is mainly based on features obtained from a concept based on water reservoir. A

reservoir is a metaphor to illustrate the region where numerals touch. Reservoir is obtained

by considering accumulation of water poured from the top or from the bottom of the

numerals. At first, considering reservoir location and size, touching position (top, middle

or bottom) is decided. Next, analyzing the reservoir boundary, touching position and

topological features of the touching pattern, the best cutting point is determined. Finally,

combined with morphological structural features the cutting path for segmentation is

generated.

Pal et. al. [2003 b] stated that a document page may contain two or more different

scripts. For Optical Character Recognition (OCR) of such a document page, it is necessary

to separate different scripts before feeding them to their individual OCR system. In this

paper an automatic scheme is presented to identify text lines of different Indian scripts

from a document. For the separation task, at first the scripts are grouped into a few classes

according to script characteristics. In the next step, feature based on water reservoir

principle, contour tracing, profile etc. are employed to identify them without using any

expensive OCR like algorithms.

Zhang et. al. [2003] tried their hands in the analysis of handwritten characters

50

(allographs) and found that it plays an important role in forensic document examination.

However, so far there is lack of comprehensive and quantitative study on individuality of

handwritten characters. Based on a large number of handwritten characters extracted from

handwriting samples of one thousand individuals in US, the individuality of handwritten

characters has been quantitatively measured through identification and verification

models. This study shows that in general, alphabetic characters bear more individuality

than numerals and use of a certain number of characters will significantly outperform the

global features of handwriting samples in handwriting identification and verification.

Moreover, the quantitative measurement of discriminative powers of characters offers a

general guidance for selecting most informative characters in examining forensic

documents.

Grau et. al. [2004] presented a new image segmentation system. This system is based

on the calculation of a tree representation of the original image in which image regions are

assigned to tree nodes, followed by a correspondence process with a model tree, which

embeds a prior knowledge about the images. An algorithm is proposed in the paper, which

performs the minimization of an error function that quantifies the difference between the

input image tree and the model tree. Another algorithm is also proposed for automatically

calculating the model tree from a set of manually segmented images. Results on synthetic

and MR brain images are presented in the paper.

Pal and Roy [2004] stated that there are printed artistic documents where text lines of a

single page may not be parallel to each other. These text lines may have different

orientations or the text lines may be curved shapes. For the optical character recognition

(OCR) of these documents, such lines are needed to extract properly. A novel scheme,

mainly based on the concept of water reservoir analogy, is proposed to extract individual

51

text lines from printed Indian documents containing multioriented and/or curved text lines.

In the proposed scheme, initially connected components are labeled and identified either

as isolated or touching. Next, each touching component is classified to either straight type

(S-type) or curve type (C-type), depending on the reservoir base area and envelope points

of the component. Based on the type (S-type or C-type) of a component, two candidate

points are computed from each touching component. Finally, candidate regions

(neighborhoods of the candidate points) of the candidate points of each component are

detected. After analyzing these candidate regions, components are grouped to get

individual text lines.

Tripathy and Pal [2004] proposed a scheme based on the water reservoir concept for

the segmentation of unconstrained Oriya handwritten text into individual characters. At

first, the text image is segmented into lines, and then lines are segmented into individual

words, and words are segmented into individual characters. For line segmentation, the

document is divided into vertical stripes. Analyzing the heights of the water reservoir

obtained from different components of the document, the width of a stripe is calculated.

Stripe wise horizontal histograms are then computed and the relationship of the peak

valley points of the histograms is used for line segment. Based on vertical projection

profile and structural features of Oriya characters, text lines are segmented into words. For

character segmentation, first the isolated and connected characters in a word are detected.

Using structural, topological and water reservoir concept based features; touching

characters of the word are then segmented.

Zheng et. al. [2004] addressed the problem of the identification of text in noisy

document images. In the paper, the stress is focused on segmenting and identifying

between handwriting and machine printed text because handwriting in a document often

52

indicates corrections, additions, or other supplemental information that should be treated

differently from the main content and moreover the segmentation and recognition

techniques requested for machine printed and handwritten text are significantly different.

The proposed scheme treats noise as a separate class and models noise based on the

selected features. Trained Fisher classifiers are used to identify machine printed text and

handwriting from noise. The context is further exploited to refine the classification. A

Markov Random Field based approach is used to model the geometrical structure of the

printed text, handwriting, and noise to rectify misclassifications. As is clear from the result

in the paper, the scheme can significantly improve page segmentation in noisy document

collections.

Jindal et. al. [2005] identified different kinds of degradation available in Gurmukhi

script. After identifying the different kinds of degradation, that is, touching characters,

broken characters, heavy printed characters, faxed documents and typewritten documents

and problems associated with each kind of degradation have been discussed and some

possible solutions have also been discussed.

Pal and Tripathy [2005] proposed a scheme towards the recognition of Indian stylistic

documents. Here, using feature based on the water reservoir concept, the characters are

segmented from the stylistic documents without any skew correction. Next, individual

characters are recognized. For recognition, contour distances of the outer contour points of

the characters are calculated from the centroid. These contour distances are then arranged

in a particular order to get size and rotation invariant feature. Finally, computing statistical

feature on these arranged contour distances, the input character is recognized.

Jindal et. al. [2006] stated that multiple horizontally overlapping lines are normally

found in printed newspapers of almost every language due to high compression methods

53

used for printing of the newspapers. For any optical character recognition (OCR) system,

presence of horizontally overlapping lines decreases the recognition accuracy drastically.

In this paper, authors have proposed a solution for segmenting horizontally overlapping

lines. Whole document has been divided into strips and proposed algorithm has been

applied for segmenting horizontally overlapping lines and associating small strips to their

respective lines. The results reveal that the algorithm is almost ninety percent perfect when

applied to the Gurmukhi script.

Li et. al. [2006] dealt with curvilinear text line detection and segmentation in

handwritten documents. Given no prior knowledge of script, authors modeled text line

detection as an image segmentation problem by enhancing text line structure using a

Gaussian window, and adopting the level set method to evolve text line boundaries.

Experiments show that the proposed method achieves high accuracy for detecting text

lines in both handwritten and machine printed documents with many scripts.

Jindal et. al. [2007] stated that horizontally overlapping lines are normally found in

printed newspapers of any Indian script. Along with these overlapping lines few other

broken components of a line (stripe) having text less than a complete line are also found in

text. The horizontally overlapping lines and other stripes make it very difficult to estimate

the boundary of a line leading to incorrect line segmentation. Incorrect line segmentation

decreases the recognition accuracy. In this paper, the authors have proposed a solution for

segmenting horizontally overlapping lines and solved the problem of other stripes in eight

most widely used printed Indian scripts. Whole document has been divided into stripes

and proposed algorithm has been applied for segmenting horizontally overlapping lines

and associating small stripes to their respective lines.

Sulem et. al. [2007] made a survey regarding the line segmentation and described that

54

there is a huge amount of historical documents in libraries and in various National

Archives that have not been converted electronically. Although automatic reading of

complete pages remains, in most cases, a long term objective, tasks such as word spotting,

text/image alignment, authentication and extraction of specific fields are in use today. For

all these tasks, a major step is to segment document into text lines. Because of the low

quality and the complexity of these documents (background noise, artifacts due to aging,

interfering lines), automatic text line segmentation remains an open research field. Authors

presented a survey of existing methods, developed during the last decade and dedicated to

documents of historical interest.

Jindal et. al. [2008] stated that the performance of an OCR system depends upon

printing quality of the input document. There are number of designed OCRs which

correctly identify fine printed documents in Indian and other scripts. But, little reported

work has been found which deals with the recognition of the degraded documents.

Therefore, if any standard OCR is tested on degraded documents, then the performance of

that system, which is working well for fine printed documents, decreases. Feature

extraction is an important task for designing an OCR for recognizing degraded documents.

In this paper, authors have discussed efficient structural features selected for recognizing

degraded printed Gurmukhi script characters.

Li et. al. [2008] proposed a novel approach based on density estimation and a state of

the art image segmentation technique, which is called as the level set method. From an

input document image, probability map is estimated, where each element represents the

probability that the underlying pixel belongs to a text line. The level set method is then

exploited to determine the boundary of neighboring text lines by evolving an initial

estimate. The proposed algorithm in the paper does not use any script specific knowledge.

55

Extensive quantitative experiments on freestyle handwritten documents with diverse

scripts, such as Arabic, Chinese, Korean, and Hindi, demonstrate that the algorithm

consistently performs well.

Palacios and Gupta [2008] described the problem related with processing of cheques.

As nowadays, bank cheques are preprinted with the account number and the cheque

number in special ink and format in many countries. These two numeric fields can be

easily read and processed using automated techniques. However, the amount filled on a

filled cheque are usually read by human eyes, and involves significant time and cost. The

system described in this paper uses the scanned image of a bank cheque to 'read' the

cheque. It includes three main modules. If these modules are implemented then that allow

for fully automated bank cheque processing. These three modules are the detection of

strings within the image, the segmentation and recognition of string in a feedback loop,

and the post processing issues that help to ensure higher accuracy of recognition. The

major benefit of the integrated system is the ability to address the complex problem of

reading handwritten bank cheque by implementing efficient algorithms for each

processing step. As per the paper, all modules have been implemented and subsequently

tested for reading the value of the cheque using different image databases. Due to the

particular requirements of this application, the system can be tuned to yield low levels of

incorrect readings. This leads to higher levels of rejection than the levels encountered in

other handwritten recognition applications. A 'rejected' cheque can be read subsequently

by human eyes or other more advanced automated approaches. However, a cheque 'read'

incorrectly is more difficult to deal with, in terms of costs and time involved to rectify the

mistake. As such, the proposed architecture can be geared towards producing the most

suitable balance between inaccurate readings and rejection level, in accordance with user

preferences. The experimental results presented in the paper do not focus on the best

56

possible results for a particular database of cheque. But, they show the benefits attained

independently by each of the modules proposed.

Bukhari et. al. [2009] stated that handwritten document images contain text lines with

multi orientations, touching and overlapping characters within consecutive text lines, and

small inter line spacing making text line segmentation a difficult task. In the paper, authors

modeled text line extraction as a general image segmentation task. The central line of parts

of text lines using ridges over the smoothened image is computed. Then the state of the art

active contours (snakes) over ridges are adapted, which results in text line segmentation.

Chaudhuri and Bera [2009] dealt with text line identification of handwritten Indian

scripts. Some of the Indian Scripts discussed in the paper are Bangla, as well as English,

Hindi, Gurmukhi and Malayalam, etc. A new dual method based upon interdependency

between text line and inter line gap is proposed in the paper. The curves are drawn by the

proposed scheme simultaneously through the text and inter line gap points found from

strip wise histogram peaks and inter peak valleys. The curves start from left and move

right while one type of points guides the curve of other type so that the curves do not

intersect. Then these curves are allowed to iteratively evolve so that the text line curves

cross more character strokes while inter line curves cross less character strokes and yet

keep the curves as straight as possible. After several iterations, the curves stabilize and

define the final text lines and inter line gaps. The approach works well on text of different

scripts with various geometric layouts, including poetry.

Philip and Samuel [2009] described an Optical Character Recognition (OCR) System

for printed text documents in Malayalam which is one of the South Indian languages. As

this is a known fact that Indian scripts are rich in patterns but these combinations of such

patterns makes the problem even more complex. But in the paper, these complex patterns

57

are exploited to get the solution. The proposed system extracted the scanned document

image into text lines, words and further characters and sub characters. The proposed

segmentation algorithm is influenced by the structure of the script. A novel set of features,

computationally simple to extract are proposed. The proposed approaches are based on the

distinctive structural features of machine printed text lines written in these scripts. A

lateral cross sectional analysis is performed along each row of the normalized binary

image matrix resulting in distinct features. The final recognition is done through classifiers

based on the Support Vector Machine (SVM) method. The proposed algorithms have been

tested on a variety of printed Malayalam characters and give good result.

Yin and Liu [2009] suggested a novel text line segmentation algorithm based on

minimal spanning tree (MST) clustering with distance metric learning. Given a distance

metric, the connected components (CCs) of document image are grouped into a tree

structure, from which text lines are extracted by dynamically cutting the edges using a new

hyper volume reduction criterion and a straightness measure. By learning the distance

metric in supervised learning on a dataset of pairs of CCs, the proposed algorithm is made

robust to handle various documents with multi skewed and curved text lines. The results

presented in the paper suggest that the proposed method worked very well.

Das et. al. [2010] addressed the segmentation of overlapped text lines and characters in

Telgu text. In fact, Segmentation is an important task of any OCR system. The accuracy of

OCR system mainly depends on the segmentation algorithm being used. Segmentation of

Telugu text is difficult when compared with Latin based languages because of its

structural complexity and increased character set. It contains vowels, consonants and

compound characters. Some of the characters may overlap together. The profile based

methods can only segment non overlapping lines and characters. The proposed algorithm

58

is based on projection profiles, connected components and spatial vertical relationships.

To segment the image into lines and characters, in this method, first the connected

components are extracted from the document image and labeled. For each connected

component the top, bottom, left, right positions are identified. Then, nearest neighborhood

method to cluster the connected components is also used. Good character segmentation

accuracy can be achieved with overlapping lines and characters as per the result shown in

the paper.

Kumar and Sengar [2010] described the line, word, character and top character

segmentation for printed Hindi text in Devanagari script and Gurmukhi script. The global

horizontal projection method computes sum of all black pixels on every row and

constructs corresponding histogram. Based on the peak/valley points of the histogram,

individual lines and words are separated.

Nallapareddy et. al. [2010] proposed a robust method for segmentation of individual

text lines based on the modified histogram obtained from run length based smearing. A

complete line and word segmentation system for some popular Indian printed languages is

presented in the paper. Both foreground and background information is used here for

accurate line segmentation. There may be some touching or overlapping characters

between two consecutive text lines and most of the line segmentation errors are generated

due to touching and overlapping character occurrences. Sometimes, interline space and

noises make line segmentation a difficult task. The proposed method can take care of this

situation accurately. Word segmentation from individual lines is also discussed here. The

results of the proposed method on documents of Bangla, Devnagari, Kannada, Telugu

scripts as well as some multi script documents are shown in the paper.

Aradhya and Naveena [2011] proposed a novel method for text line segmentation of

59

unconstrained handwritten Kannada script. The proposed method consists of two phases.

In the first phase, mathematical morphology technique is used to bridge the gap between

character components. In the second phase, component extension technique is used for

text line extract.

Mahender and Kale [2011] stated that writing has been the most natural method of

collecting, storing and transmitting information through the centuries, now serves not only

for the communication among humans, but also for the communication of humans and

machines. The free style handwriting recognition is difficult not only because of the great

amount of variations involved in the shape of characters, but also because of the

overlapping and the interconnection of the neighboring characters. Authors have presented

a structured based feature extraction and rule based recognition scheme for handwritten

Marathi word.

Pradeep et. al. [2011] gave an off line handwritten alphabetical character recognition

system using multilayer feed forward neural network. A new method, called, diagonal

based feature extraction is introduced for extracting the features of the handwritten

alphabets. The proposed recognition system performs quite well yielding higher levels of

recognition accuracy compared to the systems employing the conventional horizontal and

vertical methods of feature extraction. This system can be suitable for converting

handwritten documents into structural text form and recognizing handwritten names.

Borrowing from past literature, it can be summarized that though there is rapidly

growing body of literature on how to segment scanned documents of International scripts

as well as Indian languages but relatively few studies are there that have examined how to

effectively segment a document written in Gurmukhi script. The studies which are

available for Gurmukhi script that mostly deals with machine printed texts.

60

2.2 Need of The Study

Text can be categorized in order of increasing difficulty when there are well separated

and unbroken characters in proportional spacing, in which characters occupy different

amounts of horizontal space, depending on their shapes. Lu [1995] quoted that like when

the characters are broken, that is, single characters have more than one component. When

characters are touching characters that is more than one character in a single connected

component, or similarly when there are broken and touching characters both. In most OCR

systems, character recognition performs on individual characters. Pre processing stage

yields a ‘clean’ document in the sense that sufficient amount of shape information, high

compression and low noise on normalized image is obtained.

According to Pal et. al. [2003 b], in India, there are 18 official (Indian constitution

accepted) languages. Two or more of these languages may be written in one script. Twelve

different scripts are used for writing these languages. Under the three language formula,

many of the Indian documents are written in three languages namely, English, Hindi and

the state official language. For example, a money order form in the Punjab state may be

written in English, Hindi and Gurmukhi, because Gurmukhi (Punjabi) is the state official

language of Punjab. Here are some properties common in Indian Language scripts.

2.2.1 Properties of Indian Language Scripts

Assamese, Bangla, English, Gujarati, Hindi, Konkanai, Kannada, Kashmiri,

Malayalam, Marathi, Nepali, Oriya, Panjabi, Rajasthani, Sanskrit, Tamil, Telugu and Urdu

are the official languages of India. Hindi is most popular language in India followed by

Bangla which is the second most popular languages in India. On global scenario, English

is most popular language and whereas these two languages (Hindi and Bangla) are the 4th

and 5th

most popular language in the world. The scripts used for the Indian languages are

61

not all different. One script is used to write different languages. For example, Bangla

script is used to write Assamese and Bangla (Bengali) languages while Devnagari script is

used to write Hindi, Marathi, Rajasthani, Sanskrit and Nepali language. Constitution wise,

there are twelve different scripts which are used to write these 18 languages. Pal et. al.

[2003 b] stated that these scripts are named as Urdu, Tamil, Telugu, Gurmukhi (Panjabi),

Devnagari, Bangla, English, Gujarati, Kannada, Kashmiri, Malayalam, and Oriya.

Examples of different script lines are shown in figure 2.1.

Figure 2.1: Different Indian script lines (from top to bottom: Devnagari, Bangla, Gurumukhi, Malayalam,

Kannada, English, Tamil, Telugu, Urdu, Kashmiri, Gujrathi, Oriya)

In most of Indian scripts, alphabet system exists having basic characters, which are

actually vowel and consonant characters. Apart from these basic characters, there are

compound characters formed by combining two or more basic characters. The shape of a

compound character is usually more complex than the constituent basic characters.

In some scripts (like Gurmukhi, Devnagri or Bangla etc), many characters of the

62

alphabet system have a horizontal line at the upper part. In Devnagari it is called

‘sirorekha’ while in Bangla, this line is called ‘matra’. However, in the present study, it is

referred as head line. When two or more characters are put side by side to form a word in

the language, the head line portions of these characters touch one another and generate a

long head line, which is used as a feature for script identification. In most Indian

languages, a text line may be partitioned into three zones: higher zone, heart zone and

lower zone. Different zoning is shown in figure 2.2.

Figure 2.2: Different zones of English, Devnagari and Gurmukhi text line

The higher zone denotes the portion above the head line. The portion below the head

line is known as heart zone. This zone covers the portion of basic as well as compound

characters. The lower zone is the portion below base line. Those texts where script lines do

63

not contain head line, the mean line separates higher zone and heart zone. The base line

separates heart zone and lower zone. Pal et. al. [2003 b] opined that mean line can be

defined as an imaginary line, where most of the uppermost (lowermost) points of

characters of a text line lie. The uppermost and lowermost boundary lines of a text line are

named as upper line and lower line.

2.2.2 Features of Indian Languages and Scripts

The feature means something which is present in a symbol or character of any script,

for example a feature can be a side bar, or loop and so on. A character may have one or

combinations of certain features in it or not. Kumar et. al. [2003] is of the opinion that

there are certain features present or common in Indian scripts; some of these are as given

in the following section.

2.2.2.1 Common Alphabet: The set of alphabets of Indian languages have been derived

from the Sanskrit alphabet. Usually, there is a common set of alphabets containing 33

consonants and 15 vowels. In addition to this, there are three to four consonants and two to

three vowels which are used in specific languages or in the classical forms of others. This

difference is not very significant in practice. The basic letters of the alphabet are formed

by individual consonants and vowels. The only exception is the Tamil language which

uses twelve fewer consonants. However, the structure is not too different in Tamil too, as

this change can be modeled as dropping some of the consonants from the master list.

2.2.2.2 Akshara or Akhar: Akshra is notion used for a basic unit, called character, of

Indian languages, with reference to Gurmukhi this is also known as Akhar. It forms the

fundamental linguistic unit, like a character in English. An akhar can be made up of 0, 1,

2, or 3 consonants and a vowel. The combination of one or more akhars makes a ‘Word’.

As the languages are completely phonetic, therefore each akhar can be pronounced

64

independently. Samyuktaksharas are the combinations of akhars with more than one

consonant. They are also called as combo characters. The last of the consonants is the

main one in a samyuktakshara.

2.2.2.3 Diverse Graphemes: The commonality in the alphabet does not mean the graphic

forms are used to express them to print in the same way. Each language uses different

scripts consisting of dissimilar graphemes for printing. Thus, printed matter of one

language written in one script is unapproachable to readers of other language but written

in the same script. As we know that there are twelve major scripts in India. The

Devanagari script is the widest used one, being used to write Hindi, Marathi, Konkani, and

Nepali. Here Nepali is the language of the neighboring nation Nepal. For the individual

graphemes and their combinations, different philosophies are used for different scripts.

Some have a head line while others have non touching graphemes. The grapheme of one

of the consonants is usually at the heart of the printed akshara. The vowel appears as a

matra or vowel modifier. These can appear to the above, below, right or left to it or in

combinations. The supporting consonants of a samyuktashara also appear as modifier

graphemes to the above, below, right or left of the main one. These modifiers could be

truncated or scaled down forms of the basic consonant, but could also be completely

different. They may touch each other or the main consonant in some cases or may be

separated. These rules are not consistent even within a script and certainly not across

scripts.

2.2.2.4 Formless Font Design: With the wide use of information technology over the last

few decades, different fonts have been designed for each Indian script. The fonts are built

from glyphs and follow the graphical structure of each script, which is different for

65

different languages. It is not possible to use a consistent set of rules for this step for all

scripts. No conventions have been followed.

2.2.3 Gurmukhi Script

Lehal and Singh [2002] concluded that the word Gurmukhi is derived from the

combination of two words “Guru” and “Mukh”. Gurmukhi means to record the sayings

from the mukh (or mouth or lips) of the Gurus, i.e. from the Guru’s mukh. The credit to

originate this script goes to Guru Angad Dev Ji. He not only rearranged but also modified

and shaped certain letters into a script. New shape and order was given to the alphabets

and made it precise and accurate. Those letters were retained which depicted sounds of the

then spoken language. There was some rearrangement of the letters also such as s and h

were shifted to the first line and a was given the first place in the new alphabet. It is

believed that Gurmukhi belongs to Brahmi family. Aryans developed an Aryan script

which is known as Brahmi. This script was adapted by Aryans as per their local needs.

Between 8th and 6th B.C., this Brahmi script was introduced.

Gurmukhi script is primarily used for the Punjabi language, which is the world’s 14th

most widely spoken language. Gurmukhi script is a logical composition of its constituent

symbols in two dimensions. It is an alphabetic script. Lehal and Singh [1999] explained

that Gurmukhi script alphabet consists of 41 consonants, 12 vowels and 3 half characters

which lie at the feet of consonants. These vowels and consonants are shown in figure 2.3

and 2.4 respectively. Besides the consonants and the vowels, other constituent symbols in

Gurmukhi are a set of vowels modifiers called matra placed to the left, right, above or at

the bottom of a character or conjunct, pure consonants forms corresponding to some

consonant (also called half letters) which when combined with other consonants yield

conjuncts.

66

Figure 2.3: Vowels and Vowel diacritics (Laga Matra)

Figure 2.4: Consonants (Vianjans) of Gurmukhi Script

67

The writing style is from left to right and the concept of upper/lower case (as in

English) is absent. Most of the characters have a horizontal line at the upper part. Mostly

this line, called headline connects the character of words. Lehal and Singh [2000]

suggested that a word in Gurmukhi script can also be partitioned into two horizontal

zones. The upper zone denotes the region above the headline. The area below the

headlines, the major part of the character, is located in center zone or heart zone. These

zones are shown in figure 2.5.

a) Upper zone from line number 1 to 2 b) Heart zone from line number3 to 4

c) Lower zone from line number 4 to 5

Figure 2.5: Three zones in Gurmukhi script

Gurmukhi script has the following characteristics:

• Gurmukhi script alphabet consists of 41 consonants, 12 vowels and 3 half characters

which lie at the feet of consonants.

• The characters of words are connected mostly by a horizontal line called head line.

• All Gurmukhi letters have uniform height.

• All letters in Gurmukhi can be written between two parallel horizontal lines, a is the

only exception. The top curve of which extends beyond the upper line.

• From left to right, letters have almost uniform length, only A (aira) and g (ghaggha)

may be slightly longer than the rest.

68

• The form of letters is not effected when a vowel symbol or diacritic is attached to it,

the only exception being a to which an additional curve is added which represents two

syllables.

• A word in Gurmukhi script can be partitioned into three horizontal zones. The upper

zone denotes the region above the head line, where vowels reside, while the middle

zone or heart zone represents the area below the head line where the consonants and

some sub parts of vowels are present. The lower zone represents the area below middle

zone where some of vowels and certain half characters lie in the foot of consonants.

• The half characters in the lower zone frequently touch the above lying consonants in

the above zone.

• There are many multi component characters in Gurmukhi script. A multi component

character is a character that can be decomposed into isolated parts.

• The bounding boxes of 2 or more characters in a word may intersect or overlap

vertically.

Lehal and Singh [2002] asserted that the Gurmukhi script is a two dimensional

composition of consonants, vowels and half characters which require segmentation in a

vertical as well in horizontal direction. Thus the segmentation of Gurmukhi text calls for a

two dimensional analysis instead of commonly used one dimensional analysis as for

Roman script. Literature survey reveals that due to the following reasons, unique

segmentation method is required for the handwritten Gurumukhi script.

� The letters in cursive writing are often connected.

� The individual letters in a cursive word are often written so as to be unidentifiable

as isolated characters.

� The variance in writing style.

69

� If the handwritten line is slanting then it is difficult to segment.

� Writing quality of handwritten document is not uniform throughout the document.

� Font size can not be guessed, which is very important for character segmentation.

� Some of the handwritten letters like m (in English) can also be interpreted as a pair

nn, as shown in figure 2.6 (a). Similarly in Gurmukhi, the character g can be

segmented as rw, as shown in figure 2.6 (b).

Figure 2.6 (a): Incorrect Segmentation of a character in English

Figure 2.6 (b): Incorrect Segmentation of a character in Gurmukhi

70

2.3 Objectives of Study

The objective of the proposed work is

I. To devise an algorithm to segment a line of handwritten Gurmukhi script.

II. To develop an algorithm to segment a word of a line of handwritten Gurmukhi script.

III. To build up an algorithm to segment a character of a word of handwritten Gurmukhi

script.

IV. To compare developed algorithms with classical approach / recognised based

approach as to present comparative analysis.

2.4 Research Methodology

The proposed work consists of the following components:

i). The Detailed literature reviews were carried out to ascertain the significance of

segmentation in Gurmukhi script.

ii). Survey of various segmentation methodologies was carried out through various

case studies.

iii). In the initial stage, efforts were made to develop an algorithm to segment a line

of handwritten Gurmukhi script.

iv). Then at the next stage, an algorithm was developed to segment a word of a line

of handwritten Gurmukhi script.

v). In the last, an algorithm was developed to segment a character of a word of

handwritten Gurmukhi script.

vi). Developed work was compared with existing approach as to present

comparative analysis.

chapter 2 literature survey and...

Documents