chapter 2 literature survey and...
TRANSCRIPT
27
Chapter 2
Literature Survey and Objectives
2.1 Literature Survey
In India, there are 18 official (Indian constitution accepted) languages. Two or more of
these languages may be written in one script. Twelve different scripts are used for writing
these languages. Many of the Indian documents are supposed to be written in three
languages namely, English, Hindi and the state official language as per the three language
formula. For example, a money order form in the Tamil Naidu state is written in English,
Hindi and Tamil, because Tamil is the state official language of Tamil Naidu. The need to
have some form of automated or semi automated OCR has been recognized for decades.
As segmentation is the crucial part of the OCR, therefore more stress should be given to
this phase. Today, there are numerous algorithms that perform this task, each with its own
strengths and weaknesses. In this survey, a number of papers are reviewed and presented
which are related to the present work.
Dunn and Wang [1992] surveyed the techniques for segmenting images of handwritten
text into individual characters. The topic is broken into two categories, one is
segmentation and other is segmentation recognition techniques. First one discussed in the
paper, is straight segmentation which is the technique of forming rules to identify
members of a character set without identifying their specific classification. It is useful for
printed character set but a bit less effective for cursive text. It greatly reduces the
complexity of search for a word hypothesis since the character boundaries are pre
determined. Several approaches to segmentation recognition are discussed in the paper.
Each is analyzed for its relevance to printed, cursive, on line and off line input data.
28
Segmentation recognition strategies are more expensive due to the increased complexity of
search for finding optimum word hypotheses. However, the inherent ambiguity of cursive
text requires this type of segmentation.
Fujisawa et. al. [1992] presented a pattern oriented segmentation method for optical
character recognition that leads to document structure analysis. As a case study,
segmentation of handwritten numerals, which touch to each other, is taken first. Connected
pattern components are extracted, and spatial interrelations between components are
measured and grouped into meaningful character patterns. Stroke shapes are analyzed. On
the basis of that analysis, a method is described to find the touching positions that separate
about all of connected numerals correctly. Authors handled ambiguities by making
multiple hypotheses and verification by recognition. An extended form of pattern oriented
segmentation, tabular form recognition, is considered. Images of tabular forms are
analyzed, and frames in the tabular structure are segmented. By identifying semantic
relationships between label frames and data frames, information on the form can be
properly recognized.
Abulhaiba and Ahmed [1993] presented an automatic off line character recognition
system for totally unconstrained handwritten numerals using “Fuzzy logic”. The system
was trained and tested on the field data collected by the U.S. Postal Services Department
from dead letter envelopes. It was trained on one thousand seven hundred sixty three
unnormalized samples. The training process produced a feasible set of one hundred five
Fuzzy Constrained Character Graph Models (FCCGMs). FCCGMs tolerate large
variability in size, shape and writing style. Characters were recognized by applying a set
of rules to match a character tree representation to a FCCGM. A character tree is obtained
by first converting the character skeleton into an approximate polygon and then
29
transforming the polygon into a tree structure suitable for recognition purposes. The
system was tested on (not including the training set) one thousand eight hundred and
twelve unnormalized samples and it proved to be powerful in recognition rate and
tolerance to multi writer, multi pen, multi textured paper, and multi color ink.
Akindele and Belaid [1993] described a page segmentation method that allows
one to cut a document page image into polygonal blocks as well as into classical
rectangular blocks. The inter column and inter paragraph gaps are extracted as horizontal
and vertical lines. This builds an intersection table from the lines. The points of
intersection between these lines are treated as vertices of polygonal blocks. With the
aid of the four connected chain codes and the derived intersection table, simple isothetic
polygonal blocks are constructed from these points of intersection. The method is robust
enough to be applied to obtain polygonal blocks of any shape and any number of sides.
Pavlidis [1993] stated that research in optical character recognition (OCR) has focused
on the shape analysis of binarized images, by assuming that there would be good quality
document and isolated characters. Such assumptions are challenged by the conditions met
in practice. Binarization is difficult for low contrast documents because characters often
touch each other, not only on the sides but also between lines, etc. Author has discussed
current efforts to deal with OCR as a signal processing problem where the causes of noise
and distortions as well the idealized images (definitions of typefaces) are modeled and
subjected to a quantitative analysis. The key idea of the analysis is that while printed text
images may be binary in an ideal state, the images seen by the sensors are gray scale
because of convolution distortion and other causes. Finally, it is stated that binarization
should be carried out at the same time as feature extraction.
Liang et. al. [1994] proposed a new discrimination function for segmenting touching
30
characters. This function is based on both pixel projection and profile projection. A
dynamic recursive segmentation algorithm is developed for effectively segmenting
touching characters. Contextual information and a spelling checker are used to correct
errors caused by incorrect recognition and segmentation. As per the paper, the proposed
algorithm achieved good recognition accuracy.
Seni and Cohen [1994] described techniques to separate a line of unconstrained
(written in a natural manner) handwritten text into words. When the writing style is
unconstrained, recognition of individual components may be unreliable so these
components must be grouped together into word hypotheses, before recognition
algorithms, which may require dictionaries, can be used. The proposed system uses
original algorithms to determine distances between components in a text line and to detect
punctuation. The algorithms are tested on number of handwritten text lines extracted from
postal address blocks. A detailed performance analysis of the complete system and its
components is presented in the paper.
Avi-Itzhak et. al. [1995] stated that optical character recognition (OCR) refers to a
process by which printed documents are transformed into ASCII files for the purpose of
compact storage, editing, fast retrieval, and other file manipulations through the use of a
computer. The recognition stage of an OCR process is made difficult by added noise,
image distortion, and the various character typefaces, sizes, and fonts that a document may
have. In the proposed study, a neural network approach is introduced to perform high
accuracy recognition on multi size and multi font characters. A novel centroid dithering
training process with a low noise sensitivity normalization procedure is used to achieve
high accuracy results. The study is divided in two parts. The first part focuses on single
size and single font characters, and a two layered neural network is trained to recognize
31
the full set of 94 ASCII character images in 12 pt Courier font. The second part trades
accuracy for additional font and size capability, and a larger two layered neural network is
trained to recognize the full set of 94 ASCII character images for all font sizes from 8 to
32 and for 12 commonly used fonts. The performance of these two networks is evaluated
based on a database of more than one million character images from the testing data set.
Congedo et. al. [1995] presented a procedure for the segmentation of handwritten
numeric strings. The proposed procedure first uses hypothesis then verification strategy. In
the paper, multiple segmentation algorithms, which were based on contiguous row
partition, work sequentially on the binary image until an acceptable segmentation is
obtained. To achieve this purpose a new set of algorithms simulating a "drop falling"
process is introduced. Drop fall algorithms attempt to build a segmentation path by
mimicking an object falling or rolling in between the two characters which make up a
connected component. There are four primary types of drop fall algorithms which differ
on the direction and the starting point of the drop fall. These are top left (or left
descending), top right (or right descending), bottom left (or left ascending), and bottom
right (or right ascending). The experimental tests demonstrate the effectiveness of the new
algorithms in obtaining high confidence segmentation hypotheses.
Lu [1995] provided the insight of character segmentation. Though the information in
this paper is related with machine printed character but it gives a basis to understand
segmentation. According to the paper the segmentation can be divided in three parts. First
part is the Classical Approach in which segmentations are identified based on character
like properties. This process of cutting up the image into meaningful components is called
dissection. The second part is Recognition Based Segmentation, in which the system
searches the image for components that match classes in alphabet. Holistic Methods is the
32
third one, in which the system seeks to recognize words as a whole, thus avoiding the need
to segment into characters.
Casey and Lecolinet [1996] aimed at providing an appreciation for the range of
character segmentation techniques that have been developed. The segmentation is listed
under four headings. Classical approach consists of methods that partition the input image
into sub images, which are then classified. The second class of methods segments the
image either explicitly by classification of pre specified windows, or implicitly by the
classification of subsets of spatial features collected from the image as a whole. The third
proposed strategy is the hybrid of first two, employing dissection together with
recombination rules but using classification to select from the range of admissible
segmentation possibilities offered by these sub images. Finally, holistic approach, which
avoids segmentation by recognizing entire character strings as units.
Lee [1996] proposed a new scheme for off line recognition of totally unconstrained
handwritten numerals using a simple multilayer cluster neural network trained with the
back propagation algorithm. This method highlighted that the use of genetic algorithms
avoids the problem of finding local minima in training the multilayer cluster neural
network with gradient descent technique. Hence, the recognition rates are improved. In the
proposed scheme, Kirsch masks are adopted for extracting feature vectors and a three
layer cluster neural network with five independent sub networks to be developed for
classifying similar numerals efficiently. In order to verify the performance of the proposed
multilayer cluster neural network, it was experimented with handwritten numeral database
and correct recognition rates were obtained.
Lu and Shridhar [1996] presented an overview on the most important techniques used
in segmenting characters from handwritten words. It is well recognized that it is difficult
33
to segment individual characters from handwritten words without the support from
recognition and context analysis. One common characteristic of all the existing
handwritten word recognition algorithms is that the character segmentation process is
closely coupled with the recognition process. This review consists of three major portions,
hand printed word segmentation, handwritten numeral segmentation and cursive word
segmentation. Every algorithm discussed in the paper is accompanied with a flow chart to
give a clear grasp of the algorithm. One section summarizes the terms and measurements
commonly used in handwritten character segmentation.
Messelodi and Modena [1996] presented an algorithm for text segmentation and
recognition mainly suited for complex problems where many merged characters are
present. The basic idea is to define a distance, between lines of text and strings, which
helps to postpone the final decision about text segmentation and character classification
until the contextual analysis is performed. The distance takes into account both the
hypotheses about segmentation generated by a text segmentation module and the
hypotheses about character classification produced by a probabilistic classifier. The
algorithm has been tested by reading text on books' covers. The experimental results
highlight the quality of the solution proposed.
Trier et. al. [1996] presented an overview of feature extraction methods for offline
recognition of segmented (isolated) characters. Selection of a feature extraction method is
probably the single most important factor in achieving high recognition performance in
character recognition systems. The feature extraction methods which are discussed in the
paper, are categorized with reference to invariance properties, reconstructability, and
expected distortions and variability of the characters. Paper also suggested the problem of
choosing the appropriate feature extraction method for a given application. Different
34
feature extraction methods are designed for different representation of the characters.
Yu and Jain [1996] proposed a robust and fast skew detection algorithm based on
hierarchical Hough transformation. It is capable of detecting the skew angle for various
document images, including technical articles, postal labels, handwritten text, forms,
drawings and bar codes. The algorithm is robust even when black margins introduced by
photocopying are present in the image and when the document is scanned at a low
resolution of 50 dpi. The algorithm has two steps. In the first step, the centroids of
connected components are quickly extracted using a graph data structure. Then, in second
step, a hierarchical Hough transform (at two different angular resolutions) is applied to the
selected centroids. The skew angle corresponds to the location of the highest peak in the
Hough space. The performance of the algorithm is shown on a number of document
images collected from various application domains. The algorithm is not very sensitive to
algorithmic parameters.
Chaudhuri and Pal [1997 a] proposed an OCR system that can read two Indian
language scripts which are Bangla and Devnagari (Hindi). These two are the most popular
ones in Indian subcontinent. These scripts, having the same origin in ancient Brahmi
script, have many features in common and hence a single system can be modeled to
recognize them. The proposed model did document digitization, skew detection, text line
segmentation and zone separation, word and character segmentation, character grouping
into basic, modifier and compound character category. These are done for both scripts by
the same set of algorithms. The feature sets classification tree as well as knowledge base
(required for error correction such as lexicon) differ for Bangla and Devnagari. The
system shows a good performance for single font scripts printed on clear document.
Chaudhuri and Pal [1997 b] considered scanned documents in Devnagari and Bangla
35
for skew angle detection of scanned documents. Most characters in these scripts have
horizontal lines at the top, called head lines. The character head lines mostly join one
another in a word and the word appears as a single component. In the proposed method the
components are labeled. The upper envelope of a component is found by column wise
scanning from an imaginary line above the component. Portions of upper envelope
satisfying the properties of digital straight line are detected. They are clustered as
belonging to single text lines. Estimates from individual clusters are combined to get the
skew angle. An advantage of the method is that character segmentation and zone detection
can be readily done from headline information, which is useful in Optical Character
Recognition approaches for these scripts.
Chung and Yoon [1997] presented a performance comparison of several feature
selection methods based on neural network node pruning. It is assumed that features are
extracted and presented as the inputs of a three layered perceptron classifier. After the
assumption, authors had applied five feature selection methods before/during/after neural
network training in order to prune only input nodes of the neural network. Four of them
are node pruning methods such as node saliency method, node sensitivity method, and two
interactive pruning methods using different contribution measures. The last one is a
statistical method based on principle component analysis (PCA). The first two of them
prune input nodes during training whereas the last three do before/after network training.
For gradient and upper down, left right hole concavity features, the proposed scheme was
performed on several experiments of handwritten English alphabet and digit recognition
with/without pruning using the five feature selection algorithms, respectively. The
experimental results show that node saliency method outperforms the others.
Peake and Tan [1997] presented a detailed review of current script and language
36
identification techniques. The proposed method is based on texture analysis for script
identification which does not require character segmentation whereas the existing schemes
rely on either connected component analysis or character segmentation. A uniform text
block on which texture examination can be performed is produced from a document image
by simple processing. Multiple channel (Gabor) filters and grey level co-occurrence
matrices are used in independent experiments in order to extract texture features.
Classification of test documents is made on the basis of the features of training documents
using the K NN classifier. The method shows strength with respect to noise, the presence
of foreign characters or numerals, and can be applied to very small amounts of text too.
Alpaydin [1998] suggested that learners based on different paradigms can be combined
for improved accuracy. Each learning method assumes a certain model that comes with a
set of assumptions which may lead to error if the assumptions do not hold. Learning is an
ill posed problem and with finite data, each algorithm converges to a different solution and
fails under different circumstances. Authors stated that classifiers based on these
paradigms did generalize differently, failed on different patterns and to a certain extent
complement each other and thus they looked for ways to combine them for higher
accuracy. One way to get complementary classifiers is by using different input
representations. The methods, which are investigated, are voting, mixture of experts,
stacking and cascading. The proposed method is experimented on real world applications
like optical handwritten digit recognition, and pen based handwritten digit recognition and
it is claimed in the paper that proposed method gave satisfactory results.
Chaudhuri and Pal [1998] presented a complete Optical Character Recognition (OCR)
system for printed Bangla, the fourth most popular script in the world, in this paper. This
is the first OCR system among all script forms used in the Indian sub continent. The
37
captured image is subjected to skew correction, text graphics separation, line
segmentation, zone detection, word and character segmentation using some conventional
and some newly developed techniques. From zonal information and shape characteristics,
the basic, modified and compound characters are separated for the convenience of
classification. The basic and modified characters which are about seventy five in number
and which occupy about ninety six percent of the text corpus, are recognized by a
structural feature based tree classifier. The compound characters are recognized by a tree
classifier followed by template matching approach. The feature detection is simple and
robust where preprocessing like thinning and pruning are avoided.
Madhvanath and Govindaraju [1998] proposed a methodology of coarse holistic
features and heuristic prediction of ideal features from ASCII to address certain issues.
One of the issues included is perceptual holistic feature. This is visually obvious feature of
the word shape that has been cited in reading studies as being utilized in fluent reading.
While these features have been used for word recognition when the lexicon of possible
words is small and static, their application to the general problem of omni scriptor
handwritten word detection is limited by their variability at the word level and the paucity
of samples for word level training. The real world examples of handwritten words are
instances of the ideal paradigm of the word class distorted by the scriptor, stylus, medium
and intervening electronic imaging processes. This provides a basis for the proposed
methodology. The proposed scheme has applications in verification and lexicon reduction
for handwritten word recognition.
Reddy and Nagabhushan [1998] described a three dimensional (3-D) neural network
recognition system for conflict resolution in recognition of unconstrained handwritten
numerals. This neural network classifier is a combination of modified self organizing map
38
(MSOM) and learning vector quantization (LVQ). The 3-D neural network recognition
system has many layers of such neural network classifiers and the number of layers forms
the third dimension. The proposed scheme is experimented by employing SOM, MSOM,
SOM and LVQ, and MSOM and LVQ networks. These experiments on a database of
unconstrained handwritten samples show that the combination of MSOM and LVQ
performs better than other networks in terms of classification, recognition and training
time. The 3-D neural network eliminates the substitution error.
Tang et. al. [1998] presented an offline recognition system based on multifeature and
multilevel classification for handwritten Chinese characters. Ten classes of multifeatures,
such as peripheral shape features, stroke density features, and stroke direction features, are
used in the proposed system. The multilevel classification scheme consists of a group
classifier and a five-level character classifier, where two new technologies which are
overlap clustering and Gaussian distribution selector, are developed. Experiments have
been conducted to recognize number of daily used Chinese characters. The recognition
rate is about high as claimed in the research paper.
Jung et. al. [1999] proposed a segmentation method for a machine printed character
string with arbitrary length. It exploits recognition based segmentation, combined with
heuristic and holistic methods. The merged part of touching characters generates different
shape of patterns from the primitive character patterns. However, far left side and far right
side patterns in the touching characters are not affected by the touching. The algorithm
firstly constructs a line adjacency graph (LAG) from a word image. Blobs are found as
connected components of the LAG and small dot noises are removed. Secondly, as a word
in English can be divided into three typographical zones such as the ascender, the x height
and the descender, the location of the connected components among those zones are also
39
examined. Thirdly, the right profile of the touching character is compared with that of the
sample characters in the prototype and then the touching characters are segmented with the
width of one of the candidates in the prototype. Finally, upward, downward and left
profiles of the segmented pattern are compared with those of the candidate respectively.
Third and final steps are continued until confirmed by successful matchings of the
resulting character patterns. It has been tested with touching characters in ‘Times’ and in
‘Helvetica’ fonts that are proportional pitch fonts and found that the proposed method is
promising.
Lee and Kim [1999] proposed an integrated segmentation and recognition method
using cascade neural network. The proposed method as discussed in the paper, a new type
of cascade neural network is developed to train the spatial dependences in connected
handwritten numerals. This cascade neural network is originally extended from the
multilayer feed forward neural network. This extension improves the discrimination and
generalization power. The performance of the proposed method is verified by performing
it on recognition experiments. As is clear from the experimental results, the proposed
method has higher discrimination and generalization power than the previous integrated
segmentation and recognition (ISR) methods. The network size of the method proposed in
the paper is smaller than that of previous integrated segmentation and recognition
methods.
Lehal and Singh [1999] described a feature extraction and hybrid classification scheme
for machine recognition of Gurmukhi characters, using binary decision tree and nearest
neighbor. Classification process is completed in three stages, where in the first stage, the
characters are grouped into sets depending on their zonal positions. In the second stage,
the characters in middle zone set are further distributed into smaller sub sets by a binary
40
decision tree using a set of robust and font independent features. In the third stage, the
nearest neighbor classifier is used using the special features distinguishing the characters.
The significant point of this scheme is that a character image is tested against only certain
subsets of classes at each stage, which enhances the computational efficiency.
Oh et. al. [1999] proposed a new approach to combine multiple features in handwriting
recognition based on two ideas: feature selection based combination and class dependent
features. A non parametric method is used for feature evaluation. The first part of this
paper is devoted to the evaluation of features in terms of their class separation and
recognition capabilities. In the second part, multiple feature vectors are combined to
produce a new feature vector. Based on this fact that a feature has different discriminating
powers for different classes, a new scheme of selecting and combining class dependent
features is proposed. In this scheme, a class is considered to have its own optimal feature
vector for discriminating itself from the other classes. Using architecture of modular
neural networks as the classifier, a series of experiments were conducted on unconstrained
handwritten numerals. The results indicate that the selected features are effective in
separating pattern classes and the new feature vector, derived from a combination of two
types of features further improves the recognition rate
Arica and Yarman [2000] introduced a set of one dimensional features to represent two
dimensional shape information for HMM (Hidden Markov Model) based handwritten
optical character recognition problem. The proposed feature set embeds two dimensional
information into an observation sequence of one dimensional string, selected from a code
book. It provides a consistent normalization among distinct classes of shapes, which is
very convenient for HMM based shape recognition schemes. The normalization
parameters, which maximize the recognition rate, are dynamically estimated in the training
41
stage of HMM. The proposed character recognition system is tested on handwritten data of
the NIST database and a local database. The experimental results indicate very high
recognition rates.
Chen and Wang [2000] proposed a new approach of segmenting single or multiple
touching handwritten numeral strings (two digits). Most of the available algorithms, used
for the segmentation of connected digits, mainly focus on the analysis of foreground
pixels. Some of them concentrated on the analysis of background pixels only and others
are depending upon the concept based on a recognizer. But in this paper, the combination
of background and foreground analysis is used to segment single or multiple touching
handwritten numeral strings. Thinning of both foreground and background regions are first
processed on the image of connected numeral strings and the feature points on foreground
and background skeletons are extracted. Several possible segmentation paths are then
constructed while doing these, useless stroke is removed. Finally, the parameters of
geometric properties of each possible segmentation paths are determined and these
parameters are analyzed by the mixture Gaussian probability function to decide the best
segmentation path otherwise these are rejected. Experimental results show that the
proposed algorithm can get a good accuracy rate.
Kim et. al. [2000 a] presented a methodology which combine HMM (hidden Markov
model) and MLP (multilayer perceptron) for cursive word recognition. An explicit
segmentation based on HMM is designed. This scheme is combined with implicit
segmentation based MLP using weighting coefficients. The main reason behind the
proposed methodology is that more distinct classifiers can complement each other in a
better way. A new probability measure for the hybrid classifier as well as conventional
combining schemes is also introduced. Results mentioned in the paper showed good
42
segmentation.
Kim et. al. [2000 b] described a scheme for recognizing unconstrained handwritten
numeral strings by a composite segmentation method. Two concepts, one is recognition
free and other is recognition based segmentation, are combined. A digit group detector has
been designed to separate touching digits from isolated digits by the recognition free
segmentation method. Subsequently touching digits are segmented by prioritizing
segmentation points. These points are accomplished by analyzing the ligature and
touching types. Four special kinds of candidate segmentation points and six touching types
are defined to obtain more stable segmentation points. As per the claim made in paper, the
proposed algorithm achieved good success rate.
Lehal and Singh [2000] presented a system for recognition of machine printed
Gurmukhi script. Character recognition in Gurmukhi script faces major problems mainly
related to the unique characteristics of the script like connectivity of characters on the
headline, a larger number of similar characters and two or more characters in a word
having intersecting minimum bounding rectangles. A set of very simple and easy to
compute features is used and a hybrid classification scheme consisting of binary decision
tree and nearest neighbors is employed.
Nicchiotti and Scagliola [2000] proposed a simple procedure for the over segmentation
of cursive word, which is based on the analysis of the handwritten profiles and on the
extraction of “white holes”. Straight segmentation tries to decompose the image in a set of
sub images, each one corresponding to a character. In segmentation recognition strategies
the image is subdivided in a set of sub images (strokes) whose combinations are used to
generate character candidates. The number of sub images is greater than the number of
characters and the process is referred to also as over segmentation. Recognition is then
43
used to select the correct character hypothesis from character candidates. It follows the
policy of using simple rules on complex data and sophisticated rules on simpler data.
Experimental results show robustness and performances comparable with the best ones
presented in the literature.
Plamondon and Srihari [2000] described that handwriting has continued to persist as a
means of communication and recording information in day to day life even with the
introduction of new technologies. This has significance in human transactions, machine
recognition of handwriting has practical significance, as in reading handwritten notes, in
postal addresses on envelopes, in amounts in bank cheques, in handwritten fields in forms,
etc. This overview describes the nature of handwritten language and how it is transduced
into electronic data. It also gave the insight of the concepts behind written language
recognition algorithms. Both the online case (which pertains to the availability of
trajectory data during writing) and the off line case (which pertains to scanned images) are
considered. Algorithms for preprocessing, character and word recognition, and
performance with practical systems are indicated. Other fields of application, like
signature verification, writer authentification, and handwriting learning tools are also
considered in the paper.
Alimoglu and Alpaydin [2001] investigated techniques to combine multiple
representations of a handwritten digit to increase classification accuracy without
significantly increasing system complexity or recognition time. In pen based recognition,
the input is the dynamic movement of the pen tip over the pressure sensitive tablet. There
is also the image formed as a result of this movement. On a real world database of
handwritten digits containing more than eleven thousand handwritten digits, authors
noticed that the two multi-layer perceptron (MLP) based classifiers using these
44
representations make errors on different patterns implying that a suitable combination of
the two would lead to higher accuracy. Therefore, they implemented and compared voting,
mixture of experts, stacking and cascading. Combining the two MLP classifiers, higher
accuracy is achieved because the two classifiers/representations fail on different patterns.
So it is advocated, especially, multistage cascading scheme where the second costlier
image based classifier is employed only in a small percentage of cases.
Arica and Yarman [2001] served as an update for the readers working in the character
recognition area. First, an overview of the character recognition systems and their
evolution over time is presented. Then, the available classification recognition (CR)
techniques with their superiorities and weaknesses are reviewed. Finally, the current status
of CR is discussed and directions for future research are suggested. Special attention is
given to the offline handwriting recognition, since this area requires more research to
reach the ultimate goal of machine simulation of human reading.
Madhvanath and Govindaraju [2001] surveyed to take a fresh look at the potential role
of the Holistic paradigm in handwritten word recognition. According to Holistic paradigm
in handwritten word recognition, a word is treated as a single, indivisible entity and
attempts to recognize words from their overall shape, as opposed to their character
contents. In this survey, an overview of studies of reading process is presented which
provide evidence for the existence of a parallel holistic reading process in both developing
and skilled readers. The handwriting recognition approaches are characterized as forming
a continuous spectrum based on the visual complexity of the unit of recognition employed.
An attempt is made to interpret well known paradigms of word recognition in this
framework. An overview of features, methodologies, representations, and matching
techniques employed by holistic approaches is presented, in the paper.
45
Srihari et. al. [2001] undertook a study to objectively validate the hypothesis that
handwriting is individualistic. Handwriting samples of one thousand five hundred
individuals, representative of the US population with respect to gender age, ethnic groups,
etc., were obtained. Analyzing differences in handwriting was done by using computer
algorithms for extracting features from scanned images of handwriting. Attributes
characteristic of the handwriting were obtained. The attributes chosen were line
separation, slant, character shapes, etc. These attributes, which are a subset of attributes
used by expert document examiners, were used to quantitatively establish individuality by
using machine learning approaches. Using global attributes of handwriting and very few
characters in the writing, the ability to determine the writer with a high degree of
confidence was established. The work is a step towards providing scientific support for
admitting handwriting evidence in court. The mathematical approach and the resulting
software also have the promise of aiding the expert document examiner.
Acharyya and Kundu [2002] presented an efficient and computationally fast method
for segmenting text and graphics part of document images based on textural cues. It is
assumed that the graphics part have different textural properties than the nongraphics
(text) part. The segmentation method uses the notion of multiscale wavelet analysis and
statistical pattern recognition. Authors have used M band wavelets which decompose an
image into M×M bandpass channels. Various combinations of these channels represent the
image at different scales and orientations in the frequency plane. The objective is to
transform the edges between textures into detectable discontinuities and create the feature
maps which give a measure of the local energy around each pixel at different scales. From
these feature maps, a scale space signature is derived, by which the vector of features at
different scales is taken at each single pixel in an image. It is claimed in the paper that
segmentation is achieved by simple analysis of the scale space signature with traditional
46
‘k’ mean clustering. Any prior information regarding the font size, scanning resolution,
type of layout, etc. of the document in the proposed segmentation scheme is not assumed.
Arica and Yarman [2002] proposed a new analytic scheme, which uses a sequence of
image segmentation and recognition algorithms, for the off line cursive handwriting
recognition problem. First, some global parameters, such as slant angle, baselines, stroke
width and height, are estimated. Second, a segmentation method finds character
segmentation paths by combining gray scale and binary information. Third, a hidden
Markov model (HMM) is employed for shape recognition to label and rank the character
candidates. For this purpose, a string of codes is extracted from each segment to represent
the character candidates. The estimation of feature space parameters is embedded in the
HMM training stage together with the estimation of the HMM model parameters. Finally,
information from a lexicon and from the HMM ranks is combined in a graph optimization
problem for word level recognition. This method corrects most of the errors produced by
the segmentation and HMM ranking stages by maximizing an information measure in an
efficient graph search algorithm. The experiments indicate higher recognition rates
compared to the available methods reported in the literature.
Ashwin and Sastry [2002] described an OCR system for printed text documents in
Kannada, which is a South Indian language. Scanned image of a page written in Kannada
is given as an input to the system and the output, as a machine editable file, is achieved.
This output file is compatible with most typesetting software. The proposed system
extracts words from the document image. The segmented words are differentiated into
sub character level pieces. The structure of the script is used in the proposed scheme for
segmentation. A novel set of features for the recognition problem, which are
computationally simple to extract, is proposed. The final recognition is achieved by
47
employing a number of two class classifiers which is based on the Support Vector
Machine (SVM) method. The recognition is independent of the font and size of the printed
text
Garain and Chaudhuri [2002] described that one of the important reasons for poor
recognition rate in optical character recognition (OCR) system is the error in character
segmentation. Existence of touching characters in the scanned documents is a major
problem to design an effective character segmentation procedure. In this paper, a new
technique, based on fuzzy multifactorial analysis, is presented for identification and
segmentation of touching characters. A predictive algorithm is developed for effectively
selecting possible cut columns for segmenting the touching characters. The proposed
method has been applied to printed documents in Devnagari and Bangla as authors felt
that these two scripts are the most popular scripts of the Indian sub continent. The results
obtained from a test set of considerable size show that a reasonable improvement in
recognition rate can be achieved with a modest increase in computations.
Kapoor et. al. [2002] proposed an accurate and exhaustive approach to detect the skew
angle of the images of words/ characters of cursive Devanagari script. This approach was
applied to 235 writing samples and a total collection of around 6000 samples. It is efficient
in terms of time and is a simpler process as compared to the existing ones. The method is
an extension to the work carried out by Pal and Chaudhuri. Heuristic approach has been
applied to detect the skew angle. The inherent dominating features of the structure of the
Devanagari script have been used to accurately calculate the skew of the Devanagari word.
Pal et. al. [2002] dealt with a new scheme for automatic segmentation of unconstrained
handwritten connected numerals. This approach is mainly based on water reservoir. A
reservoir is a metaphor to illustrate where the region numerals touch. Reservoir is obtained
48
by considering accumulation of water poured from the top or from the bottom of the
numerals. At first, considering reservoir location and size, touching positions are decided.
Next, analyzing the reservoir boundary, touching position and topological features of the
touching pattern, the best cutting point is determined. Finally, combined with
morphological structural features the cutting path for segmentation is generated.
Pal and Datta [2003] proposed a robust scheme to segment unconstrained handwritten
Bangla texts into lines, words and characters. For line segmentation, at first, the text is
divided into vertical stripes. Stripe width of the document is computed by statistical
analysis of the text height in the document. The horizontal histogram of these stripes and
the relationship of the minimal values of the histograms are used to segment text lines.
Based on the vertical projection profile, lines are segmented into words. For segmentation
of characters, water reservoir principle is used. At first, isolated and touching characters in
a word are identified. Next touching characters of the word are segmented based on the
reservoir base area points and structural feature of the component.
Devessar et. al. [2003] suggested a new approach to segment machine printed
Gurmukhi text. To resolve the issues of touching characters, a two pass mechanism is
used. In pass one, the segmentation point is approximated, while in pass two the cutting
point is optimized. This approach has been very successful in segmenting a pair as well as
triplets of touching characters. This approach can easily be extended to the other Indian
languages scripts such as Devnagri and Bangla, which have horizontal lines at the top
called headlines.
Pal and Sarkar [2003] worked on Optical Character Recognition system for printed
Urdu. Here, the document image is captured using a flatbed scanner and passed through
skew correction, line segmentation and character segmentation modules. These modules
49
are developed by combining conventional and newly proposed techniques. Next,
individual characters are recognized using a combination of topological, contour and water
reservoir concept based features. The feature detection methods are simple and robust.
This approach achieves a good character level accuracy on average.
Pal et. al. [2003 a] dealt with a new technique for automatic segmentation of
unconstrained handwritten connected numerals. To take care of variability involved in the
writing style of different individuals a robust scheme is presented in the paper. The
scheme is mainly based on features obtained from a concept based on water reservoir. A
reservoir is a metaphor to illustrate the region where numerals touch. Reservoir is obtained
by considering accumulation of water poured from the top or from the bottom of the
numerals. At first, considering reservoir location and size, touching position (top, middle
or bottom) is decided. Next, analyzing the reservoir boundary, touching position and
topological features of the touching pattern, the best cutting point is determined. Finally,
combined with morphological structural features the cutting path for segmentation is
generated.
Pal et. al. [2003 b] stated that a document page may contain two or more different
scripts. For Optical Character Recognition (OCR) of such a document page, it is necessary
to separate different scripts before feeding them to their individual OCR system. In this
paper an automatic scheme is presented to identify text lines of different Indian scripts
from a document. For the separation task, at first the scripts are grouped into a few classes
according to script characteristics. In the next step, feature based on water reservoir
principle, contour tracing, profile etc. are employed to identify them without using any
expensive OCR like algorithms.
Zhang et. al. [2003] tried their hands in the analysis of handwritten characters
50
(allographs) and found that it plays an important role in forensic document examination.
However, so far there is lack of comprehensive and quantitative study on individuality of
handwritten characters. Based on a large number of handwritten characters extracted from
handwriting samples of one thousand individuals in US, the individuality of handwritten
characters has been quantitatively measured through identification and verification
models. This study shows that in general, alphabetic characters bear more individuality
than numerals and use of a certain number of characters will significantly outperform the
global features of handwriting samples in handwriting identification and verification.
Moreover, the quantitative measurement of discriminative powers of characters offers a
general guidance for selecting most informative characters in examining forensic
documents.
Grau et. al. [2004] presented a new image segmentation system. This system is based
on the calculation of a tree representation of the original image in which image regions are
assigned to tree nodes, followed by a correspondence process with a model tree, which
embeds a prior knowledge about the images. An algorithm is proposed in the paper, which
performs the minimization of an error function that quantifies the difference between the
input image tree and the model tree. Another algorithm is also proposed for automatically
calculating the model tree from a set of manually segmented images. Results on synthetic
and MR brain images are presented in the paper.
Pal and Roy [2004] stated that there are printed artistic documents where text lines of a
single page may not be parallel to each other. These text lines may have different
orientations or the text lines may be curved shapes. For the optical character recognition
(OCR) of these documents, such lines are needed to extract properly. A novel scheme,
mainly based on the concept of water reservoir analogy, is proposed to extract individual
51
text lines from printed Indian documents containing multioriented and/or curved text lines.
In the proposed scheme, initially connected components are labeled and identified either
as isolated or touching. Next, each touching component is classified to either straight type
(S-type) or curve type (C-type), depending on the reservoir base area and envelope points
of the component. Based on the type (S-type or C-type) of a component, two candidate
points are computed from each touching component. Finally, candidate regions
(neighborhoods of the candidate points) of the candidate points of each component are
detected. After analyzing these candidate regions, components are grouped to get
individual text lines.
Tripathy and Pal [2004] proposed a scheme based on the water reservoir concept for
the segmentation of unconstrained Oriya handwritten text into individual characters. At
first, the text image is segmented into lines, and then lines are segmented into individual
words, and words are segmented into individual characters. For line segmentation, the
document is divided into vertical stripes. Analyzing the heights of the water reservoir
obtained from different components of the document, the width of a stripe is calculated.
Stripe wise horizontal histograms are then computed and the relationship of the peak
valley points of the histograms is used for line segment. Based on vertical projection
profile and structural features of Oriya characters, text lines are segmented into words. For
character segmentation, first the isolated and connected characters in a word are detected.
Using structural, topological and water reservoir concept based features; touching
characters of the word are then segmented.
Zheng et. al. [2004] addressed the problem of the identification of text in noisy
document images. In the paper, the stress is focused on segmenting and identifying
between handwriting and machine printed text because handwriting in a document often
52
indicates corrections, additions, or other supplemental information that should be treated
differently from the main content and moreover the segmentation and recognition
techniques requested for machine printed and handwritten text are significantly different.
The proposed scheme treats noise as a separate class and models noise based on the
selected features. Trained Fisher classifiers are used to identify machine printed text and
handwriting from noise. The context is further exploited to refine the classification. A
Markov Random Field based approach is used to model the geometrical structure of the
printed text, handwriting, and noise to rectify misclassifications. As is clear from the result
in the paper, the scheme can significantly improve page segmentation in noisy document
collections.
Jindal et. al. [2005] identified different kinds of degradation available in Gurmukhi
script. After identifying the different kinds of degradation, that is, touching characters,
broken characters, heavy printed characters, faxed documents and typewritten documents
and problems associated with each kind of degradation have been discussed and some
possible solutions have also been discussed.
Pal and Tripathy [2005] proposed a scheme towards the recognition of Indian stylistic
documents. Here, using feature based on the water reservoir concept, the characters are
segmented from the stylistic documents without any skew correction. Next, individual
characters are recognized. For recognition, contour distances of the outer contour points of
the characters are calculated from the centroid. These contour distances are then arranged
in a particular order to get size and rotation invariant feature. Finally, computing statistical
feature on these arranged contour distances, the input character is recognized.
Jindal et. al. [2006] stated that multiple horizontally overlapping lines are normally
found in printed newspapers of almost every language due to high compression methods
53
used for printing of the newspapers. For any optical character recognition (OCR) system,
presence of horizontally overlapping lines decreases the recognition accuracy drastically.
In this paper, authors have proposed a solution for segmenting horizontally overlapping
lines. Whole document has been divided into strips and proposed algorithm has been
applied for segmenting horizontally overlapping lines and associating small strips to their
respective lines. The results reveal that the algorithm is almost ninety percent perfect when
applied to the Gurmukhi script.
Li et. al. [2006] dealt with curvilinear text line detection and segmentation in
handwritten documents. Given no prior knowledge of script, authors modeled text line
detection as an image segmentation problem by enhancing text line structure using a
Gaussian window, and adopting the level set method to evolve text line boundaries.
Experiments show that the proposed method achieves high accuracy for detecting text
lines in both handwritten and machine printed documents with many scripts.
Jindal et. al. [2007] stated that horizontally overlapping lines are normally found in
printed newspapers of any Indian script. Along with these overlapping lines few other
broken components of a line (stripe) having text less than a complete line are also found in
text. The horizontally overlapping lines and other stripes make it very difficult to estimate
the boundary of a line leading to incorrect line segmentation. Incorrect line segmentation
decreases the recognition accuracy. In this paper, the authors have proposed a solution for
segmenting horizontally overlapping lines and solved the problem of other stripes in eight
most widely used printed Indian scripts. Whole document has been divided into stripes
and proposed algorithm has been applied for segmenting horizontally overlapping lines
and associating small stripes to their respective lines.
Sulem et. al. [2007] made a survey regarding the line segmentation and described that
54
there is a huge amount of historical documents in libraries and in various National
Archives that have not been converted electronically. Although automatic reading of
complete pages remains, in most cases, a long term objective, tasks such as word spotting,
text/image alignment, authentication and extraction of specific fields are in use today. For
all these tasks, a major step is to segment document into text lines. Because of the low
quality and the complexity of these documents (background noise, artifacts due to aging,
interfering lines), automatic text line segmentation remains an open research field. Authors
presented a survey of existing methods, developed during the last decade and dedicated to
documents of historical interest.
Jindal et. al. [2008] stated that the performance of an OCR system depends upon
printing quality of the input document. There are number of designed OCRs which
correctly identify fine printed documents in Indian and other scripts. But, little reported
work has been found which deals with the recognition of the degraded documents.
Therefore, if any standard OCR is tested on degraded documents, then the performance of
that system, which is working well for fine printed documents, decreases. Feature
extraction is an important task for designing an OCR for recognizing degraded documents.
In this paper, authors have discussed efficient structural features selected for recognizing
degraded printed Gurmukhi script characters.
Li et. al. [2008] proposed a novel approach based on density estimation and a state of
the art image segmentation technique, which is called as the level set method. From an
input document image, probability map is estimated, where each element represents the
probability that the underlying pixel belongs to a text line. The level set method is then
exploited to determine the boundary of neighboring text lines by evolving an initial
estimate. The proposed algorithm in the paper does not use any script specific knowledge.
55
Extensive quantitative experiments on freestyle handwritten documents with diverse
scripts, such as Arabic, Chinese, Korean, and Hindi, demonstrate that the algorithm
consistently performs well.
Palacios and Gupta [2008] described the problem related with processing of cheques.
As nowadays, bank cheques are preprinted with the account number and the cheque
number in special ink and format in many countries. These two numeric fields can be
easily read and processed using automated techniques. However, the amount filled on a
filled cheque are usually read by human eyes, and involves significant time and cost. The
system described in this paper uses the scanned image of a bank cheque to 'read' the
cheque. It includes three main modules. If these modules are implemented then that allow
for fully automated bank cheque processing. These three modules are the detection of
strings within the image, the segmentation and recognition of string in a feedback loop,
and the post processing issues that help to ensure higher accuracy of recognition. The
major benefit of the integrated system is the ability to address the complex problem of
reading handwritten bank cheque by implementing efficient algorithms for each
processing step. As per the paper, all modules have been implemented and subsequently
tested for reading the value of the cheque using different image databases. Due to the
particular requirements of this application, the system can be tuned to yield low levels of
incorrect readings. This leads to higher levels of rejection than the levels encountered in
other handwritten recognition applications. A 'rejected' cheque can be read subsequently
by human eyes or other more advanced automated approaches. However, a cheque 'read'
incorrectly is more difficult to deal with, in terms of costs and time involved to rectify the
mistake. As such, the proposed architecture can be geared towards producing the most
suitable balance between inaccurate readings and rejection level, in accordance with user
preferences. The experimental results presented in the paper do not focus on the best
56
possible results for a particular database of cheque. But, they show the benefits attained
independently by each of the modules proposed.
Bukhari et. al. [2009] stated that handwritten document images contain text lines with
multi orientations, touching and overlapping characters within consecutive text lines, and
small inter line spacing making text line segmentation a difficult task. In the paper, authors
modeled text line extraction as a general image segmentation task. The central line of parts
of text lines using ridges over the smoothened image is computed. Then the state of the art
active contours (snakes) over ridges are adapted, which results in text line segmentation.
Chaudhuri and Bera [2009] dealt with text line identification of handwritten Indian
scripts. Some of the Indian Scripts discussed in the paper are Bangla, as well as English,
Hindi, Gurmukhi and Malayalam, etc. A new dual method based upon interdependency
between text line and inter line gap is proposed in the paper. The curves are drawn by the
proposed scheme simultaneously through the text and inter line gap points found from
strip wise histogram peaks and inter peak valleys. The curves start from left and move
right while one type of points guides the curve of other type so that the curves do not
intersect. Then these curves are allowed to iteratively evolve so that the text line curves
cross more character strokes while inter line curves cross less character strokes and yet
keep the curves as straight as possible. After several iterations, the curves stabilize and
define the final text lines and inter line gaps. The approach works well on text of different
scripts with various geometric layouts, including poetry.
Philip and Samuel [2009] described an Optical Character Recognition (OCR) System
for printed text documents in Malayalam which is one of the South Indian languages. As
this is a known fact that Indian scripts are rich in patterns but these combinations of such
patterns makes the problem even more complex. But in the paper, these complex patterns
57
are exploited to get the solution. The proposed system extracted the scanned document
image into text lines, words and further characters and sub characters. The proposed
segmentation algorithm is influenced by the structure of the script. A novel set of features,
computationally simple to extract are proposed. The proposed approaches are based on the
distinctive structural features of machine printed text lines written in these scripts. A
lateral cross sectional analysis is performed along each row of the normalized binary
image matrix resulting in distinct features. The final recognition is done through classifiers
based on the Support Vector Machine (SVM) method. The proposed algorithms have been
tested on a variety of printed Malayalam characters and give good result.
Yin and Liu [2009] suggested a novel text line segmentation algorithm based on
minimal spanning tree (MST) clustering with distance metric learning. Given a distance
metric, the connected components (CCs) of document image are grouped into a tree
structure, from which text lines are extracted by dynamically cutting the edges using a new
hyper volume reduction criterion and a straightness measure. By learning the distance
metric in supervised learning on a dataset of pairs of CCs, the proposed algorithm is made
robust to handle various documents with multi skewed and curved text lines. The results
presented in the paper suggest that the proposed method worked very well.
Das et. al. [2010] addressed the segmentation of overlapped text lines and characters in
Telgu text. In fact, Segmentation is an important task of any OCR system. The accuracy of
OCR system mainly depends on the segmentation algorithm being used. Segmentation of
Telugu text is difficult when compared with Latin based languages because of its
structural complexity and increased character set. It contains vowels, consonants and
compound characters. Some of the characters may overlap together. The profile based
methods can only segment non overlapping lines and characters. The proposed algorithm
58
is based on projection profiles, connected components and spatial vertical relationships.
To segment the image into lines and characters, in this method, first the connected
components are extracted from the document image and labeled. For each connected
component the top, bottom, left, right positions are identified. Then, nearest neighborhood
method to cluster the connected components is also used. Good character segmentation
accuracy can be achieved with overlapping lines and characters as per the result shown in
the paper.
Kumar and Sengar [2010] described the line, word, character and top character
segmentation for printed Hindi text in Devanagari script and Gurmukhi script. The global
horizontal projection method computes sum of all black pixels on every row and
constructs corresponding histogram. Based on the peak/valley points of the histogram,
individual lines and words are separated.
Nallapareddy et. al. [2010] proposed a robust method for segmentation of individual
text lines based on the modified histogram obtained from run length based smearing. A
complete line and word segmentation system for some popular Indian printed languages is
presented in the paper. Both foreground and background information is used here for
accurate line segmentation. There may be some touching or overlapping characters
between two consecutive text lines and most of the line segmentation errors are generated
due to touching and overlapping character occurrences. Sometimes, interline space and
noises make line segmentation a difficult task. The proposed method can take care of this
situation accurately. Word segmentation from individual lines is also discussed here. The
results of the proposed method on documents of Bangla, Devnagari, Kannada, Telugu
scripts as well as some multi script documents are shown in the paper.
Aradhya and Naveena [2011] proposed a novel method for text line segmentation of
59
unconstrained handwritten Kannada script. The proposed method consists of two phases.
In the first phase, mathematical morphology technique is used to bridge the gap between
character components. In the second phase, component extension technique is used for
text line extract.
Mahender and Kale [2011] stated that writing has been the most natural method of
collecting, storing and transmitting information through the centuries, now serves not only
for the communication among humans, but also for the communication of humans and
machines. The free style handwriting recognition is difficult not only because of the great
amount of variations involved in the shape of characters, but also because of the
overlapping and the interconnection of the neighboring characters. Authors have presented
a structured based feature extraction and rule based recognition scheme for handwritten
Marathi word.
Pradeep et. al. [2011] gave an off line handwritten alphabetical character recognition
system using multilayer feed forward neural network. A new method, called, diagonal
based feature extraction is introduced for extracting the features of the handwritten
alphabets. The proposed recognition system performs quite well yielding higher levels of
recognition accuracy compared to the systems employing the conventional horizontal and
vertical methods of feature extraction. This system can be suitable for converting
handwritten documents into structural text form and recognizing handwritten names.
Borrowing from past literature, it can be summarized that though there is rapidly
growing body of literature on how to segment scanned documents of International scripts
as well as Indian languages but relatively few studies are there that have examined how to
effectively segment a document written in Gurmukhi script. The studies which are
available for Gurmukhi script that mostly deals with machine printed texts.
60
2.2 Need of The Study
Text can be categorized in order of increasing difficulty when there are well separated
and unbroken characters in proportional spacing, in which characters occupy different
amounts of horizontal space, depending on their shapes. Lu [1995] quoted that like when
the characters are broken, that is, single characters have more than one component. When
characters are touching characters that is more than one character in a single connected
component, or similarly when there are broken and touching characters both. In most OCR
systems, character recognition performs on individual characters. Pre processing stage
yields a ‘clean’ document in the sense that sufficient amount of shape information, high
compression and low noise on normalized image is obtained.
According to Pal et. al. [2003 b], in India, there are 18 official (Indian constitution
accepted) languages. Two or more of these languages may be written in one script. Twelve
different scripts are used for writing these languages. Under the three language formula,
many of the Indian documents are written in three languages namely, English, Hindi and
the state official language. For example, a money order form in the Punjab state may be
written in English, Hindi and Gurmukhi, because Gurmukhi (Punjabi) is the state official
language of Punjab. Here are some properties common in Indian Language scripts.
2.2.1 Properties of Indian Language Scripts
Assamese, Bangla, English, Gujarati, Hindi, Konkanai, Kannada, Kashmiri,
Malayalam, Marathi, Nepali, Oriya, Panjabi, Rajasthani, Sanskrit, Tamil, Telugu and Urdu
are the official languages of India. Hindi is most popular language in India followed by
Bangla which is the second most popular languages in India. On global scenario, English
is most popular language and whereas these two languages (Hindi and Bangla) are the 4th
and 5th
most popular language in the world. The scripts used for the Indian languages are
61
not all different. One script is used to write different languages. For example, Bangla
script is used to write Assamese and Bangla (Bengali) languages while Devnagari script is
used to write Hindi, Marathi, Rajasthani, Sanskrit and Nepali language. Constitution wise,
there are twelve different scripts which are used to write these 18 languages. Pal et. al.
[2003 b] stated that these scripts are named as Urdu, Tamil, Telugu, Gurmukhi (Panjabi),
Devnagari, Bangla, English, Gujarati, Kannada, Kashmiri, Malayalam, and Oriya.
Examples of different script lines are shown in figure 2.1.
Figure 2.1: Different Indian script lines (from top to bottom: Devnagari, Bangla, Gurumukhi, Malayalam,
Kannada, English, Tamil, Telugu, Urdu, Kashmiri, Gujrathi, Oriya)
In most of Indian scripts, alphabet system exists having basic characters, which are
actually vowel and consonant characters. Apart from these basic characters, there are
compound characters formed by combining two or more basic characters. The shape of a
compound character is usually more complex than the constituent basic characters.
In some scripts (like Gurmukhi, Devnagri or Bangla etc), many characters of the
62
alphabet system have a horizontal line at the upper part. In Devnagari it is called
‘sirorekha’ while in Bangla, this line is called ‘matra’. However, in the present study, it is
referred as head line. When two or more characters are put side by side to form a word in
the language, the head line portions of these characters touch one another and generate a
long head line, which is used as a feature for script identification. In most Indian
languages, a text line may be partitioned into three zones: higher zone, heart zone and
lower zone. Different zoning is shown in figure 2.2.
Figure 2.2: Different zones of English, Devnagari and Gurmukhi text line
The higher zone denotes the portion above the head line. The portion below the head
line is known as heart zone. This zone covers the portion of basic as well as compound
characters. The lower zone is the portion below base line. Those texts where script lines do
63
not contain head line, the mean line separates higher zone and heart zone. The base line
separates heart zone and lower zone. Pal et. al. [2003 b] opined that mean line can be
defined as an imaginary line, where most of the uppermost (lowermost) points of
characters of a text line lie. The uppermost and lowermost boundary lines of a text line are
named as upper line and lower line.
2.2.2 Features of Indian Languages and Scripts
The feature means something which is present in a symbol or character of any script,
for example a feature can be a side bar, or loop and so on. A character may have one or
combinations of certain features in it or not. Kumar et. al. [2003] is of the opinion that
there are certain features present or common in Indian scripts; some of these are as given
in the following section.
2.2.2.1 Common Alphabet: The set of alphabets of Indian languages have been derived
from the Sanskrit alphabet. Usually, there is a common set of alphabets containing 33
consonants and 15 vowels. In addition to this, there are three to four consonants and two to
three vowels which are used in specific languages or in the classical forms of others. This
difference is not very significant in practice. The basic letters of the alphabet are formed
by individual consonants and vowels. The only exception is the Tamil language which
uses twelve fewer consonants. However, the structure is not too different in Tamil too, as
this change can be modeled as dropping some of the consonants from the master list.
2.2.2.2 Akshara or Akhar: Akshra is notion used for a basic unit, called character, of
Indian languages, with reference to Gurmukhi this is also known as Akhar. It forms the
fundamental linguistic unit, like a character in English. An akhar can be made up of 0, 1,
2, or 3 consonants and a vowel. The combination of one or more akhars makes a ‘Word’.
As the languages are completely phonetic, therefore each akhar can be pronounced
64
independently. Samyuktaksharas are the combinations of akhars with more than one
consonant. They are also called as combo characters. The last of the consonants is the
main one in a samyuktakshara.
2.2.2.3 Diverse Graphemes: The commonality in the alphabet does not mean the graphic
forms are used to express them to print in the same way. Each language uses different
scripts consisting of dissimilar graphemes for printing. Thus, printed matter of one
language written in one script is unapproachable to readers of other language but written
in the same script. As we know that there are twelve major scripts in India. The
Devanagari script is the widest used one, being used to write Hindi, Marathi, Konkani, and
Nepali. Here Nepali is the language of the neighboring nation Nepal. For the individual
graphemes and their combinations, different philosophies are used for different scripts.
Some have a head line while others have non touching graphemes. The grapheme of one
of the consonants is usually at the heart of the printed akshara. The vowel appears as a
matra or vowel modifier. These can appear to the above, below, right or left to it or in
combinations. The supporting consonants of a samyuktashara also appear as modifier
graphemes to the above, below, right or left of the main one. These modifiers could be
truncated or scaled down forms of the basic consonant, but could also be completely
different. They may touch each other or the main consonant in some cases or may be
separated. These rules are not consistent even within a script and certainly not across
scripts.
2.2.2.4 Formless Font Design: With the wide use of information technology over the last
few decades, different fonts have been designed for each Indian script. The fonts are built
from glyphs and follow the graphical structure of each script, which is different for
65
different languages. It is not possible to use a consistent set of rules for this step for all
scripts. No conventions have been followed.
2.2.3 Gurmukhi Script
Lehal and Singh [2002] concluded that the word Gurmukhi is derived from the
combination of two words “Guru” and “Mukh”. Gurmukhi means to record the sayings
from the mukh (or mouth or lips) of the Gurus, i.e. from the Guru’s mukh. The credit to
originate this script goes to Guru Angad Dev Ji. He not only rearranged but also modified
and shaped certain letters into a script. New shape and order was given to the alphabets
and made it precise and accurate. Those letters were retained which depicted sounds of the
then spoken language. There was some rearrangement of the letters also such as s and h
were shifted to the first line and a was given the first place in the new alphabet. It is
believed that Gurmukhi belongs to Brahmi family. Aryans developed an Aryan script
which is known as Brahmi. This script was adapted by Aryans as per their local needs.
Between 8th and 6th B.C., this Brahmi script was introduced.
Gurmukhi script is primarily used for the Punjabi language, which is the world’s 14th
most widely spoken language. Gurmukhi script is a logical composition of its constituent
symbols in two dimensions. It is an alphabetic script. Lehal and Singh [1999] explained
that Gurmukhi script alphabet consists of 41 consonants, 12 vowels and 3 half characters
which lie at the feet of consonants. These vowels and consonants are shown in figure 2.3
and 2.4 respectively. Besides the consonants and the vowels, other constituent symbols in
Gurmukhi are a set of vowels modifiers called matra placed to the left, right, above or at
the bottom of a character or conjunct, pure consonants forms corresponding to some
consonant (also called half letters) which when combined with other consonants yield
conjuncts.
66
Figure 2.3: Vowels and Vowel diacritics (Laga Matra)
Figure 2.4: Consonants (Vianjans) of Gurmukhi Script
67
The writing style is from left to right and the concept of upper/lower case (as in
English) is absent. Most of the characters have a horizontal line at the upper part. Mostly
this line, called headline connects the character of words. Lehal and Singh [2000]
suggested that a word in Gurmukhi script can also be partitioned into two horizontal
zones. The upper zone denotes the region above the headline. The area below the
headlines, the major part of the character, is located in center zone or heart zone. These
zones are shown in figure 2.5.
a) Upper zone from line number 1 to 2 b) Heart zone from line number3 to 4
c) Lower zone from line number 4 to 5
Figure 2.5: Three zones in Gurmukhi script
Gurmukhi script has the following characteristics:
• Gurmukhi script alphabet consists of 41 consonants, 12 vowels and 3 half characters
which lie at the feet of consonants.
• The characters of words are connected mostly by a horizontal line called head line.
• All Gurmukhi letters have uniform height.
• All letters in Gurmukhi can be written between two parallel horizontal lines, a is the
only exception. The top curve of which extends beyond the upper line.
• From left to right, letters have almost uniform length, only A (aira) and g (ghaggha)
may be slightly longer than the rest.
68
• The form of letters is not effected when a vowel symbol or diacritic is attached to it,
the only exception being a to which an additional curve is added which represents two
syllables.
• A word in Gurmukhi script can be partitioned into three horizontal zones. The upper
zone denotes the region above the head line, where vowels reside, while the middle
zone or heart zone represents the area below the head line where the consonants and
some sub parts of vowels are present. The lower zone represents the area below middle
zone where some of vowels and certain half characters lie in the foot of consonants.
• The half characters in the lower zone frequently touch the above lying consonants in
the above zone.
• There are many multi component characters in Gurmukhi script. A multi component
character is a character that can be decomposed into isolated parts.
• The bounding boxes of 2 or more characters in a word may intersect or overlap
vertically.
Lehal and Singh [2002] asserted that the Gurmukhi script is a two dimensional
composition of consonants, vowels and half characters which require segmentation in a
vertical as well in horizontal direction. Thus the segmentation of Gurmukhi text calls for a
two dimensional analysis instead of commonly used one dimensional analysis as for
Roman script. Literature survey reveals that due to the following reasons, unique
segmentation method is required for the handwritten Gurumukhi script.
� The letters in cursive writing are often connected.
� The individual letters in a cursive word are often written so as to be unidentifiable
as isolated characters.
� The variance in writing style.
69
� If the handwritten line is slanting then it is difficult to segment.
� Writing quality of handwritten document is not uniform throughout the document.
� Font size can not be guessed, which is very important for character segmentation.
� Some of the handwritten letters like m (in English) can also be interpreted as a pair
nn, as shown in figure 2.6 (a). Similarly in Gurmukhi, the character g can be
segmented as rw, as shown in figure 2.6 (b).
Figure 2.6 (a): Incorrect Segmentation of a character in English
Figure 2.6 (b): Incorrect Segmentation of a character in Gurmukhi
70
2.3 Objectives of Study
The objective of the proposed work is
I. To devise an algorithm to segment a line of handwritten Gurmukhi script.
II. To develop an algorithm to segment a word of a line of handwritten Gurmukhi script.
III. To build up an algorithm to segment a character of a word of handwritten Gurmukhi
script.
IV. To compare developed algorithms with classical approach / recognised based
approach as to present comparative analysis.
2.4 Research Methodology
The proposed work consists of the following components:
i). The Detailed literature reviews were carried out to ascertain the significance of
segmentation in Gurmukhi script.
ii). Survey of various segmentation methodologies was carried out through various
case studies.
iii). In the initial stage, efforts were made to develop an algorithm to segment a line
of handwritten Gurmukhi script.
iv). Then at the next stage, an algorithm was developed to segment a word of a line
of handwritten Gurmukhi script.
v). In the last, an algorithm was developed to segment a character of a word of
handwritten Gurmukhi script.
vi). Developed work was compared with existing approach as to present
comparative analysis.