zhu 2009 pattern-recognition

Upload: kanaan-bhissy

Post on 10-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Zhu 2009 Pattern-Recognition

    1/8

    Pattern Recognition 42 (2009) 3184 -- 3191

    Contents lists available at ScienceDirect

    Pattern Recognition

    journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r

    Language identification for handwritten document images using a shape codebook

    Guangyu Zhu , Xiaodong Yu, Yi Li, David Doermann

    Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742, USA

    A R T I C L E I N F O A B S T R A C T

    Article history:Received 8 August 2008

    Received in revised form 24 November 2008

    Accepted 21 December 2008

    Keywords:

    Language identification

    Shape descriptor

    Shape codebook

    Handwriting recognition

    Document image analysis

    Language identification for handwritten document images is an open document analysis problem. In

    this paper, we propose a novel approach to language identification for documents containing mixture of

    handwritten and machine printed text using image descriptors constructed from a codebook of shape

    features. We encode local text structures using scale and rotation invariant codewords, each representing

    a segmentation-free shape feature that is generic enough to be detected repeatably. We learn a concise,

    structurally indexed shape codebook from training by clustering and partitioning similar feature types

    through graph cuts. Our approach is easily extensible and does not require skew correction, scale nor-

    malization, or segmentation. We quantitatively evaluate our approach using a large real-world document

    image collection, which is composed of 1512 documents in eight languages (Arabic, Chinese, English,

    Hindi, Japanese, Korean, Russian, and Thai) and contains a complex mixture of handwritten and machine

    printed content. Experiments demonstrate the robustness and flexibility of our approach, and show

    exceptional language identification performance that exceeds the state of the art.

    2009 Elsevier Ltd. All rights reserved.

    1. Introduction

    Language identification continues to be a fundamental research

    problem in document image analysis and retrieval. For systems that

    process diverse multilingual document images, such as Google Book

    Search [1] or an automated global expense reimbursement applica-

    tion [2], the performance of language identification is crucial for a

    broad range of tasksfrom determining the correct optical character

    recognition (OCR) engine for text extraction to document indexing,

    translation, and search. In character recognition, for instance, almost

    all existing work requires that the script and/or language of the pro-

    cessed document be known [3].

    Prior research in the field of language identification has focused

    almost exclusively on the homogeneous content of machine printed

    text. Document image collections in practice, however, often con-

    tain a diverse and complex mixture of machine printed and uncon-strained handwritten content, and vary tremendously in font and

    style. Language identification on document images containing hand-

    writing is still an open research problem [6].

    Categorization of unconstrained handwriting presents a number

    of fundamental challenges. First, handwriting exhibits much larger

    Corresponding author at Language and Media Processing Laboratory, Institute

    for Advanced Computer Studies, University of Maryland, College Park, MD 20742,

    USA. Tel.: +1 2406762573; fax: +1301 3149115

    E-mail addresses: [email protected] (G. Zhu), [email protected]

    (X. Yu), [email protected] (Y. Li), [email protected] (D. Doermann).

    0031-3203/$- see front matter 2009 Elsevier Ltd. All rights reserved.

    doi:10.1016/j.patcog.2008.12.022

    variability compared to machine printed text. Handwriting varia-

    tions due to style, cultural, and personalized differences are typi-cal [7], which significantly increase the diversity of shapes found

    in handwritten words. Text lines in handwriting are curvilinear and

    the gaps between neighboring words and lines are far from uniform.

    There are no well-defined baselines for handwritten text lines, even

    by linear or piecewise-linear approximation [4]. Second, automatic

    processing of off-line document image content needs to be robust

    in the presence of unconstrained document layouts, formatting, and

    image degradations.

    Our approach is based on the view that the intricate differences

    between languages can be effectively captured using low-level

    segmentation-free shape features and structurally indexed shape

    descriptors [25]. Low-level local shape features serve well suited

    for this purpose because they can be detected robustly in practice,

    without detection or segmentation of high-level entities, such astext lines or words. As shown in Fig. 2, visual differences between

    handwriting samples across languages are well captured by differ-

    ent configurations of neighboring line segments, which provide rich

    description of local text structures.

    We propose a novel approach for language identification using

    image descriptors built from a codebook of generic shape features

    that are translation, scale, and rotation invariant. To construct

    structural index among a large number of shape features extracted

    from diverse content, we dynamically partition the space of shape

    primitives by clustering similar feature types. We formulate fea-

    ture partitioning as a graph cuts problem with the objective of

    obtaining a concise and globally balanced index by sampling from

    http://www.sciencedirect.com/science/journal/prhttp://www.elsevier.com/locate/prmailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://www.elsevier.com/locate/prhttp://www.sciencedirect.com/science/journal/pr
  • 8/8/2019 Zhu 2009 Pattern-Recognition

    2/8

    G. Zhu et al. / Pattern Recognition 42 (2009) 3184 -- 3191 3185

    Fig. 1. Examples from the Maryland multilingual handwriting database [4] and IAM handwriting DB3.0 database [5]. Languages in the top row are Arabic, Chinese, English,

    and Hindi, and those in the second row are Japanese, Korean, Russian, and Thai.

    Fig. 2. Shape differences in text are captured locally by a large variety of neighboring line segments. (a)(d) Examples of handwriting from four different languages. (e)(h)

    Detected line segments after local fitting, each shown in different color.

    training data. Each cluster in the shape codebook is represented

    by an exemplary codeword, making association of the feature type

    efficient. We obtain very competitive language identification per-

    formance on real document collections using a multi-class SVM

    classifier.

    The structure of this paper is as follows. Section 2 reviews re-

    lated work. In Section 3, we describe the algorithm for learning the

    codebook from diverse low-level shape features, and present a com-

    putational framework for language identification using the shape

    codebook. We discuss experimental results in Section 4 and conclude

    in Section 5.

    2. Related work

    We first present a comprehensive overview of existing techniques

    on script and language identification and discuss their limitations on

    document images with unconstrained handwriting. We then high-

    light work related to our approach on contour-based learning.

    2.1. Language identification

    Prior literature on script and language identification has largely

    focused on the domain of machine printed documents. These works

  • 8/8/2019 Zhu 2009 Pattern-Recognition

    3/8

    3186 G. Zhu et al. / Pattern Recognition 42 (2009) 3184 -- 3191

    can be broadly classified into three categoriesstatistical analysis of

    text lines [812], texture analysis [13,14], and template matching

    [15].

    Statistical analysis using discriminating features extracted from

    text lines, including distribution of upward concavities [8,9], hori-

    zontal projection profile [10,11], and vertical cuts of connected com-

    ponents [12], has proven to be effective on homogeneous collection

    of machine printed documents. These approaches, however, do have

    a few major limitations for handwriting. First, they are based on theassumption of uniformity among printed text, and require precise

    baseline alignment and word segmentation. Freestyle handwritten

    text lines are curvilinear, and in general, there are no well-defined

    baselines, even by linear or piecewise-linear approximation [4]. Sec-

    ond, it is difficultto extendthesemethods to a newlanguage, because

    they employ a combination of hand-picked and trainable features

    and a variety of decision rules. In fact, most of these approaches re-

    quire effective script identification to discriminate between selected

    subset of languages, and use different feature sets for script and lan-

    guage identification, respectively.

    Script identification using rotation invariant texture features,

    including multi-channel Gabor filters [13] and wavelet log co-

    occurrence features [14], were demonstrated on small blocks of

    printed text with similar characteristics. However, no results werereported on full-page documents that involve variations in layouts

    and fonts. Script identification for printed words was explored in

    [16,17] using texture features between a small number of scripts.

    Template matching approach computes the most likely script by

    probabilistic voting on matched templates, where each template pat-

    tern is of fixed size and is rescaled from a cluster of connected com-

    ponents. The script and language identification system developed

    by Hochberg et al. [15] at Los Alamos National Laboratory based

    on template matching is able to process 13 machine printed scripts

    without explicit assumptions of distinguishing characteristics for a

    selected subset of languages. Template matching is intuitive and de-

    livers state-of-the-art performance when the content is constrained

    (i.e. printed text in similar fonts). However, templates are not flexi-

    ble enough to generalize across large variations in fonts or handwrit-ing styles that are typical in diverse datasets [15]. From a practical

    point of view, the system also has to learn the discriminability of

    each template through labor-intensive training to achieve the best

    performance. This requires tremendous amount of supervision and

    further limits applicability.

    There exists very little literature on language identification for

    unconstrained handwriting. To the best of our knowledge, the exper-

    iments of Hochberg et al. [7] on script and language identification of

    handwritten document images is the only reference on this topic in

    the literature. They used linear discriminant analysis based on five

    simple features of connected component, including relative centroid

    locations and aspect ratio. The approach is shown to be sensitive

    to large variations across writers and diverse document content [7].

    Irregularities in the document, including machine printed text, il-

    lustrations, markings, and handwriting in different orientation from

    main body, were removed manually by image editing from their

    evaluation dataset.

    2.2. Contour-based learning

    Learning contour features is an important aspect in many com-

    puter vision problems. Ferrari et al. [18] proposed scale-invariant ad-

    jacent segment (kAS) features extracted from the contour segment

    network of image tiles, and used them in a sliding window scheme

    for object detection. By explicitly encoding both geometric and spa-

    tial arrangement among the segments, kAS descriptor demonstrates

    state-of-the-art performance in shape-based object detection, and

    outperforms descriptors based on interest points and histograms of

    gradient orientations [19]. However, kAS descriptor is not rotation

    invariant, because segments are rigidly ordered from left to right.

    This limits the repeatability of high-order kAS, and the best perfor-

    mance in [18] is reported when using 2AS.

    In handwriting recognition literature, one interesting work also

    motivated by the idea of learning a feature codebook is the study

    on writer identification by Schomaker et al. [20]. Writer identifi-

    cation assumes that the language is known beforehand and aims

    to distinguish between writers based on specific characteristics ofhandwriting. Schomaker et al. used closed-contour features of ink

    blobs directly, without any shape descriptor. The difference between

    contour features is computed using Hamming distance. The low-

    dimensional codebook representation presented in [20] is based on

    Kohonen self-organizing map (SOM) [21]. Their approach demands

    good segmentation and is not scale or rotation invariant. To account

    for size variations, for instance, SOMs need to be computed at mul-

    tiple scales, which requires large training data and reportedly takes

    122 hours on a 128-node Linux cluster.

    3. Constructing a shape codebook

    Recognition of diverse text content needs to account for large

    intra-class variations typical in real document collections, includ-ing unconstrained document layouts, formatting, and handwriting

    styles. The scope of the problem is intuitively in favor of low-level

    shape features that can be detected robustly without word or line

    detection and segmentation. Rather than focusing on selection of

    class-specific features, our approach aims to distinguish differences

    between text collectively using the statistics of a large variety of

    generic, geometrically invariant feature types (shape codewords)

    that are structurally indexed. Our emphasis on the generic nature of

    codewords presents a different perspective to recognition that has

    traditionally focused on finding sophisticated features or visual se-

    lection models, which may limit generalization performance.

    We explore the kAS contour feature recently introduced by Ferrari

    et al. [18], which consists of a chain ofk roughly straight, connected

    contour segments. Specifically, we focus on thecase of triple adjacentsegments (TAS) [22], which strike a balance between lower-order

    contour features that are not discriminative enough and higher-order

    ones that are less likely to be detected robustly. As shown in Fig. 3,

    a contour feature C is compactly represented by an ordered set of

    lengths and orientations of ci for i {1,2,3}, where ci denotes line

    segment i in C.

    3.1. Extraction of contour features

    Feature detection in our approach is very efficient. First, we com-

    pute edges using the Canny edge detector [23], which gives precise

    localization and unique response on text content. Second, we group

    contour segments by connected components and fit them locally

    into line segments using an algorithm that breaks a line segmentinto two only when the deviation is beyond a threshold [24]. Then,

    within each connected component, every triplet of connected line

    segments that starts from the current segment is extracted. Fig. 2

    provides visualization of the quality of detected contour features by

    our approach using randomly permuted colors for neighboring line

    segments.

    Our feature detection scheme requires only linear time and space

    in the number of contour fragments n, and is highly parallelizable. It

    is more efficient and stable than [18], which requires construction of

    contour segment network and depth first search from each segment,

    leading to O(n log(n)) time on average and O(n2) in the worst case.

    We encode text structures in a translation, scale, and rotation

    invariant fashion by computing orientations and lengths with ref-

    erence to the first detected line segment. This is distinct from the

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/8/2019 Zhu 2009 Pattern-Recognition

    4/8

    G. Zhu et al. / Pattern Recognition 42 (2009) 3184 -- 3191 3187

    Fig. 3. Examples of (a) C-shape, (b) Z-shape and (c) Y-shape triple adjacent segment (TAS) shape features that capture local structures. TAS shape descriptors are

    segmentation-free, and are translation, scale and rotation invariant.

    Fig. 4. The 25 most frequent exemplary codewords for (a) Arabic, (b) Chinese, (c) English, and (d) Hindi document images, capture the distinct shape features of different

    languages. (e)(h) show codeword instances in the same cluster as the first five exemplary codewords for each language (the leftmost column), ordered by ascending

    distances to the center of their associated clusters. Scaled and rotated versions of similar features are clustered together.

    motivation ofkAS descriptor that attempts to enumerate spatial ar-

    rangements of contours within local regions. Furthermore, kAS de-

    scriptor does not take into account of rotation invariance.

    3.2. Measure of dissimilarity

    The overall dissimilarity between two contour features can be

    quantified by the weighted sum of the distances in lengths and ori-

    entations. We use the following generalized measure of dissimilaritybetween two contour features Ca and Cb

    d(Ca,Cb, k) = kTlengthdlength + k

    Torientdorient, (1)

    where vectors dlength and dorient are composed of the individual dis-

    tances between contour lengths and orientations, respectively.klengthand korient are their corresponding weight vectors, providing sensi-

    tivity control over the tolerance of line fitting. One natural measure

    of dissimilarity in lengths between two contour segments is their

    log ratio. We compute orientation difference between two segments

    by normalizing their absolute value of angle difference to . In our

    experiments, we use a larger weighting factor for orientation to de-

    emphasize the difference in the lengths because they may be less

    accurate due to line fitting.

    3.3. Learning the shape codebook

    We extract a large number of shape codewords by sampling

    from training images, and construct an indexed shape codebook by

    clustering and partitioning the codewords. A codebook provides a

    concise structural organization for associating large varieties of low-

    level features [25], and is efficient because it enables comparison to

    much fewer feature types.

    3.3.1. Clustering shape features

    Prior to clustering, we compute the distance between each pair

    of codewords and construct a weighted graph G = (V,E), in which

    each node on the graph represents a codeword. The weight w(Pa,Pb)

    on an edge connecting two nodes Ca and Cb is defined as a function

    of their distance

    w(Ca,Cb) = exp

    d(Ca,Cb)

    2

    2D

    , (2)

    where we set parameter D to 20% of the maximum distance among

    all pairs of nodes.

    We formulate feature clustering as a spectral graph partitioning

    problem,in whichwe seek togroup the set ofvertices Vinto balanced

    disjoint sets {V1,V2, . . . ,VK}, such that by the measure defined in (1),

  • 8/8/2019 Zhu 2009 Pattern-Recognition

    5/8

    3188 G. Zhu et al. / Pattern Recognition 42 (2009) 3184 -- 3191

    the similarity between vertices within the same set is high and that

    between different sets is low.

    More concretely, let the NNsymmetric weight matrix for all the

    vertices be Wand N= |V|. We define the degree matrix D as an NN

    diagonal matrix, whose i-th element D(i) along the diagonal satisfies

    D(i) =

    jW(i,j). We use an N K matrix X to represent a graph

    partition, i.e. X= [X1,X2, . . . ,XK], where each element of matrix X is

    either 0 or 1. We can show that the feature clustering formulation

    that seeks a globally balanced graph partition is equivalent to thenormalized cuts criterion [26], and can be written as

    maximize (X) =1

    K

    Kl=1

    XTlWXl

    XTlDXl

    , (3)

    subject to X {0, 1}NK andj

    X(i,j) = 1. (4)

    Minimizing normalized cuts exactly is NP-complete. We use a fast

    algorithm [27] for finding its discrete near-global optima, which is

    robust to random initialization and converges faster than other clus-

    tering methods.

    3.3.2. Organizing features in the codebook

    For each cluster, we select the feature instance closest to thecenter of the cluster as the exemplary codeword. This ensures that

    an exemplary codeword has the smallest sum of squares distance to

    all the other features within the cluster. To effectively capture the

    variations among each feature type, we compute the cluster radius of

    each exemplary codeword. The radius of the cluster is defined as the

    maximum distance from the cluster center to all the other features

    within the cluster. The constructed codebook C is composed of all

    exemplary codewords.

    Table 1

    Languages and varieties in dataset used for evaluation of language identification.

    Language Arabic Chinese English Hindi Japanese Korean Russian Thai

    Pages 245 184 175 186 161 168 204 189

    Writers 22 18 26 17 8 25 5 10

    Table 2

    Confusion table of our approach for the eight languages.

    A C E H J K R T

    A 99.7 0.3 0 0 0 0 0 0

    C 1.4 85.0 4.0 1.0 6.7 1.0 0.7 0.2

    E 1.6 0 95.9 0.2 0 1.1 0.6 0.6

    H 0.2 0.2 0 98.8 0.8 0 0 0

    J 0 1.3 1.0 0.2 96.2 1.3 0 0

    K 0 0.8 0.1 1.9 0.5 96.0 0.5 0.1

    R 0.5 0 2.0 0 0 0 97.1 0.4

    T 0 0.3 1.6 0.9 0.6 0.3 0 96.3

    Fig. 5. Confusion tables for language identification using (a) LBP [28], (b) template matching [7], (c) kAS [18], (d) our approach (A: Arabic, C: Chinese, E: English, H: Hindi,

    J: Japanese, K: Korean, R: Russian, T: Thai, U: Unknown).

    Fig. 4 shows the 25 most frequent exemplary shape codewords

    for Arabic, Chinese, English, and Hindi, which are learned from 10

    documents of each language. Distinguishing features between lan-

    guages, including cursive style in Arabic, 45 and 90 transitions in

    Chinese, and various configurations due to long horizontal lines

    in Hindi, are learned automatically. Each row in Fig. 4(e)(h) lists

    examples of codewords in the same cluster, ordered by ascending

    distances to the center of their associated clusters. Through cluster-

    ing, translated, scaled and rotated versions of similar features aregrouped together.

    Since each codeword represents a generic local shape feature, in-

    tuitively a majority of codewords should appear in images across

    content categories, even though their frequencies of occurrence devi-

    ate significantly. In our experiments, we find that 86.3% of codeword

    instances appear in document images across all eight languages.

    3.4. Constructing the image descriptor

    We construct a histogram descriptor for each image, which pro-

    vides statistics of the frequency at which each feature type occurs.

    For each detected codeword W from the image, we compute the

    nearest exemplary word Ck in the shape codebook. We increment

    the descriptor entry corresponding to Ck only if

    d(W,Ck)< rk, (5)

    where rk is the cluster radius associated with the exemplary code-

    word Ck. This quantization step ensures that unseen features that

    deviate considerably from training features are not used for image

    description. In our experiments, we found that only less than 2% of

    the contour features cannot be found in the shape codebook learned

    from the training data.

    4. Experiments

    4.1. Dataset

    We use 1512 document images of eight languages (Arabic, Chi-nese, English, Hindi, Japanese, Korean, Russian, and Thai) from the

    University of Maryland multilingual database [4] and IAM handwrit-

    ing DB3.0 database [5] for evaluation on language identification (see

    Fig. 1). Both databases are well-known, large, real-world collections,

    which contain the source identity of each image in the ground truth.

    This enables us to construct a diverse dataset that closely mirrors the

    true complexities of heterogeneous document image repositories in

    practice. Table 1 summarizes the statistics of the dataset.

    4.2. Overview of the experiments

    We compare our approach with three state-of-the-art ap-

    proaches that are well motivated by different views of image

  • 8/8/2019 Zhu 2009 Pattern-Recognition

    6/8

    G. Zhu et al. / Pattern Recognition 42 (2009) 3184 -- 3191 3189

    categorization template matching [15], local binary patterns (LBP)

    [28], and kAS [18]. Proposed by Ojala et al. [28], LBP is a widely used

    rotation invariant approach for texture analysis. It captures spatial

    structure of local image texture in circular neighborhoods across

    angular space and resolution, and has demonstrated excellent results

    in a wide range of whole-image categorization problems involving

    diverse data [29,30].

    For effective comparison, we used multi-class SVM classifiers [31]

    trained on the same pool of randomly selected handwritten docu-ment images from each language class in the following experiments,

    for LBP, kAS, and our approach, respectively. We further evaluate the

    generalization performance of our approach as the size of training

    data varies.

    4.3. Results and discussions

    The confusion tables for LBP, template matching, kAS, and our

    approach are shown in Fig. 6. Our approach gives excellent results

    on all the eight languages, with a mean diagonal of 95.6% and a

    standard deviation of 4.5%. Table 2 lists all entries in the confusion

    table of our approach for the eight languages.

    kAS, with a mean diagonal of 88.2%, is also shown to be very

    effective. Neither kAS nor our approach has difficulty generalizingacross large variations such as font types or handwriting styles, as

    evident from their relatively small standard deviations along diag-

    onal entries in the confusion tables, compared to LBP and template

    matching (see Fig. 6).

    The performance of template matching varies significantly across

    languages, with 68.1% mean diagonal and 20.5% standard deviation

    Fig. 6. Comparison of language identification performances of different approaches.

    Fig. 7. Examples of error cases of Chinese handwritten documents due to image degradations and content mixture.

    along diagonal. One big confusion of template matching is between

    Japanese and Chinese since a document in Japanese may contain

    varying number of Kanji (Chinese characters). Rigid templates are

    not flexible for identifying discriminative partial features and the

    bias in voting decision towards the dominant candidate causes less

    frequently matched templates to be ignored. Another source of error

    that lowers the performance of template matching is undetermined

    cases (see the unknown column in Fig. 5(b)), where probabilistic

    voting cannot decide between languages with roughly equal votes.Texture-based LBP could not effectively recognize differences be-

    tween languages on a diverse dataset because distinctive layouts and

    unconstrained handwriting exhibit irregularities that are difficult to

    capture using texture, and its mean diagonal is only 55.1%.

    Fig. 6 quantitatively compares the overall language identification

    performances of different approaches by the means and standard

    derivations of diagonal entries in their confusion tables. Shape-based

    approaches, including our approach and kAS, show higher recogni-

    tion rates and smaller performance derivations, compared to those

    based on models of textures and rigid templates.

    It gives further insights to analyze the language identification er-

    rors made by our approach. Among all the eight languages, recog-

    nition of Chinese handwriting is shown to be the most challenging

    task. As shown in Fig. 5, the Chinese language identification per-formances for all the four approaches are significantly lower com-

    pared to those of other languages. We show several typical error

    cases in Fig. 7, where Chinese handwritten documents are wrongly

    Fig. 8. Recognition rates of our approach as the size of training data varies. Our

    approach achieves excellent language identification performance even using a small

    number of handwritten document images per language for training.

    http://-/?-http://-/?-http://-/?-
  • 8/8/2019 Zhu 2009 Pattern-Recognition

    7/8

    3190 G. Zhu et al. / Pattern Recognition 42 (2009) 3184 -- 3191

    recognized as other languages. These error cases include document

    images with severe image degradations, second-generation docu-

    ments captured from low-resolution source copies (e.g. photocopies

    of fax documents), and documents containing significant amount of

    other content.

    Good generalization is important for the success of document

    analysis systems that need to process diverse, unconstrained data.

    Fig. 8 shows the recognition rates of our approach as the size of train-

    ing set varies. We observe highly competitive language identificationperformance on this challenging dataset even when a small amount

    of training data per language class is used. This demonstrates the

    effectiveness of using generic low-level shape features when mid or

    high-level vision representations may not be generalized or flexible

    enough for the task.

    Our results on language identification are very encouraging from

    a practical point of view, as the training in our approach requires

    considerably less supervision. Our approach only needs the class

    label of each training image, and does not require skew correction,

    scale normalization, or segmentation.

    5. Conclusion

    In this paper, we have proposed a computational framework forlanguage identification using a shape codebook composed of a wide

    variety of local shape features. Each codeword represents a charac-

    teristic structure that is generic enough to be detected repeatably,

    and the entire shape codebook is learned from training data with

    little supervision. Low-level, segmentation-free shape features and

    the geometrically invariant shape descriptor improves robustness

    against large variations typical in unconstrained handwriting, and

    makes our approach flexible to extend. In addition,the codebook pro-

    vides a principled approach to structurally indexing and associating

    a vast number of feature types. Experiments on large complex real-

    world document image collections show excellent language identi-

    fication results that exceed the state of the art.

    Acknowledgments

    Portions of this paper appeared in previous conference publica-

    tions [25,32]. This research was supported by the US Department of

    Defense under contract MDA-9040-2C-0406. The authors would like

    to thank the anonymous reviewers for their constructive comments,

    which have significantly improved the paper.

    References

    [1] L. Vincent, Google Book Search: document understanding on a massive scale,in: Proceedings of the International Conference on Document Analysis andRecognition, 2007, pp. 819823.

    [2] G. Zhu, T.J. Bethea, V. Krishna, Extracting relevant named entities for automatedexpense reimbursement, in: Proceedings of the ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, 2007, pp. 10041012.

    [3] S. Rice, G. Nagy, T. Nartker, Optical Character Recognition: An Illustrated Guideto the Frontier, Kluwer Academic Publishers, Dordrecht, 1999.

    [4] Y. Li, Y. Zheng, D. Doermann, S. Jaeger, Script-independent text linesegmentation in freestyle handwritten documents, IEEE Transactions on PatternAnalysis and Machine Intelligence 30 (8) (2008) 13131329.

    [5] U. Marti, H. Bunke, The IAM-database: an English sentence database for off-line handwriting recognition, International Journal of Document Analysis andRecognition 5 (2006) 3946, Available: http://www.iam.unibe.ch/fki/iamDB/.

    [6] R. Plamondon, S.N. Srihari, On-line and off-line handwriting recognition: acomprehensive survey, IEEE Transactions on Pattern Analysis and MachineIntelligence 22 (1) (2000) 6384.

    [7] J. Hochberg, K. Bowers, M. Cannon, P. Kelly, Script and language identificationfor handwritten document images, International Journal of Document Analysisand Recognition 2 (23) (1999) 4552.

    [8] D.-S. Lee, C.R. Nohl, H.S. Baird, Language identification in complex, unoriented,and degraded document images, in: Proceedings of the IAPR Workshop onDocument Analysis Systems, 1996, pp. 1739.

    [9] A. Spitz, Determination of script and language content of document images,IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (3) (1997)235245.

    [10] J. Ding, L. Lam, C.Y. Suen, Classification of oriental and European scripts by

    using characteristic features, in: Proceedings of the International Conferenceon Document Analysis and Recognition, 1997, pp. 10231027.

    [11] C.Y. Suen, S. Bergler, N. Nobile, B. Waked, C. Nadal, A. Bloch, Categorizingdocument images into script and language classes, in: Proceedings ofthe International Conference on Advances in Pattern Recognition, 1998,pp. 297306.

    [12] S. Lu, C.L. Tan, Script and language identification in noisy and degradeddocument images, IEEE Transactions on Pattern Analysis and MachineIntelligence 30 (2) (2008) 1424.

    [13] T. Tan, Rotation invariant texture features and their use in automatic scriptidentification, IEEE Transactions on Pattern Analysis and Machine Intelligence20 (7) (1998) 751756.

    [14] A. Busch, W.W. Boles, S. Sridharan, Texture for script identification, IEEETransactions on Pattern Analysis and Machine Intelligence 27 (11) (2005)17201732.

    [15] J. Hochberg, P. Kelly, T. Thomas, L. Kerns, Automatic script identification fromdocument images using cluster-based templates, IEEE Transactions on PatternAnalysis and Machine Intelligence 19 (2) (1997) 176181.

    [16] H. Ma, D. Doermann, Word level script identification on scanned documentimages, in: Proceedings of the Document Recognition and Retrieval, 2004, pp.124135.

    [17] S. Jaeger, H. Ma, D. Doermann, Identifying script on word-level withinformational confidence, in: Proceedings of the International Conference onDocument Analysis and Recognition, 2005, pp. 416420.

    [18] V. Ferrari, L. Fevrier, F. Jurie, C. Schmid, Groups of adjacent contour segments forobject detection, IEEE Transactions on Pattern Analysis and Machine Intelligence30 (1) (2008) 3651.

    [19] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp. 886893.

    [20] L. Schomaker, M. Bulacu, K. Franke, Automatic writer identification usingfragmented connected-component contours, in: Proceedings of the InternationalWorkshop on Frontiers in Handwriting Recognition, 2004, pp. 185190.

    [21] T. Kohonen, Self-Organization and Associative Memory, third ed., Springer,Berlin, 1989.

    [22] X. Yu, Y. Li, C. Fermuller, D. Doermann, Object detection using a shape codebook,in: Proceedings of the British Machine Vision Conference, 2007, pp. 110.

    [23] J. Canny, A computational approach to edge detection, IEEE Transactions onPattern Analysis and Machine Intelligence 8 (6) (1986) 679697.

    [24] P.D. Kovesi, MATLAB and octave functions for computer vision andimage processing, 2000. Available: http://www.csse.uwa.edu.au/pk/research/matlabfns/.

    [25] G. Zhu, X. Yu, Y. Li, D. Doermann, Learning visual shape lexicon for documentimage content recognition, in: Proceedings of the European Conference onComputer Vision, vol. 2, 2008, pp. 745758.

    [26] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Transactions onPattern Analysis and Machine Intelligence 22 (8) (2000) 888905.

    [27] S.X. Yu, J. Shi, Multiclass spectral clustering, in: Proceedings of the InternationalConference on Computer Vision, 2003, pp. 1117.

    [28] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotationinvariant texture classification with local binary patterns, IEEE Transactions onPattern Analysis and Machine Intelligence 24 (7) (2002) 971987.

    [29] M. Heikkila, M. Pietikainen, A texture-based method for modeling thebackground and detecting moving objects, IEEE Transactions on Pattern Analysisand Machine Intelligence 28 (4) (2006) 657662.

    [30] T. Ahonen, A. Hadid, M. Pietikainen, Face description with local binary patterns:application to face recognition, IEEE Transactions on Pattern Analysis andMachine Intelligence 28 (12) (2006) 20372041.

    [31] C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, 2001.Available: http://www.csie.ntu.edu.tw/cjlin/libsvm.

    [32] G. Zhu, X. Yu, Y. Li, D. Doermann, Unconstrained language identification usinga shape codebook, in: Proceedings of the International Conference on Frontiersin Handwriting Recognition, 2008, pp. 1318.

    About the AuthorGUANGYU ZHU received the B.Eng. (First Class Honours) in Electrical and Electronic Engineering from Nanyang Technological University, Singapore, in2001. He is currently a Ph.D. candidate in the Department of Electrical and Computer Engineering at University of Maryland, College Park. From 2001 to 2003, he wasa technical solution architect at IBM Singapore. He worked at IBM Almaden Research Center, San Jose, California, in 2006 and Yahoo! Search and Advertising Sciences,Santa Clara, California, in 2008. Since 2004, he has been a member of the Language and Media Processing Laboratory, University of Maryland. His research interests includedocument image analysis, machine learning, pattern recognition, and information retrieval. He is a member of the IEEE and IAPR.

    http://-/?-http://-/?-http://www.iam.unibe.ch/~fki/iamDB/http://www.iam.unibe.ch/~fki/iamDB/http://www.iam.unibe.ch/~fki/iamDB/http://www.iam.unibe.ch/~fki/iamDB/http://www.csse.uwa.edu.au/~pk/research/matlabfns/http://www.csse.uwa.edu.au/~pk/research/matlabfns/http://www.csse.uwa.edu.au/~pk/research/matlabfns/http://www.csse.uwa.edu.au/~pk/research/matlabfns/http://www.csse.uwa.edu.au/~pk/research/matlabfns/http://www.csse.uwa.edu.au/~pk/research/matlabfns/http://www.csie.ntu.edu.tw/~cjlin/libsvmhttp://www.csie.ntu.edu.tw/~cjlin/libsvmhttp://www.csie.ntu.edu.tw/~cjlin/libsvmhttp://www.csie.ntu.edu.tw/~cjlin/libsvmhttp://www.csie.ntu.edu.tw/~cjlin/libsvmhttp://www.csse.uwa.edu.au/~pk/research/matlabfns/http://www.csse.uwa.edu.au/~pk/research/matlabfns/http://www.iam.unibe.ch/~fki/iamDB/http://-/?-
  • 8/8/2019 Zhu 2009 Pattern-Recognition

    8/8

    G. Zhu et al. / Pattern Recognition 42 (2009) 3184 -- 3191 3191

    About the AuthorXIAODONG YU received the B.Eng. in Electrical Engineering from University of Science and Technology of China and M.S. in Computer Science fromNational University of Singapore in 1999 and 2002, respectively. He was with the Institute for Infocomm Research, Singapore, and the Nanyang Technological University,Singapore, from 2001 to 2004. He is now a Ph.D. student in the Department of Electrical and Computer Engineering, University of Maryland, College Park. His researchinterests include the object recognition, image and video retrieval. He is a member of the IEEE.

    About the AuthorYI LI received the B.Eng. and M.Eng. degrees from South China University of Technology, Guangzhou, China, in 2001 and 2004, respectively. He iscurrently working toward the Ph.D. degree in the Department of Electrical and Computer Engineering at the University of Maryland, College Park. He is with the MarylandTerrapin Team, which received Second Place in the First Semantic Robot Vision Challenge (SRVC) at the 22nd Conference on Artificial Intelligence (AAAI 2007) sponsored bythe US National Science Foundation. His research interests include document image processing, shape analysis, object categorization, robotics, and computer vision. He isthe Future Faculty Fellow of the A. James Clark School of Engineering at the University of Maryland. He received the Best Student Paper Award from the 10th International

    Workshop on Frontiers in Handwriting Recognition (IWFHR 2006). He is a student member of the IEEE.

    About the AuthorDAVID DOERMANN received the B.S. degree in Computer Science and Mathematics from Bloomsburg University in 1987 and the M.S. and Ph.D.degrees from the University of Maryland, College Park, in 1989 and 1993, respectively. Since 1993, he has been a codirector of the Laboratory for Language and MediaProcessing in the Institute for Advanced Computer Studies at the University of Maryland and an adjunct member of the graduate faculty. His team of 1520 researchersfocuses on topics related to document image analysis and multimedia information processing. The team's recent intelligent document image analysis projects include pagedecomposition, structural analysis and classification, page segmentation, logo recognition, document image compression, duplicate document image detection, image-basedretrieval, character recognition, generation of synthetic OCR data, and signature verification. In video processing, projects have centered on the segmentation of compresseddomain video sequences, structural representation and classification of video, detection of reformatted video sequences, and the performance evaluation of automated videoanalysis algorithms. He is a coeditor of the International Journal on Document Analysis and Recognition. He has more than 25 journal publications and almost 100 refereedconference proceedings. He is a member of the IEEE.