full report 13 dec 2010 -...
TRANSCRIPT
ABSTRACT
Identification based on Chinese handwriting is an interesting research in the
field of pattern recognition and computer vision. Recently, many innovative methods
and approaches have been developed for writer identification. Unlike character of
western alphabet such as English, German, French, some oriental character such as
Korean, Arabic and Chinese have structural characteristics. These structural
characteristics, particularly on Chinese character have a complex structure due to the
numerous strokes that warped into a cursive shape and have much larger set of
characters. Hence, more features are needed to be generated prior to the classification
phase for better identification. However, these features need to be well-represented for
identification purposes. Hence in this study, an improved discretization is
implemented to transform the range of continuous quantitative values of writer’s
features into a number of appropriate intervals, denoted as an integer label. Several
experiments have been conducted with two different types of datasets: pre-discretized
and post-discretized datasets. Post-discretized datasets is the extarcted features that
have performed with discretization process; while pre-discretized are the original
features, obtained from Direction-based Feature Extraction (DFE) technique. For
reliable identification performance through discretization, 10, 7 and 5 cross-
validations (CV) have been tested on both datasets. The experiments have shown that
the overall best result are obtained with discretized data, with identification accuracy
above 94.0% compared to pre-discretized with identification accuracy below 50.0%.
It can be concluded that the discretization process is efficient for representing the
writers’ features in obtaining higher identification rates for better forensic document
analysis.
ABSTRAK
Pengenalpastian tulisan tangan cina merupakan bidang tujahan penyelidikan
yang menarik dalam bidang pencaman pola dan visi komputer. Terdapat banyak
kaedah dan pendekatan inovatif terkini yang telah dibangunkan oleh penyelidik setara
bagi tujuan mengenalpasti penulis. Tidak seperti aksara barat seperti Inggeris, Jerman,
Perancis, kebanyakkan aksara orental seperti Korea, Arab dan cina mempunyai ciri-
ciri yang berstruktur. Ciri-ciri berstruktur ini terutamanya struktur aksara cina adalah
rumit kerana terdapat pelbagai rupabentuk lingkaran yang membentuk perwakilan
kursif yang banyak. Oleh yang demikian, banyak fitur perlu dijana sebelum
pengelasan bagi mendapatkan pengenalpastian yang baik. Namun begitu, fitur perlu
diwakilkan dengan cara yang baik bagi tujuan pengenalpastian yang berkualiti. Oleh
yang demikian, kajian ini melaksanakan pendiskretan pembaikan dengan
menjelmakan julat nilai selanjar kuantitatif fitur penulis dalam bentuk selang yang
bersesuaian; ini dikenali sebagai pelabelan integer. Pelbagai ujikaji telah dijalankan
menggunakan dua set data yang berbeza iaitu pra-pendiskretan data dan pasca-
pendiskretan data. Set data pasca-pendiskretan adalah data yang telah disari fiturnya
selepas proses pendiskretan. Manakala set data pra-pendiskretan adalah fitur asli yang
disari mengunakan kaedah Penyarian Fitur berdasarkan Arah (PFA). Bagi
mendapatkan prestasi pengenalpastian yang baik, validasi-silang 10, 7, dan 5
digunakan bagi menguji kedua-dua jenis set data tersebut. Hasil dapatan menunjukkan
bahawa hampir keseluruhan keputusan yang terbaik dijana oleh set data pasca-
pendiskretan dengan ketepatan melebihi 94% berbanding set data pra-pendiskretan
dengan ketepatan kurang dari 50%. Oleh yang demikian, boleh dirumuskan bahawa
proses pendiskretan amat berkesan bagi mewakilkan fitur penulis untuk mendapatkan
kadar pengenalpastian yang tinggi dalam menganalisa dokumen forensik.
TABLE OF CONTENTS
CHAPTER TITLE PAGE
DECLARATION ii
DEDICATION iii
ACKNOWLEDGEMENT iv
ABSTRACT v
ABSTRAK vi
TABLE OF CONTENTS vii
LIST OF TABLES xii
LIST OF FIGURES xiii
LIST OF ABBREVIATIONS xvi
1 INTRODUCTION 1
1.1 Overview 1
1.2 Problem Background 3
1.3 Problem Statement 6
1.4 Dissertation Aim 8
1.5 Objectives 8
1.6 Dissertation Scope 9
1.7 Significant of the Dissertation 9
1.8 Organization of the Dissertation 10
2 LITERATURE REVIEW 11
2.1 Introduction 11
2.2 Overview of Pattern Recognition 12
2.2.1 Pattern Recognition Hierarchy (Recognition,
Identification and Verification) 13
2.2.2 Identification based handwriting in Pattern
Recognition 17
2.3 Overview of Chinese Handwriting 18
2.3.1 Writer Identification on Multiple Languages 20
2.3.2 Writer Identification on Chinese Handwriting 23
2.4 Process in Chinese Handwritten Identification 27
2.4.1 Pre-processing 28
2.4.1.1 Normalisation 29
2.4.1.2 Thinning 31
2.4.2 Feature Extraction on Handwritten Character 31
2.4.3 Discretization Process on Chinese Handwritten
Data 34
2.4.3.1 Previous Discretization Method 35
2.4.3.2 Beneficial of Discretization Method 38
2.4.4 Writer Identification 39
2.4.4.1 Classification based on Soft Computing
Approaches 40
2.4.4.1.1 Rough Set Theory 42
2.5 Summary 44
3 RESEARCH METHODOLOGY 45
3.1 Introduction 45
3.2 Research Framework 46
3.3 Pre-Processing Phase 48
3.3.1 Chinese Handwritten Datasets 48
3.3.2 Defining Chinese Handwritten Data Primitives 50
3.3.3 Normalisation on Chinese handwriting 51
3.3.4 Feature Extraction 53
3.3.4.1 Directional based Feature Extraction
technique 53
3.4 Discretization Phase 60
3.4.1 Invariant Discretization Method 60
3.5 Identification Phase 62
3.5.1 Rough set classification 63
3.5.2 Performance Measurement 64
3.5.2.1 Confusion Matrix 66
3.5.2.2 Cross Validation 67
3.6 Summary 70
4 EXPERIMENTAL RESULTS 71
4.1 Introduction 71
4.2 Chinese Handwritten datasets 72
4.3 Pre-processing Phase 73
4.3.1 Standard size and Elimination of Background’s
Noise 73
4.3.2 Binarization 74
4.3.3 Feature Extraction 75
4.4 Discretization Phase 78
4.4.1 Discretization Procedure 80
4.5 Training Process 82
4.6 Testing Process and Classification for Chinese Writer
Identification 82
4.7 Experimental results of Pre-Discretize and Post-
Discretize on Chinese datasets 83
4.8 Summary 94
5 CONCLUSION 95
5.1 Introduction 95
5.2 Dissertation Constributions 96
5.3 Future Works 97
5.4 Summary 98
REFERENCES 99-106
LIST OF TABLES
TABLE NO. TITLE PAGE
1.1 Advantages and disadvantages of feature 4
extraction and identification methods
2.1 Hierarchy in Patern Recognition 16
2.2 Various methods on multiple languages 21
2.3 Offline and online writer identification on Chinese 24
handwriting
2.4 Feature extraction techniques for identification process 33
2.5 Categories of Discretization 36
2.6 Various Discretization methods 37
2.7 Techniques based on Soft computing approaches 41
3.1 Information for writer identification 49
4.1 Feature vectors from 12 samples of handwriting 76
4.2 Reduction process on both datasets classification 85
4.3 Comparisons of identification rates with different training and 87
testing datasets
4.4 Cross Validation experiment settings for both types of datasets 89
4.5 Comparisons of identification rates using 70% training and 91
30% testing data with 10, 7 and 5-fold Cross Validation (CV)
4.6 Comparisons of identification rates using 60% training and 92
40% testing data with 10, 7 and 5-fold Cross Validation (CV)
on both datasets
LIST OF FIGURES
FIGURE NO. TITLE PAGE
2.1 Basic framework of Pattern Recognition System 13
(Selim Aksoy, 2010)
2.2 Writer identification framework (Saphin Gupta, 14
2008)
2.3 Writer verification framework (Saphin Gupta, 15
2008)
2.4 Same Chinese character ( ) with different number 18
of strokes (Fang Hsuan Cheng, 1997) a) 8 strokes
character b) 6 strokes character
2.5 Examples of similar Chinese character with different 19
meaning (Po Hsien Wu, 2003)
2.6 Examples of new created word and its meaning 20
2.7 (a): Four handwritings from two writers 26
(b): Grid microstructure features on handwritings 26
(c): Differences measurement between two handwritings 27
(Xin Li and Xiaoqing Ding, 2009)
2.8 An example of the original image and the normalisation 30
result of the Chinese handwritten text (Yuchen et al.,
2009) a) before normalisation b) after normalisation
2.9 An example of the original image and the thinning result 31
of the Chinese character (Fang Hsuan Cheng, 1997)
a) before thinning b) after thinning
2.10 Rough Set Theory (Pawlak, 1982) 43
2.11 Approximation role in Rough Set Theory (Pawlak, 43
1982)
3.1 Discretization on writer identification based Chinese 47
handwriting framework
3.2 Sample of Chinese handwritten text (Tonghua et al., 2006) 49
3.3 Example of normalized Chinese handwritten image 52
(Yuchen et al., 2009) (a) before normalisation
(b) after normalisation
3.4 Image of 9 equal sized zones of 1 Chinese character, 54
3.5 Starters are rounded in red colour 55
3.6 Intersections are rounded in red 56
3.7 Minor starters are rounded in red 57
3.8 Feature vectors of the zoned Chinese character 58
image,
3.9 Traverse process in a Chinese character, 59
3.10 Statistic feature in ROSETTA Toolkit 63
3.11 Annotation feature in ROSETTA Toolkit 64
3.12 Confusion Matrix of pre-discretized datasets with 66
Johnson’s classification
3.13 Confusion Matrix of post-discretized datasets with 67
Johnson’s classification
3.14 Interface for 10 fold Cross Validation 68
3.15 Interface for command file and log file 68
3.16 Cross Validation log file 69
4.1 4 variation of Chinese handwritten characters from 3 writers 72
4.2 Image of 50x50 pixels effect of pre-processing algorithm 74
(a) before pre-processing (b) after pre-processing
4.3 Binary value from the input character for writer 1 75
4.4 Pre-discretized Chinese datasets of 15 writers ready for ID 77
process
4.5 Examples of Discretization process for writer 1 and writer 13 79
4.6 Post-discretized Chinese datasets for 15 writers 80
4.7 Identification rate for post-discretized datasets of writer 1 83
4.8 Confusion Matrix for post-discretized datasets of all writers 88
4.9 JohnsonReducer rules and its BatchClassifier command 89
4.10 CV script for Holte 1R, Genetic Algorithm and Exhaustive 90
method
4.11 Visualization of divergence level between pre-discretized and 92
post-discretized Chinese datasets using 70% training and 30%
testing data
4.12 Visualization of divergence level between pre-discretized and 93
post-discretized Chinese datasets using 60% training and 40%
testing data
LIST ABBREVIATIONS
ACO Ant Colony Optimisation
ANN Artificial Neural Network
ASI Aspect Scale Invariant
ATM Automated Teller Machine
BMP Bit Map Picture
CV Cross Validation
DFE Directional based Feature Extraction
DCR Discretization Class Reduction
EIG Equal Information Gain
EIW Equal Interval Width
FFT Fast Fourier Transformation
GGD Generalized Gaussian Distribution
GM Geometrical Moments
GMM Gaussian Mixture Model
GSC Gradient, Structural, Concavity
GUI Graphical User Interface
HIT MW Harbin Institute of Technology, Multi Writer
HMM Hidden Markov Model
HPP Horizontal Projection Profile
ID Invariant Discretization
IDC Information Distance Criterion
LM Language Model
MAE Mean Absolute Error
ME Maximum Entropy
MSC Multimedia Super Corridor
OCR Optical Character Recognition
RFID Radio Frequency Identification
RNG Random Number Generator
SD Standard Deviation
SC Shape Codes
SVM Support Vector Machine
TSC Temporal Sequence Code
UMI United Moment Invariant
VPP Vertical Projection Profile
WPT Wavelet Packet Transform
CHAPTER 1
INTRODUCTION
1.1 Overview
Ever since the commencement of Multimedia Super Corridor mission in
Malaysia, recent studies in the field of computer vision and pattern recognition show a
great amount of interest in the content of Biometric Security, Information and
Communication Technology. As a result, computer system as a tool for information
and communication medium is becoming more vital since then. In MSC themes,
there are few focus areas that have been identified, and these included Digital
Content, RFID Technology, Advanced Materials and Biometric Technology.
Biometric Technology has become an important research area now days. A biometric
system is one of the established systems that used the concept of pattern recognition
technology. Its operation begins with individual’s input, then feature extraction
process and finally comparison between extracted features and the model set in
database to obtain the precise result. This notable intensification leads to many
biometric applications being analysed, developed and commercialised for security and
crime identification purpose includes Handwriting (Srihari and Ball, 2009; Guo,
Christian and Alex, 2010), Signature (Cheng, Beng, and Connie, 2009; Eusebiu
Marcu, 2010), Iris (Ramkumar et al., 2005; Kevin Bowyer, 2009), Facial Features
(Thirimachos and Josef, 2009; Ehsan et al., 2010), Fingerprints (Duncan et al., 2009)
and others. However, our study related to writer identification based on Chinese
handwriting, interesting topic in the area of Pattern Recognition.
Pattern recognition in handwriting is wide-ranging term which cover up all
types of application field including identification based on handwriting (Guo,
Christian and Alex, 2010), verification based on handwriting (Srihari and Ball, 2009),
authentication (Muzaffar and Jurgen, 2009; Behzad and Mohsen, 2010) and character
recognition (Tonghua et al., 2009). Each of those proposed approaches has different
intention. Guo, Christian and Alex (2010) proposed a character prototype approach as
a model template to assist their alphabet knowledge base. These proposed approaches
described the additional information to support writer identification process and at the
same time preserved the style of individuality handwriting. Their experimental result
successfully increase writer identification accuracy from 66% to 87%. Other study
focuses at the early stage such as pre-processing including normalisation, feature
extraction in order to attain a better performance for recognition and identification
task. It is vital to understand the history and complexity of the handwriting structure
before it can be classified and categorized. A comprehensive review covering the
research work on writer identification for Chinese handwriting using various methods
have been well described and written (LongZuo and Tieniu Tan, 2002) and (Cheng
and En, 2009).
In this study, the effectiveness of Discretization process for individual
identification based on Chinese handwriting is investigated. Execution of
discretization in proposed framework throughout this dissertation is based on
Invariant Discretization proposed by Azah Kamilah Muda, Siti Mariyam Shamsuddin,
and Maslina Darus (2008). According to the authors, Mean Absolute Error (MAE) is
taken to describe the authorship invariance of the handwritings which are then allied
into the proposed Invariants Discretization to organize the obtained features into the
related writer class in order to improve the identification performance.
1.2 Problem Background
Identification based on Chinese handwriting is an interesting research in the
field of pattern recognition and computer vision. Recently, many innovative methods
and approaches have been developed for writer identification using Dynamic Feature
on Strokes (Bangy Li and Tieniu Tan, 2009), Fusion on both Dynamic and Static
Features (Wenfeng Jin, Yunhong Wang and Tieniu Tan, 2005), and Textual Analysis
(JunFeng and Xu Gao, 2009). Those approaches managed to overcome the
complexity of Chinese character to make identification task easier. However, some
advantages and disadvantages crop up in the proposed method to be shared of. Table
1.1 summarizes the comparison on advantages and disadvantages of the proposed
identification methods. However, some information are said to be practical for early
understanding such as the development of Hidden Markov Model (HMM) based
approach for Chinese character recognition (Kim and Govindaraju, 1997; Tonghua et
al., 2009). They have been proved well and known as established approach but some
weaknesses still exist. This approach often required large data in computation
process. It is a multifaceted method because have to develop classifier for each
character recognition purpose when dealing with large character set. Yu Chen et al.
(2009) has pointed out three typical of randomness occurred in Chinese handwriting
during the process of feature extraction. The randomness included the area of writing
style, content and position of the Chinese character. However, their proposed method
are said to be successful on randomness by significantly proven the accuracy of 98%
and 95% with database of 500 and 100 persons. Unfortunately, the computation
process of the proposed method required higher computational costs. Nafiz and Fatos
(2001) provide a summary to the character recognition prior to identification task.
The authors discussed the status of current character recognition, weaknesses and
point out some suggestion that is helpful for new researches on the related field.
Unlike character of western alphabet such as English, German, French, some oriental
character such as Korean, Arabic and Chinese have structural characteristics.
Table 1.1: Advantages and disadvantages of feature extraction and identification
methods
Authors Approaches Pros Cons Result (%)
Tonghua
et al.
(2009)
Segmentation
Free Strategy
based on
HMM
o State of art
strategy
o Depends on
CPU time,
memory,
experiences
of designer.
Confidence
with more
than 99%.
YuChen
et al.
(2009)
Spectral
Feature
Extraction
Method
based on Fast
Fourier
Transformati
on
o Minimizes
randomness in
Chinese character.
o Feasible for large
volume of data set.
o Successfully
commercialised.
o Higher
computation
costs.
Handwriting
samples:
100persons=
98%
500persons=
95%
WenFeng
Jin et al.
(2009)
Sum Rule,
Weighted
Sum Rule
and User
Specific Sum
Rule applied
to combine
dynamic and
static features
o Minimizes the
complexity in
Chinese character.
o Focuses on
12 primary
Chinese
strokes.
o More
training
data
needed.
Dynamic +
Static
features
improved
identificatio
n accuracy
better than
dynamic
features
Bangy Li
and
Tieniu
Tan
(2009)
Temporal
Sequence
Code (TSC)
and Shape
Codes (SC)
o Require small
number of
character.
o Work effectively
for English and
Chinese
character.
o Depends
on
emotions
and
physical
state of
writers.
Chinese
template
acc. >90%
English
template
acc. >95%
He and
Tan
(2004)
Textual
Analysis
Approaches
-Using Gabor
Filters and
Autocorrelati
on Function
o Applicable to
both text
dependent and
text independent
Chinese
handwriting.
o Long
calculation
time.
o High
computatio
n cost
Top 1=90%
Top 3=96%
Top5=100%
Top10=100%
Kim et
al. (1997)
Hidden
Markov
Model
o Work well with
variety of cursive
strokes
o Uses small
memory
o Large data
(18,000
characters).
o Time
complexity
and not
applicable
to real time
application.
o Depends
on input
pattern
Accuracy
rate of
90.3% with
speed of
1.83 s per
character.
Difficulties in Chinese character are well known as large alphabet language; it
has a variety of categories. Secondly, Chinese handwriting involves many types of
strokes features such as dynamic stroke, sub stroke and others. Some significant
method for dynamic strokes problem in Chinese character has been successfully
proposed by Bangy Li et al. (2007) using novel method. Unfortunately, this method
is only useful for online text independent.
Based on common issues in handling various Chinese strokes features data, we
were attracted by the Discretization approaches written by Alexis et al. (2009) and
Fabrice and Ricco (2005). According to the authors, these approaches can handle
complex and continuous attributes, which will be put into practise before
classification. Thus, in this study, the discretization concept is adopted in proposed
identification framework to seek for the improvement in the writer identification
accuracy. The identification results are used to show the effectiveness of the
discretization development in Chinese handwriting. The discretization conducted in
this study on Chinese handwriting is based on the discretization proposed by Azah
Kamilah, Siti Mariyam and Maslina Darus (2008). The authors proposed an
Invariants Discretization method for handwriting identification. Their result
successfully shows a great accuracy of 99.9% for writer identification using post-
discretized data instead of pre-discretized data.
1.3 Problem Statement
Basically, Discretization process can be classified into global versus local,
supervised versus unsupervised and static versus dynamic. Supervised method
needed information of class label while unsupervised method does not. Gennady
Agre and Stanimir (2002) compared supervised and unsupervised Discretization
method for continuous attributes. However, their result yielded that both classifiers
significantly improved the classification accuracy. This explained that Discretization
process is needed and is an important pre processing steps for machine learning and
data mining application.
From the problem background, many efforts have been analysed and carried
out either at pre-processing or post-processing stage in order to achieve the best
classification performance. As known, Chinese character is a large alphabet language
and have structural characteristic with construction of many stroke features (Fang
Hyuan Cheng, 1997; Tieniu Tan et al., 2000). This creates a large amount of features
data consequently slow down the computation progression and misclassification
occurs during learning and classification. Misclassification often happen causes by
overlapping feature space. This is true for handwriting and noises.
In practise, too much of pre-processing at early stage such as normalisation to
discard mixes noise of the character in order to refine classification would almost
certainly trade off important individualistic of the writer’s handwriting (Andreas and
Horst, 2005). Hence, create poor classification.
Those issues can significantly affect overall performance of handwriting
identification task either in terms of its identification capability and competency.
Thus, discovery set of decision rules in datasets is one of the most important
characteristics in machine learning and data mining task. For instance, if a raw
datasets is large and in continuous values, conditions need to be created through
appropriate decision rules to find the threshold values or specific range among the
datasets to represent the objects. In other words, creates better data representation.
This process is called Discretization.
As such, Discretization in this dissertation which based on the idea proposed
by Azah Kamilah et al. (2008) on Chinese handwriting is added to the writer
identification framework. The proposed Invariant Discretization is chosen as a basic
structure to assist classification phase because their experimental result successfully
demonstrated the achievement of 99.9% identification accuracy with discretized data
instead of non-discretized data.
Hence, the hypothesis of this study can be stated:
Discretization process would enhance the performance of individual identification on
Chinese handwriting.
1.4 Dissertation Aim
The aim of this study is to investigate the impact of Discretization process on
Chinese handwriting for individual identification.
1.5 Objectives
There are four main objectives to be achieved in these studies as below:
1. To propose a new framework of Discretization process for Chinese writer
identification.
2. To develop Azah’s Invariant Discretization algorithm on Chinese characters.
3. To evaluate the effectiveness of Discretization process on Chinese
handwriting.
4. To compare Azah’s Invariant Discretization method on Post-Discretized
Chinese handwriting with Pre-Discretized Chinese handwriting data for
Chinese handwriting identification.
1.6 Dissertation Scope
Several concerns encountered before writing the scope of this study. As
acknowledged, common problem arrises in Chinese handwriting including larger
amount of Chinese character features, complicated structure and etc, probably would
take very long time to finalize if start from scratch. Because this dissertation mainly
focus on Discretization process, thus only a small work progress of pre-processing
task on Chinese handwriting. Normalization applied here is just to standardize all
character images into equal size and to enhance the quality of the image for clearer
depiction. Overall, the scopes of the study are on the following area:
1. HIT MW handwritten Chinese database is chosen as our experimental
database.
2. MATLAB tool was selected for simulating and visualizing.
3. Visual C++ programming will be developed.
4. ROSETTA Toolkit will be used for classification purpose.
1.7 Significant of the Dissertation
The significant of this study is capable to provide some basic idea and can be
treated as a benchmark for researches or practitioners who are interested in finding an
alternative on individuality identification based on handwriting. Moreover, this study
could be a part of research work on biometric and natural approach for personal
authentication such as the work done by Eusebiu Marcu (2010) as accurate personal
identification could steer clear of crime and fake issues.
1.8 Organization of the Dissertation
Generally, the thesis is organized as follows:
1. Introduction
First chapter cover the general overview of the thesis including the
introduction of Pattern Recognition Technology and Chinese handwriting for
identification, common issues arrises in Chinese handwriting, scopes,
objectives and significance of the study.
2. Literature Review
Second chapter discuss the background and history of the related work in more
details.
3. Research Methodology
In this chapter, writer identification framework will be constructed. Details on
relevant data, how and what kind of method and tool that would be best use to
assist the work will be briefly elaborated.
4. Preliminary and Expected Result
Chapter 4 demonstrate the experimental result after the process of pre-
processing, feature extraction and classification. The experimental result
obtain will be well analysed, compared and evaluated.
5. Conclusion
This section summarizes the whole process of the study. Future works and
contributions of this dissertation also will be included.