chapter 2 literature surveyshodhganga.inflibnet.ac.in/bitstream/10603/16044/11/11_chapter 2.p… ·...
TRANSCRIPT
20
CHAPTER 2
LITERATURE SURVEY
2.1 SURVEY ON TEXT MINING
Eighty percent of the information in the world is currently
stored in unstructured textual format (Kalogeratos and Likas, 2011).
Although techniques such as Natural Language Processing (NLP) can
accomplish limited text analysis, there are currently no computer
programs available to analyse and interpret text for diverse information
extraction needs. Therefore text mining is a dynamic and emerging area.
The world is fast becoming information intensive, in which specialized
information is being collected into very large data sets. For example,
extraction of information from Chinese handwritten documents (Koo and
Cho, 2012).
For example, Internet contains a vast amount of online text
documents, which rapidly change and grow. It is nearly impossible to
manually organize such vast and rapidly evolving data. The necessity to
extract useful and relevant information from such large data sets (Chen et
al, 2010) has led to an important need to develop computationally
efficient text mining algorithms. An example problem is to automatically
assign natural language text documents to predefined sets of categories
based on their content.
Other examples of problems involving large data sets include
searching for targeted information from scientific citation databases such
as Institute of Electrical and Electronics Engineers (IEEE), Association
21
for Computing Machinery (ACM), Elseveir’s scopus (SCOPUS) search,
filter and categorize web pages by topic (Dimopoulos et al., 2010) and
routing relevant email to the appropriate addresses. A particular problem
of interest here is that of classifying documents into a set of user defined
categories based on the content. Thus, as the document size increases, the
dimension of the hyperspace in which text classification is done becomes
enormous, resulting in high computational cost (Luo et al, 2009).
However, the dimensionality can be reduced through feature
extraction algorithms. Topic summarization (Forestier et al, 2010) in
terms of content coverage, coherence, and consistency, the summaries
are superior to those derived from existing summarization methods based
on human-composed reference summaries.
Text mining is the automatic and semi-automatic extraction of
implicit, previously unknown, and potentially useful information and
patterns, from a large amount of unstructured textual data, such as
natural-language texts. In text mining, each document is represented as a
vector, whose dimension is approximately the number of distinct
keywords in it, which can be very large. One of the main challenges in
text mining is to classify textual data with such high dimensionality
(Song et al, 2013).
In addition to high dimensionality, text-mining algorithms
should also deal with word ambiguities such as pronouns, synonyms,
noisy data, spelling mistakes, abbreviations, acronyms and improperly
structured text. Text mining algorithms are two types: Supervised
learning and unsupervised learning. In addition to supervised and
unsupervised learning a Meta learning approach is also applied to
optimization (Kordik et al, 2010).
22
Supervised learning (Zhiding et al, 2010) is a technique in
which the algorithm uses predictor and target attribute value pairs to
learn the predictor and the target value relation. The training data consist
of pairs of predictor and target values. Each predictor value is tagged
with a target value. If the algorithm can predict a categorical value for a
target attribute, it is called a classification function. Class is an example
of a categorical variable. Positive and negative can be two values of the
categorical variable class. Categorical values do not have partial
ordering. If the algorithm can predict a numerical value then it is called
regression. Numerical values have partial ordering.
Use of traditional k-mean type algorithm is limited only to
numeric data. Ahmad and Dey (2007) presents a clustering algorithm
based on k-mean paradigm that works well for data with mixed numeric
and categorical features. The authors proposed new cost function and
distance measure based on co-occurrence of values. The measures also
take into account the significance of an attribute towards the clustering
process (Birant and Kut, 2007).
Unsupervised learning (Ilin, 2012) is a technique in which the
algorithm uses only the predictor attribute values. There is no target
attribute value and the learning task is to gain some understanding of
relevant structural patterns in the data. Each row in a data set represents a
point in n-dimensional space and unsupervised learning algorithms
investigate the relationship between these various points in n-
dimensional space. Examples of unsupervised learning are clustering,
density estimation and feature extraction.
Text collections contain millions of unique terms, which make
the text - mining process difficult. Therefore, feature-extraction is used
23
when applying machine learning methods. A feature is a combination of
attributes (keywords), which captures important characteristics of the
data. A feature extraction method creates a new set of features far
smaller than the number of original attributes by decomposing the
original data. Therefore it enhances the speed of supervised learning.
Zha et al (2001) has combined k-means with spectral analysis,
Kotsiantis et al (2004) extended k-means algorithm to improve the k-
means algorithm. The Spatial Mining has been done by Ng and Han
(1994). Jain et al (1999) has improved this concept by introducing
Principal Component Analysis and this has been adopted for the analysis
done in image processing technique.
Unsupervised algorithms like Principal Components Analysis
(PCA), singular value decomposition, and Nonnegative Matrix
Factorization (NMF) involve factoring the document-word matrix, based
on different constraints for feature extraction (Ghosh et al, 2011).
Nonnegative matrix factorization is a new unsupervised algorithm for
efficient feature extraction of text documents. NMF is a feature
extraction algorithm that decomposes text data by creating a user-defined
number of features. NMF gives a reduced representation of the original
text data. It decomposes a text data matrix.
Each document of a text collection can be represented as a
linear combination of basis text document vectors or “feature” vectors. A
document, ‘Doc1’ (first column of the matrix) can be constructed as a
linear combination of the basis vectors ‘W1’, ‘W2’ … ‘Wk’, with the
corresponding coefficients ‘h11’, ‘h21’, … ‘hk1’ from matrix Hkn. Thus,
once the model is built and the feature vectors are constructed, any
document can be represented in terms of ‘k’ coefficients; resulting in a
24
reduced dimensionality (Yan et al, 2011) from ‘m’ to ‘k’. In this example
document ‘Doc1’ is a linear combination of feature vectors ‘W1’, ‘W2’,
‘W3’…’W10’ and its corresponding weights.
The NMF decomposition is non-unique; the matrices ‘W’ and
‘H’ depend on the NMF algorithm employed and the error measure used
to check convergence. Some of the NMF algorithm types are
multiplicative update algorithm, gradient descent algorithm by an
alternating least squares algorithm. The NMF algorithm iteratively
updates the factorization based on a given objective function. The
general objective function is to minimize the Euclidean distance between
each column of the matrix and its approximation. Xu and Wunsch (2010)
proved that the above update rules achieve monotonic convergence.
Clearly, the accuracy of the approximation depends on the
value of ‘k’, which is the number of feature vectors. In this work, ‘k’ is
user defined. A systematic study has been carried out to investigate the
influence of k on the accuracy of the model.
In text documents, two important aspects are Term weight and
Similarity measure (Zhang et al, 2012). In text mining each document is
represented as a vector. The elements in the vector reflect the frequency
of terms in documents, and each word is a dimension and documents are
vectors. Each word in a document has weights. These weights can be of
two types: Local and global weights. If local weights are used, then term
weights are normally expressed as term frequencies (TF).
If global weights are used, Inverse Document Frequency
(IDF), IDF values, gives the weight of a term. It is possible to do better
term weighing by multiplying ‘tf’ values with ‘IDF’ values, by
25
considering local and global information. Therefore total weight of a
‘term = tf * IDF’. This is commonly referred to as, ‘tf * IDF’ weighting.
Different from previous document clustering methods based
on latent semantic indexing or NMF, The Locality Preserving Index
(LPI) has been done by Agrafiotis and Xu (2002); Cai et al (2005) tries
to discover both the geometric and discriminating structures of the
document space using locality preserving indexing (LPI). In the LPI,
information retrieval is provided using rough set method of filtering
method based on support vector machine.
This was further modified by Cai et al. (2011), in which the
authors used NMF for text categorization. NMF can only be performed
in the original feature space of the data points and it gives acceptable
results than existing systems.
In LPI, the documents can be projected into a lower
dimensional semantic space in which the documents related to the same
semantics are close to each other. Cai et al (2011) further modified LPI
as Locally Consistent Concept Factorization (LCCF) by using the graph
Laplacian to smooth the document-to-concept mapping. The LCCF can
extract concepts with respect to the intrinsic manifold structure and thus
documents associated with the same concept can be well clustered. These
are affected to improve the performance of the algorithm which have
limitation due to more epochs and repeated iterations.
The divide-and-merge (Cheng et al., 2006), metric learning
model (Lebanon, 2006) is proposed in the literature which has
performance limitations due to more epochs and repeated iterations. The
divide-and-merge methodology of clustering a set of objects that
26
combines a top-down “divide” phase with a bottom-up “merge” phase. In
contrast, previous algorithms use either top-down or bottom-up methods
to construct a hierarchical clustering or produce a flat clustering using
local search (e.g., k-means). Divide and merge is used by many
researchers, in which Cheng et al (2006) proposed spectral algorithm for
divide phase. Sentiment analysis or opinion mining aims to use
automated tools to detect subjective information such as opinions,
attitudes, and feelings expressed in text.
If two documents describe similar topics, employing nearly the
same keywords, these texts are similar and their similarity measure
should be high. Usually dot product represents similarity of the
documents. To normalize the dot product, it can be divided it by the
Euclidean distances of the two documents (He et al, 2011). This ratio
defines the cosine angle between the vectors, with values between
‘0’ and ‘1’. This is called cosine similarity.
Soft margin classification - If the training set is linearly
separable then it is called hard margin classification. If the training set is
not linearly separable, slack variables ‘ξi’ can be added to allow some
misclassification of difficult or noisy examples where ξi > 0, i = 1 … n.
This procedure is called soft margin classification (Wang et al, 2012).
Non-linear classifiers (Charu et al, 2012) - The slack variable
approach is not a very efficient technique for classifying non-separable
classes in input space. In this case soft margin classification is not
applicable because the data is not linearly separable. Non-linear
classifiers require a feature map ‘Φ’, which is a function that maps the
input data patterns into a higher dimensional space. For example, two
27
dimensional input spaces show two non-separable classes as circles and
triangles.
After that the input data space is mapped to a three-
dimensional feature space using a feature map ‘Φ’. In the feature space
support vector machine can find a linear classifier that can separate these
classes easily by a hyper plane. For a data of ‘100’ dimensional, all
second order features are 5000. The feature map approach inflates the
input representation. It is not scalable, unless small subset of features is
used. The explicit computation of the feature map Φ can be avoided, if
the learning algorithm would just depend on inner products, Support
Vector Machine (SVM) decision function (Mu et al, 2012) has been
always in terms of dot products.
Kernel functions - Kernels functions are used for mapping the
input space to a feature space instead of a feature map ‘Φ’, if the
operations on classes are always dot products (Wu, 2012). In this way the
complexity of calculating ‘Φ’ can be reduced. The main optimization
function of SVM can be re-written in the dual form where data appears
only as inner product between data points. Kernel, ‘K’ is a function that
returns the inner product of two data points. Computing kernel, ‘K’ is
equivalent to mapping data patterns into a higher dimensional space and
then taking the dot product there.
Using this kernel approach, SVM exploits information about
the inner product between data points into feature space. Kernels map
data points in feature space where they are more easily possible linearly
separable. In order to classify non-separable classes kernel technique is a
better approach. SVM performs a nonlinear mapping of the input vector
from the input space into a higher dimensional Hilbert space, where the
28
mapping is determined by the kernel function. Two typical kernel
functions are, 1) Polynomial Kernel, where ‘d’ is the dimension and ‘C’
is a constant, 2) Gaussian Kernel, where ‘σ’ is the bandwidth of a
Gaussian curve.
Many methods for local optimization are based on the notion
of a direction of a local descent at a given point. A local improvement of
a point in hand can be made using this direction. As a rule, modern
methods for global optimization do not use directions of global descent
for global improvement of the point in hand. From this point of view,
Global OPtimization (GOP) algorithm based on a dynamical systems
approach is an unusual method. A hybrid GOP proposed by Ali and
Babak (2010), which structure is similar to that used in local
optimization: a new iteration can be obtained as an improvement on the
previous one along a certain direction. In contrast with local methods, is
a direction of a global descent and for more diversification combined
with Tabu search.
Multi-class and Multi-target problems - Text classification is
usually a multi-target problem. Each document can be in multiple
categories, exactly one category or no category. Examples of multi-target
problems in medical diagnosis are, a disease may belong to multiple
categories, and a gene can have multiple functions. A multi-target
problem is the same as building K independent binary problems, where
K is the number of targets.
Each problem uses the rows of its target set to a value and all
the other rows are set to the opposite class. In a multi-target case a
document can belong to more than one class with high probability. For
example, suppose that a given document can belong to one of ‘4’ classes:
29
Circle, Square, Triangle and Diamond. In this case, need ‘4’ independent
binary problems. In this case, after a model is built, when a new
document arrives, the mining uses its ‘4’ binary models and determines
that the document belongs to one or more of the ‘4’ classes.
A document, in a multi-target problem (Wang, et al, 2011),
belongs to more than one class. If a document belongs only to a single
class, it would be a multi-class problem. Each binary problem is built
using all the data.
2.2 REVIEWS ON DATA CLUSTERING
The clustering or the cluster analysis is a set of methodologies
for classification of samples into a number of groups. Therefore, the
samples in one group are grouped and samples belonging to different
groups are grouped as another group. The input of clustering is a set of
samples and the process of clustering is to measure the similarity and or
dissimilarity between giving samples. The output of the clustering is a
number of groups or clusters in the form of graphs (Scarselli et al 2009),
histograms and normal computer results showing group no in Figure
(2.1).
The Clustering is a well-established technique for data
interpretation. It usually requires prior information, e.g., about the
statistical distribution of the data or the number of clusters to detect.
“Clustering” attempts to identify natural clusters in a data set. It does this
by partitioning the entities in the data such that each partition consists of
entities that are close (or similar), according to some distance (similarity)
function based on entity attributes (Luhr and Lazarescu, 2009).
30
Conversely, entities in different partitions are relatively far apart
(dissimilar).
Existing clustering algorithms such as K-means, Partioning
Around Medoids (PAM), Clusterig Large Applications based
RANdomized Search (CLARANS), Density Based Spatial Clustering of
Applications with Noise and (DBSCAN) are designed to find clusters
that fit some static models. For example, K-means, PAM and CLARANS
assume that clusters are hyper-ellipsoidal or hyper-spherical and are of
similar sizes. The DBSCAN assumes that all points of a cluster are
density reachable and points belonging to different clusters are not.
However, all these algorithms can break down if the choice of
parameters in the static model is incorrect with respect to the data set
being clustered, or the model did not capture the characteristics of the
clusters (e.g., size or shape). Because the objective is to discern structure
in the data, the results of a clustering are then examined by a domain
expert to see if the groups suggest something.
For example, crop production data from an agricultural region
may be clustered according to various combinations of factors, including
soil type, cumulative rainfall, average low temperature, solar radiation,
availability of irrigation, strain of seed used and type of fertilizer applied.
Interpretation by a domain expert is needed to determine whether a
discerned pattern- such as a propensity for high yields to be associated
with heavy applications of fertilizer-is meaningful, because other factors
may actually be responsible (e.g., if the fertilizer is water soluble and
rainfall has been heavy).
31
(a) Initial data (b) Output in three (c) Output in four
clusters clusters
Figure. 2.1: Cluster analysis process
Many clustering algorithms that work well with traditional
data deteriorate when executed on geospatial data (which often are
characterized by a high number of attributes or dimensions), resulting in
increased running times or poor-quality clusters. For this reason, recent
research has cantered on the development of clustering methods for
large, highly dimensioned data sets, particularly techniques that execute
in linear time as a function of input size or that require only one or two
passes through the data. Recently developed spatial clustering methods
that seem particularly appropriate for geospatial data include
partitioning, hierarchical, density based, grid based and cluster based
analysis.
Hierarchical methods build clusters through top-down (by
splitting) or bottom-up (through aggregation) methods. Density based
methods define clusters as regions of space with a relatively large
number of spatial objects; unlike other methods, these can find
arbitrarily-shaped clusters. Grid based methods divide space into a raster
tessellation and cluster objects based on this structure. Model based
methods find the best fit of the data relative to specific functional
32
forms. Constraints based methods can capture spatial restrictions on
clusters or the relationships that define these clusters.
An input to a cluster analysis can be described as an ordered
pair (X, s), or (X, d), where ‘X’ is a set of descriptions of samples and ‘s’
and ‘d, are measures for similarity or dissimilarity (distance) between
samples, respectively in equation (2.1) and (2.2). Output from the
clustering system is a partition A = {G1, G2, …, GN} where Gk, k = 1, …,
N is a crisp subset of ‘X’ such that:
G1∪ G2∪ …, ∪GN = X (2.1)
G1∩ G2 ∩ …, ∩GN = Ø (2.2)
The G1, G2 … Gn are the clusters.
Most clustering algorithms are based on the following four
popular approaches:
(1) Partitioning methods
(2) Hierarchical clustering
(3) Iterative square-error partitioned clustering
(4) Density based clustering
• Partitioning methods: Given a database of ‘n’ objects or data
tuples, a partitioning method constructs ‘k(n)’ partitions of the
data, where each partition represents a cluster. That is, it
classifies the data into’k’ groups, which together satisfy the
following requirements:
• Each group must contain at least one object
• Each object must belong to exactly one group
33
Notice that the second requirement can be relaxed in some
fuzzy partitioning techniques (Tang et al, 2010). Such a
partitioning method creates an initial partitioning. It then uses
an iterative relocation technique that attempts to improve the
partitioning by moving objects from one group to another.
Representative algorithms include k-means, k-medoids and
CLARANS algorithm.
• Hierarchical clustering methods: Hierarchical techniques
organize (Saha et al, 2010) data in a nested sequence of groups,
which can be displayed in the form of a dendrogram or a tree
structure. A hierarchical method creates a hierarchical
decomposition of a given set of data objects. Hierarchical
methods can be classified as agglomerative (bottom-up) or
divisive (top-down), based on how the hierarchical
decomposition is formed. Agglomerative nesting and divisive
analysis are examples of agglomerative and divisive methods,
respectively.
• Iterative square-error partitioned clustering methods:
Square-error partitioned algorithms attempt to obtain that
partition which minimizes the within-cluster scatter or
maximizes the between-cluster scatter (Li et al, 2008). These
methods are nonhierarchical because all resulting clusters are
groups of samples at the same level of partition. To guarantee
that an optimum solution has been obtained, one has to
examine all possible partitions of ‘N’ samples of n-dimensions
into K clusters (for a given K), but that retrieval process is not
computation ally feasible.
34
• Density based clustering methods: Most partitioning methods
cluster objects based on the distance between objects (Guha et
al, 2001). Such methods can find only spherical-shaped clusters
and encounter difficulty in discovering clusters of arbitrary
shape. Other clustering methods have been developed based on
the notion of density. Their general idea is to continue growing
a given cluster as long as the density (the number of objects or
data points) in the “neighbourhood” exceeds a threshold. Such
a method is able to filter out noises (outliers) and discover
clusters of arbitrary shape. Representative algorithms include
DBSCAN, OPTICS and density based clustering (Kantere et al,
2009).
The above traditional clustering methods are proving better
result in the other data mining applications. These methods provide lesser
performance, which is due to the input of the text mining differed from
other mining applications.
The input of the text mining is a group of string, which has few
complicated characteristics like polysemy and synonymy. The polysemy
means a word which has multiple meanings and the synonymy is a
multiple word having the same meaning. Therefore, new ways of
researches are implemented to retrieve the meaning of the documents
when text mining is carried out.
2.3. ADVANCEMENTS IN TRADITIONAL MINING
MODELS
Text mining is a new and on-going research domain, which
needs efficient clustering methods. In the initial stages of data mining
35
research, various classifiers using association rules are applied to
knowledge discovery. Most of the classifiers use positive rues as similarity
measures. Kundu et al. (2008) proposes negative rules for associative
classifier. The generation of negative associations from datasets has been
attacked from different perspectives by various authors and this has
proved to be a very computationally expensive task. The authors propose
the classifier, which termed as “Associative Classifier with Negative
rules” is not only time-efficient but also achieves significantly better
accuracy than four other state-of-the-art classification methods by
experimenting on benchmark datasets.
The comparison shown by Mazid et al. (2009) gives the
detailed study of Association ruled based mining model. In which the
Rule based mining (which may be performed through either supervised
learning or unsupervised learning techniques) are compared with recent
research proposals using predefined test sets. In terms of accuracy and
computational complexity, the author concluded Apriori is a better
choice for the rule based mining task.
Later in 2009, hybrid mining model is proposed for
classification, for ex, concept classification proposed by Brown and
Forouraghi, (2009) and Rahman et al. (2010). As already concluded that,
apriori is a well-known algorithm which is used extensively in market-
basket analysis and data mining. The algorithm is used for learning
association rules from transactional databases and is based on simple
counting procedures. In the hybrid model, a priori is further improved by
C4.5 decision tree and k-means clustering algorithms, respectively.
El-far et al. (2011) proposed k-means classifier for data mining
which applied to Three-dimensional data models to visualize realistic
36
objects. This study proposes k-means for application such as medical
simulations, games, virtual reality. There are two major approaches for
drawing or building 3d objects, (1) the search in the database can be done
via requests that are either 3D objects, (2) via some 2D views of the 3D
object. This study contributes extract characteristic views of 3D models
using a Data Mining algorithm which comprises apriori, Charm, Close+
and Extraction of association rules. The work tested using a database that
contains 120 numbers of 3D models selected from the Princeton Shape
Benchmark, for 342numbers of 2D views.
The advancement of DBSCAN (Chen and Chen, 2012) defined
an event as a significant theme development that continues for a period
of time. In general, all these events are temporally disjoint and which may
be taken together form the message of the topic. Moreover, events in
different themes may be associated because of their temporal proximity
and context similarity. The authors propose a model to identify the themes
and the events from the given document and associated events.
The recent development of conceptual text mining includes
string mining which concentrates low memory usage (Dhaliwal et al.,
2012), Text deduction methodology (Chenghua et al, 2012) which
proposes a novel probabilistic modelling framework called Joint
Sentiment-Topic model based on Latent Dirichlet Allocation are
recommended implementation of recent research, which detects
sentiment and topic simultaneously from text.
The changes in the coordinates of the text documents are
major critical issues in text mining. Hence handling such changes in
coordinates attracts researchers, for example, Wright and Grothendieck
(2012). Also, document classification on time series data is a frequent
37
application. Iwata et al (2012), proposed sequential modelling for
multiple time series database.
For a detailed survey of text mining, a survey of evolutionary
algorithm by Barros et al (2012) and “Survey of Twenty of Years of
Mixture of Experts” by Yuksel 2012 are recommended.
2.4 SURVEY ON ARTIFICIAL NEURAL NETWORK (ANN)
BASED LEARNING MODEL
The performance of ANN depends on the architecture of the
ANN (Franco et al 2009), training and learning methods (Dam et al
2008) (Pavel et al 2010), pre-processing methods (Rudy et al 2008),
training and testing data set ratio.
ANN is a Self-Organized, Distributed, and Adaptive Rule
based Induction System (Folino et al, 2009). Predicting the system
imbalance volumes (Maria and Daniel 2006), Predicting business failure
(Li et al 2010), time series forecasting (Khashei et al 2010), Speech
recognition (Dede et al 2010, Gulin and Murat 2010), predicting short
term wind power (Andrew and Wenyan 2010), optimization of energy
consumption (Andrew et al 2010), face recognition (Sheryl and Loris
2010), Web-services classification (Ramakanta et al 2010), mutual fund
performance evaluation (Kehluh and Szuwai 2010), Performance
evaluation of cognitive radio systems (Katidiotis et al 2010) and
vulnerability of a power system (Ahmed et al 2010) are some of the
prediction model based implementation in various engineering domain.
Jolai and Ghanbari (2010) presented an improved ANN
approach for solving travelling salesman problem. Hopfield neural
networks (HNN) and data transformation techniques together is
38
employed to improve the accuracy of the results and reach to the optimal
tours with less total distance. To get an optimal result, Z-score and
logarithmic approaches are integrated with HNN. These powerful unified
methods have recently culminated with the HNN method. It is innovative
across various scientific and engineering fields. For example: Huang and
Liu (1997) employed 'HNN and genetic algorithms together for the
purpose of pattern recognition.
Yen (2009) employed the same tools for identifying
probability density functions. The author has applied HNN for motion
planning. Wang and Zhou (2009) employed the stochastic optimal
competitive HNN to solve clustering problem. The advantages of HNN
are as follows:
• First of all, HNN is capable of solving both the continuous and
the combinatorial problems though some conventional
methods are as well.
• Second, it is a parallel-processing version of the gradient
method and thus can be more powerful than most previous
methods.
The HNN has few drawbacks. One of the most concerning
drawbacks is that sometimes they find locally minimum solutions instead
of global minimum solutions.
Jasna and Vesna (2010) are using the Feed Forward (FFNN)
Neural Network. The author proposes a ‘z’ score scaling as pre-
processing methods and 70:30 data set for training and testing ratio. The
author’s concentrates an increased prediction accuracy of wind power to
be produced at future time periods is often bounded by the prediction
39
model complexity and computational time involved. A trade-off between
the two conflicting objectives is addressed in the above report.. First, a
set of the most relevant parameters i.e. predictors, is selected using the
underlying physics and pattern immersed in data. Second, the most
promising clustering scenario is applied to produce a model for each
clustered subspace.
Kehluh and Szuwei (2010) is designed a Fast Adaptive Neural
Network Classifier (FANNC). FANNC is a newly-developed model
which combines features of adaptive resonance theory and field theory.
In FANNC, the result shows that the approach requires much less time
than the Back Propagation Neural Network (BPNN) approach to evaluate
mutual fund performance, and Root Mean Square (RMS) is also superior
for FANNC.
Gulin and Murat (2010) developed three different neural
network models, which are Multilayer Back Propagation, Elman Neural
Networks (ENN) and Probabilistic Neural Networks (PNN). The
developed model is applied to speech recognition. The speech
recognition problem is a branch of pattern recognition. Some popular
techniques to tackle this problem are artificial neural networks, dynamic
time warping, and hidden Markov modelling.
A recent study on isolated Malay digit recognition reports
dynamic time warping and hidden Markov modelling techniques to have
recognition rates of 80.5% and 90.7%, respectively. Meanwhile,
recognition rates obtained by neural networks for similar applications—
as in this study—are often above. Due to this aspect, ANN appears to be
a convenient classifier for the speech recognition problem.
40
The ENN is a type of recurrent neural network, and basically
contains a two layer BPNN. Distinct from other BPNN, it has a feedback
loop from the output of the first hidden layer to that layer’s input. The
ENN topology designed for this application has the below parameters,
hidden layer 1: 40 neurons, hidden layer 2: 30 neurons. In the above two
hidden layers, hyperbolic tangent and linear activation functions are
used, respectively. In output layer, the logarithmic sigmoid activation
function is used.
PNN is a network topology that makes use of the probability
distribution function for the calculation of network connection weights.
In the first hidden layer, the distance from input data to train data is
calculated, and in the second hidden layer, these calculated distances are
summed up, producing the resultant output vector. Thus, model classes
are obtained. In the output layer, the output of the network is determined
as the most probable model class.
The design process for PNN is a bit different than other two
network topologies in terms of training; because, in PNN, weights for
input–output matches fed to the network are altered by a distribution
consultant. The PNN topology designed for this application has the three
parameters, which are distribution constant: 0.1, hidden layer 1: 310
neurons, hidden layer 2: 10 neurons.
The main objective of the Ramakanta et al (2010) is to develop
various classification models based on intelligent techniques namely
BPNN, PNN, and Lease Vector Machine to predict the quality of a web
service based on a number of QoS attributes. These models are
developed based on the past data comprising QoS attributes as
explanatory variables and the quality of web services as the dependent
41
variable. Since each of the QoS attribute defines different dimensions of
the quality of web services and since they collectively influence the
quality of web service. Assumes that these QoS attributes are non-
linearly related to the quality of web services, to approximate this
nonlinear relationship with the help of several intelligent techniques.
Li et al (2010) developed the prediction model for predicting
business failures. Several top 10 data mining methods have become very
popular alternatives in business failure prediction, e.g., support vector
machine and k nearest neighbour. In comparison with the other
classification mining methods, the advantages of classification and
regression tree methods are included, simplicity of results, easy
implementation, nonlinear estimation, being non-parametric, accuracy
and stability.
Andrew et al (2010) model is solved with a particle swarm
optimization algorithm. The parameter selection is performed to
eliminate uncontrollable parameters of less importance. An appropriate
parameter selection and dimensionality reduction can improve the
comprehensibility, scalability, and, possibly, accuracy of the resulting
models. The boosting tree algorithm and wrapper are used to perform the
parameter selection on the date set. In the above mentioned report the
boosting tree algorithm shares the advantages of the decision tree
induction and tends to be robust in the removal of irrelevant parameters.
In the boosting tree algorithm, a split at every node of any regression tree
is based on certain criteria, e.g., minimization of the total regression
error.
In the process of generating successive trees, the statistical
importance of each variable at each split of every tree is accumulated and
42
normalized. Predictors with a higher importance rank indicate a larger
contribution to the predicted output parameter. Wrappers are also
commonly used methods to reduce the dimensionality of the variable
space.
For the wrapper-type, a specific search algorithm searches the
space of all possible variables and evaluates each subset of variables
after building a model based on this subset. Considering the expensive
computational cost, pace regression is used as the evaluator, and a
genetic algorithm is used as the search algorithm. The population size is
set at 20, the maximum number of iterations is 20, the crossover
probability is 0.6, and the mutation probability is 0.033.
In Sheryl and Loris (2010), a control system based on double
neural networks for parallel mechanism is presented with the objective of
the nonlinear modelling and controlling. The control system is
composed, one Neural Network Controller for compensating the
nonlinear modelling and one Neural Network Identification for the
controlling model. Simulation results have shown that the response time,
movement accuracy and resistance to load disturbance of the parallel
mechanism system can be improved using the double neural networks.
In the proposed domain, i.e. computer network, the traffic flow
(Ya Gao and Shiliang 2010) is highly dynamic in nature, therefore the
entire congestion control algorithm has some limitations as it involves
mathematical model. As an alternate, a prediction model based on
routing is proposed which in tern predicts traffic free path using ANN.
The performance of ANN has some requirements and
limitations like the optimal number of hidden layer. If the number of
43
hidden layers is increased, then the accuracy of the system will increase
but the system will converge slowly and vice versa.