chapter 2 literature surveyshodhganga.inflibnet.ac.in/bitstream/10603/16044/11/11_chapter 2.p… ·...

20

CHAPTER 2

LITERATURE SURVEY

2.1 SURVEY ON TEXT MINING

Eighty percent of the information in the world is currently

stored in unstructured textual format (Kalogeratos and Likas, 2011).

Although techniques such as Natural Language Processing (NLP) can

accomplish limited text analysis, there are currently no computer

programs available to analyse and interpret text for diverse information

extraction needs. Therefore text mining is a dynamic and emerging area.

The world is fast becoming information intensive, in which specialized

information is being collected into very large data sets. For example,

extraction of information from Chinese handwritten documents (Koo and

Cho, 2012).

For example, Internet contains a vast amount of online text

documents, which rapidly change and grow. It is nearly impossible to

manually organize such vast and rapidly evolving data. The necessity to

extract useful and relevant information from such large data sets (Chen et

al, 2010) has led to an important need to develop computationally

efficient text mining algorithms. An example problem is to automatically

assign natural language text documents to predefined sets of categories

based on their content.

Other examples of problems involving large data sets include

searching for targeted information from scientific citation databases such

as Institute of Electrical and Electronics Engineers (IEEE), Association

21

for Computing Machinery (ACM), Elseveir’s scopus (SCOPUS) search,

filter and categorize web pages by topic (Dimopoulos et al., 2010) and

routing relevant email to the appropriate addresses. A particular problem

of interest here is that of classifying documents into a set of user defined

categories based on the content. Thus, as the document size increases, the

dimension of the hyperspace in which text classification is done becomes

enormous, resulting in high computational cost (Luo et al, 2009).

However, the dimensionality can be reduced through feature

extraction algorithms. Topic summarization (Forestier et al, 2010) in

terms of content coverage, coherence, and consistency, the summaries

are superior to those derived from existing summarization methods based

on human-composed reference summaries.

Text mining is the automatic and semi-automatic extraction of

implicit, previously unknown, and potentially useful information and

patterns, from a large amount of unstructured textual data, such as

natural-language texts. In text mining, each document is represented as a

vector, whose dimension is approximately the number of distinct

keywords in it, which can be very large. One of the main challenges in

text mining is to classify textual data with such high dimensionality

(Song et al, 2013).

In addition to high dimensionality, text-mining algorithms

should also deal with word ambiguities such as pronouns, synonyms,

noisy data, spelling mistakes, abbreviations, acronyms and improperly

structured text. Text mining algorithms are two types: Supervised

learning and unsupervised learning. In addition to supervised and

unsupervised learning a Meta learning approach is also applied to

optimization (Kordik et al, 2010).

22

Supervised learning (Zhiding et al, 2010) is a technique in

which the algorithm uses predictor and target attribute value pairs to

learn the predictor and the target value relation. The training data consist

of pairs of predictor and target values. Each predictor value is tagged

with a target value. If the algorithm can predict a categorical value for a

target attribute, it is called a classification function. Class is an example

of a categorical variable. Positive and negative can be two values of the

categorical variable class. Categorical values do not have partial

ordering. If the algorithm can predict a numerical value then it is called

regression. Numerical values have partial ordering.

Use of traditional k-mean type algorithm is limited only to

numeric data. Ahmad and Dey (2007) presents a clustering algorithm

based on k-mean paradigm that works well for data with mixed numeric

and categorical features. The authors proposed new cost function and

distance measure based on co-occurrence of values. The measures also

take into account the significance of an attribute towards the clustering

process (Birant and Kut, 2007).

Unsupervised learning (Ilin, 2012) is a technique in which the

algorithm uses only the predictor attribute values. There is no target

attribute value and the learning task is to gain some understanding of

relevant structural patterns in the data. Each row in a data set represents a

point in n-dimensional space and unsupervised learning algorithms

investigate the relationship between these various points in n-

dimensional space. Examples of unsupervised learning are clustering,

density estimation and feature extraction.

Text collections contain millions of unique terms, which make

the text - mining process difficult. Therefore, feature-extraction is used

23

when applying machine learning methods. A feature is a combination of

attributes (keywords), which captures important characteristics of the

data. A feature extraction method creates a new set of features far

smaller than the number of original attributes by decomposing the

original data. Therefore it enhances the speed of supervised learning.

Zha et al (2001) has combined k-means with spectral analysis,

Kotsiantis et al (2004) extended k-means algorithm to improve the k-

means algorithm. The Spatial Mining has been done by Ng and Han

(1994). Jain et al (1999) has improved this concept by introducing

Principal Component Analysis and this has been adopted for the analysis

done in image processing technique.

Unsupervised algorithms like Principal Components Analysis

(PCA), singular value decomposition, and Nonnegative Matrix

Factorization (NMF) involve factoring the document-word matrix, based

on different constraints for feature extraction (Ghosh et al, 2011).

Nonnegative matrix factorization is a new unsupervised algorithm for

efficient feature extraction of text documents. NMF is a feature

extraction algorithm that decomposes text data by creating a user-defined

number of features. NMF gives a reduced representation of the original

text data. It decomposes a text data matrix.

Each document of a text collection can be represented as a

linear combination of basis text document vectors or “feature” vectors. A

document, ‘Doc1’ (first column of the matrix) can be constructed as a

linear combination of the basis vectors ‘W1’, ‘W2’ … ‘Wk’, with the

corresponding coefficients ‘h11’, ‘h21’, … ‘hk1’ from matrix Hkn. Thus,

once the model is built and the feature vectors are constructed, any

document can be represented in terms of ‘k’ coefficients; resulting in a

24

reduced dimensionality (Yan et al, 2011) from ‘m’ to ‘k’. In this example

document ‘Doc1’ is a linear combination of feature vectors ‘W1’, ‘W2’,

‘W3’…’W10’ and its corresponding weights.

The NMF decomposition is non-unique; the matrices ‘W’ and

‘H’ depend on the NMF algorithm employed and the error measure used

to check convergence. Some of the NMF algorithm types are

multiplicative update algorithm, gradient descent algorithm by an

alternating least squares algorithm. The NMF algorithm iteratively

updates the factorization based on a given objective function. The

general objective function is to minimize the Euclidean distance between

each column of the matrix and its approximation. Xu and Wunsch (2010)

proved that the above update rules achieve monotonic convergence.

Clearly, the accuracy of the approximation depends on the

value of ‘k’, which is the number of feature vectors. In this work, ‘k’ is

user defined. A systematic study has been carried out to investigate the

influence of k on the accuracy of the model.

In text documents, two important aspects are Term weight and

Similarity measure (Zhang et al, 2012). In text mining each document is

represented as a vector. The elements in the vector reflect the frequency

of terms in documents, and each word is a dimension and documents are

vectors. Each word in a document has weights. These weights can be of

two types: Local and global weights. If local weights are used, then term

weights are normally expressed as term frequencies (TF).

If global weights are used, Inverse Document Frequency

(IDF), IDF values, gives the weight of a term. It is possible to do better

term weighing by multiplying ‘tf’ values with ‘IDF’ values, by

25

considering local and global information. Therefore total weight of a

‘term = tf * IDF’. This is commonly referred to as, ‘tf * IDF’ weighting.

Different from previous document clustering methods based

on latent semantic indexing or NMF, The Locality Preserving Index

(LPI) has been done by Agrafiotis and Xu (2002); Cai et al (2005) tries

to discover both the geometric and discriminating structures of the

document space using locality preserving indexing (LPI). In the LPI,

information retrieval is provided using rough set method of filtering

method based on support vector machine.

This was further modified by Cai et al. (2011), in which the

authors used NMF for text categorization. NMF can only be performed

in the original feature space of the data points and it gives acceptable

results than existing systems.

In LPI, the documents can be projected into a lower

dimensional semantic space in which the documents related to the same

semantics are close to each other. Cai et al (2011) further modified LPI

as Locally Consistent Concept Factorization (LCCF) by using the graph

Laplacian to smooth the document-to-concept mapping. The LCCF can

extract concepts with respect to the intrinsic manifold structure and thus

documents associated with the same concept can be well clustered. These

are affected to improve the performance of the algorithm which have

limitation due to more epochs and repeated iterations.

The divide-and-merge (Cheng et al., 2006), metric learning

model (Lebanon, 2006) is proposed in the literature which has

performance limitations due to more epochs and repeated iterations. The

divide-and-merge methodology of clustering a set of objects that

26

combines a top-down “divide” phase with a bottom-up “merge” phase. In

contrast, previous algorithms use either top-down or bottom-up methods

to construct a hierarchical clustering or produce a flat clustering using

local search (e.g., k-means). Divide and merge is used by many

researchers, in which Cheng et al (2006) proposed spectral algorithm for

divide phase. Sentiment analysis or opinion mining aims to use

automated tools to detect subjective information such as opinions,

attitudes, and feelings expressed in text.

If two documents describe similar topics, employing nearly the

same keywords, these texts are similar and their similarity measure

should be high. Usually dot product represents similarity of the

documents. To normalize the dot product, it can be divided it by the

Euclidean distances of the two documents (He et al, 2011). This ratio

defines the cosine angle between the vectors, with values between

‘0’ and ‘1’. This is called cosine similarity.

Soft margin classification - If the training set is linearly

separable then it is called hard margin classification. If the training set is

not linearly separable, slack variables ‘ξi’ can be added to allow some

misclassification of difficult or noisy examples where ξi > 0, i = 1 … n.

This procedure is called soft margin classification (Wang et al, 2012).

Non-linear classifiers (Charu et al, 2012) - The slack variable

approach is not a very efficient technique for classifying non-separable

classes in input space. In this case soft margin classification is not

applicable because the data is not linearly separable. Non-linear

classifiers require a feature map ‘Φ’, which is a function that maps the

input data patterns into a higher dimensional space. For example, two

27

dimensional input spaces show two non-separable classes as circles and

triangles.

After that the input data space is mapped to a three-

dimensional feature space using a feature map ‘Φ’. In the feature space

support vector machine can find a linear classifier that can separate these

classes easily by a hyper plane. For a data of ‘100’ dimensional, all

second order features are 5000. The feature map approach inflates the

input representation. It is not scalable, unless small subset of features is

used. The explicit computation of the feature map Φ can be avoided, if

the learning algorithm would just depend on inner products, Support

Vector Machine (SVM) decision function (Mu et al, 2012) has been

always in terms of dot products.

Kernel functions - Kernels functions are used for mapping the

input space to a feature space instead of a feature map ‘Φ’, if the

operations on classes are always dot products (Wu, 2012). In this way the

complexity of calculating ‘Φ’ can be reduced. The main optimization

function of SVM can be re-written in the dual form where data appears

only as inner product between data points. Kernel, ‘K’ is a function that

returns the inner product of two data points. Computing kernel, ‘K’ is

equivalent to mapping data patterns into a higher dimensional space and

then taking the dot product there.

Using this kernel approach, SVM exploits information about

the inner product between data points into feature space. Kernels map

data points in feature space where they are more easily possible linearly

separable. In order to classify non-separable classes kernel technique is a

better approach. SVM performs a nonlinear mapping of the input vector

from the input space into a higher dimensional Hilbert space, where the

28

mapping is determined by the kernel function. Two typical kernel

functions are, 1) Polynomial Kernel, where ‘d’ is the dimension and ‘C’

is a constant, 2) Gaussian Kernel, where ‘σ’ is the bandwidth of a

Gaussian curve.

Many methods for local optimization are based on the notion

of a direction of a local descent at a given point. A local improvement of

a point in hand can be made using this direction. As a rule, modern

methods for global optimization do not use directions of global descent

for global improvement of the point in hand. From this point of view,

Global OPtimization (GOP) algorithm based on a dynamical systems

approach is an unusual method. A hybrid GOP proposed by Ali and

Babak (2010), which structure is similar to that used in local

optimization: a new iteration can be obtained as an improvement on the

previous one along a certain direction. In contrast with local methods, is

a direction of a global descent and for more diversification combined

with Tabu search.

Multi-class and Multi-target problems - Text classification is

usually a multi-target problem. Each document can be in multiple

categories, exactly one category or no category. Examples of multi-target

problems in medical diagnosis are, a disease may belong to multiple

categories, and a gene can have multiple functions. A multi-target

problem is the same as building K independent binary problems, where

K is the number of targets.

Each problem uses the rows of its target set to a value and all

the other rows are set to the opposite class. In a multi-target case a

document can belong to more than one class with high probability. For

example, suppose that a given document can belong to one of ‘4’ classes:

29

Circle, Square, Triangle and Diamond. In this case, need ‘4’ independent

binary problems. In this case, after a model is built, when a new

document arrives, the mining uses its ‘4’ binary models and determines

that the document belongs to one or more of the ‘4’ classes.

A document, in a multi-target problem (Wang, et al, 2011),

belongs to more than one class. If a document belongs only to a single

class, it would be a multi-class problem. Each binary problem is built

using all the data.

2.2 REVIEWS ON DATA CLUSTERING

The clustering or the cluster analysis is a set of methodologies

for classification of samples into a number of groups. Therefore, the

samples in one group are grouped and samples belonging to different

groups are grouped as another group. The input of clustering is a set of

samples and the process of clustering is to measure the similarity and or

dissimilarity between giving samples. The output of the clustering is a

number of groups or clusters in the form of graphs (Scarselli et al 2009),

histograms and normal computer results showing group no in Figure

(2.1).

The Clustering is a well-established technique for data

interpretation. It usually requires prior information, e.g., about the

statistical distribution of the data or the number of clusters to detect.

“Clustering” attempts to identify natural clusters in a data set. It does this

by partitioning the entities in the data such that each partition consists of

entities that are close (or similar), according to some distance (similarity)

function based on entity attributes (Luhr and Lazarescu, 2009).

30

Conversely, entities in different partitions are relatively far apart

(dissimilar).

Existing clustering algorithms such as K-means, Partioning

Around Medoids (PAM), Clusterig Large Applications based

RANdomized Search (CLARANS), Density Based Spatial Clustering of

Applications with Noise and (DBSCAN) are designed to find clusters

that fit some static models. For example, K-means, PAM and CLARANS

assume that clusters are hyper-ellipsoidal or hyper-spherical and are of

similar sizes. The DBSCAN assumes that all points of a cluster are

density reachable and points belonging to different clusters are not.

However, all these algorithms can break down if the choice of

parameters in the static model is incorrect with respect to the data set

being clustered, or the model did not capture the characteristics of the

clusters (e.g., size or shape). Because the objective is to discern structure

in the data, the results of a clustering are then examined by a domain

expert to see if the groups suggest something.

For example, crop production data from an agricultural region

may be clustered according to various combinations of factors, including

soil type, cumulative rainfall, average low temperature, solar radiation,

availability of irrigation, strain of seed used and type of fertilizer applied.

Interpretation by a domain expert is needed to determine whether a

discerned pattern- such as a propensity for high yields to be associated

with heavy applications of fertilizer-is meaningful, because other factors

may actually be responsible (e.g., if the fertilizer is water soluble and

rainfall has been heavy).

31

(a) Initial data (b) Output in three (c) Output in four

clusters clusters

Figure. 2.1: Cluster analysis process

Many clustering algorithms that work well with traditional

data deteriorate when executed on geospatial data (which often are

characterized by a high number of attributes or dimensions), resulting in

increased running times or poor-quality clusters. For this reason, recent

research has cantered on the development of clustering methods for

large, highly dimensioned data sets, particularly techniques that execute

in linear time as a function of input size or that require only one or two

passes through the data. Recently developed spatial clustering methods

that seem particularly appropriate for geospatial data include

partitioning, hierarchical, density based, grid based and cluster based

analysis.

Hierarchical methods build clusters through top-down (by

splitting) or bottom-up (through aggregation) methods. Density based

methods define clusters as regions of space with a relatively large

number of spatial objects; unlike other methods, these can find

arbitrarily-shaped clusters. Grid based methods divide space into a raster

tessellation and cluster objects based on this structure. Model based

methods find the best fit of the data relative to specific functional

32

forms. Constraints based methods can capture spatial restrictions on

clusters or the relationships that define these clusters.

An input to a cluster analysis can be described as an ordered

pair (X, s), or (X, d), where ‘X’ is a set of descriptions of samples and ‘s’

and ‘d, are measures for similarity or dissimilarity (distance) between

samples, respectively in equation (2.1) and (2.2). Output from the

clustering system is a partition A = {G1, G2, …, GN} where Gk, k = 1, …,

N is a crisp subset of ‘X’ such that:

G1∪ G2∪ …, ∪GN = X (2.1)

G1∩ G2 ∩ …, ∩GN = Ø (2.2)

The G1, G2 … Gn are the clusters.

Most clustering algorithms are based on the following four

popular approaches:

(1) Partitioning methods

(2) Hierarchical clustering

(3) Iterative square-error partitioned clustering

(4) Density based clustering

• Partitioning methods: Given a database of ‘n’ objects or data

tuples, a partitioning method constructs ‘k(n)’ partitions of the

data, where each partition represents a cluster. That is, it

classifies the data into’k’ groups, which together satisfy the

following requirements:

• Each group must contain at least one object

• Each object must belong to exactly one group

33

Notice that the second requirement can be relaxed in some

fuzzy partitioning techniques (Tang et al, 2010). Such a

partitioning method creates an initial partitioning. It then uses

an iterative relocation technique that attempts to improve the

partitioning by moving objects from one group to another.

Representative algorithms include k-means, k-medoids and

CLARANS algorithm.

• Hierarchical clustering methods: Hierarchical techniques

organize (Saha et al, 2010) data in a nested sequence of groups,

which can be displayed in the form of a dendrogram or a tree

structure. A hierarchical method creates a hierarchical

decomposition of a given set of data objects. Hierarchical

methods can be classified as agglomerative (bottom-up) or

divisive (top-down), based on how the hierarchical

decomposition is formed. Agglomerative nesting and divisive

analysis are examples of agglomerative and divisive methods,

respectively.

• Iterative square-error partitioned clustering methods:

Square-error partitioned algorithms attempt to obtain that

partition which minimizes the within-cluster scatter or

maximizes the between-cluster scatter (Li et al, 2008). These

methods are nonhierarchical because all resulting clusters are

groups of samples at the same level of partition. To guarantee

that an optimum solution has been obtained, one has to

examine all possible partitions of ‘N’ samples of n-dimensions

into K clusters (for a given K), but that retrieval process is not

computation ally feasible.

34

• Density based clustering methods: Most partitioning methods

cluster objects based on the distance between objects (Guha et

al, 2001). Such methods can find only spherical-shaped clusters

and encounter difficulty in discovering clusters of arbitrary

shape. Other clustering methods have been developed based on

the notion of density. Their general idea is to continue growing

a given cluster as long as the density (the number of objects or

data points) in the “neighbourhood” exceeds a threshold. Such

a method is able to filter out noises (outliers) and discover

clusters of arbitrary shape. Representative algorithms include

DBSCAN, OPTICS and density based clustering (Kantere et al,

2009).

The above traditional clustering methods are proving better

result in the other data mining applications. These methods provide lesser

performance, which is due to the input of the text mining differed from

other mining applications.

The input of the text mining is a group of string, which has few

complicated characteristics like polysemy and synonymy. The polysemy

means a word which has multiple meanings and the synonymy is a

multiple word having the same meaning. Therefore, new ways of

researches are implemented to retrieve the meaning of the documents

when text mining is carried out.

2.3. ADVANCEMENTS IN TRADITIONAL MINING

MODELS

Text mining is a new and on-going research domain, which

needs efficient clustering methods. In the initial stages of data mining

35

research, various classifiers using association rules are applied to

knowledge discovery. Most of the classifiers use positive rues as similarity

measures. Kundu et al. (2008) proposes negative rules for associative

classifier. The generation of negative associations from datasets has been

attacked from different perspectives by various authors and this has

proved to be a very computationally expensive task. The authors propose

the classifier, which termed as “Associative Classifier with Negative

rules” is not only time-efficient but also achieves significantly better

accuracy than four other state-of-the-art classification methods by

experimenting on benchmark datasets.

The comparison shown by Mazid et al. (2009) gives the

detailed study of Association ruled based mining model. In which the

Rule based mining (which may be performed through either supervised

learning or unsupervised learning techniques) are compared with recent

research proposals using predefined test sets. In terms of accuracy and

computational complexity, the author concluded Apriori is a better

choice for the rule based mining task.

Later in 2009, hybrid mining model is proposed for

classification, for ex, concept classification proposed by Brown and

Forouraghi, (2009) and Rahman et al. (2010). As already concluded that,

apriori is a well-known algorithm which is used extensively in market-

basket analysis and data mining. The algorithm is used for learning

association rules from transactional databases and is based on simple

counting procedures. In the hybrid model, a priori is further improved by

C4.5 decision tree and k-means clustering algorithms, respectively.

El-far et al. (2011) proposed k-means classifier for data mining

which applied to Three-dimensional data models to visualize realistic

36

objects. This study proposes k-means for application such as medical

simulations, games, virtual reality. There are two major approaches for

drawing or building 3d objects, (1) the search in the database can be done

via requests that are either 3D objects, (2) via some 2D views of the 3D

object. This study contributes extract characteristic views of 3D models

using a Data Mining algorithm which comprises apriori, Charm, Close+

and Extraction of association rules. The work tested using a database that

contains 120 numbers of 3D models selected from the Princeton Shape

Benchmark, for 342numbers of 2D views.

The advancement of DBSCAN (Chen and Chen, 2012) defined

an event as a significant theme development that continues for a period

of time. In general, all these events are temporally disjoint and which may

be taken together form the message of the topic. Moreover, events in

different themes may be associated because of their temporal proximity

and context similarity. The authors propose a model to identify the themes

and the events from the given document and associated events.

The recent development of conceptual text mining includes

string mining which concentrates low memory usage (Dhaliwal et al.,

2012), Text deduction methodology (Chenghua et al, 2012) which

proposes a novel probabilistic modelling framework called Joint

Sentiment-Topic model based on Latent Dirichlet Allocation are

recommended implementation of recent research, which detects

sentiment and topic simultaneously from text.

The changes in the coordinates of the text documents are

major critical issues in text mining. Hence handling such changes in

coordinates attracts researchers, for example, Wright and Grothendieck

(2012). Also, document classification on time series data is a frequent

37

application. Iwata et al (2012), proposed sequential modelling for

multiple time series database.

For a detailed survey of text mining, a survey of evolutionary

algorithm by Barros et al (2012) and “Survey of Twenty of Years of

Mixture of Experts” by Yuksel 2012 are recommended.

2.4 SURVEY ON ARTIFICIAL NEURAL NETWORK (ANN)

BASED LEARNING MODEL

The performance of ANN depends on the architecture of the

ANN (Franco et al 2009), training and learning methods (Dam et al

2008) (Pavel et al 2010), pre-processing methods (Rudy et al 2008),

training and testing data set ratio.

ANN is a Self-Organized, Distributed, and Adaptive Rule

based Induction System (Folino et al, 2009). Predicting the system

imbalance volumes (Maria and Daniel 2006), Predicting business failure

(Li et al 2010), time series forecasting (Khashei et al 2010), Speech

recognition (Dede et al 2010, Gulin and Murat 2010), predicting short

term wind power (Andrew and Wenyan 2010), optimization of energy

consumption (Andrew et al 2010), face recognition (Sheryl and Loris

2010), Web-services classification (Ramakanta et al 2010), mutual fund

performance evaluation (Kehluh and Szuwai 2010), Performance

evaluation of cognitive radio systems (Katidiotis et al 2010) and

vulnerability of a power system (Ahmed et al 2010) are some of the

prediction model based implementation in various engineering domain.

Jolai and Ghanbari (2010) presented an improved ANN

approach for solving travelling salesman problem. Hopfield neural

networks (HNN) and data transformation techniques together is

38

employed to improve the accuracy of the results and reach to the optimal

tours with less total distance. To get an optimal result, Z-score and

logarithmic approaches are integrated with HNN. These powerful unified

methods have recently culminated with the HNN method. It is innovative

across various scientific and engineering fields. For example: Huang and

Liu (1997) employed 'HNN and genetic algorithms together for the

purpose of pattern recognition.

Yen (2009) employed the same tools for identifying

probability density functions. The author has applied HNN for motion

planning. Wang and Zhou (2009) employed the stochastic optimal

competitive HNN to solve clustering problem. The advantages of HNN

are as follows:

• First of all, HNN is capable of solving both the continuous and

the combinatorial problems though some conventional

methods are as well.

• Second, it is a parallel-processing version of the gradient

method and thus can be more powerful than most previous

methods.

The HNN has few drawbacks. One of the most concerning

drawbacks is that sometimes they find locally minimum solutions instead

of global minimum solutions.

Jasna and Vesna (2010) are using the Feed Forward (FFNN)

Neural Network. The author proposes a ‘z’ score scaling as pre-

processing methods and 70:30 data set for training and testing ratio. The

author’s concentrates an increased prediction accuracy of wind power to

be produced at future time periods is often bounded by the prediction

39

model complexity and computational time involved. A trade-off between

the two conflicting objectives is addressed in the above report.. First, a

set of the most relevant parameters i.e. predictors, is selected using the

underlying physics and pattern immersed in data. Second, the most

promising clustering scenario is applied to produce a model for each

clustered subspace.

Kehluh and Szuwei (2010) is designed a Fast Adaptive Neural

Network Classifier (FANNC). FANNC is a newly-developed model

which combines features of adaptive resonance theory and field theory.

In FANNC, the result shows that the approach requires much less time

than the Back Propagation Neural Network (BPNN) approach to evaluate

mutual fund performance, and Root Mean Square (RMS) is also superior

for FANNC.

Gulin and Murat (2010) developed three different neural

network models, which are Multilayer Back Propagation, Elman Neural

Networks (ENN) and Probabilistic Neural Networks (PNN). The

developed model is applied to speech recognition. The speech

recognition problem is a branch of pattern recognition. Some popular

techniques to tackle this problem are artificial neural networks, dynamic

time warping, and hidden Markov modelling.

A recent study on isolated Malay digit recognition reports

dynamic time warping and hidden Markov modelling techniques to have

recognition rates of 80.5% and 90.7%, respectively. Meanwhile,

recognition rates obtained by neural networks for similar applications—

as in this study—are often above. Due to this aspect, ANN appears to be

a convenient classifier for the speech recognition problem.

40

The ENN is a type of recurrent neural network, and basically

contains a two layer BPNN. Distinct from other BPNN, it has a feedback

loop from the output of the first hidden layer to that layer’s input. The

ENN topology designed for this application has the below parameters,

hidden layer 1: 40 neurons, hidden layer 2: 30 neurons. In the above two

hidden layers, hyperbolic tangent and linear activation functions are

used, respectively. In output layer, the logarithmic sigmoid activation

function is used.

PNN is a network topology that makes use of the probability

distribution function for the calculation of network connection weights.

In the first hidden layer, the distance from input data to train data is

calculated, and in the second hidden layer, these calculated distances are

summed up, producing the resultant output vector. Thus, model classes

are obtained. In the output layer, the output of the network is determined

as the most probable model class.

The design process for PNN is a bit different than other two

network topologies in terms of training; because, in PNN, weights for

input–output matches fed to the network are altered by a distribution

consultant. The PNN topology designed for this application has the three

parameters, which are distribution constant: 0.1, hidden layer 1: 310

neurons, hidden layer 2: 10 neurons.

The main objective of the Ramakanta et al (2010) is to develop

various classification models based on intelligent techniques namely

BPNN, PNN, and Lease Vector Machine to predict the quality of a web

service based on a number of QoS attributes. These models are

developed based on the past data comprising QoS attributes as

explanatory variables and the quality of web services as the dependent

41

variable. Since each of the QoS attribute defines different dimensions of

the quality of web services and since they collectively influence the

quality of web service. Assumes that these QoS attributes are non-

linearly related to the quality of web services, to approximate this

nonlinear relationship with the help of several intelligent techniques.

Li et al (2010) developed the prediction model for predicting

business failures. Several top 10 data mining methods have become very

popular alternatives in business failure prediction, e.g., support vector

machine and k nearest neighbour. In comparison with the other

classification mining methods, the advantages of classification and

regression tree methods are included, simplicity of results, easy

implementation, nonlinear estimation, being non-parametric, accuracy

and stability.

Andrew et al (2010) model is solved with a particle swarm

optimization algorithm. The parameter selection is performed to

eliminate uncontrollable parameters of less importance. An appropriate

parameter selection and dimensionality reduction can improve the

comprehensibility, scalability, and, possibly, accuracy of the resulting

models. The boosting tree algorithm and wrapper are used to perform the

parameter selection on the date set. In the above mentioned report the

boosting tree algorithm shares the advantages of the decision tree

induction and tends to be robust in the removal of irrelevant parameters.

In the boosting tree algorithm, a split at every node of any regression tree

is based on certain criteria, e.g., minimization of the total regression

error.

In the process of generating successive trees, the statistical

importance of each variable at each split of every tree is accumulated and

42

normalized. Predictors with a higher importance rank indicate a larger

contribution to the predicted output parameter. Wrappers are also

commonly used methods to reduce the dimensionality of the variable

space.

For the wrapper-type, a specific search algorithm searches the

space of all possible variables and evaluates each subset of variables

after building a model based on this subset. Considering the expensive

computational cost, pace regression is used as the evaluator, and a

genetic algorithm is used as the search algorithm. The population size is

set at 20, the maximum number of iterations is 20, the crossover

probability is 0.6, and the mutation probability is 0.033.

In Sheryl and Loris (2010), a control system based on double

neural networks for parallel mechanism is presented with the objective of

the nonlinear modelling and controlling. The control system is

composed, one Neural Network Controller for compensating the

nonlinear modelling and one Neural Network Identification for the

controlling model. Simulation results have shown that the response time,

movement accuracy and resistance to load disturbance of the parallel

mechanism system can be improved using the double neural networks.

In the proposed domain, i.e. computer network, the traffic flow

(Ya Gao and Shiliang 2010) is highly dynamic in nature, therefore the

entire congestion control algorithm has some limitations as it involves

mathematical model. As an alternate, a prediction model based on

routing is proposed which in tern predicts traffic free path using ANN.

The performance of ANN has some requirements and

limitations like the optimal number of hidden layer. If the number of

43

hidden layers is increased, then the accuracy of the system will increase

but the system will converge slowly and vice versa.

chapter 2 literature surveyshodhganga.inflibnet.ac.in/bitstream/10603/16044/11/11_chapter 2.p… ·...

Documents