data mining: concepts and techniques unit-iii part-i classification and predictions september 10,...

DATA MINING: CONCEPTS AND TECHNIQUES

UNIT-III

Part-I Classification and PredictionsApril 21, 2023

DATA MINING CSE@HCST 1

Classification and Prediction

What is classification? What is

prediction?

Issues regarding classification and

prediction

Classification by decision tree

induction

Bayesian classification

*Rule-based classification

Classification by back propagation

Neural Network

*Support Vector Machines (SVM)

*Associative classification

Lazy learners (or learning from

your neighbors)

Other classification methods

*Prediction

*Accuracy and error measures

*Ensemble methods

*Model selection

Summary

April 21, 2023

2

DATA MINING CSE@HCST

Objectives

April 21, 2023DATA MINING CSE@HCST

3

Classification vs. Prediction


4 Classification-

Predicts categorical class labels (discrete or nominal). Classifies data (constructs a model) based on the training set and

the values (class labels) in a classifying attribute and uses it in classifying new data.

Prediction- Models continuous-valued functions, i.e., predicts unknown or

missing values. Typical applications-

Credit approval. Document categorization. Target marketing. Medical diagnosis. Treatment effectiveness analysis. Fraud detection.

Classification types


5

Classification—A Two-Step Process


6 Model construction: describing a set of predetermined classes

Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute.

The set of tuples used for model construction is training set. The model is represented as classification rules, decision trees, or

mathematical formulae. Model usage: for classifying future or unknown objects

Estimate accuracy of the model The known label of test sample is compared with the classified result

from the model. Accuracy rate is the percentage of test set samples that are correctly

classified by the model. Test set is independent of training set, otherwise over-fitting will

occur. If the accuracy is acceptable, use the model to classify data tuples

whose class labels are not known.

Classification—A Two-Step Process


7

Example-1 : Model Construction


8

9

Example-1: Using the Model in Prediction April 21, 2023DATA MINING CSE@HCST

Example-2 Process (1): Model Construction


10

TrainingData

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

Process (2): Using the Model in Prediction


11

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

How does classification work ?


12

Supervised vs. Unsupervised Learning


13

Supervised learning (classification)

Supervision: The training data (observations,

measurements, etc.) are accompanied by labels indicating

the class of the observations.

New data is classified based on the training set.

Unsupervised learning (clustering)

The class labels of training data is unknown.

Given a set of measurements, observations, etc. with the

aim of establishing the existence of classes or clusters in

the data.

Issues: Data Preparation


14

Data cleaning Preprocess data in order to reduce noise and handle missing

values Relevance analysis (feature selection)

Remove the irrelevant or redundant attributes Data transformation

Generalize and/or normalize data

Issues: Evaluating Classification Methods


15

Accuracy classifier accuracy: predicting class label. predictor accuracy: guessing value of predicted attributes.

Speed time to construct the model (training time). time to use the model (classification/prediction time).

Robustness: handling noise and missing values. Scalability: efficiency in disk-resident databases. Interpretability

understanding and insight provided by the model. Other measures, e.g., goodness of rules, such as decision tree

size or compactness of classification rules.

Decision Tree Induction: Training Dataset


16

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

This follows an example of Quinlan’s ID3 (Playing Tennis)

Decision Tree


17

Decision Tree


18

Decision Tree


19

Decision Tree


20

Decision Tree


21

Decision Tree


22

Decision Tree


23

Decision Tree


24

Decision Tree Induction


25

Decision Tree Boundary


26



27



28



29



30



31



32



33



34



35



36

37

Attribute Selection Measure: Information Gain (ID3/C4.5)

Select the attribute with the highest information gain

Let pi be the probability that an arbitrary tuple in D belongs to

class Ci, estimated by |Ci, D|/|D|

Expected information (entropy) needed to classify a tuple in D:

Information needed (after using A to split D into v partitions) to classify D:

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo

)(||

||)(

1j

v

j

jA DInfo

D

DDInfo

(D)InfoInfo(D)Gain(A) AApril 21, 2023DATA MINING CSE@HCST


38

Gain Ratio for Attribute Selection (C4.5)


39

Gini Index (CART, IBM IntelligentMiner)


40

Comparisons Of Attribute Selection Measures


41

42

Other Attribute Selection Measures

CHAID: a popular decision tree algorithm, measure based on χ2 test for

independence.

C-SEP: performs better than info. gain and gini index in certain cases.

G-statistic: has a close approximation to χ2 distribution.

MDL (Minimal Description Length) principle (i.e., the simplest solution is

preferred):

The best tree as the one that requires the fewest # of bits to both (1) encode

the tree, and (2) encode the exceptions to the tree.

Multivariate splits (partition based on multiple variable combinations)

CART: finds multivariate splits based on a linear comb. of attrs.

Which attribute selection measure is the best?

Most give good results, none is significantly superior than othersApril 21, 2023DATA MINING CSE@HCST



43



44

Decision Tree Induction [IMPORTANT]


45


46

EXAMPLE: Decision Tree Induction


47

EXAMPLE: Decision Tree Induction


48


49

EXAMPLE: Calculating Gain Ratio


50

Gini Index


51

Calculating Gini Index


52

53

Overfitting and Tree Pruning

Overfitting: An induced tree may overfit the training data- Too many branches, some may reflect anomalies due to noise

or outliers. Poor accuracy for unseen samples.

Two approaches to avoid overfitting- Prepruning: Halt tree construction early ̵ do not split a node

if this would result in the goodness measure falling below a threshold. Difficult to choose an appropriate threshold.

Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees. Use a set of data different from the training data to decide

which is the “best pruned tree”. April 21, 2023DATA MINING CSE@HCST

Overfitting and Tree Pruning


54

55

Enhancements to Basic Decision Tree Induction

Allow for continuous-valued attributes- Dynamically define new discrete-valued attributes that

partition the continuous attribute value into a discrete set of intervals.

Handle missing attribute values- Assign the most common value of the attribute. Assign probability to each of the possible values.

Attribute construction- Create new attributes based on existing ones that are sparsely

represented. This reduces fragmentation, repetition, and replication.


56

Bayesian Classification: Why?

A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.

Foundation: Based on Bayes’ Theorem. Performance: A simple Bayesian classifier, naïve Bayesian

classifier, has comparable performance with decision tree and selected neural network classifiers.

Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data.

Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured.April 21, 2023DATA MINING CSE@HCST

57

Bayes’ Theorem: Basics

Total probability Theorem:

Bayes’ Theorem:

Let X be a data sample (“evidence”): class label is unknown. Let H be a hypothesis that X belongs to class C. Classification is to determine P(H|X), (i.e., posteriori probability): the

probability that the hypothesis holds given the observed data sample X. P(H) (prior probability): the initial probability-

E.g., X will buy computer, regardless of age, income, …… P(X): probability that sample data is observed. P(X|H) (likelihood): the probability of observing the sample X, given that

the hypothesis holds. E.g., Given that X will buy computer, the prob. that X is 31…40,

medium income.

)()1

|()( iAPM

i iABPBP

)(/)()|()(

)()|()|( XXX

XX PHPHPP

HPHPHP


58

Prediction Based on Bayes’ Theorem

Given training data X, posteriori probability of a hypothesis H,

P(H|X), follows the Bayes’ theorem-

Informally, this can be viewed as-

posteriori = [likelihood * prior/evidence]

Predicts X belongs to Ci iff the probability P(Ci|X) is the highest

among all the P(Ck|X) for all the k classes.

Practical difficulty: It requires initial knowledge of many

probabilities, involving significant computational cost.

)(/)()|()(

)()|()|( XXX

XX PHPHPP

HPHPHP


59

Classification Is to Derive the Maximum Posteriori

Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn).

Suppose there are m classes C1, C2, …, Cm. Classification is to derive the maximum posteriori, i.e., the

maximal P(Ci|X). This can be derived from Bayes’ theorem-

Since P(X) is constant for all classes, only

needs to be maximized.

)()()|(

)|(X

XX

PiCPiCP

iCP

)()|()|( iCPiCPiCP XX

60

Naïve Bayes Classifier A simplified assumption: attributes are conditionally

independent (i.e., no dependence relation between attributes):

This greatly reduces the computation cost: Only counts the class distribution.

If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D).

If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ

and P(xk|Ci) is-

)|(...)|()|(1

)|()|(21

CixPCixPCixPn

kCixPCiP

nk

X

2

2

2

)(

2

1),,(

x

exg

),,()|(ii CCkxgCiP X

61

Naïve Bayes Classifier: Training Dataset

Class:C1:buys_computer = ‘yes’C2:buys_computer = ‘no’

Data to be classified: X = (age <=30, Income = medium,Student = yesCredit_rating = Fair)

age incomestudentcredit_ratingbuys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no


62

Naïve Bayes Classifier: An Example

P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643

P(buys_computer = “no”) = 5/14= 0.357 Compute P(X|Ci) for each class

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444

P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4

P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667

P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2

P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667

P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028

P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)

age income studentcredit_ratingbuys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no


63

Avoiding the Zero-Probability Problem

Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero.

Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10).

Use Laplacian correction (or Laplacian estimator) Adding 1 to each case

Prob(income = low) = 1/1003

Prob(income = medium) = 991/1003

Prob(income = high) = 11/1003 The “corrected” prob. estimates are close to their

“uncorrected” counterparts.

n

kCixkPCiXP

1)|()|(

64

Naïve Bayes Classifier: Comments

Advantages- Easy to implement. Good results obtained in most of the cases.

Disadvantages- Assumption: class conditional independence, therefore loss of

accuracy. Practically, dependencies exist among variables-

E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer,

diabetes, etc. Dependencies among these cannot be modeled by Naïve Bayes

Classifier. How to deal with these dependencies? Bayesian Belief Networks.


Classification by Backpropagation


65

Backpropagation: A neural network learning algorithm.

Started by psychologists and neurobiologists to develop and

test computational analogues of neurons.

A neural network: A set of connected input/output units

where each connection has a weight associated with it.

During the learning phase, the network learns by adjusting

the weights so as to be able to predict the correct class label of

the input tuples.

Also referred to as connectionist learning due to the

connections between units.

Neural Network as a Classifier


66 Weakness

Long training time. Require a number of parameters typically best determined empirically,

e.g., the network topology or ``structure." Poor interpretability: Difficult to interpret the symbolic meaning behind

the learned weights and of ``hidden units" in the network.

Strength High tolerance to noisy data. Ability to classify untrained patterns. Well-suited for continuous-valued inputs and outputs. Successful on a wide array of real-world data. Algorithms are inherently parallel. Techniques have recently been developed for the extraction of rules from

trained neural networks.

A Neuron (= a perceptron)

The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping

)sign(y

ExampleFor n

0ikii xw

April 21, 2023

DATA MINING CSE@HCST

67

k-

f

weighted sum

Inputvector x

output y

Activationfunction

weightvector w

w0

w1

wn

x0

x1

xn

A Multi-Layer Feed-Forward Neural Network


68

Output layer

Input layer

Hidden layer

Output vector

Input vector: X

wij

i

jiijj OwI

jIje

O

1

1

))(1( jjjjj OTOOErr

jkk

kjjj wErrOOErr )1(

ijijij OErrlww )(jjj Errl)(

How A Multi-Layer Neural Network Works?


69

The inputs to the network correspond to the attributes measured for each

training tuple .

Inputs are fed simultaneously into the units making up the input layer.

They are then weighted and fed simultaneously to a hidden layer.

The number of hidden layers is arbitrary, although usually only one.

The weighted outputs of the last hidden layer are input to units making up

the output layer, which emits the network's prediction.

The network is feed-forward in that none of the weights cycles back to an

input unit or to an output unit of a previous layer.

From a statistical point of view, networks perform nonlinear regression:

Given enough hidden units and enough training samples, they can closely

approximate any function.

Defining a Network Topology


70

First decide the network topology: # of units in the input layer, # of hidden layers (if > 1), # of units in each hidden layer, and # of units in the output layer.

Normalizing the input values for each attribute measured in the training tuples to [0.0—1.0].

One input unit per domain value, each initialized to 0. Output, if for classification and more than two classes, one

output unit per class is used. Once a network has been trained and its accuracy is

unacceptable, repeat the training process with a different network topology or a different set of initial weights.

Backpropagation


71

Iteratively process a set of training tuples & compare the network's prediction

with the actual known target value.

For each training tuple, the weights are modified to minimize the mean

squared error between the network's prediction and the actual target value.

Modifications are made in the “backwards” direction: from the output layer,

through each hidden layer down to the first hidden layer, hence

“backpropagation”.

Steps- Initialize weights (to small random #s) and biases in the network. Propagate the inputs forward (by applying activation function). Backpropagate the error (by updating weights and biases). Terminating condition (when error is very small, etc.).


72


73

Multilayer Neural Network


74


75


76


77

Lazy vs. Eager Learning


78 Lazy vs. eager learning

Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple.

Eager learning (the above discussed methods): Given a set of training set, constructs a classification model before receiving new (e.g., test) data to classify.

Lazy: less time in training but more time in predicting. Accuracy-

Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function.

Eager: must commit to a single hypothesis that covers the entire instance space.

Lazy Learner: Instance-Based Methods


79 Instance-based learning:

Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified.

Typical approaches- k-nearest neighbor approach

Instances represented as points in a Euclidean space. Locally weighted regression

Constructs local approximation. Case-based reasoning

Uses symbolic representations and knowledge-based inference.

The k-Nearest Neighbor Algorithm


80

All instances correspond to points in the n-D space. The nearest neighbor are defined in terms of Euclidean

distance, dist(X1, X2). Target function could be discrete- or real- valued. For discrete-valued, k-NN returns the most common value

among the k training examples nearest to xq.

Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples.

.

_+

_ xq

+

_ _+

_

_

+

.

..

. .

April 21, 2023DATA MINING CSE@HCST81

Example : -Nearest Neighbors

K-Nearest Neighbor K-Nearest Neighbor ClassifierClassifier

CustomerCustomer AgAgee

IncomIncomee

No. No. credit credit cardscards

ResponsResponsee

JohnJohn 3535 35K35K 33 NoNo

RachelRachel 2222 50K50K 22 YesYes

HannahHannah 6363 200K200K 11 NoNo

TomTom 5959 170K170K 11 NoNo

NellieNellie 2525 40K40K 44 YesYes

DavidDavid 3737 50K50K 22 ??

April 21, 2023DATA MINING

CSE@HCST96

ExampleExample

K-Nearest Neighbor K-Nearest Neighbor ClassifierClassifier

CustomerCustomer AgAgee

IncomIncomee

(K)(K)

No. No.

cardcardss

JohnJohn 3535 3535 33

RachelRachel 2222 5050 22

HannahHannah 6363 200200 11

TomTom 5959 170170 11

NellieNellie 2525 4040 44

DavidDavid 3737 5050 22

ResponsResponsee

NoNo

YesYes

NoNo

NoNo

YesYes

Distance from Distance from DavidDavid

sqrt [(35-37)sqrt [(35-37)22+(35-+(35-50)50)2 2 +(3-2)+(3-2)22]=]=15.1615.16

sqrt [(22-37)sqrt [(22-37)22+(50-+(50-50)50)2 2 +(2-2)+(2-2)22]=]=1515

sqrt [(63-37)sqrt [(63-37)22+(200-+(200-50)50)2 2 +(1-+(1-2)2)22]=]=152.23152.23

sqrt [(59-37)sqrt [(59-37)22+(170-+(170-50)50)2 2 +(1-2)+(1-2)22]=]=122122

sqrt [(25-37)sqrt [(25-37)22+(40-+(40-50)50)2 2 +(4-2)+(4-2)22]=]=15.7415.74YesApril 21, 2023DATA MINING CSE@HCST

97

Genetic Algorithms (GA-Part-I)


98

Genetic Algorithm: based on an analogy to biological evolution.

An initial population is created consisting of randomly generated rules- Each rule is represented by a string of bits

E.g., if A1 and ¬A2 then C2 can be encoded as 100

If an attribute has k > 2 values, k bits can be used

Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules and their offsprings.

The fitness of a rule is represented by its classification accuracy on a set of training examples.

Offsprings are generated by crossover and mutation.

The process continues until a population P evolves when each rule in P satisfies a prespecified threshold.

Slow but easily parallelizable.

Genetic Algorithms (GA)


99



100



101


To use a genetic algorithm, you must encode solutions to your problem in a structure that can be stored in the computer.

This object is a genome (or chromosome). The genetic algorithm creates a population of genomes then

applies crossover and mutation to the individuals in the population to generate new individuals.

It uses various selection criteria so that it picks the best individuals for mating (and subsequent crossover).

Your objective function determines how 'good' each individual is.


102


The genetic algorithm is very simple, yet it performs well on many different types of problems.

But there are many ways to modify the basic algorithm, and many parameters that can be 'tweaked'.

Basically, if you get the objective function right, the representation right and the operators right, then variations on the genetic algorithm and its parameters will result in only minor improvements.


103

Representation

You can use any representation for the individual genomes in the genetic algorithm.

Holland worked primarily with strings of bits, but you can use arrays, trees, lists, or any other object.

But you must define genetic operators- (initialization, mutation, crossover, comparison) for any representation that you decide to use.

Remember that each individual must represent a complete solution to the problem you are trying to optimize.


104


105

Mutation operators

These are some sample tree mutation operators. You can use more than one operator during an

evolution. The mutation operator introduces a certain amount

of randomness to the search. It can help the search find solutions that crossover

alone might not encounter.


106


107

Crossover operators

These are some sample tree crossover operators. Typically crossover is defined so that two

individuals (the parents) combine to produce two more individuals (the children).

But you can define asexual crossover or single-child crossover as well.

The primary purpose of the crossover operator is to get genetic material from the previous generation to the subsequent generation.


108


109

Mutation operators

These are some sample list mutation operators. Notice that lists may be fixed or variable length. Also common are order-based lists in which the sequence is

important and nodes cannot be duplicated during the genetic operations.

You can use more than one operator during an evolution. The mutation operator introduces a certain amount of

randomness to the search. It can help the search find solutions that crossover alone

might not encounter.


110


111


Notice that lists may be fixed or variable length. Also common are order-based lists in which the sequence is important and nodes cannot be duplicated during the genetic operations. You can use more than one operator during an evolution. The mutation operator introduces a certain amount of randomness to the search. It can help the search find solutions that crossover alone might not encounter.


113


Two of the most common genetic algorithm implementations are 'simple' and 'steady state'.

In simple state- It is a generational algorithm in which the entire population is replaced each generation.

The steady state genetic algorithm is used by the Genitor program. In this algorithm, only a few individuals are replaced each 'generation'. This type of replacement is often referred to as overlapping populations.


114


115

http://lancet.mit.edu/mbwall/presentations


116

Genetic AlgorithmGenetic Algorithms (GA-Part-I)

117


Outline

Introduction to Genetic Algorithm (GA) GA Components

Representation Recombination Mutation Parent Selection Survivor selection

Example

118


Introduction to GA (1)119

Calculus Base Techniques

Fibonacci

Search Techniqes

Guided random search techniqes

Enumerative Techniqes

BFSDFS Dynamic Programmin

g

Tabu Search

Hill Climbi

ng

Simulated Anealing

Evolutionary Algorithms

Genetic Programming

Genetic Algorithms

Sort


Introduction to GA (2)

“Genetic Algorithms are good at taking large, potentially huge search spaces and navigating them, looking for optimal combinations of things, solutions you might not otherwise find in a lifetime.”- Salvatore Mangano, Computer Design, May 1995.

Originally developed by John Holland (1975) The genetic algorithm (GA) is a search heuristic that

mimics the process of natural evolution Uses concepts of “Natural Selection” and “Genetic

Inheritance” (Darwin 1859)

120


Use of GA

Widely-used in business, science and engineering Optimization and Search Problems Scheduling and Timetabling

121


Let’s Learn Biology (1)

Our body is made up of trillions of cells. Each cell has a core structure (nucleus) that contains your chromosomes.

Each chromosome is made up of tightly coiled strands of deoxyribonucleic acid (DNA). Genes are segments of DNA that determine specific traits, such as eye or hair color. You have more than 20,000 genes.

A gene mutation is an alteration in your DNA. It can be inherited or acquired during your lifetime, as cells age or are exposed to certain chemicals. Some changes in your genes result in genetic disorders.

122


Let’s Learn Biology (2) 123

Source: http://www.riversideonline.com/health_reference/Tools/DS00549.cfm

1101101April 21, 2023DATA MINING CSE@HCST

Let’s Learn Biology (3) 124


Let’s Learn Biology (4)

Natural Selection Darwin's theory of evolution: only the organisms best

adapted to their environment tend to survive and transmit their genetic characteristics in increasing numbers to succeeding generations while those less adapted tend to be eliminated.

125

Source: http://www.bbc.co.uk/programmes/p0022nyy


GA is inspired from Nature

A genetic algorithm maintains a population of candidate solutions for the problem at hand,and makes it evolve by iteratively applying a set of stochastic operators

126


Nature VS GA

The computer model introduces simplifications (relative to the real biological mechanisms),

BUT

surprisingly complex and interesting structures have emerged out of evolutionary algorithms

127


High-level Algorithm

produce an initial population of individuals evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while

128


GA Components129

Source: http://www.engineering.lancs.ac.ukApril 21, 2023DATA MINING CSE@HCST

GA Components With Example

The MAXONE problem : Suppose we want to maximize the number of ones in a string of L binary digits

It may seem trivial because we know the answer in advance

However, we can think of it as maximizing the number of correct answers, each encoded by 1, to L yes/no difficult questions`

130


GA Components: Representation

Encoding An individual is encoded (naturally) as a string of l

binary digits Let’s say L = 10. Then, 1 = 0000000001 (10 bits)

131


Algorithm

produce an initial population of individuals

evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while

132


Initial Population

We start with a population of n random strings. Suppose that l = 10 and n = 6

We toss a fair coin 60 times and get the following initial population:

s1 = 1111010101

s2 = 0111000101

s3 = 1110110101

s4 = 0100010011

s5 = 1110111101

s6 = 0100110000

133


Algorithm

produce an initial population of individuals

evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while

134


Fitness Function: f()

We toss a fair coin 60 times and get the following initial population:

s1 = 1111010101 f (s1) = 7

s2 = 0111000101 f (s2) = 5

s3 = 1110110101 f (s3) = 7

s4 = 0100010011 f (s4) = 4

s5 = 1110111101 f (s5) = 8

s6 = 0100110000 f (s6) = 3 --------------------------------------------------- =

34

135


Algorithm

produce an initial population of individuals evaluate the fitness of all individuals while termination condition not met do

select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while

136


Selection (1)

Next we apply fitness proportionate selection with the roulette wheel method:

We repeat the extraction as many times as the number of individuals

we need to have the same parent population size (6 in our case)

137

Individual i will have a probability to be chosen Individual i will have a probability to be chosen

i

if

if

)(

)(

2211nn

33

Area is Proportional to fitness value

44


Selection (2)

Suppose that, after performing selection, we get the following population:

s1` = 1111010101 (s1) s2` = 1110110101 (s3) s3` = 1110111101 (s5) s4` = 0111000101 (s2) s5` = 0100010011 (s4) s6` = 1110111101 (s5)

138


Algorithm

produce an initial population of individuals evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction

recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while

139


Recombination (1)

aka Crossover For each couple we decide according to

crossover probability (for instance 0.6) whether to actually perform crossover or not

Suppose that we decide to actually perform crossover only for couples (s1`, s2`) and (s5`, s6`).

For each couple, we randomly extract a crossover point, for instance 2 for the first and 5 for the second

140


Recombination (2)141


Algorithm

produce an initial population of individuals evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals

mutate individuals evaluate the fitness of the modified individuals generate a new population End while

142


Mutation (1)143

Before applying mutation:

s1`` = 1110110101

s2`` = 1111010101

s3`` = 1110111101

s4`` = 0111000101

s5`` = 0100011101

s6`` = 1110110011

After applying mutation:

s1``` = 1110100101

s2``` = 1111110100

s3``` = 1110101111

s4``` = 0111000101

s5``` = 0100011101

s6``` = 1110110001 April 21, 2023DATA MINING CSE@HCST

Mutation (2)

The final step is to apply random mutation: for each bit that we are to copy to the new population we allow a small probability of error (for instance 0.1)

Causes movement in the search space(local or global)

Restores lost information to the population

144


High-level Algorithm

produce an initial population of individuals evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals

evaluate the fitness of the modified individuals

generate a new population End while

145


Fitness of New Population

After Applying Mutation: s1``` = 1110100101 f (s1```) = 6

s2``` = 1111110100 f (s2```) = 7

s3``` = 1110101111 f (s3```) = 8

s4``` = 0111000101 f (s4```) = 5

s5``` = 0100011101 f (s5```) = 5

s6``` = 1110110001 f (s6```) = 6 -------------------------------------------------------------

37

146


Example (End)

In one generation, the total population fitness changed from 34 to 37, thus improved by ~9%

At this point, we go through the same process all over again, until a stopping criterion is met

147


Distribution of Individuals

Distribution of Individuals in Generation 0

Distribution of Individuals in Generation N


148

Issues

Choosing basic implementation issues: representation population size, mutation rate, ... selection, deletion policies crossover, mutation operators

Termination Criteria Performance, scalability Solution is only as good as the evaluation function (often

hardest part)

149


When to Use a GA

Alternate solutions are too slow or overly complicated Need an exploratory tool to examine new approaches Problem is similar to one that has already been

successfully solved by using a GA Want to hybridize with an existing solution Benefits of the GA technology meet key problem

requirements


150

Conclusion

Inspired from Nature Has many areas of Applications GA is powerful

151



END

data mining: concepts and techniques unit-iii part-i classification and predictions september 10,...

Documents