prediction by n.gopinath ap/cse december 22, 2015data mining: concepts and techniques1

PredictionBy

N.GopinathAP/CSE

April 21, 2023Data Mining: Concepts and

Techniques 1


Techniques 2

Linear Regression

Linear regression: involves a response variable y and a single predictor variable x

y = w0 + w1 x

where w0 (y-intercept) and w1 (slope) are regression coefficients

Method of least squares: estimates the best-fitting straight line

Multiple linear regression: involves more than one predictor variable

Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)

Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2

Solvable by extension of least square method or using SAS, S-Plus

Many nonlinear functions can be transformed into the above

||

1

2

||

1

)(

))((

1 D

ii

D

iii

xx

yyxxw xwyw

10


Techniques 3

Some nonlinear models can be modeled by a polynomial function

A polynomial regression model can be transformed into linear regression model. For example,

y = w0 + w1 x + w2 x2 + w3 x3

convertible to linear with new variables: x2 = x2, x3= x3

y = w0 + w1 x + w2 x2 + w3 x3

Other functions, such as power function, can also be transformed to linear model

Some models are intractable nonlinear (e.g., sum of exponential terms) possible to obtain least square estimates through

extensive calculation on more complex formulae

Nonlinear Regression


Techniques 4

Chapter 6. Classification and Prediction

What is classification? What

is prediction?

Issues regarding

classification and prediction

Classification by decision

tree induction

Bayesian classification

Rule-based classification

Classification by back

propagation

Support Vector Machines

(SVM)

Associative classification

Other classification methods

Prediction

Accuracy and error measures

Ensemble methods

Model selection

Summary


Techniques 5

Classifier Accuracy Measures

Accuracy of a classifier M, acc(M): percentage of test set tuples that are correctly classified by the model M

Error rate (misclassification rate) of M = 1 – acc(M) Given m classes, CMi,j, an entry in a confusion matrix, indicates #

of tuples in class i that are labeled by the classifier as class j Alternative accuracy measures (e.g., for cancer diagnosis)

sensitivity = t-pos/pos /* true positive recognition rate */specificity = t-neg/neg /* true negative recognition rate */precision = t-pos/(t-pos + f-pos)accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos +

neg) This model can also be used for cost-benefit analysis

classes buy_computer = yes

buy_computer = no

total recognition(%)

buy_computer = yes

6954 46 7000 99.34

buy_computer = no

412 2588 3000 86.27

total 7366 2634 10000

95.52

C1 C2

C1 True positive False negative

C2 False positive

True negative


Techniques 6

Predictor Error Measures

Measure predictor accuracy: measure how far off the predicted value is from the actual known value

Loss function: measures the error betw. yi and the predicted value

yi’

Absolute error: | yi – yi’|

Squared error: (yi – yi’)2

Test error (generalization error): the average loss over the test set Mean absolute error: Mean squared error:

Relative absolute error: Relative squared error:

The mean squared-error exaggerates the presence of outliers

Popularly use (square) root mean-square error, similarly, root relative squared error

d

yyd

iii

1

|'|

d

yyd

iii

1

2)'(

d

ii

d

iii

yy

yy

1

1

||

|'|

d

ii

d

iii

yy

yy

1

2

1

2

)(

)'(


Techniques 7

Evaluating the Accuracy of a Classifier or Predictor (I)

Holdout method Given data is randomly partitioned into two independent sets

Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation

Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the

accuracies obtained Cross-validation (k-fold, where k = 10 is most popular)

Randomly partition the data into k mutually exclusive subsets, each approximately equal size

At i-th iteration, use Di as test set and others as training set Leave-one-out: k folds where k = # of tuples, for small sized

data Stratified cross-validation: folds are stratified so that class

dist. in each fold is approx. the same as that in the initial data


Techniques 8

Evaluating the Accuracy of a Classifier or Predictor (II)

Bootstrap Works well with small data sets Samples the given training tuples uniformly with replacement

i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set

Several boostrap methods, and a common one is .632 boostrap Suppose we are given a data set of d tuples. The data set is sampled

d times, with replacement, resulting in a training set of d samples. The data tuples that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)

Repeat the sampling procedue k times, overall accuracy of the model: ))(368.0)(632.0()( _

1_ settraini

k

isettesti MaccMaccMacc


Techniques 9



is prediction?

Issues regarding



tree induction




propagation


(SVM)



Prediction


Ensemble methods

Model selection

Summary


Techniques 10

Ensemble Methods: Increasing the Accuracy

Ensemble methods Use a combination of models to increase accuracy Combine a series of k learned models, M1, M2, …, Mk,

with the aim of creating an improved model M* Popular ensemble methods

Bagging: averaging the prediction over a collection of classifiers

Boosting: weighted vote with a collection of classifiers Ensemble: combining a set of heterogeneous classifiers


Techniques 11

Bagging: Boostrap Aggregation

Analogy: Diagnosis based on multiple doctors’ majority vote Training

Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., boostrap)

A classifier model Mi is learned for each training set Di

Classification: classify an unknown sample X Each classifier Mi returns its class prediction The bagged classifier M* counts the votes and assigns the

class with the most votes to X Prediction: can be applied to the prediction of continuous values

by taking the average value of each prediction for a given test tuple

Accuracy Often significant better than a single classifier derived from D For noise data: not considerably worse, more robust Proved improved accuracy in prediction


Techniques 12

Boosting

Analogy: Consult several doctors, based on a combination of weighted diagnoses—weight assigned based on the previous diagnosis accuracy

How boosting works? Weights are assigned to each training tuple A series of k classifiers is iteratively learned After a classifier Mi is learned, the weights are updated to allow

the subsequent classifier, Mi+1, to pay more attention to the

training tuples that were misclassified by M i

The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy

The boosting algorithm can be extended for the prediction of continuous values

Comparing with bagging: boosting tends to achieve greater accuracy, but it also risks overfitting the model to misclassified data


Techniques 13

Adaboost (Freund and Schapire, 1997)

Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd) Initially, all the weights of tuples are set the same (1/d) Generate k classifiers in k rounds. At round i,

Tuples from D are sampled (with replacement) to form a training set Di of the same size

Each tuple’s chance of being selected is based on its weight A classification model Mi is derived from Di

Its error rate is calculated using Di as a test set If a tuple is misclssified, its weight is increased, o.w. it is

decreased Error rate: err(Xj) is the misclassification error of tuple Xj.

Classifier Mi error rate is the sum of the weights of the misclassified tuples:

The weight of classifier Mi’s vote is )(

)(1log

i

i

Merror

Merror d

jji errwMerror )()( jX


Techniques 14



is prediction?

Issues regarding



tree induction




propagation


(SVM)



Prediction


Ensemble methods

Model selection

Summary


Techniques 15

Model Selection: ROC Curves

ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models

Originated from signal detection theory Shows the trade-off between the true

positive rate and the false positive rate The area under the ROC curve is a

measure of the accuracy of the model Rank the test tuples in decreasing

order: the one that is most likely to belong to the positive class appears at the top of the list

The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model

Vertical axis represents the true positive rate

Horizontal axis rep. the false positive rate

The plot also shows a diagonal line

A model with perfect accuracy will have an area of 1.0


Techniques 16



is prediction?

Issues regarding



tree induction




propagation


(SVM)



Prediction


Ensemble methods

Model selection

Summary


Techniques 17

Summary (I)

Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends.

Effective and scalable methods have been developed for decision trees induction, Naive Bayesian classification, Bayesian belief network, rule-based classifier, Backpropagation, Support Vector Machine (SVM), associative classification, nearest neighbor classifiers, and case-based reasoning, and other classification methods such as genetic algorithms, rough set and fuzzy set approaches.

Linear, nonlinear, and generalized linear models of regression can be used for prediction. Many nonlinear problems can be converted to linear problems by performing transformations on the predictor variables. Regression trees and model trees are also used for prediction.


Techniques 18

Summary (II)

Stratified k-fold cross-validation is a recommended method for

accuracy estimation. Bagging and boosting can be used to

increase overall accuracy by learning and combining a series of

individual models.

Significance tests and ROC curves are useful for model selection

There have been numerous comparisons of the different

classification and prediction methods, and the matter remains a

research topic

No single method has been found to be superior over all others for

all data sets

Issues such as accuracy, training time, robustness, interpretability,

and scalability must be considered and can involve trade-offs,

further complicating the quest for an overall superior method


Techniques 19

References (1)

C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation Computer Systems, 13, 1997.

C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press, 1995.

L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, 1984.

C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2): 121-168, 1998.

P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling machine learning. KDD'95.

W. Cohen. Fast effective rule induction. ICML'95. G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule

groups for gene expression data. SIGMOD'05. A. J. Dobson. An Introduction to Generalized Linear Models. Chapman

and Hall, 1990. G. Dong and J. Li. Efficient mining of emerging patterns: Discovering

trends and differences. KDD'99.


Techniques 20

References (2)

R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley and Sons, 2001

U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line

learning and an application to boosting. J. Computer and System Sciences, 1997. J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision

tree construction of large datasets. VLDB’98. J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree

Construction. SIGMOD'99. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data

Mining, Inference, and Prediction. Springer-Verlag, 2001. D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The

combination of knowledge and statistical data. Machine Learning, 1995. M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision

tree induction: Efficient classification in data mining. RIDE'97. B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule. KDD'98. W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on

Multiple Class-Association Rules, ICDM'01.


Techniques 21

References (3)

T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 2000.

J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research, Blackwell Business, 1994.

M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. EDBT'96.

T. M. Mitchell. Machine Learning. McGraw Hill, 1997. S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-

Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986. J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.


Techniques 22

References (4)

R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and pruning. VLDB’98.

J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. VLDB’96.

J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann, 1990.

P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2005.

S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991.

S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.

I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, 2ed. Morgan Kaufmann, 2005.

X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03

H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical clusters. KDD'03.

prediction by n.gopinath ap/cse december 22, 2015data mining: concepts and techniques1

Documents

error betw

yi yi squared error

relative squared error

abovedata mining

linear regression model

d data

relative absolute error

test setmean absolute