slides

30
Data Mining and Machine Learning Yen-Jen Oyang Dept. of Computer Science and Information Engineering

Upload: tommy96

Post on 04-Nov-2014

216 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: slides

Data Mining and Machine Learning

Yen-Jen Oyang

Dept. of Computer Science and Information Engineering

Page 2: slides

Reference Books

• “Data Mining” by Ian Witten and Eibe Frank.

• “Data Mining” by Jiawei Han and Micheline Kamber.

Page 3: slides

Observations and Challenges in the Information Age

• A huge volume of information has been and is being digitized and stored in the computer.

• Due to the volume of digitized information, effectively exploitation of information is beyond the capability of human being without the aid of intelligent computer software.

Page 4: slides

An Example of Data Mining

• Given the data set shown on next slide, can we figure out a set of rules that predict the classes of objects?

Page 5: slides

Data Set

Data Class Data Class Data Class

( 15,33)

O ( 18,28)

× ( 16,31)

O

( 9 ,23)

× ( 15,35)

O ( 9 ,32)

×

( 8 ,15)

× ( 17,34)

O ( 11,38)

×

( 11,31)

O ( 18,39)

× ( 13,34)

O

( 13,37)

× ( 14,32)

O ( 19,36)

×

( 18,32)

O ( 25,18)

× ( 10,34)

×

( 16,38)

× ( 23,33)

× ( 15,30)

O

( 12,33)

O ( 21,28)

× ( 13,22)

×

Page 6: slides

Distribution of the Data Set

。。

10 15 20

30

。。。 。。

。 。。

××

××

×

×

×

×

×

×

××

×

×

Page 7: slides

Rule Based on Observation

.

0

30

253015 22

Xclass

else

class

, thenand y

yxIf

Page 8: slides

Identifying Boundary of Different Classes of Objects

Page 9: slides

Boundary Identified

Page 10: slides

Data Mining /Knowledge Discovery

• The main theme of data mining is to discover unknown and implicit knowledge in a large dataset.

• There are three main categories of data mining algorithms:• Classification;• Clustering;• Mining association rule/correlation analysis.

Page 11: slides

Data Classification

• In a data classification problem, each object is described by a set of attribute values and each object belongs to one of the predefined classes.

• The goal is to derive a set of rules that predicts which class a new object should belong to, based on a given set of training samples. Data classification is also called supervised learning.

Page 12: slides

Applications of Data Classification

• One example is that a bank wants to develop an automatic mechanism that decides whether a credit card application should be approved or not based on existing customers’ records.

• Another example is that a hospital wants to determine whether a new patient belongs to the high-risk group of a particular disease, based on the patient’s health record.

Page 13: slides

An Example of Data Classification Applications

PoorMaleNoYoungMiddleHigh school

GoodFemale-----MiddleMiddleCollege

PoorMaleNoMiddleLowHigh school

GoodFemaleNoYoungMiddleCollege

GoodFemaleYesOldHighHigh school

PoorMaleYesOldHighCollege

PoorFemaleYesYoungMiddleHigh school

GoodMaleNoMiddleLowHigh school

GoodMaleYesOldHighCollege

Credit ratingSexOwn HouseAgeAnnual IncomeEducation

ClassAttributes

Page 14: slides

• The rule derived is as follows:• If (education = high school) and ~(income = hi

gh), then credit rating = poor.

• Otherwise, credit rating = good.

• Most of time, the rules derived are not perfect. In other words, misprediction is unavoidable in most cases. In this example, the accuracy is 7/9 = 78%.

Page 15: slides

Representation and Inference of Knowledge

• Knowledge represented in an interpretable form such as rules is one of the most important outputs of the data classification software.

• Some classification algorithms may perform well in prediction/classification but does not output knowledge or rules, e.g. neural networks and support vector machine.

Page 16: slides

• In some data classification applications, we are not concerned about the knowledge based on which the decisions are made. For example, a credit card company wants to develop an automatic mechanism that determine the credit limits of new applications.

• However, in many applications, it is of interest to learn the knowledge and even to conduct inference.

Page 17: slides

Rule Generated by a RBF Network Based Learning

Algorithm for the Previous Example

Let and

If then prediction=“O”.

Otherwise prediction=“X”.

2o

2o

210

12o

o 2

1)( i

icv

i i

evf

.

2

1)(

2

214

12x

x

2x

x

j

jcv

j j

evf

),()( xo vfvf

Page 18: slides

(15,33)

(11,31)

(18,32)

(12,33)

(15,35)

(17,34)

(14,32)

(16,31)

(13,34)

(15,30)

1.723 2.745 2.327 1.794 1.973 2.045 1.794 1.794 1.794 2.027

ico

io

(9,23) (8,15)(13,37)

(16,38)

(18,28)

(18,39)

(25,18)

(23,33)

(21,28)

(9,32)(11,38)

(19,36)

(10,34)

(13,22)

6.458 10.08 2.939 2.745 5.451 3.287 10.86 5.322 5.070 4.562 3.463 3.587 3.232 6.260

jcx

jx

Page 19: slides

Alternative Data Classification Algorithms

• Decision tree (Q4.5 and Q5.0);• Instance-based learning(KNN);• Naïve Bayesian classifier;• Support vector machine(SVM);

• Novel approaches including the RBF network based classifier that we have recently proposed.

Page 20: slides

Accuracy of Different Classification Algorithms

Data setclassification algorithms

RBF SVM 1NN 3NN

Satimage

(4335,2000)92.30 91.30 89.35 90.6

Letter

(15000,5000)97.12 97.98 95.26 95.46

Shuttle

(43500,14500)99.94 99.92 99.91 99.92

Average 96.45 96.40 94.84 95.33

Page 21: slides

Comparison of Execution Time(in seconds)

RBF without data reduction

RBF with data reduction SVM

Cross validation

Satimage 670 265 64622

Letter 2825 1724 386814

Shuttle 96795 59.9 467825

Make classifier

Satimage 5.91 0.85 21.66

Letter 17.05 6.48 282.05

Shuttle 1745 0.69 129.84

Test

Satimage 21.3 7.4 11.53

Letter 128.6 51.74 94.91

Shuttle 996.1 5.85 2.13

Page 22: slides

More InsightsSatimage Letter Shuttle

# of training samples in the original data set 4435 15000 43500

# of training samples after data reduction is applied 1815 7794 627

% of training samples remaining 40.92% 51.96% 1.44%

Classification accuracy after data reduction is applied 92.15 96.18 99.32

# of support vectors in identified by LIBSVM 1689 8931 287

Page 23: slides

Instance-Based Learning

• In instance-based learning, we take k nearest training samples of a new instance (v1, v2, …, vm) and assign the new instance to the class that has most instances in the k nearest training samples.

• Classifiers that adopt instance-based learning are commonly called the KNN classifiers.

Page 24: slides

Example of the KNN Classifiers

• If an 1NN classifier is employed, then the prediction of “” = “X”.

• If an 3NN classifier is employed, then prediction of “” = “O”.

Page 25: slides

Data Clustering

• Data clustering concerns how to group a set of objects based on their similarity of attributes and/or their proximity in the vector space. Data clustering is also called unsupervised learning.

Page 26: slides

Applications of Data Clustering

• One application is to cluster the customers of a bank so that the bank can provide services more effectively.

• For example, the bank may find the following clusters in its customers:• aggressive investors;• conservative investors;• balanced investors.

Page 27: slides

Related Challenging Issues

• Two challenging issues associated with data clustering and classification:• feature selection;

• outlier detection.

Page 28: slides

Importance of Feature Selection

• Inclusion of features that are not correlated to the classification decision may make the problem even more complicated.

• For example, in the data set shown on the following page, inclusion of the feature corresponding to the Y-axis causes incorrect prediction of the test instance marked by “”, if a 3NN classifier is employed.

Page 29: slides

• It is apparent that “o”s and “x” s are separated by x=10. If only the attribute corresponding to the x-axis was selected, then the 3NN classifier would predict the class of “” correctly.

x=10 x

y

Page 30: slides

Summary

• Data clustering and data classification have been widely used in biological and medical data analysis.

• Statistical analysis is probably the most important tool in various data mining algorithms.