temple university – cis dept. cis664– knowledge discovery and data mining

28
Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining V. Megalooikonomou Introduction to Decision Trees (based on notes by Jiawei Han and Micheline Kamber and on notes

Upload: nell

Post on 20-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining. V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by Christos Faloutsos). More details… Decision Trees. Decision trees Problem Approach - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Temple University – CIS Dept.CIS664– Knowledge Discovery and Data Mining

V. Megalooikonomou

Introduction to Decision Trees

(based on notes by Jiawei Han and Micheline Kamber and on notes by Christos Faloutsos)

Page 2: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

More details… Decision Trees

Decision trees Problem Approach

Classification through trees Building phase - splitting policies Pruning phase (to avoid over-fitting)

Page 3: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Decision Trees

Problem: Classification – i.e., given a training set (N tuples, with M attributes, plus a label attribute) find rules, to predict the label for newcomers

Pictorially:

Page 4: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Decision treesAge Chol-level Gender … CLASS-ID

30 150 M +

-

??

Page 5: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Decision trees Issues:

missing values noise ‘rare’ events

Page 6: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Decision trees types of attributes

numerical (= continuous) - eg: ‘salary’

ordinal (= integer) - eg.: ‘# of children’

nominal (= categorical) - eg.: ‘car-type’

Page 7: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Decision trees Pictorially, we have

num. attr#1 (e.g., ‘age’)

num. attr#2(e.g., chol-level)

+

-++ +

++

+

-

--

--

Page 8: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Decision trees and we want to label ‘?’

num. attr#1 (e.g., ‘age’)

num. attr#2(e.g., chol-level)

+

-++ +

++

+

-

--

--

?

Page 9: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Decision trees so we build a decision tree:

num. attr#1 (e.g., ‘age’)

num. attr#2(e.g., chol-level)

+

-++ +

++

+

-

--

--

?

50

40

Page 10: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Decision trees so we build a decision tree:

age<50

Y

+ chol. <40

N

- ...

Y N

Page 11: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Decision trees Typically, two steps:

tree building tree pruning (for over-training/over-

fitting)

Page 12: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Tree building How?

num. attr#1 (e.g., ‘age’)

num. attr#2(e.g.,chol-level)

+

-++ +

++

+

-

--

--

Page 13: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Tree building

How? A: Partition, recursively - pseudocode:

Partition ( dataset S)if all points in S have same labelthen returnevaluate splits along each attribute Apick best split, to divide S into S1 and S2Partition(S1); Partition(S2)

Q1: how to introduce splits along attribute Ai

Q2: how to evaluate a split?

Page 14: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Tree building Q1: how to introduce splits along

attribute Ai

A1: for numerical attributes:

binary split, or multiple split

for categorical attributes: compute all subsets (expensive!), or use a greedy approach

Page 15: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Tree building Q2: how to evaluate a split? A: by how close to uniform each

subset is - ie., we need a measure of uniformity:

Page 16: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Tree building

entropy: H(p+, p-)

p+10

0

1

0.5

Any other measure?

p+: relative frequency of class + in S

Page 17: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Tree building

entropy: H(p+, p-)

p+10

0

1

0.5

‘gini’ index: 1-p+2 - p-

2

p+10

0

1

0.5

p+: relative frequency of class + in S

Page 18: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Tree building

Intuition: entropy: #bits to encode the class

label gini: classification error, if we

randomly guess ‘+’ with prob. p+

Page 19: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Tree building

Thus, we choose the split that reduces entropy/classification-error the most: E.g.:

num. attr#1 (e.g., ‘age’)

num. attr#2(e.g., chol-level)

+

-++ +

++

+

-

----

Page 20: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Tree building Before split: we need

(n+ + n-) * H( p+, p-) = (7+6) * H(7/13, 6/13)

bits total, to encode all the class labels

After the split we need:0 bits for the first half

and(2+6) * H(2/8, 6/8) bits for the second

half

Page 21: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Tree pruning What for?

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

----

...

Page 22: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Tree pruning Q: How to do it?

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

----

...

Page 23: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Tree pruning Q: How to do it? A1: use a ‘training’ and a ‘testing’

set - prune nodes that improve classification in the ‘testing’ set. (Drawbacks?)

Page 24: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Tree pruning Q: How to do it? A1: use a ‘training’ and a ‘testing’

set - prune nodes that improve classification in the ‘testing’ set. (Drawbacks?)

A2: or, rely on MDL (= Minimum Description Language) - in detail:

Page 25: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Tree pruning envision the problem as

compression and try to minimize the # bits to

compress

(a) the class labels AND(b) the representation of the decision

tree

Page 26: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

(MDL) a brilliant idea – e.g.: best n-degree

polynomial to compress these points: the one that minimizes (sum of errors

+ n )

Page 27: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Major Issues in Data Mining

Issues relating to the diversity of data types Handling relational as well as complex types of data Mining information from heterogeneous databases and

global information systems (WWW) Issues related to applications and social impacts

Application of discovered knowledge Domain-specific data mining tools Intelligent query answering Process control and decision making

Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem

Protection of data security, integrity, and privacy

Page 28: Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Summary Data mining: discovering interesting patterns from large amounts

of data A natural evolution of database technology, in great demand, with

wide applications A KDD process includes data cleaning, data integration, data

selection, transformation, data mining, pattern evaluation, and knowledge presentation

Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination,

association, classification, clustering, outlier and trend analysis, etc.

Classification of data mining systems Major issues in data mining