chapter 9 classification and regression treescontents.kocw.net › kocw › document › 2014 ›...

20
2014-09-21 1 Hanyang University Quest Lab. Chapter 9 Classification and regression trees Fall, 2014 Department of IME Hanyang University Hanyang University Quest Lab. 9.2 Classification trees Data-driven method for classification (classification tree) and prediction (regression tree) Tree method developed by Breiman et al. → CART Obs are separated into subgroups based on predictors Goal: Classify or predict an outcome based on a set of predictors The output is a set of rules Recursive partitioning and pruning with consideration of the homogeneity

Upload: others

Post on 29-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

1

Han

yang

Un

iversity

Quest L

ab.

Chapter 9

Classification and regression trees

Fall, 2014Department of IMEHanyang University

Hanyang UniversityQuest Lab.

9.2 Classification trees

• Data-driven method for classification (classification tree) and prediction

(regression tree)

• Tree method developed by Breiman et al. → CART

• Obs are separated into subgroups based on predictors

• Goal: Classify or predict an outcome based on a set of predictors

• The output is a set of rules

• Recursive partitioning and pruning with consideration of the

homogeneity

Page 2: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

2

Hanyang UniversityQuest Lab.

Example:

• Goal: classify a record as “will accept credit card offer” or “will not

accept”

• Rule might be “IF (Income > 92.5) AND (Education < 1.5) AND (Family

<= 2.5) THEN Class = 0 (nonacceptor)

• Also called CART, Decision Trees, or just Trees

• Rules are represented by tree diagrams

9.1 Introduction

Hanyang UniversityQuest Lab.

9.1 Introduction

Number in the circle node: splitting value

Terminal (square) node: acceptor(1) or nonacceptor(0)

Number on the fork: number of records

Predictor

Page 3: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

3

Hanyang UniversityQuest Lab.

9.1 Introduction

“IF (Income > 92.5) AND (Education < 1.5) AND (Family <= 2.5) THEN Class = 0 (nonacceptor)”

Hanyang UniversityQuest Lab.

9.2 Classification trees

• Divide the p-dimensional space into two by a split value of such

that: 1, … , : and 1, … , :

• Measure how “pure” or homogeneous each of the resulting portions are

(“Pure” = containing records of mostly one class)

• Algorithm tries different values of ,and

to maximize purity in split

• After you get a “maximum purity” split, repeat the process for a second

split, and so on

• Repeat until we get “pure” classes (belonging to one class)

Recursive partitioning

Page 4: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

4

Hanyang UniversityQuest Lab.

9.2 Classification trees

12 owner and 12 non-owners

of riding mowers

Income(x1) and lot size(x2)

Example 1: Riding mowers

Income Lot_Size Ownership60.0 18.4 owner85.5 16.8 owner64.8 21.6 owner61.5 20.8 owner87.0 23.6 owner110.1 19.2 owner108.0 17.6 owner82.8 22.4 owner69.0 20.0 owner93.0 20.8 owner51.0 22.0 owner81.0 20.0 owner75.0 19.6 non-owner52.8 20.8 non-owner64.8 17.2 non-owner43.2 20.4 non-owner84.0 17.6 non-owner49.2 17.6 non-owner59.4 16.0 non-owner66.0 18.4 non-owner47.4 16.4 non-owner33.0 18.8 non-owner51.0 14.0 non-owner63.0 14.8 non-owner

9.2 Classification trees

The first split: Lot Size = 19,000

Page 5: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

5

9.2 Classification trees

First Split – The Tree

“Is Lot_Size ≤ 19,000?”

Yes No

Hanyang UniversityQuest Lab.

9.3 Measure of impurity

• Assume there are m classes

= rectangle A에서 class 에 속하는 obs의 비율

Gini impurity index :

• If all obs belong to same class, 0(why?)

• If all are equally distributed, 1 / .

• In two-class case, is maximized at 0.5.

2

1

( ) 1 ,m

kk

I A p

0 ( ) ( 1) /I A m m

Page 6: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

6

Hanyang UniversityQuest Lab.

9.3 Measure of impurity

• Assume there are m classes

= rectangle A에서 class 에 속하는 obs의 비율

Entropy measure :

• If all obs belong to same class, Entropy 0(why?)

• If all are equally distributed, Entropy log2 .

• In two-class case, Entropy(A) is maximized at 0.5.

21

Entropy( ) log ( ),m

k kk

A p p

20 Entropy( ) log ( )A m

Hanyang UniversityQuest Lab.

9.3 Measure of impurity

Node im

purity

Note that impurity measures are at their peaks when p=0.5 (when the data contain 50% of each of the two classes). 

P = 0.5

Page 7: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

7

Hanyang UniversityQuest Lab.

9.3 Measure of impurity

In Example 1,

Before split (12 owner and 12 nonowner): Gini=0.5, Entropy=1

After split (x2 ≥ 19 와 x2 < 19):

• Upper rectangle(9 owners and 3 nonowners): Gini=0.375, Entropy=0.811

• Lower rectangle(3 owners and 9 nonowners): Gini=0.375, Entropy=0.811

• Combined impurity (weighted average): Gini=0.375, Entropy=0.811

Both impurity indices decrease after split.

Second Split: Income = $84,750

9.2 Classification trees

Third Split: Income = $57,150

Page 8: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

8

Tree after three splits

9.2 Classification trees

After All Splits (until when?)

9.2 Classification trees

Page 9: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

9

Hanyang UniversityQuest Lab.

9.3 Measure of impurity

Decision node: circle 안의 수는

splitting value

Fork: left는 splitting value보다

작거나 같은 쪽, right는 큰 쪽,

숫자는 record의 개수

Leaf: final rectangle

split variable은 circle 아래 표기

Tree after all splits

Hanyang UniversityQuest Lab.

9.3 Measure of impurity

Determining Leaf Node Label

• Each leaf node label is determined by “voting” of the records within it,

and by the cutoff value

• Records within each leaf node are from the training data

• Default cutoff=0.5 means that the leaf node’s label is the majority

class

• Cutoff = 0.75: requires majority of 75% or more “1” records in the leaf

to label it a “1” node

Page 10: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

10

Hanyang UniversityQuest Lab.

Then at a node, how to determine split value of ??

① 질문 생성

② 각 질문 별로 information gain (IG) 계산

③ IG가 가장 큰 것을 선택

When stop?

① Impurity = 0일 때

② Leaf에서의 record 개수가 어느 정도 이하일 때

③ IG가 임계값 이하일 때

④ Pruning

9.3 Measure of impurity

Hanyang UniversityQuest Lab.

Then at a node, how to determine split value of ??

① 질문 생성

② 각 질문 별로 information gain (IG) 계산 후 IG가 가장 큰 것을 선택

IG = Reduction in entropy caused by partitioning the data according to this 

variable.

Ex. Suppose a node S is a collection of 14 examples [9 yes’, 5 no’s]. 

9.3 Measure of impurity

2

2 2 21

9 9 5 5[9 ,5 ] log log log 0.94

14 14 14 14i ii

Entropy y n p p

Page 11: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

11

Hanyang UniversityQuest Lab.

9.3 Measure of impurity

( )

( , ) ( ) ( )i

vi v

v value x

SIG S x Entropy S Entropy S

S

• value (A): the set of all possible values for a variable A.

• Sv: the subset of S for which variable A has value v. 

Hanyang UniversityQuest Lab.

9.3 Measure of impurity

Outlook Temperature Humidity Windy Play?

sunny hot high false No

sunny hot high true No

overcast hot high false Yes

rain mild high false Yes

rain cool normal false Yes

rain cool normal true No

overcast cool normal true Yes

sunny mild high false No

sunny cool normal false Yes

rain mild normal false Yes

sunny mild normal true Yes

overcast mild high true Yes

overcast hot normal false Yes

rain mild high true No

( ) ,

[9 ,5 ]

[6 ,2 ]

[3 ,3 ]

false

true

Values windy false true

S y n

S y n

S y n

{ , }

( , ) ( ) ( )

8 6( ) ( ) ( )

14 140.048

vv

v false true

false true

SIG S windy Entropy S Entropy S

S

Entropy S Entropy S Entropy S

Similarly, IG(S,Humidity)=0.151. 

Humidity provides greater information gain than Windy. 

variable

Page 12: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

12

Hanyang UniversityQuest Lab.

9.4 Evaluating the performance of a classification tree

Example 2. Acceptance of personal loan

• Previous campaign을 분석하여 better campaign 기획

• What combination of factors make customers convert to personal loan?

• Predictors: demographic info(age, income, etc.), response to last campaign, r

elationship with bank(mortgage, security account, etc.)

• Last campaign – 480 customers accepted out of 5,000 customers (9.6%)

Hanyang UniversityQuest Lab.

9.4 Evaluating the performance of a classification tree

2,500 training set, 1,500 validation set, 1,000 test sets

Page 13: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

13

Hanyang UniversityQuest Lab.

9.4 Evaluating the performance of a classification tree

Hanyang UniversityQuest Lab.

9.4 Evaluating the performance of a classification tree

100% accurate in training set - full-grown tree에서 leaf는 pure class

“overfitting의 문제”

Page 14: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

14

Hanyang UniversityQuest Lab.

9.4 Evaluating the performance of a classification tree

• Validation set과 test set

에서는 100%가 아님

• Full-grown 이전에 stop

또는 full-grown tree의

pruning이 필요

Hanyang UniversityQuest Lab.

“Overfitting”: because of too small number of observations

9.5 Avoiding Overfitting

→ stop tree growth (CHAID), or pruning tree back (CART, C4.5)

Page 15: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

15

Hanyang UniversityQuest Lab.

9.5 Avoiding Overfitting

• Stop when?

tree depth? min number of obs? min reduction in impurity? etc.

- CHAID(chi-squared automatic interaction detection):

Splitting stops when purity improvement is not statistically significant

Stopping tree growth: CHAID

Hanyang UniversityQuest Lab.

9.5 Avoiding overfitting

- Lets tree grow to full extent, then prunes it back

- Riding mower 예제에서 마지막 몇 개 split에 의한 rectangle은

single record만 가짐 → 이들은 noise를 포함하고 있다고 볼 수 있음

- Pruning이란 decision node를 leaf로 바꾸는 것

- Unseen data의 error rate가 증가하는 시점을 발견하는 것이 목표

Pruning the tree

Page 16: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

16

Hanyang UniversityQuest Lab.

9.5 Avoiding overfitting

Use cost complexity(CC) of a tree : ,

where

=tree 에서의 misclassification에 의한 cost

=number of leaves in

=penalty factor (set by user)

(Tree size가 커질수록 에러는 작아지지만 대신 페널티가 커진다)

( 값이 커질 수록 큰 tree의 CC가 높아진다)

• Among trees of given size(# of decision nodes), choose the one with

lowest CC

Pruning the tree (CART)

Hanyang UniversityQuest Lab.

9.5 Avoiding overfitting

• Do this for each size of tree

→각 size 별로 최소 CC를 가지는 tree들을 얻음

• 이들 tree 중에서 validation set의 misclassification error가 minimum인

것(“minimum error tree”)을 선택 하거나

• 또는 minimum error tree의 one standard error 범위내의 tree 중 제일

작은 것을 선택(“best pruned tree”)

• 이는 Sampling error를 고려한 수정임 (validation data도 pruning 과정

에서 사용했으며, 다른 sample일 경우 달라질 수도 있었기 때문)

(1 )p pp

n

• Do this for each size of tree

→각 size 별로 최소 CC를 가지는 tree들을 얻음

• 이들 tree 중에서 validation set의 misclassification error가 minimum인

것(“minimum error tree”)을 선택 하거나

• 또는 minimum error tree의 one standard error 범위내의 tree 중 제일

작은 것을 선택(“best pruned tree”)

• 이는 Sampling error를 고려한 수정임 (validation data도 pruning 과정

에서 사용했으며, 다른 sample일 경우 달라질 수도 있었기 때문)

(1 ),

p pp

n

n은 validation set의 샘플수

Page 17: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

17

Hanyang UniversityQuest Lab.

9.5 Avoiding overfitting

full grown tree

0.0147 (1 0.0147)0.0147 0.0178 1.78%

1500

. . . . . . . . . . . .

minimum error tree

best pruned tree

Hanyang UniversityQuest Lab.

9.6 Classification rules from trees

Tree를 바탕으로 classification 쉽게 함

Ex. Rule might be “IF (Income > 92.5) AND (Education < 1.5) AND

(Family <= 2.5) THEN Class = 0 (nonacceptor)

• 두 개 이상의 class가 있

는 경우로 확장 가능

• Leaf node 는 m 개의

class 중 하나의 label

Page 18: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

18

Hanyang UniversityQuest Lab.

9.8 Regression trees

• Regression tree - Used with continuous outcome variable• Procedure similar to classification tree• Many splits attempted, choose the one that minimizes impurity

Ex. Predicting Toyota Corolla (see Ch 6): 10 predictors, 600 training set

Best tree: only two

predictors are useful –

Age and Horse Power.

Hanyang UniversityQuest Lab.

9.8 Regression trees

Cf. Classification tree: leaf의 class는 그 leaf에 있는 data를 “vote”해서 결정

Regression tree: leaf node의 value는 그 leaf에 있는 data의 average

Prediction

Typical measure: sum of squared deviation from mean of the leaf

Measuring impurity

RMSE (root-mean-squared-error), lift charts, etc.

same way as other predictive methods

Evaluating performance

Page 19: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

19

Hanyang UniversityQuest Lab.

• Good classifier, useful for variable selection

• No need to transform variables

• Automatically selection of variables by means of split (pruning에 의해 일부

variable만 선택 가능)

• Outliers에 대해 robust (split은 record의 value의 order에만 관련이 있고, 크기

와는 무관)

• But sensitive to the change in the data (slight change can cause very

different split)

• Predictor와 response 간에 특별한 relationship을 미리 가정하지 않음

(cf. linear regression은 linear relationship을 미리 가정)

9.9 Advantages, weaknesses, and extensions

Hanyang UniversityQuest Lab.

• Split가 one predictor에 의해서만 이루어지므로 predictor 간 상호관계 고려하

지 못함

• horizontal and vertical splitting이 적절하지 않은 경우에는 performance 저하

• Discriminant analysis (Ch 12) is preferable

9.9 Advantages, weaknesses, and extensions

Page 20: Chapter 9 Classification and regression treescontents.kocw.net › KOCW › document › 2014 › hanyang › huhseon › 6… · 2016-09-09 · Classification and regression trees

2014-09-21

20

Hanyang UniversityQuest Lab.

• 9.1 (eBay.com)

• 9.2 (Flight delay)

• 9.3 (Regression tree – Toyota Corolla)

Ch. 9 Problems