comp307 week 1 - victoria university of wellington · comp307 week 3 (tutorial) 1. announcements...

11
COMP307 Week 3 (Tutorial) 1. Announcements Assignment 1 (15% ) Helpdesk sessions 2. Sets Training and Test sets Validation set 3. Datasets Instances Features and feature vectors Class label 4. 3-K Techniques k-Nearest Neighbour k-fold Cross Validation k-Means Clustering 5. Decision Trees DT learning vs learned DT Impurity measure Conditions Pruning 6. Other Questions

Upload: others

Post on 16-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMP307 Week 1 - Victoria University of Wellington · COMP307 Week 3 (Tutorial) 1. Announcements •Assignment 1 (15%) •Helpdesk sessions 2. Sets •Training and Test sets •Validation

COMP307 Week 3 (Tutorial)

1. Announcements• Assignment 1 (15%)• Helpdesk sessions

2. Sets• Training and Test sets• Validation set

3. Datasets• Instances• Features and feature vectors• Class label

4. 3-K Techniques• k-Nearest Neighbour• k-fold Cross Validation• k-Means Clustering

5. Decision Trees• DT learning vs learned DT• Impurity measure Conditions • Pruning

6. Other Questions

Page 2: COMP307 Week 1 - Victoria University of Wellington · COMP307 Week 3 (Tutorial) 1. Announcements •Assignment 1 (15%) •Helpdesk sessions 2. Sets •Training and Test sets •Validation

Over-fitting

Page 3: COMP307 Week 1 - Victoria University of Wellington · COMP307 Week 3 (Tutorial) 1. Announcements •Assignment 1 (15%) •Helpdesk sessions 2. Sets •Training and Test sets •Validation

Validation Set

• What?• Why?• How?

Page 4: COMP307 Week 1 - Victoria University of Wellington · COMP307 Week 3 (Tutorial) 1. Announcements •Assignment 1 (15%) •Helpdesk sessions 2. Sets •Training and Test sets •Validation

Datasets and Instances

12.7 2.6 55.3 . . . 15.0 A

f1 f2 f3 f4 f5 . . . class10 2.2 45 3.7 22.1 . . . A3.7 7.9 12 2.1 17.5 . . . A22.8 27.9 11.4 36 77 . . . B90.4 6.34 2.77 15.8 53.7 . . . A74.6 4.78 84.9 15.9 103 . . . B2.89 14.7 3.11 10 52 . . . B

Page 5: COMP307 Week 1 - Victoria University of Wellington · COMP307 Week 3 (Tutorial) 1. Announcements •Assignment 1 (15%) •Helpdesk sessions 2. Sets •Training and Test sets •Validation

K-Nearest Neighbour

10.3 45.7 2.7 A

7.1 80.5 1.1 A

22.3 20.4 9.6 B

30.5 21.2 17.9 B

5.2 67.1 7.7 A

15.6 18.6 11.4 B

11.9 53.4 6.3 A

Training Set

f1 f2 f3 class

2.1 33.5 4.7 ?14.84

47.40

24.57

33.65

33.88

21.19

22.24

2.1 33.5 4.7 A

Page 6: COMP307 Week 1 - Victoria University of Wellington · COMP307 Week 3 (Tutorial) 1. Announcements •Assignment 1 (15%) •Helpdesk sessions 2. Sets •Training and Test sets •Validation

k-fold Cross Validation75 %

92 %

43 %

76 %

30 %

82 %

83 %

47 %

50 %

73 %

65.10 %Dataset

Training Test

50 % 50 %

• How do we specify the number of instances in each fold?• How do we select those instances?

Page 7: COMP307 Week 1 - Victoria University of Wellington · COMP307 Week 3 (Tutorial) 1. Announcements •Assignment 1 (15%) •Helpdesk sessions 2. Sets •Training and Test sets •Validation

k-Means Clustering

Page 8: COMP307 Week 1 - Victoria University of Wellington · COMP307 Week 3 (Tutorial) 1. Announcements •Assignment 1 (15%) •Helpdesk sessions 2. Sets •Training and Test sets •Validation

Decision Tress (DT)

• DT learning ≠ learned DT• Impurity measure

1. 0, if all instances belong to one class2. Max, equal number of instances for both classes3. Continuous –smooth

• Large vs. small trees?

Page 9: COMP307 Week 1 - Victoria University of Wellington · COMP307 Week 3 (Tutorial) 1. Announcements •Assignment 1 (15%) •Helpdesk sessions 2. Sets •Training and Test sets •Validation

Example (Training) Dataset• Approve/Reject a loan application?

6

Applicant Job Deposit Family Class

1 true low single Approve

2 true low couple Approve

3 true low single Approve

4 true high single Approve

5 false high couple Approve

6 false low couple Reject

7 true low children Reject

8 false low single Reject

9 false high children Reject

Page 10: COMP307 Week 1 - Victoria University of Wellington · COMP307 Week 3 (Tutorial) 1. Announcements •Assignment 1 (15%) •Helpdesk sessions 2. Sets •Training and Test sets •Validation

Numeric Features• Can split on a simple comparison

– Which split point?– Consider class boundaries

23

Deposit < $10K

True False

Applicant Job Deposit Family Class

8 false $3K single Reject

3 true $4K single Approve

6 false $6K couple Reject

2 true $7K couple Approve

7 true $8K children Reject

1 true $10K single Approve

4 true $16K single Approve

5 false $18K couple Approve

9 false $30K children Reject

<$4K

<$6K

<$7K

<$8K

<$10K

<$30K

Page 11: COMP307 Week 1 - Victoria University of Wellington · COMP307 Week 3 (Tutorial) 1. Announcements •Assignment 1 (15%) •Helpdesk sessions 2. Sets •Training and Test sets •Validation

Other Questions

• Whether k-NN can cope with categorical data/features?• Whether k-Means can lead to the same clusters?• How do we select/initialise the seeds in k-Means?