classification and decision trees - eth z · decision node = test on an attribute branch = an...

28
Classification and Decision Trees Iza Moise, Evangelos Pournaras, Dirk Helbing Iza Moise, Evangelos Pournaras, Dirk Helbing 1

Upload: others

Post on 17-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Classification and Decision TreesIza Moise, Evangelos Pournaras, Dirk Helbing

Iza Moise, Evangelos Pournaras, Dirk Helbing 1

Page 2: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Overview

ClassificationDecision Trees

Iza Moise, Evangelos Pournaras, Dirk Helbing 2

Page 3: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Classification

Iza Moise, Evangelos Pournaras, Dirk Helbing 3

Page 4: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Definition

Classificationis a data mining function that assigns items in a collection to targetcategories or classes.

The goal

is to accurately predict the target class for each data point.

• Supervised

• Outcome → class

Iza Moise, Evangelos Pournaras, Dirk Helbing 4

Page 5: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Definition

Classificationis a data mining function that assigns items in a collection to targetcategories or classes.

The goal

is to accurately predict the target class for each data point.

• Supervised

• Outcome → class

Iza Moise, Evangelos Pournaras, Dirk Helbing 4

Page 6: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Definition

Classificationis a data mining function that assigns items in a collection to targetcategories or classes.

The goal

is to accurately predict the target class for each data point.

• Supervised

• Outcome → class

Iza Moise, Evangelos Pournaras, Dirk Helbing 4

Page 7: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Types of Classification

I binary classification → target attribute has only two values

I multi class targets have more than two values

crispy classification → given an input, the classifier returns itslabel

probabilistic → given an input, the classifier returns itsprobabilities to belong to each class

Iza Moise, Evangelos Pournaras, Dirk Helbing 5

Page 8: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Applications

Classification Example: Spam Filtering

Classify  as  “Spam”  or  “Not  Spam”

11Machine Learning: CS 6375 Introduction, Instructor: Vibhav Gogate,The University of Texas at Dallas

Iza Moise, Evangelos Pournaras, Dirk Helbing 6

Page 9: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Applications[cont.]

Classification Example: Weather Prediction

22Machine Learning: CS 6375 Introduction, Instructor: Vibhav Gogate,The University of Texas at Dallas

Iza Moise, Evangelos Pournaras, Dirk Helbing 7

Page 10: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Applications[cont.]

• Customer Target Marketing

• Medical Disease Diagnosis

• Supervised Event Detection

• Multimedia Data Analysis

• Document Categorization and Filtering

• Social Network Analysis

Iza Moise, Evangelos Pournaras, Dirk Helbing 8

Page 11: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

A Three-Phase Process

1. Training phase: a model is constructed from the traininginstances.

→ classification algorithm finds relationships between predictorsand targets

→ relationships are summarised in a model→ train the model on data with known labels (training data)

2. Testing phase: test the model on a test sample whose classlabels are known but not used for training the model (testingdata)

3. Usage phase: use the model for classification on new datawhose class labels are unknown (new data)

Iza Moise, Evangelos Pournaras, Dirk Helbing 9

Page 12: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Training Phase - Model Construction

3

3Data Warehousing and Data Mining, Instructor: Prof. Hany Saleeb

Iza Moise, Evangelos Pournaras, Dirk Helbing 10

Page 13: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Testing Phase - Model usage

4

4Data Warehousing and Data Mining, Instructor: Prof. Hany Saleeb

Iza Moise, Evangelos Pournaras, Dirk Helbing 11

Page 14: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Methods of classification

• Decision Trees

• k-Nearest Neighbours

• Neural Networks

• Logistic Regression

• Linear Discriminant Analysis

Iza Moise, Evangelos Pournaras, Dirk Helbing 12

Page 15: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Decision Trees

Iza Moise, Evangelos Pournaras, Dirk Helbing 13

Page 16: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Main principles

A decision treecreates a hierarchical partitioning of the data which relates the differ-ent partitions at the leaf level to the different classes.

Data requirements:

• Attribute-Value description: object expressible in terms of afixed collection of properties or attributes (e.g., hot, mild, cold).

• Predefined classes (target values): the target function hasdiscrete output values (boolean or multi-class).

• Sufficient data: enough training cases should be provided tolearn the model.

Iza Moise, Evangelos Pournaras, Dirk Helbing 14

Page 17: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Main principles [cont.]

• decision node = test on anattribute

• branch = an outcome of the test

• leaf node = classification ordecision

• root = the best predictor

• path: a disjunction of tests tomake the final decision

Iza Moise, Evangelos Pournaras, Dirk Helbing 15

Page 18: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Main principles [cont.]

• decision node = test on anattribute

• branch = an outcome of the test

• leaf node = classification ordecision

• root = the best predictor

• path: a disjunction of tests tomake the final decision

Iza Moise, Evangelos Pournaras, Dirk Helbing 15

Page 19: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Main principles [cont.]

• decision node = test on anattribute

• branch = an outcome of the test

• leaf node = classification ordecision

• root = the best predictor

• path: a disjunction of tests tomake the final decision

Iza Moise, Evangelos Pournaras, Dirk Helbing 15

Page 20: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Main principles [cont.]

• decision node = test on anattribute

• branch = an outcome of the test

• leaf node = classification ordecision

• root = the best predictor

• path: a disjunction of tests tomake the final decision

Iza Moise, Evangelos Pournaras, Dirk Helbing 15

Page 21: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Main principles [cont.]

• decision node = test on anattribute

• branch = an outcome of the test

• leaf node = classification ordecision

• root = the best predictor

• path: a disjunction of tests tomake the final decision

Iza Moise, Evangelos Pournaras, Dirk Helbing 15

Page 22: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Main principles [cont.]

• decision node = test on anattribute

• branch = an outcome of the test

• leaf node = classification ordecision

• root = the best predictor

• path: a disjunction of tests tomake the final decision

Classification on new instances is done by following a matchingpath from the root to a leaf node

Iza Moise, Evangelos Pournaras, Dirk Helbing 15

Page 23: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

5

5Dr. Saed Sayad, adjunct Professor at the University of Toronto

Iza Moise, Evangelos Pournaras, Dirk Helbing 16

Page 24: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Split criterion

A condition (or predicate) on:

• a single attribute → univariate split

• multiple attributes → multivariate split

I Recursively split the training data

I Goal: maximize the information gain (the discrimination amongthe classes)

→ how well an attribute separates the examples accordingto their target classification

Iza Moise, Evangelos Pournaras, Dirk Helbing 17

Page 25: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

How to build a decision tree?

Top-down tree construction:

• all training data are the root

• data are partitioned recursively based on selected attributes

• bottom-up tree pruning→ remove subtrees or branches, in a bottom-up manner, toimprove the estimated accuracy on new cases.

• conditions for stopping partitioning:

• all samples for a given node belong to the same class

• there are no remaining attributes for further partitioning

• there are no samples left

Iza Moise, Evangelos Pournaras, Dirk Helbing 18

Page 26: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Pros and Cons

Pros:

X simple to understand and interpret

X little data preparation and little computation

X indicates which attributes are most important for classification

Iza Moise, Evangelos Pournaras, Dirk Helbing 19

Page 27: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

Pros and Cons

Cons:

X learning an optimal decision tree is NP-complete

X perform poorly with many classes and small data

X computationally expensive to train

X over-complex trees do not generalise well from the trainingdata (overfitting)

Iza Moise, Evangelos Pournaras, Dirk Helbing 20

Page 28: Classification and Decision Trees - ETH Z · decision node = test on an attribute branch = an outcome of the test leaf node = classification or decision root = the best predictor

What’s next?

• k-nearest Neighbors

• Clustering

Iza Moise, Evangelos Pournaras, Dirk Helbing 21