lecture 8: decision trees & k-nearest neighbors

Machine Learning for Language Technology Lecture 8: Decision Trees and

k-‐Nearest Neighbors Marina San:ni

Department of Linguis:cs and Philology Uppsala University, Uppsala, Sweden

Autumn 2014

Acknowledgement: Thanks to Prof. Joakim Nivre for course design and materials

1

Supervised Classifica:on

•  Divide instances into (two or more) classes –  Instance (feature vector):

•  Features may be categorical or numerical

–  Class (label):

–  Training data:

•  Classifica:on in Language Technology –  Spam filtering (spam vs. non-‐spam) –  Spelling error detec:on (error vs. no error) –  Text categoriza:on (news, economy, culture, sport, ...) –  Named en:ty classifica:on (person, loca:on, organiza:on, ...)

2

€

X = {x t,y t }t=1N

€

x = x1, …, xm

€

y

Models for Classifica:on •  Genera:ve probabilis:c models: – Model of P(x, y) – Naive Bayes

•  Condi:onal probabilis:c models: – Model of P(y | x) –  Logis:c regression

•  Discrimina:ve model: – No explicit probability model – Decision trees, nearest neighbor classifica:on –  Perceptron, support vector machines, MIRA

3

Repea:ng… •  Noise – Data cleaning is expensive and :me consuming

•  Margin •  Induc:ve bias

4

Types of inductive biases • Minimum cross-‐‑validation error • Maximum margin • Minimum description length • [...]

DECISION TREES

5

Decision Trees

•  Hierarchical tree structure for classifica:on –  Each internal node specifies a test of some feature –  Each branch corresponds to a value for the tested

feature –  Each leaf node provides a classifica:on for the

instance •  Represents a disjunc:on of conjunc:ons of

constraints –  Each path from root to leaf specifies a conjunc:on

of tests –  The tree itself represents the disjunc:on of all paths

6

Decision Tree

7

Divide and Conquer •  Internal decision nodes

–  Univariate: Uses a single a]ribute, xi •  Numeric xi : Binary split : xi > wm •  Discrete xi : n-‐way split for n possible values

–  Mul:variate: Uses all a]ributes, x •  Leaves

–  Classifica:on: class labels (or propor:ons) –  Regression: r average (or local fit)

•  Learning: – Greedy recursive algorithm –  Find best split X = (X1, ..., Xp), then induce tree for each Xi

8

Classifica:on Trees (ID3, CART, C4.5)

9

•  For node m, Nm instances reach m, Nim belong to Ci

•  Node m is pure if pim is 0 or 1 •  Measure of impurity is entropy

€

Im = − pmi log2pm

i

i=1

K

∑

P Ci |x,m( ) ≡ pmi =Nm

i

Nm

Example: Entropy

•  Assume two classes (C1, C2) and four instances (x1, x2, x3, x4) •  Case 1:

–  C1 = {x1, x2, x3, x4}, C2 = { } –  Im = – (1 log 1 + 0 log 0) = 0

•  Case 2: –  C1 = {x1, x2, x3}, C2 = {x4} –  Im = – (0.75 log 0.75 + 0.25 log 0.25) = 0.81

•  Case 3: –  C1 = {x1, x2}, C2 = {x3, x4} –  Im = – (0.5 log 0.5 + 0.5 log 0.5) = 1

10

Best Split

P Ci |x,m, j( ) ≡ pmji =Nmj

i

Nmj

11

•  If node m is pure, generate a leaf and stop, otherwise split with test t and con:nue recursively

•  Find the test that minimizes impurity •  Impurity ager split with test t: –  Nmj of Nm take branch j –  Ni

mj belong to Ci

€

Imt = −

Nmj

Nmj=1

n

∑ pmji log2pmj

i

i=1

K

∑

€

Im = − pmi log2pm

i

i=1

K

∑

P Ci |x,m( ) ≡ pmi =Nm

i

Nm

Informa:on Gain

•  We want to determine which a]ribute in a given set of training feature vectors is most useful for discrimina:ng between the classes to be learned.

•  •Informa:on gain tells us how important a given a]ribute of the feature vectors is.

•  •We will use it to decide the ordering of a]ributes in the nodes of a decision tree.

13

Informa:on Gain and Gain Ra:o

€

Im = − pmi log2pm

i

i=1

K

∑

14

•  Choosing the test that minimizes impurity maximizes the informa:on gain (IG):

•  Informa:on gain prefers features with many values

•  The normalized version is called gain ra:o (GR):

€

Imt = −

Nmj

Nmj=1

n

∑ pmji log2pmj

i

i=1

K

∑

€

IGmt = Im − Im

t

€

Vmt = −

Nmj

Nm

log2Nmj

Nmj=1

n

∑

€

GRmt =

IGmt

Vmt

Pruning Trees

•  Decision trees are suscep:ble to overfijng •  Remove subtrees for be]er generaliza:on: –  Prepruning: Early stopping (e.g., with entropy threshold) –  Postpruning: Grow whole tree, then prune subtrees

•  Prepruning is faster, postpruning is more accurate (requires a separate valida:on set)

15

Rule Extrac:on from Trees

16

C4.5Rules (Quinlan, 1993)

Learning Rules

•  Rule induc:on is similar to tree induc:on but –  tree induc:on is breadth-‐first –  rule induc:on is depth-‐first (one rule at a :me)

•  Rule learning: –  A rule is a conjunc:on of terms (cf. tree path) –  A rule covers an example if all terms of the rule evaluate to true for the example (cf. sequence of tests)

–  Sequen:al covering: Generate rules one at a :me un:l all posi:ve examples are covered

–  IREP (Fürnkrantz and Widmer, 1994), Ripper (Cohen, 1995)

17

Proper:es of Decision Trees •  Decision trees are appropriate for classifica:on when: –  Features can be both categorical and numeric –  Disjunc:ve descrip:ons may be required –  Training data may be noisy (missing values, incorrect labels)

–  Interpreta:on of learned model is important (rules)

•  Induc0ve bias of (most) decision tree learners: 1.   Prefers trees with informa0ve aEributes close to the root 2.   Prefers smaller trees over bigger ones (with pruning) 3.   Preference bias (incomplete search of complete space)

18

K-‐NEAREST NEIGHBORS

19

Nearest Neighbor Classifica:on •  An old idea

•  Key components: – Storage of old instances – Similarity-‐based reasoning to new instances

20

This “rule of nearest neighbor” has considerable elementary intuitive appeal and probably corresponds to practice in many situations. For example, it is possible that much medical diagnosis is influenced by the doctor's recollection of the subsequent history of an earlier patient whose symptoms resemble in some way those of the current patient. (Fix and Hodges, 1952)

k-‐Nearest Neighbour •  Learning: –  Store training instances in memory

•  Classifica:on: – Given new test instance x,

•  Compare it to all stored instances •  Compute a distance between x and each stored instance xt •  Keep track of the k closest instances (nearest neighbors)

– Assign to x the majority class of the k nearest neighbours

•  A geometric view of learning –  Proximity in (feature) space à same class –  The smoothness assump:on

21

Eager and Lazy Learning •  Eager learning (e.g., decision trees) –  Learning – induce an abstract model from data –  Classifica:on – apply model to new data

•  Lazy learning (a.k.a. memory-‐based learning) –  Learning – store data in memory –  Classifica:on – compare new data to data in memory –  Proper:es:

•  Retains all the informa:on in the training set – no abstrac:on

•  Complex hypothesis space – suitable for natural language? •  Main drawback – classifica:on can be very inefficient

22

Dimensions of a k-‐NN Classifier

•  Distance metric – How do we measure distance between instances? – Determines the layout of the instance space

•  The k parameter – How large neighborhood should we consider? – Determines the complexity of the hypothesis space

23

Distance Metric 1

•  Overlap = count of mismatching features

24

€

Δ(x,z) = δ(xi,zi)i=1

m

∑

⎪⎪⎪

⎩

⎪⎪⎪

⎨

⎧

≠

=−

−

=

ii

ii

ii

ii

ii

zxifzxif

elsenumericifzx

zx10

,minmax

),(δ

Distance Metric 2

•  MVDM = Modified Value Difference Metric

25

€


m

∑

€

δ(xi,zi) = P(C j | xi)− P(C j | zi)j=1

K

∑

The k parameter •  Tunes the complexity of the hypothesis space –  If k = 1, every instance has its own neighborhood –  If k = N, all the feature space is one neighborhood

26

k = 1 k = 15

E = E(h|V ) = 1 h xt( ) ≠ rt( )t=1

M

∑

A Simple Example

€


m

∑

27

Training set: 1.  (a, b, a, c) à A 2.  (a, b, c, a) à B 3.  (b, a, c, c) à C 4.  (c, a, b, c) à A

New instance: 5.  (a, b, b, a)

Distances (overlap): Δ(1, 5) = 2 Δ(2, 5) = 1 Δ(3, 5) = 4 Δ(4, 5) = 3

k-‐NN classifica:on: 1-‐NN(5) = B 2-‐NN(5) = A/B 3-‐NN(5) = A 4-‐NN(5) = A

⎪⎪⎪

⎩

⎪⎪⎪

⎨

⎧

≠

=−

−

=

ii

ii

ii

ii

ii

zxifzxif

elsenumericifzx

zx10

,minmax

),(δ

Further Varia:ons on k-‐NN

•  Feature weights: – The overlap metric gives all features equal weight – Features can be weighted by IG or GR

•  Weighted vo:ng: – The normal decision rule gives all neighbors equal weight

–  Instances can be weighted by (inverse) distance

28

Proper:es of k-‐NN •  Nearest neighbor classifica:on is appropriate when: –  Features can be both categorical and numeric –  Disjunc:ve descrip:ons may be required –  Training data may be noisy (missing values, incorrect labels)

–  Fast classifica:on is not crucial •  Induc0ve bias of k-‐NN: 1.   Nearby instances should have the same label

(smoothness assump0on) 2.   All features are equally important (without feature

weights) 3.   Complexity tuned by the k parameter

29

End of Lecture 8

30

lecture 8: decision trees & k-nearest neighbors

Education