lecture 8: decision trees & k-nearest neighbors
DESCRIPTION
Decision trees, best split, entropy, information gain, gain ratio, k-Nearest Neighbors, distance metric.TRANSCRIPT
![Page 1: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/1.jpg)
Machine Learning for Language Technology Lecture 8: Decision Trees and
k-‐Nearest Neighbors Marina San:ni
Department of Linguis:cs and Philology Uppsala University, Uppsala, Sweden
Autumn 2014
Acknowledgement: Thanks to Prof. Joakim Nivre for course design and materials
1
![Page 2: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/2.jpg)
Supervised Classifica:on
• Divide instances into (two or more) classes – Instance (feature vector):
• Features may be categorical or numerical
– Class (label):
– Training data:
• Classifica:on in Language Technology – Spam filtering (spam vs. non-‐spam) – Spelling error detec:on (error vs. no error) – Text categoriza:on (news, economy, culture, sport, ...) – Named en:ty classifica:on (person, loca:on, organiza:on, ...)
2
€
X = {x t,y t }t=1N
€
x = x1, …, xm
€
y
![Page 3: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/3.jpg)
Models for Classifica:on • Genera:ve probabilis:c models: – Model of P(x, y) – Naive Bayes
• Condi:onal probabilis:c models: – Model of P(y | x) – Logis:c regression
• Discrimina:ve model: – No explicit probability model – Decision trees, nearest neighbor classifica:on – Perceptron, support vector machines, MIRA
3
![Page 4: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/4.jpg)
Repea:ng… • Noise – Data cleaning is expensive and :me consuming
• Margin • Induc:ve bias
4
Types of inductive biases • Minimum cross-‐‑validation error • Maximum margin • Minimum description length • [...]
![Page 5: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/5.jpg)
DECISION TREES
5
![Page 6: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/6.jpg)
Decision Trees
• Hierarchical tree structure for classifica:on – Each internal node specifies a test of some feature – Each branch corresponds to a value for the tested
feature – Each leaf node provides a classifica:on for the
instance • Represents a disjunc:on of conjunc:ons of
constraints – Each path from root to leaf specifies a conjunc:on
of tests – The tree itself represents the disjunc:on of all paths
6
![Page 7: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/7.jpg)
Decision Tree
7
![Page 8: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/8.jpg)
Divide and Conquer • Internal decision nodes
– Univariate: Uses a single a]ribute, xi • Numeric xi : Binary split : xi > wm • Discrete xi : n-‐way split for n possible values
– Mul:variate: Uses all a]ributes, x • Leaves
– Classifica:on: class labels (or propor:ons) – Regression: r average (or local fit)
• Learning: – Greedy recursive algorithm – Find best split X = (X1, ..., Xp), then induce tree for each Xi
8
![Page 9: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/9.jpg)
Classifica:on Trees (ID3, CART, C4.5)
9
• For node m, Nm instances reach m, Nim belong to Ci
• Node m is pure if pim is 0 or 1 • Measure of impurity is entropy
€
Im = − pmi log2pm
i
i=1
K
∑
P Ci |x,m( ) ≡ pmi =Nm
i
Nm
![Page 10: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/10.jpg)
Example: Entropy
• Assume two classes (C1, C2) and four instances (x1, x2, x3, x4) • Case 1:
– C1 = {x1, x2, x3, x4}, C2 = { } – Im = – (1 log 1 + 0 log 0) = 0
• Case 2: – C1 = {x1, x2, x3}, C2 = {x4} – Im = – (0.75 log 0.75 + 0.25 log 0.25) = 0.81
• Case 3: – C1 = {x1, x2}, C2 = {x3, x4} – Im = – (0.5 log 0.5 + 0.5 log 0.5) = 1
10
![Page 11: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/11.jpg)
Best Split
P Ci |x,m, j( ) ≡ pmji =Nmj
i
Nmj
11
• If node m is pure, generate a leaf and stop, otherwise split with test t and con:nue recursively
• Find the test that minimizes impurity • Impurity ager split with test t: – Nmj of Nm take branch j – Ni
mj belong to Ci
€
Imt = −
Nmj
Nmj=1
n
∑ pmji log2pmj
i
i=1
K
∑
€
Im = − pmi log2pm
i
i=1
K
∑
P Ci |x,m( ) ≡ pmi =Nm
i
Nm
![Page 12: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/12.jpg)
12
![Page 13: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/13.jpg)
Informa:on Gain
• We want to determine which a]ribute in a given set of training feature vectors is most useful for discrimina:ng between the classes to be learned.
• •Informa:on gain tells us how important a given a]ribute of the feature vectors is.
• •We will use it to decide the ordering of a]ributes in the nodes of a decision tree.
13
![Page 14: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/14.jpg)
Informa:on Gain and Gain Ra:o
€
Im = − pmi log2pm
i
i=1
K
∑
14
• Choosing the test that minimizes impurity maximizes the informa:on gain (IG):
• Informa:on gain prefers features with many values
• The normalized version is called gain ra:o (GR):
€
Imt = −
Nmj
Nmj=1
n
∑ pmji log2pmj
i
i=1
K
∑
€
IGmt = Im − Im
t
€
Vmt = −
Nmj
Nm
log2Nmj
Nmj=1
n
∑
€
GRmt =
IGmt
Vmt
![Page 15: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/15.jpg)
Pruning Trees
• Decision trees are suscep:ble to overfijng • Remove subtrees for be]er generaliza:on: – Prepruning: Early stopping (e.g., with entropy threshold) – Postpruning: Grow whole tree, then prune subtrees
• Prepruning is faster, postpruning is more accurate (requires a separate valida:on set)
15
![Page 16: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/16.jpg)
Rule Extrac:on from Trees
16
C4.5Rules (Quinlan, 1993)
![Page 17: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/17.jpg)
Learning Rules
• Rule induc:on is similar to tree induc:on but – tree induc:on is breadth-‐first – rule induc:on is depth-‐first (one rule at a :me)
• Rule learning: – A rule is a conjunc:on of terms (cf. tree path) – A rule covers an example if all terms of the rule evaluate to true for the example (cf. sequence of tests)
– Sequen:al covering: Generate rules one at a :me un:l all posi:ve examples are covered
– IREP (Fürnkrantz and Widmer, 1994), Ripper (Cohen, 1995)
17
![Page 18: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/18.jpg)
Proper:es of Decision Trees • Decision trees are appropriate for classifica:on when: – Features can be both categorical and numeric – Disjunc:ve descrip:ons may be required – Training data may be noisy (missing values, incorrect labels)
– Interpreta:on of learned model is important (rules)
• Induc0ve bias of (most) decision tree learners: 1. Prefers trees with informa0ve aEributes close to the root 2. Prefers smaller trees over bigger ones (with pruning) 3. Preference bias (incomplete search of complete space)
18
![Page 19: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/19.jpg)
K-‐NEAREST NEIGHBORS
19
![Page 20: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/20.jpg)
Nearest Neighbor Classifica:on • An old idea
• Key components: – Storage of old instances – Similarity-‐based reasoning to new instances
20
This “rule of nearest neighbor” has considerable elementary intuitive appeal and probably corresponds to practice in many situations. For example, it is possible that much medical diagnosis is influenced by the doctor's recollection of the subsequent history of an earlier patient whose symptoms resemble in some way those of the current patient. (Fix and Hodges, 1952)
![Page 21: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/21.jpg)
k-‐Nearest Neighbour • Learning: – Store training instances in memory
• Classifica:on: – Given new test instance x,
• Compare it to all stored instances • Compute a distance between x and each stored instance xt • Keep track of the k closest instances (nearest neighbors)
– Assign to x the majority class of the k nearest neighbours
• A geometric view of learning – Proximity in (feature) space à same class – The smoothness assump:on
21
![Page 22: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/22.jpg)
Eager and Lazy Learning • Eager learning (e.g., decision trees) – Learning – induce an abstract model from data – Classifica:on – apply model to new data
• Lazy learning (a.k.a. memory-‐based learning) – Learning – store data in memory – Classifica:on – compare new data to data in memory – Proper:es:
• Retains all the informa:on in the training set – no abstrac:on
• Complex hypothesis space – suitable for natural language? • Main drawback – classifica:on can be very inefficient
22
![Page 23: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/23.jpg)
Dimensions of a k-‐NN Classifier
• Distance metric – How do we measure distance between instances? – Determines the layout of the instance space
• The k parameter – How large neighborhood should we consider? – Determines the complexity of the hypothesis space
23
![Page 24: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/24.jpg)
Distance Metric 1
• Overlap = count of mismatching features
24
€
Δ(x,z) = δ(xi,zi)i=1
m
∑
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
≠
=−
−
=
ii
ii
ii
ii
ii
zxifzxif
elsenumericifzx
zx10
,minmax
),(δ
![Page 25: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/25.jpg)
Distance Metric 2
• MVDM = Modified Value Difference Metric
25
€
Δ(x,z) = δ(xi,zi)i=1
m
∑
€
δ(xi,zi) = P(C j | xi)− P(C j | zi)j=1
K
∑
![Page 26: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/26.jpg)
The k parameter • Tunes the complexity of the hypothesis space – If k = 1, every instance has its own neighborhood – If k = N, all the feature space is one neighborhood
26
k = 1 k = 15
E = E(h|V ) = 1 h xt( ) ≠ rt( )t=1
M
∑
![Page 27: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/27.jpg)
A Simple Example
€
Δ(x,z) = δ(xi,zi)i=1
m
∑
27
Training set: 1. (a, b, a, c) à A 2. (a, b, c, a) à B 3. (b, a, c, c) à C 4. (c, a, b, c) à A
New instance: 5. (a, b, b, a)
Distances (overlap): Δ(1, 5) = 2 Δ(2, 5) = 1 Δ(3, 5) = 4 Δ(4, 5) = 3
k-‐NN classifica:on: 1-‐NN(5) = B 2-‐NN(5) = A/B 3-‐NN(5) = A 4-‐NN(5) = A
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
≠
=−
−
=
ii
ii
ii
ii
ii
zxifzxif
elsenumericifzx
zx10
,minmax
),(δ
![Page 28: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/28.jpg)
Further Varia:ons on k-‐NN
• Feature weights: – The overlap metric gives all features equal weight – Features can be weighted by IG or GR
• Weighted vo:ng: – The normal decision rule gives all neighbors equal weight
– Instances can be weighted by (inverse) distance
28
![Page 29: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/29.jpg)
Proper:es of k-‐NN • Nearest neighbor classifica:on is appropriate when: – Features can be both categorical and numeric – Disjunc:ve descrip:ons may be required – Training data may be noisy (missing values, incorrect labels)
– Fast classifica:on is not crucial • Induc0ve bias of k-‐NN: 1. Nearby instances should have the same label
(smoothness assump0on) 2. All features are equally important (without feature
weights) 3. Complexity tuned by the k parameter
29
![Page 30: Lecture 8: Decision Trees & k-Nearest Neighbors](https://reader034.vdocuments.us/reader034/viewer/2022051313/5481c01bb07959600c8b45f0/html5/thumbnails/30.jpg)
End of Lecture 8
30