lecture 02: machine learning for language technology - decision trees and nearest neighbors
DESCRIPTION
In this lecture, we talk about two different discriminative machine learning methods: decision trees and k-nearest neighbors. Decision trees are hierarchical structures.k-nearest neighbors are based on two principles: recollection and resemblance.TRANSCRIPT
![Page 1: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/1.jpg)
1
Lecture 2:
Decision Trees – Nearest NeighborsLast Updated: 09 Sept 2013
Uppsala University
Department of Linguistics and Philology, September 2013
Machine Learning for Language Technology(Schedule)
Lec 2: Decision Trees - Nearest Neighbors
![Page 2: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/2.jpg)
Lec 2: Decision Trees - Nearest Neighbors
2
Most slides from previous courses (i.e. from E. Alpaydin and J. Nivre) with
adaptations
![Page 3: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/3.jpg)
Lec 2: Decision Trees - Nearest Neighbors
3
Supervised Classification
Divide instances into (two or more) classes Instance (feature vector):
Features may be categorical or numerical Class (label): Training data:
Classification in Language Technology Spam filtering (spam vs. non-spam) Spelling error detection (error vs. no error) Text categorization (news, economy, culture, sport, ...) Named entity classification (person, location,
organization, ...)
X {x t,y t }t1N
x x1, , xm
y
![Page 4: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/4.jpg)
Lec 2: Decision Trees - Nearest Neighbors
4
Concept=the thing to be learned (here, how to decide the genre of a new webpage); Instances=examples (here a webpage)Features=attributes (here morphological features, normalized counts)Classes=labels (here webgenres)
![Page 5: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/5.jpg)
Lec 2: Decision Trees - Nearest Neighbors
5
Format for the software you chose (here arff)
![Page 6: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/6.jpg)
Lec 2: Decision Trees - Nearest Neighbors
6
Models for Classification Generative probabilistic models:
Model of P(x, y) Naive Bayes
Conditional probabilistic models: Model of P(y | x) Logistic regression (next time)
Discriminative model: No explicit probability model Decision trees, nearest neighbor classification
(today) Perceptron, support vector machines, MIRA (next
time)
![Page 7: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/7.jpg)
Lec 2: Decision Trees - Nearest Neighbors
7
Repeating… Noise
Data cleaning is expensive and time consuming
Margin Inductive bias
Types of inductive biases• Minimum cross-validation
error• Maximum margin• Minimum description length• [...]
![Page 8: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/8.jpg)
Lec 2: Decision Trees - Nearest Neighbors
8
Decision Trees
![Page 9: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/9.jpg)
Lec 2: Decision Trees - Nearest Neighbors
9
Decision Trees
Hierarchical tree structure for classification Each internal node specifies a test of some
feature Each branch corresponds to a value for the
tested feature Each leaf node provides a classification for the
instance Represents a disjunction of conjunctions of
constraints Each path from root to leaf specifies a
conjunction of tests The tree itself represents the disjunction of all
paths
![Page 10: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/10.jpg)
10
Decision Tree
Lec 2: Decision Trees - Nearest Neighbors
![Page 11: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/11.jpg)
11
Divide and Conquer
Lec 2: Decision Trees - Nearest Neighbors
Internal decision nodes Univariate: Uses a single attribute, xi
Numeric xi : Binary split : xi > wm
Discrete xi : n-way split for n possible values Multivariate: Uses all attributes, x
Leaves Classification: class labels (or proportions) Regression: r average (or local fit)
Learning: Greedy recursive algorithm Find best split X = (X1, ..., Xp), then induce tree for
each Xi
![Page 12: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/12.jpg)
12
Classification Trees (ID3, CART, C4.5)
Lec 2: Decision Trees - Nearest Neighbors
ˆ P Ci | x,m pmi
Nmi
Nm
For node m, Nm instances reach m, Nim belong to Ci
Node m is pure if pim is 0 or 1
Measure of impurity is entropy
I m pmi log2 pm
i
i1
K
![Page 13: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/13.jpg)
Lec 2: Decision Trees - Nearest Neighbors
13
Example: Entropy
Assume two classes (C1, C2) and four instances (x1, x2, x3, x4)
Case 1: C1 = {x1, x2, x3, x4}, C2 = { } Im = – (1 log 1 + 0 log 0) = 0
Case 2: C1 = {x1, x2, x3}, C2 = {x4} Im = – (0.75 log 0.75 + 0.25 log 0.25) = 0.81
Case 3: C1 = {x1, x2}, C2 = {x3, x4} Im = – (0.5 log 0.5 + 0.5 log 0.5) = 1
![Page 14: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/14.jpg)
14
Best Split
Lec 2: Decision Trees - Nearest Neighbors
ˆ P Ci | x,m, j pmji
Nmji
Nmj
If node m is pure, generate a leaf and stop, otherwise split with test t and continue recursively
Find the test that minimizes impurity Impurity after split with test t:
Nmj of Nm take branch j Ni
mj belong to Ci
I mt
Nmj
Nmj1
n
pmji log2 pmj
i
i1
K
I m pmi log2 pm
i
i1
K
ˆ P Ci | x,m pmi
Nmi
Nm
![Page 15: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/15.jpg)
15 Lec 2: Decision Trees - Nearest Neighbors
![Page 16: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/16.jpg)
Lec 2: Decision Trees - Nearest Neighbors
16
Information Gain
We want to determine which attribute in a given set of training feature vectors is most usefulfor discriminating between the classes to be learned.
•Information gain tells us how important a given attribute of the feature vectors is.
•We will use it to decide the ordering of attributes in the nodes of a decision tree.
![Page 17: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/17.jpg)
Lec 2: Decision Trees - Nearest Neighbors
17
Information Gain and Gain Ratio
Choosing the test that minimizes impurity maximizes the information gain (IG):
Information gain prefers features with many values
The normalized version is called gain ratio (GR):
I mt
Nmj
Nmj1
n
pmji log2 pmj
i
i1
K
IG mt I m I m
t
Vmt
Nmj
Nm
log2
Nmj
Nmj1
n
GRmt
IG mt
Vmt
I m pmi log2 pm
i
i1
K
![Page 18: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/18.jpg)
18
Pruning Trees
Lec 2: Decision Trees - Nearest Neighbors
Decision trees are susceptible to overfitting Remove subtrees for better generalization:
Prepruning: Early stopping (e.g., with entropy threshold)
Postpruning: Grow whole tree, then prune subtrees
Prepruning is faster, postpruning is more accurate (requires a separate validation set)
![Page 19: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/19.jpg)
19
Rule Extraction from Trees
Lec 2: Decision Trees - Nearest Neighbors
C4.5Rules (Quinlan, 1993)
![Page 20: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/20.jpg)
20
Learning Rules
Lec 2: Decision Trees - Nearest Neighbors
Rule induction is similar to tree induction but tree induction is breadth-first rule induction is depth-first (one rule at a time)
Rule learning: A rule is a conjunction of terms (cf. tree path) A rule covers an example if all terms of the rule
evaluate to true for the example (cf. sequence of tests)
Sequential covering: Generate rules one at a time until all positive examples are covered
IREP (Fürnkrantz and Widmer, 1994), Ripper (Cohen, 1995)
![Page 21: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/21.jpg)
Lec 2: Decision Trees - Nearest Neighbors
21
Properties of Decision Trees Decision trees are appropriate for classification when:
Features can be both categorical and numeric Disjunctive descriptions may be required Training data may be noisy (missing values, incorrect
labels) Interpretation of learned model is important (rules)
Inductive bias of (most) decision tree learners:1. Prefers trees with informative attributes close to
the root2. Prefers smaller trees over bigger ones (with
pruning)3. Preference bias (incomplete search of complete
space)
![Page 22: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/22.jpg)
Lec 2: Decision Trees - Nearest Neighbors
22
k-Nearest Neighbors
![Page 23: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/23.jpg)
Lec 2: Decision Trees - Nearest Neighbors
23
Nearest Neighbor Classification An old idea
Key components: Storage of old instances Similarity-based reasoning to new instances
This “rule of nearest neighbor” has considerable elementary intuitive appeal and probably corresponds to practice in many situations. For example, it is possible that much medical diagnosis is influenced by the doctor's recollection of the subsequent history of an earlier patient whose symptoms resemble in some way those of the current patient. (Fix and Hodges, 1952)
![Page 24: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/24.jpg)
Lec 2: Decision Trees - Nearest Neighbors
24
k-Nearest Neighbour Learning:
Store training instances in memory Classification:
Given new test instance x, Compare it to all stored instances Compute a distance between x and each stored
instance xt
Keep track of the k closest instances (nearest neighbors)
Assign to x the majority class of the k nearest neighbours
A geometric view of learning Proximity in (feature) space same class The smoothness assumption
![Page 25: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/25.jpg)
Lec 2: Decision Trees - Nearest Neighbors
25
Eager and Lazy Learning Eager learning (e.g., decision trees)
Learning – induce an abstract model from data Classification – apply model to new data
Lazy learning (a.k.a. memory-based learning) Learning – store data in memory Classification – compare new data to data in
memory Properties:
Retains all the information in the training set – no abstraction
Complex hypothesis space – suitable for natural language?
Main drawback – classification can be very inefficient
![Page 26: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/26.jpg)
Lec 2: Decision Trees - Nearest Neighbors
26
Dimensions of a k-NN Classifier Distance metric
How do we measure distance between instances? Determines the layout of the instance space
The k parameter How large neighborhood should we consider? Determines the complexity of the hypothesis
space
![Page 27: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/27.jpg)
Lec 2: Decision Trees - Nearest Neighbors
27
Distance Metric 1 Overlap = count of mismatching features
(x,z) (x i,zi)i1
m
ii
ii
ii
ii
ii
zxif
zxif
elsenumericifzx
zx
1
0
,minmax
),(
![Page 28: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/28.jpg)
Lec 2: Decision Trees - Nearest Neighbors
28
Distance Metric 2 MVDM = Modified Value Difference Metric
(x,z) (x i,zi)i1
m
(x i,zi) P(C j | x i) P(C j | zi)j1
K
![Page 29: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/29.jpg)
29
The k parameter Tunes the complexity of the hypothesis
space If k = 1, every instance has its own neighborhood If k = N, all the feature space is one neighborhood
k = 1 k = 15
Lec 2: Decision Trees - Nearest Neighbors
ˆ E E(h |V ) 1 h x t r t t1
M
![Page 30: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/30.jpg)
30
A Simple Example
Training set:1. (a, b, a, c) A2. (a, b, c, a) B3. (b, a, c, c) C4. (c, a, b, c) A
New instance:5. (a, b, b, a)
Distances (overlap):Δ(1, 5) = 2Δ(2, 5) = 1Δ(3, 5) = 4Δ(4, 5) = 3
k-NN classification:1-NN(5) = B2-NN(5) = A/B3-NN(5) = A4-NN(5) = A
Lec 2: Decision Trees - Nearest Neighbors
(x,z) (x i,zi)i1
m
ii
ii
ii
ii
ii
zxif
zxif
elsenumericifzx
zx
1
0
,minmax
),(
![Page 31: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/31.jpg)
Lec 2: Decision Trees - Nearest Neighbors
31
Further Variations on k-NN Feature weights:
The overlap metric gives all features equal weight Features can be weighted by IG or GR
Weighted voting: The normal decision rule gives all neighbors equal
weight Instances can be weighted by (inverse) distance
![Page 32: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/32.jpg)
Lec 2: Decision Trees - Nearest Neighbors
32
Properties of k-NN Nearest neighbor classification is appropriate
when: Features can be both categorical and numeric Disjunctive descriptions may be required Training data may be noisy (missing values, incorrect
labels) Fast classification is not crucial
Inductive bias of k-NN:1. Nearby instances should have the same label
(smoothness)2. All features are equally important (without
feature weights)3. Complexity tuned by the k parameter
![Page 33: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/33.jpg)
Lec 2: Decision Trees - Nearest Neighbors
33
Reading Alpaydin (2010): Ch. 3, 9 Daumé (2012): Ch. 1-2 Witten and Frank (2005): Ch. 2 Tom Mitchell (1997) Ch.3: Decision Tree
Learning
![Page 34: Lecture 02: Machine Learning for Language Technology - Decision Trees and Nearest Neighbors](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c5dfdd4a7959b8388b4741/html5/thumbnails/34.jpg)
Lec 2: Decision Trees - Nearest Neighbors
34
End of Lecture 2
Thanks for your attention