1 sims 290-2: applied natural language processing barbara rosario october 4, 2004
Post on 21-Dec-2015
223 views
TRANSCRIPT
![Page 1: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/1.jpg)
1
SIMS 290-2: Applied Natural Language Processing
Barbara RosarioOctober 4, 2004
![Page 2: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/2.jpg)
2
Today
Algorithms for Classification Binary classification
PerceptronWinnowSupport Vector Machines (SVM)Kernel Methods
Multi-Class classificationDecision TreesNaïve BayesK nearest neighbor
![Page 3: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/3.jpg)
3
Binary Classification: examples
Spam filtering (spam, not spam)Customer service message classification (urgent vs. not urgent)Information retrieval (relevant, not relevant)Sentiment classification (positive, negative)Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes
![Page 4: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/4.jpg)
4
Binary Classification
Given: some data items that belong to a positive (+1 ) or a negative (-1 ) classTask: Train the classifier and predict the class for a new data itemGeometrically: find a separator
![Page 5: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/5.jpg)
5
Linear versus Non Linear algorithms
Linearly separable data: if all the data points can be correctly classified by a linear (hyperplanar) decision boundary
![Page 6: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/6.jpg)
6
Linearly separable data
Class1Class2Linear Decision boundary
![Page 7: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/7.jpg)
7
Non linearly separable data
Class1Class2
![Page 8: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/8.jpg)
8
Non linearly separable data
Non Linear Classifier Class1Class2
![Page 9: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/9.jpg)
9
Linear versus Non Linear algorithms
Linear or Non linear separable data?We can find out only empirically
Linear algorithms (algorithms that find a linear decision boundary)
When we think the data is linearly separableAdvantages
– Simpler, less parameters
Disadvantages– High dimensional data (like for NLT) is usually not linearly
separable
Examples: Perceptron, Winnow, SVMNote: we can use linear algorithms also for non linear problems (see Kernel methods)
![Page 10: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/10.jpg)
10
Linear versus Non Linear algorithms
Non LinearWhen the data is non linearly separableAdvantages
– More accurate
Disadvantages– More complicated, more parameters
Example: Kernel methods
Note: the distinction between linear and non linear applies also for multi-class classification (we’ll see this later)
![Page 11: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/11.jpg)
11
Simple linear algorithms
Perceptron and Winnow algorithmLinearBinary classificationOnline (process data sequentially, one data point at the time)Mistake drivenSimple single layer Neural Networks
![Page 12: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/12.jpg)
12From Gert Lanckriet, Statistical Learning Theory Tutorial
Linear binary classification
Data: {(xi,yi)}i=1...n
x in Rd (x is a vector in d-dimensional space) feature vector
y in {-1,+1} label (class, category)
Question: Design a linear decision boundary: wx + b (equation of hyperplane) such that the classification rule associated with it has minimal probability of error classification rule:
– y = sign(w x + b) which means:– if wx + b > 0 then y = +1– if wx + b < 0 then y = -1
![Page 13: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/13.jpg)
13From Gert Lanckriet, Statistical Learning Theory Tutorial
Linear binary classification
Find a good hyperplane (w,b) in Rd+1
that correctly classifies data points as much as possible
In online fashion: one data point at the time, update weights as necessary
wx + b = 0
Classification Rule: y = sign(wx + b)
![Page 14: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/14.jpg)
14From Gert Lanckriet, Statistical Learning Theory Tutorial
Perceptron algorithmInitialize: w1 = 0
Updating rule For each data point x
If class(x) != decision(x,w)then
wk+1 wk + yixi
k k + 1 else
wk+1 wk
Function decision(x, w)If wx + b > 0 return +1
Else return -1
wk
0
+1
-1wk x + b = 0
wk+1
Wk+1 x + b = 0
![Page 15: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/15.jpg)
15From Gert Lanckriet, Statistical Learning Theory Tutorial
Perceptron algorithm
Online: can adjust to changing target, over timeAdvantages
Simple and computationally efficientGuaranteed to learn a linearly separable problem (convergence, global optimum)
LimitationsOnly linear separationsOnly converges for linearly separable dataNot really “efficient with many features”
![Page 16: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/16.jpg)
16From Gert Lanckriet, Statistical Learning Theory Tutorial
Winnow algorithm
Another online algorithm for learning perceptron weights:
f(x) = sign(wx + b)Linear, binary classification Update-rule: again error-driven, but multiplicative (instead of additive)
![Page 17: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/17.jpg)
17From Gert Lanckriet, Statistical Learning Theory Tutorial
Winnow algorithm
wk
0
+1
-1wk x + b= 0
wk+1
Wk+1 x + b = 0
Initialize: w1 = 0Updating rule For each data point x
If class(x) != decision(x,w)then
wk+1 wk + yixi Perceptron
wk+1 wk *exp(yixi) Winnow
k k + 1 else
wk+1 wk
Function decision(x, w)If wx + b > 0 return +1Else return -1
![Page 18: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/18.jpg)
18From Gert Lanckriet, Statistical Learning Theory Tutorial
Perceptron vs. Winnow
AssumeN available featuresonly K relevant items, with K<<N
Perceptron: number of mistakes: O( K N)Winnow: number of mistakes: O(K log N)
Winnow is more robust to high-dimensional feature
spaces
![Page 19: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/19.jpg)
19From Gert Lanckriet, Statistical Learning Theory Tutorial
Perceptron vs. WinnowPerceptronOnline: can adjust to changing target, over timeAdvantages
Simple and computationally efficientGuaranteed to learn a linearly separable problem
Limitationsonly linear separationsonly converges for linearly separable datanot really “efficient with many features”
WinnowOnline: can adjust to changing target, over timeAdvantages
Simple and computationally efficientGuaranteed to learn a linearly separable problem Suitable for problems with many irrelevant attributes
Limitationsonly linear separationsonly converges for linearly separable datanot really “efficient with many features”
Used in NLP
![Page 20: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/20.jpg)
20
Weka
Winnow in Weka
![Page 21: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/21.jpg)
21From Gert Lanckriet, Statistical Learning Theory Tutorial
Another family of linear algorithmsIntuition (Vapnik, 1965) If the classes are linearly separable:
Separate the dataPlace hyper-plane “far” from the data: large marginStatistical results guarantee good generalization
Large margin classifier
BAD
![Page 22: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/22.jpg)
22From Gert Lanckriet, Statistical Learning Theory Tutorial
GOOD
Maximal Margin Classifier
Intuition (Vapnik, 1965) if linearly separable:
Separate the dataPlace hyperplane “far” from the data: large marginStatistical results guarantee good generalization
Large margin classifier
![Page 23: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/23.jpg)
23From Gert Lanckriet, Statistical Learning Theory Tutorial
If not linearly separableAllow some errorsStill, try to place hyperplane “far” from each class
Large margin classifier
![Page 24: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/24.jpg)
24
Large Margin Classifiers
AdvantagesTheoretically better (better error bounds)
LimitationsComputationally more expensive, large quadratic programming
![Page 25: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/25.jpg)
25From Gert Lanckriet, Statistical Learning Theory Tutorial
Support Vector Machine (SVM)
Large Margin Classifier
Linearly separable case
Goal: find the hyperplane that maximizes the margin
wT x + b = 0
M wTxa + b = 1
wTxb + b = -1
Support vectors
![Page 26: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/26.jpg)
26From Gert Lanckriet, Statistical Learning Theory Tutorial
Support Vector Machine (SVM)
Text classificationHand-writing recognitionComputational biology (e.g., micro-array data)Face detection Face expression recognition Time series prediction
![Page 27: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/27.jpg)
27
Non Linear problem
![Page 28: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/28.jpg)
28
Non Linear problem
![Page 29: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/29.jpg)
29From Gert Lanckriet, Statistical Learning Theory Tutorial
Non Linear problem
Kernel methodsA family of non-linear algorithmsTransform the non linear problem in a linear one (in a different feature space)Use linear algorithms to solve the linear problem in the new space
![Page 30: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/30.jpg)
30
Main intuition of Kernel methods
(Copy here from black board)
![Page 31: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/31.jpg)
31From Gert Lanckriet, Statistical Learning Theory Tutorial
X=[x z]
Basic principle kernel methods : Rd RD (D >> d)
(X)=[x2 z2 xz]
f(x) = sign(w1x2+w2z2+w3xz +b)
wT(x)+b=0
![Page 32: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/32.jpg)
32From Gert Lanckriet, Statistical Learning Theory Tutorial
Basic principle kernel methods
Linear separability: more likely in high dimensionsMapping: maps input into high-dimensional feature spaceClassifier: construct linear classifier in high-dimensional feature spaceMotivation: appropriate choice of leads to linear separabilityWe can do this efficiently!
![Page 33: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/33.jpg)
33
Basic principle kernel methods
We can use the linear algorithms seen before (Perceptron, SVM) for classification in the higher dimensional space
![Page 34: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/34.jpg)
34
Multi-class classification
Given: some data items that belong to one of M possible classes Task: Train the classifier and predict the class for a new data itemGeometrically: harder problem, no more simple geometry
![Page 35: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/35.jpg)
35
Multi-class classification
![Page 36: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/36.jpg)
36
Multi-class classification: Examples
Author identificationLanguage identificationText categorization (topics)
![Page 37: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/37.jpg)
37
(Some) Algorithms for Multi-class classification
LinearParallel class separators: Decision TreesNon parallel class separators: Naïve Bayes
Non LinearK-nearest neighbors
![Page 38: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/38.jpg)
38
Linear, parallel class separators (ex: Decision Trees)
![Page 39: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/39.jpg)
39
Linear, NON parallel class separators (ex: Naïve Bayes)
![Page 40: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/40.jpg)
40
Non Linear (ex: k Nearest Neighbor)
![Page 41: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/41.jpg)
41http://dms.irb.hr/tutorial/tut_dtrees.php
Decision Trees
Decision tree is a classifier in the form of a tree structure, where each node is either:
Leaf node - indicates the value of the target attribute (class) of examples, orDecision node - specifies some test to be carried out on a single attribute-value, with one branch and sub-tree for each possible outcome of the test.
A decision tree can be used to classify an example by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance.
![Page 42: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/42.jpg)
42
Training Examples
NoStrongHighMildRainD14
YesWeakNormalHotOvercastD13
YesStrongHighMildOvercastD12
YesStrongNormalMildSunnyD11
YesStrongNormalMildRainD10
YesWeakNormalColdSunnyD9
NoWeakHighMildSunnyD8
YesWeakNormalCoolOvercastD7
NoStrongNormalCoolRainD6
YesWeakNormalCoolRainD5
YesWeakHighMildRain D4
YesWeakHighHotOvercastD3
NoStrongHighHotSunnyD2
NoWeakHighHotSunnyD1
Play TennisWindHumidityTemp.OutlookDay
Goal: learn when we can play Tennis and when we cannot
![Page 43: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/43.jpg)
43www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
![Page 44: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/44.jpg)
44www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
No Yes
Each internal node tests an attribute
Each branch corresponds to anattribute value node
Each leaf node assigns a classification
![Page 45: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/45.jpg)
45www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp
No
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ?
![Page 46: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/46.jpg)
46Foundations of Statistical Natural Language Processing, Manning and Schuetze
Decision Tree for Reuter classification
![Page 47: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/47.jpg)
47Foundations of Statistical Natural Language Processing, Manning and Schuetze
Decision Tree for Reuter classification
![Page 48: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/48.jpg)
48
Building Decision Trees
Given training data, how do we construct them?The central focus of the decision tree growing algorithm is selecting which attribute to test at each node in the tree. The goal is to select the attribute that is most useful for classifying examples. Top-down, greedy search through the space of possible decision trees.
That is, it picks the best attribute and never looks back to reconsider earlier choices.
![Page 49: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/49.jpg)
49
Building Decision Trees
Splitting criterionFinding the features and the values to split on
– for example, why test first “cts” and not “vs”? – Why test on “cts < 2” and not “cts < 5” ?
Split that gives us the maximum information gain (or the maximum reduction of uncertainty)
Stopping criterionWhen all the elements at one node have the same class, no need to split further
In practice, one first builds a large tree and then one prunes it back (to avoid overfitting)
See Foundations of Statistical Natural Language Processing, Manning and Schuetze for a good introduction
![Page 50: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/50.jpg)
50http://dms.irb.hr/tutorial/tut_dtrees.php
Decision Trees: Strengths
Decision trees are able to generate understandable rules. Decision trees perform classification without requiring much computation. Decision trees are able to handle both continuous and categorical variables. Decision trees provide a clear indication of which features are most important for prediction or classification.
![Page 51: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/51.jpg)
51http://dms.irb.hr/tutorial/tut_dtrees.php
Decision Trees: weaknesses
Decision trees are prone to errors in classification problems with many classes and relatively small number of training examples. Decision tree can be computationally expensive to train.
Need to compare all possible splitsPruning is also expensive
Most decision-tree algorithms only examine a single field at a time. This leads to rectangular classification boxes that may not correspond well with the actual distribution of records in the decision space.
![Page 52: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/52.jpg)
52
Decision Trees
Decision Trees in Weka
![Page 53: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/53.jpg)
53
Naïve Bayes
More powerful that Decision Trees
Decision Trees Naïve Bayes
![Page 54: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/54.jpg)
54
Naïve Bayes Models
Graphical Models: graph theory plus probability theoryNodes are variablesEdges are conditional probabilities
A
B C
P(A) P(B|A)P(C|A)
![Page 55: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/55.jpg)
55
Naïve Bayes Models
Graphical Models: graph theory plus probability theoryNodes are variablesEdges are conditional probabilitiesAbsence of an edge between nodes implies independence between the variables of the nodes
A
B C
P(A) P(B|A)P(C|A) P(C|A,B)
![Page 56: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/56.jpg)
56Foundations of Statistical Natural Language Processing, Manning and Schuetze
Naïve Bayes for text classification
![Page 57: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/57.jpg)
57
Naïve Bayes for text classification
earn
Shr 34 cts vs shrper
![Page 58: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/58.jpg)
58
Naïve Bayes for text classification
The words depend on the topic: P(wi| Topic)
P(cts|earn) > P(tennis| earn)
Naïve Bayes assumption: all words are independent given the topicFrom training set we learn the probabilities P(wi| Topic) for each word and for each topic in the training set
Topic
w1 w2 w3 w4 wnwn-1
![Page 59: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/59.jpg)
59
Naïve Bayes for text classification
To: Classify new exampleCalculate P(Topic | w1, w2, … wn) for each topic
Bayes decision rule:Choose the topic T’ for which P(T’ | w1, w2, … wn) > P(T | w1, w2, … wn) for each T T’
Topic
w1 w2 w3 w4 wnwn-1
![Page 60: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/60.jpg)
60
Naïve Bayes: Math
Naïve Bayes define a joint probability distribution: P(Topic , w1, w2, … wn) = P(Topic) P(wi| Topic)
We learn P(Topic) and P(wi| Topic) in training
Test: we need P(Topic | w1, w2, … wn)
P(Topic | w1, w2, … wn) = P(Topic , w1, w2, … wn) / P(w1, w2, … wn)
![Page 61: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/61.jpg)
61
Naïve Bayes: Strengths
Very simple modelEasy to understandVery easy to implement
Very efficient, fast training and classificationModest space storageWidely used because it works really well for text categorizationLinear, but non parallel decision boundaries
![Page 62: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/62.jpg)
62
Naïve Bayes: weaknesses
Naïve Bayes independence assumption has two consequences:
The linear ordering of words is ignored (bag of words model)The words are independent of each other given the class: False
– President is more likely to occur in a context that contains election than in a context that contains poet
Naïve Bayes assumption is inappropriate if there are strong conditional dependencies between the variables(But even if the model is not “right”, Naïve Bayes models do well in a surprisingly large number of cases because often we are interested in classification accuracy and not in accurate probability estimations)
![Page 63: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/63.jpg)
63
Naïve Bayes
Naïve Bayes in Weka
![Page 64: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/64.jpg)
64
k Nearest Neighbor Classification
Nearest Neighbor classification rule: to classify a new object, find the object in the training set that is most similar. Then assign the category of this nearest neighborK Nearest Neighbor (KNN): consult k nearest neighbors. Decision based on the majority category of these neighbors. More robust than k = 1
Example of similarity measure often used in NLP is cosine similarity
![Page 65: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/65.jpg)
65
1-Nearest Neighbor
![Page 66: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/66.jpg)
66
1-Nearest Neighbor
![Page 67: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/67.jpg)
67
3-Nearest Neighbor
![Page 68: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/68.jpg)
68
3-Nearest Neighbor
Assign the category of the majority of the neighbors
But this is closer..We can weight neighbors according to their similarity
![Page 69: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/69.jpg)
69
k Nearest Neighbor Classification
StrengthsRobustConceptually simpleOften works wellPowerful (arbitrary decision boundaries)
WeaknessesPerformance is very dependent on the similarity measure used (and to a lesser extent on the number of neighbors k used)Finding a good similarity measure can be difficultComputationally expensive
![Page 70: 1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d5d5503460f94a3bc5a/html5/thumbnails/70.jpg)
70
Summary
Algorithms for Classification Linear versus non linear classificationBinary classification
Perceptron WinnowSupport Vector Machines (SVM)Kernel Methods
Multi-Class classificationDecision TreesNaïve BayesK nearest neighbor
On Wednesday: Weka