Download - I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario
![Page 1: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/1.jpg)
I256
Applied Natural Language Processing
Fall 2009
Lecture 11
Classification
Barbara Rosario
![Page 2: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/2.jpg)
2
Announcements
• Next Thursday: project ideas• Assignment 3 was due today• Solutions and grades for assignment 2
by end of this week• Assignment 4 out; due in 2 weeks
(October 20)• Project proposals (1, 2 pages) due
October 15 (more on this on Thursday)
![Page 3: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/3.jpg)
3
Classification
• Define classes/categories• Label text• Extract features• Choose a classifier
– Naive Bayes Classifier – NN (i.e. perceptron)– Maximum Entropy– ….
• Train it • Use it to classify new examples
![Page 4: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/4.jpg)
4
Today
• Algorithms for Classification
• Binary classification– Linear and not linear
• Multi-Class classification– Linear and not linear
• Examples of classification algorithms for each case
![Page 5: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/5.jpg)
5
Classification• In classification problems, each entity in some domain can
be placed in one of a discrete set of categories: yes/no, friend/foe, good/bad/indifferent, blue/red/green, etc.
• Given a training set of labeled entities, develop a “rule” for assigning labels to entities in a test set
• Many variations on this theme:– binary classification– multi-category classification– non-exclusive categories– ranking
• Many criteria to assess rules and their predictions– overall errors– costs associated with different kinds of errors
![Page 6: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/6.jpg)
6
Algorithms for Classification
• It's possible to treat these learning methods as black boxes.
• But there's a lot to be learned from taking a closer look
• An understanding of these methods can help guide our choice for the appropriate learning method (binary or multi-class for example)
![Page 7: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/7.jpg)
7
Representation of Objects
• Each object to be classified is represented as a pair (x, y):– where x is a description of the object– where y is a label
• Success or failure of a machine learning classifier often depends on choosing good descriptions of objects– the choice of description can also be viewed as
a learning problem (feature selections)– but good human intuitions are often needed here
![Page 8: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/8.jpg)
8
Data Types
– text and hypertext
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><title>Welcome to FairmontNET</title>
</head><STYLE type="text/css">.stdtext {font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 11px; color: #1F3D4E;}.stdtext_wh {font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 11px; color: WHITE;}</STYLE>
<body leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" bgcolor="BLACK"><TABLE cellpadding="0" cellspacing="0" width="100%" border="0"> <TR> <TD width=50% background="/TFN/en/CDA/Images/common/labels/decorative_2px_blk.gif"> </TD> <TD><img src="/TFN/en/CDA/Images/common/labels/decorative.gif"></td> <TD width=50% background="/TFN/en/CDA/Images/common/labels/decorative_2px_blk.gif"> </TD> </TR></TABLE><tr> <td align="right" valign="middle"><IMG src="/TFN/en/CDA/Images/common/labels/centrino_logo_blk.gif"></td></tr></body></html>
![Page 9: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/9.jpg)
9
Data Types
– network layout: graph
– And many many others….
![Page 10: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/10.jpg)
10
Example: Digit Recognition• Input: images / pixel grids• Output: a digit 0-9• Setup:
– Get a large collection of example images, each labeled with a digit
– Note: someone has to hand label all this data
– Want to learn to predict labels of new, future digit images
• Features: The attributes used to make the digit decision
– Pixels: (6,8)=ON– Shape Patterns: NumComponents,
AspectRatio, NumLoops– …
• Current state-of-the-art: Human-level performance
0
1
2
1
??
![Page 11: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/11.jpg)
11
Text Classification tasks
Assign the correct class label for a given input/object
In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance.
Examples:
Adapted from: Foundations of Statistical NLP (Manning et al)
Problem Object Label’s categories Tagging Word POSSense Disambiguation Word The word’s sensesInformation retrieval Document Relevant/not relevantSentiment classification Document Positive/negativeText categorization Document Topics/classesAuthor identification Document AuthorsLanguage identification Document Language
![Page 12: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/12.jpg)
12
Other Examples of Real-World Classification Tasks
• Fraud detection (input: account activity, classes: fraud / no fraud)
• Web page spam detection (input: HTML/rendered page, classes: spam / ham)
• Speech recognition and speaker recognition (input: waveform, classes: phonemes or words)
• Medical diagnosis (input: symptoms, classes: diseases)• Automatic essay grader (input: document, classes: grades)• Customer service email routing and foldering• Link prediction in social networks• … many many more
• Classification is an important commercial technology
![Page 13: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/13.jpg)
13
Training and Validation• Data: labeled instances, e.g. emails marked spam/ham
– Training set– Validation set– Test set
• Training– Estimate parameters on training set– Tune features on validation set – Report results on test set– Anything short of this yields over-optimistic claims
• Evaluation– Many different metrics– Ideally, the criteria used to train the classifier should be closely
related to those used to evaluate the classifier
• Statistical issues– Want a classifier which does well on test data– Overfitting: fitting the training data very closely, but not generalizing
well– Error bars: want realistic (conservative) estimates of accuracy
TrainingData
ValidationData
TestData
![Page 14: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/14.jpg)
14
Intuitive Picture of the Problem
Class1Class2
![Page 15: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/15.jpg)
15
Some Issues
• There may be a simple separator (e.g., a straight line in 2D or a hyperplane in general) or there may not
• There may be “noise” of various kinds• There may be “overlap”• Some classifiers explicitly represent separators (e.g.,
straight lines), while for other classifiers the separation is done implicitly
• Some classifiers just make a decision as to which class an object is in; others estimate class probabilities
![Page 16: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/16.jpg)
16
Methods
• Binary vs. multi class classification
• Linear vs. non linear
![Page 17: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/17.jpg)
17
Methods• Linear Models:
– Perceptron & Winnow (neural networks)– Large margin classifier
• Support Vector Machine (SVM)
• Probabilistic models:– Naïve Bayes– Maximum Entropy Models
• Decision Models:– Decision Trees
• Instance-based methods:– Nearest neighbor
![Page 18: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/18.jpg)
18
Binary Classification: examples
• Spam filtering (spam, not spam)• Customer service message classification (urgent
vs. not urgent)• Information retrieval (relevant, not relevant)• Sentiment classification (positive, negative)• Sometime it can be convenient to treat a multi-
way problem like a binary one: one class versus all the others, for all classes
![Page 19: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/19.jpg)
19
Binary Classification
• Given: some data items that belong to a positive (+1 ) or a negative (-1 ) class
• Task: Train the classifier and predict the class for a new data item
• Geometrically: find a separator
![Page 20: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/20.jpg)
20
Linear versus Non Linear algorithms
• Linearly separable data: if all the data points can be correctly classified by a linear (hyperplanar) decision boundary
![Page 21: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/21.jpg)
21
Linearly separable data
Class1Class2Linear Decision boundary
![Page 22: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/22.jpg)
22
Non linearly separable data
Class1Class2
![Page 23: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/23.jpg)
23
Non linearly separable data
Non Linear Classifier Class1Class2
![Page 24: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/24.jpg)
24
Linear versus Non Linear algorithms
• Linear or Non linear separable data?– We can find out only empirically
• Linear algorithms (algorithms that find a linear decision boundary)– When we think the data is linearly separable– Advantages
• Simpler, less parameters
– Disadvantages• High dimensional data (like for NLP) is usually not linearly
separable
– Examples: Perceptron, Winnow, large margin – Note: we can use linear algorithms also for non linear
problems (see Kernel methods)
![Page 25: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/25.jpg)
25
Linear versus Non Linear algorithms
• Non Linear algorithms – When the data is non linearly separable– Advantages
• More accurate
– Disadvantages• More complicated, more parameters
– Example: Kernel methods– Note: the distinction between linear and non
linear applies also for multi-class classification (we’ll see this later)
![Page 26: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/26.jpg)
26
Simple linear algorithms
• Perceptron and Winnow algorithm– Linear– Binary classification– Online (process data sequentially, one
data point at the time)– Mistake driven– Simple single layer Neural Networks
![Page 27: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/27.jpg)
27
Linear binary classification• Data: {(xi,yi)}i=1...n
– x in Rd (x is a vector in d-dimensional space) feature vector– y in {-1,+1} label (class, category)
• Question: – Design a linear decision boundary: wx + b (equation
of hyperplane) such that the classification rule associated with it has minimal probability of error
– classification rule: • y = sign(w x + b) which means:• if wx + b > 0 then y = +1• if wx + b < 0 then y = -1
From Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 28: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/28.jpg)
28
Linear binary classification• Find a good hyperplane
(w,b) in Rd+1
that correctly classifies data points as much as possible
• In online fashion: one data point at the time, update weights as necessary
wx + b = 0
Classification Rule: y = sign(wx + b)
From Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 29: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/29.jpg)
29
Perceptron algorithm• Initialize: w1 = 0
• Updating rule For each data point x
– If class(x) != decision(x,w)– then wk+1 wk + yixi
k k + 1 – else
wk+1 wk
• Function decision(x, w)– If wx + b > 0 return +1– Else return -1
wk
0
+1
-1wk x + b = 0
wk+1
Wk+1 x + b = 0
From Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 30: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/30.jpg)
30
Perceptron algorithm
• Online: can adjust to changing target, over time• Advantages
– Simple and computationally efficient– Guaranteed to learn a linearly separable problem
(convergence, global optimum)• Limitations
– Only linear separations– Only converges for linearly separable data– Not really “efficient with many features”
From Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 31: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/31.jpg)
31
Winnow algorithm
• Another online algorithm for learning perceptron weights:
f(x) = sign(wx + b)
• Linear, binary classification
• Update-rule: again error-driven, but multiplicative (instead of additive)
![Page 32: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/32.jpg)
32
Winnow algorithm
wk
0
+1
-1wk x + b= 0
wk+1
Wk+1 x + b = 0
• Initialize: w1 = 0• Updating rule For each data point x
– If class(x) != decision(x,w)– then
wk+1 wk + yixi Perceptron
wk+1 wk *exp(yixi) Winnow
k k + 1 – else
wk+1 wk
• Function decision(x, w)– If wx + b > 0 return +1– Else return -1
From Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 33: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/33.jpg)
33
Perceptron vs. Winnow• Assume
– N available features– only K relevant features, with K<<N
• Perceptron: number of mistakes: O( K N)• Winnow: number of mistakes: O(K log N)
Winnow is more robust to high-dimensional feature spaces
Often used in NLP
From Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 34: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/34.jpg)
34
Perceptron vs. Winnow• Perceptron• Online: can adjust to changing
target, over time• Advantages
– Simple and computationally efficient
– Guaranteed to learn a linearly separable problem
• Limitations– only linear separations– only converges for linearly
separable data– not really “efficient with many
features”
• Winnow• Online: can adjust to changing target,
over time• Advantages
– Simple and computationally efficient
– Guaranteed to learn a linearly separable problem
– Suitable for problems with many irrelevant attributes
• Limitations– only linear separations– only converges for linearly
separable data– not really “efficient with many
features”• Used in NLP
![Page 35: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/35.jpg)
35
• Another family of linear algorithms
• Intuition (Vapnik, 1965) • If the classes are linearly
separable:– Separate the data– Place hyper-plane “far” from
the data: large margin– Statistical results guarantee
good generalization
Large margin classifier
BAD
From Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 36: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/36.jpg)
36
GOOD
Maximal Margin Classifier
• Intuition (Vapnik, 1965) if linearly separable:– Separate the data– Place hyperplane
“far” from the data: large margin
– Statistical results guarantee good generalization
Large margin classifier
From Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 37: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/37.jpg)
37
If not linearly separable– Allow some errors– Still, try to place
hyperplane “far” from each class
Large margin classifier
From Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 38: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/38.jpg)
38
Large Margin Classifiers
• Advantages– Theoretically better (better error bounds)
• Limitations– Computationally more expensive, large
quadratic programming
![Page 39: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/39.jpg)
39
Non Linear problem
![Page 40: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/40.jpg)
40
Non Linear problem
![Page 41: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/41.jpg)
From Gert Lanckriet, Statistical Learning Theory Tutorial
41
Non Linear problem
• Kernel methods
• A family of non-linear algorithms
• Transform the non linear problem in a linear one (in a different feature space)
• Use linear algorithms to solve the linear problem in the new space
![Page 42: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/42.jpg)
42
Main intuition of Kernel methods
• (Copy here from black board)
![Page 43: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/43.jpg)
43
X=[x z]
Basic principle kernel methods
: Rd RD (D >> d)
(X)=[x2 z2 xz]
f(x) = sign(w1x2+w2z2+w3xz +b)
wT(x)+b=0
From Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 44: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/44.jpg)
44
Basic principle kernel methods
• Linear separability: more likely in high dimensions
• Mapping: maps input into high-dimensional feature space
• Classifier: construct linear classifier in high-dimensional feature space
• Motivation: appropriate choice of leads to linear separability
• We can do this efficiently!
From Gert Lanckriet, Statistical Learning Theory Tutorial
![Page 45: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/45.jpg)
45
Basic principle kernel methods
• We can use the linear algorithms seen before (for example, perceptron) for classification in the higher dimensional space
![Page 46: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/46.jpg)
46
Multi-class classification
• Given: some data items that belong to one of M possible classes
• Task: Train the classifier and predict the class for a new data item
• Geometrically: harder problem, no more simple geometry
![Page 47: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/47.jpg)
47
Multi-class classification
![Page 48: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/48.jpg)
48
Multi-class classification: Examples
• Author identification
• Language identification
• Text categorization (topics)
![Page 49: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/49.jpg)
49
(Some) Algorithms for multi-class classification
• Linear– Parallel class separators: Decision Trees– Non parallel class separators: Naïve
Bayes and Maximum Entropy
• Non Linear– K-nearest neighbors
![Page 50: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/50.jpg)
50
Linear, parallel class separators (ex: Decision Trees)
![Page 51: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/51.jpg)
51
Linear, NON parallel class separators (ex: Naïve Bayes)
![Page 52: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/52.jpg)
52
Non Linear (ex: k Nearest Neighbor)
![Page 53: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/53.jpg)
53
Decision Trees
• Decision tree is a classifier in the form of a tree structure, where each node is either:– Decision node - specifies some test to be carried
out on a single attribute-value, with one branch and sub-tree for each possible outcome of the test
– Leaf node - indicates the value of the target attribute (class) of examples
• A decision tree can be used to classify an example by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance.
![Page 54: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/54.jpg)
54
Training Examples
NoStrongHighMildRainD14YesWeakNormalHotOvercastD13YesStrongHighMildOvercastD12YesStrongNormalMildSunnyD11YesStrongNormalMildRainD10YesWeakNormalColdSunnyD9NoWeakHighMildSunnyD8YesWeakNormalCoolOvercastD7NoStrongNormalCoolRainD6YesWeakNormalCoolRainD5YesWeakHighMildRain D4 YesWeakHighHotOvercastD3NoStrongHighHotSunnyD2NoWeakHighHotSunnyD1
Play TennisWindHumidityTemp.OutlookDay
Goal: learn when we can play Tennis and when we cannot
![Page 55: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/55.jpg)
55
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
![Page 56: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/56.jpg)
www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp
56
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
No Yes
Each internal node tests an attribute
Each branch corresponds to anattribute value node
Each leaf node assigns a classification
![Page 57: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/57.jpg)
www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp
57
No
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ?
![Page 58: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/58.jpg)
58
Decision Tree for Reuter classification
Stat NLP, Manning and Schuetze
![Page 59: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/59.jpg)
59
Decision Tree for Reuter classification
Stat NLP, Manning and Schuetze
![Page 60: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/60.jpg)
60
Decision Tree for Reuter classification
Stat NLP, Manning and Schuetze
![Page 61: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/61.jpg)
61
Building Decision Trees
• Once we have a decision tree, it is straightforward to use it to assign labels to new input values.
• How we can build a decision tree that models a given training set?
![Page 62: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/62.jpg)
62
Building Decision Trees
• The central focus of the decision tree growing algorithm is selecting which attribute to test at each node in the tree. The goal is to select the attribute that is most useful for classifying examples.
• Top-down, greedy search through the space of possible decision trees.– That is, it picks the best attribute and never
looks back to reconsider earlier choices.
![Page 63: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/63.jpg)
63
Building Decision Trees• Splitting criterion
– Finding the features and the values to split on • for example, why test first “cts” and not “vs”? • Why test on “cts < 2” and not “cts < 5” ?
– Split that gives us the maximum information gain (or the maximum reduction of uncertainty)
• Stopping criterion– When all the elements at one node have the same class,
no need to split further
• In practice, one first builds a large tree and then one prunes it back (to avoid overfitting)
• See Foundations of Statistical Natural Language Processing, Manning and Schuetze for a good introduction
![Page 64: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/64.jpg)
64
Decision Trees: Strengths
• Decision trees are able to generate understandable rules.
• Decision trees perform classification without requiring much computation.
• Decision trees are able to handle both continuous and categorical variables.
• Decision trees provide a clear indication of which features are most important for prediction or classification.
![Page 65: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/65.jpg)
65
Decision Trees: weaknesses
• Decision trees are prone to errors in classification problems with many classes and relatively small number of training examples. – Since each branch in the decision tree splits the training data,
the amount of training data available to train nodes lower in the tree can become quite small.
• Decision tree can be computationally expensive to train. – Need to compare all possible splits– Pruning is also expensive
![Page 66: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/66.jpg)
66
Decision Trees: weaknesses
• Most decision-tree algorithms only examine a single field at a time. This leads to rectangular classification boxes that may not correspond well with the actual distribution of records in the decision space. – The fact that decision trees require that
features be checked in a specific order limits their ability to exploit features that are relatively independent of one another.
– Naive Bayes overcomes this limitation by allowing all features to act "in parallel."
![Page 67: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/67.jpg)
67
Naïve BayesMore powerful that Decision Trees
Decision Trees Naïve Bayes
Every feature gets a say in determining which label should be assigned to a given input value.
![Page 68: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/68.jpg)
Foundations of Statistical Natural Language Processing, Manning and Schuetze
68
Naïve Bayes for text classification
![Page 69: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/69.jpg)
69
Naïve Bayes for text classification
earn
Shr 34 cts vs shrper
![Page 70: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/70.jpg)
70
Naïve Bayes for text classification
• The words depend on the topic: P(wi| Topic)
– P(cts|earn) > P(tennis| earn)
• Naïve Bayes (aka independence) assumption: all words are independent given the topic
• From training set we learn the probabilities P(wi| Topic) for each word and for each topic in the training set
Topic
w1 w2 w3 w4 wnwn-1
![Page 71: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/71.jpg)
71
Naïve Bayes for text classification
• To: Classify new example
• Calculate P(Topic | w1, w2, … wn) for each topic
• Bayes decision rule:– Choose the topic T’ for which
– P(T’ | w1, w2, … wn) > P(T | w1, w2, … wn) for each T T’
Topic
w1 w2 w3 w4 wnwn-1
![Page 72: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/72.jpg)
72
Naïve Bayes: Strengths
• Very simple model– Easy to understand– Very easy to implement
• Can scale easily to millions of training examples (just need counts!)
• Very efficient, fast training and classification• Modest space storage• Widely used because it works really well for text
categorization• Linear, but non parallel decision boundaries
![Page 73: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/73.jpg)
73
Naïve Bayes: weaknesses
• Naïve Bayes independence assumption has two consequences:– The linear ordering of words is ignored (bag of words model)– The words are independent of each other given the class: False
• President is more likely to occur in a context that contains election than in a context that contains poet
• Naïve Bayes assumption is inappropriate if there are strong conditional dependencies between the variables
• (But even if the model is not “right”, Naïve Bayes models do well in a surprisingly large number of cases because often we are interested in classification accuracy and not in accurate probability estimations)
• Does not optimize prediction accuracy
![Page 74: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/74.jpg)
74
The naivete of independence
• Naïve Bayes assumption is inappropriate if there are strong conditional dependencies between the variables
• One problem that arises is that the classifier can end up "double-counting" the effect of highly correlated features, pushing the classifier closer to a given label than is justified.
• Consider a name gender classifier• For example, the features ends-with(a) and ends-
with(vowel) are dependent on one another, because if an input value has the first feature, then it must also have the second feature. For features like these, the duplicated information may be given more weight than is justified by the training set.
![Page 75: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/75.jpg)
75
Maximum Entropy
Decision TreesNaïve Bayes & Maximum Entropy
Maximum Entropy: Multi class, linear classifierDifference with Naïve Bayes:
Maximum Entropy is a conditional model
![Page 76: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/76.jpg)
76
• Builds a model that predicts P(input, label) ---the joint probability of a (input, label) pair.
• Examples: • Naïve Bayes
• Aka discriminative• Builds a model that
predicts P(label | input) — the probability of a label given the input value.
• Examples: • Maximum Entropy
Generative vs Conditional Classifiers
v1 v2 v3
T
v1 v2 v3
T
Generative Discriminative
![Page 77: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/77.jpg)
77
Generative vs Conditional• Generative P(input, label):
can be used to answer the following questions:
1. What is the most likely label for a given input?
2. How likely is a given label for a given input?
3. What is the most likely input value?
4. How likely is a given input value? 5. How likely is a given input value
with a given label? 6. What is the most likely label for
an input that might have one of two values (but we don't know which)?
• Conditional P(label | input): can be used to answer the following questions:
1. What is the most likely label for a given input?
2. How likely is a given label for a given input?
![Page 78: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/78.jpg)
78
Generative vs Conditional• Generative P(input, label) --- Conditional P(label | input) • Generative models are strictly more powerful than conditional
models, since we can calculate the conditional probability P(label|input) from the joint probability P(input, label), but not vice versa
• This additional power comes at a price. Because the model is more powerful, it has more "free parameters" which need to be learned.
• Thus, when using a more powerful model, we end up with less data that can be used to train each parameter's value, making it harder to find the best parameter values.
• As a result, a generative model may not do as good a job at answering questions 1 and 2 as a conditional model, since the conditional model can focus its efforts on those two questions.
• However, if we do need answers to questions like 3-6, then we have no choice but to use a generative model.
![Page 79: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/79.jpg)
79
k Nearest Neighbor Classification• Instance-based method • Nearest Neighbor classification rule: to classify a
new object, find the object in the training set that is most similar (i.e. closest in the vector space).
• Then assign the category of this nearest neighbor
• K Nearest Neighbor (KNN): consult k nearest neighbors. Decision based on the majority category of these neighbors. More robust than k = 1 (nearest Neighbor)
• Example of similarity measure often used in NLP is cosine similarity
![Page 80: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/80.jpg)
80
1-Nearest Neighbor
![Page 81: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/81.jpg)
81
1-Nearest Neighbor
![Page 82: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/82.jpg)
82
3-Nearest Neighbor
![Page 83: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/83.jpg)
83
3-Nearest Neighbor
Assign the category of the majority of the neighbors
But this is closer..We can weight neighbors according to their similarity
![Page 84: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/84.jpg)
84
k Nearest Neighbor Classification• Strengths
– Robust– Conceptually simple– Often works well (used in computer vision a lot)– Powerful (arbitrary decision boundaries)– Very fast training (just build data structures)
• Weaknesses– Performance is very dependent on the similarity measure used
(and to a lesser extent on the number of neighbors k used)– Finding a good similarity measure can be difficult– Computationally expensive for prediction– Not best accuracy among classifiers
![Page 85: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/85.jpg)
85
What do models tell us?
• Descriptive models vs. explanatory models.
• Descriptive models capture patterns and correlations in the data but they don't provide any information about why the data contains those patterns
• Explanatory models attempt to capture properties and relationships that cause the linguistic patterns.
![Page 86: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/86.jpg)
86
What do models tell us?
• For example, we might introduce the abstract concept of "polar adjective", as one that has an extreme meaning, and categorize some adjectives like adore and detest as polar.
• Our explanatory model would contain the constraint that absolutely can only combine with polar adjectives, and definitely can only combine with non-polar adjectives.
![Page 87: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/87.jpg)
87
What do models tell us?
• Most models that are automatically constructed from a corpus are descriptive models
• If our goal is to understand the linguistic patterns, then we can use this information about which features are related as a starting point for further experiments designed to tease apart the relationships between features and patterns.
• On the other hand, if we're just interested in using the model to make predictions (e.g., as part of a language processing system), then we can use the model to make predictions about new data without worrying about the details of underlying causal relationships.
![Page 88: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/88.jpg)
88
Summary
• Algorithms for Classification • Linear versus non linear classification
• Binary Linear– Perceptron, Winnon, Large margin classifier
• Binary Non Linear – Kernel methods
• Multi-class Linear– Decision Trees, Naïve bayes, Maximum Entropy
• Multi-class Non Linear– K-nearest neighbors
![Page 89: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/89.jpg)
89
Methods• Linear Models:
– Perceptron & Winnow (neural networks)– Large margin classifier
• Support Vector Machine (SVM)
• Probabilistic models:– Naïve Bayes– Maximum Entropy Models
• Decision Models:– Decision Trees
• Instance-based methods:– Nearest neighbor
![Page 90: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/90.jpg)
90
Classification
• Define classes/categories• Label text• Extract features• Choose a classifier
– Naive Bayes Classifier – NN (i.e. perceptron)– Maximum Entropy– ….
• Train it • Use it to classify new examples
![Page 91: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/91.jpg)
91
How do you choose the classifier?
• Binary vs. multi-class• Try different ones (linear –not linear) and
evaluate• Pros and cons
– For example: do you care about being fast at decision time? (k-Nearest Neighbor is not…)
– Fast training? (decision tree training is slow)– Do you care about interpretable rules (decision trees)
or only classification accuracy?
![Page 92: I256 Applied Natural Language Processing Fall 2009 Lecture 11 Classification Barbara Rosario](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d5e5503460f94a3d20b/html5/thumbnails/92.jpg)
92
Next week
• Projects!