![Page 1: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/1.jpg)
Text ClassificationVector Space Based and Linear
Classifier
Reference: Introduction to Information Retrievalby C. Manning, P. Raghavan, H. Schutze
1
![Page 2: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/2.jpg)
Recall: Vector Space Representation
• Each document is a vector, one component (term weight) for each term (= word).
• Normally normalize vectors to unit length.• High‐dimensional vector space:
– Terms are axes– 10,000+ dimensions, or even 100,000+– Docs are vectors in this space
• How can we do classification in this space?– Recall that Naïve Bayes classification does not make use of the term weight.
2
![Page 3: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/3.jpg)
Classification Using Vector Spaces
• As before, the training set is a set of documents, each labeled with its class (e.g., topic)
• In vector space based representation, this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space
• Premise 1: Documents in the same class form a contiguous region of space
• Premise 2: Documents from different classes don’t overlap (much)
• We define surfaces to delineate classes in the space
3
![Page 4: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/4.jpg)
Documents in a Vector Space
Government
Science
Arts
4
![Page 5: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/5.jpg)
Test Document of what class?
Government
Science
Arts
5
![Page 6: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/6.jpg)
Test Document = Government
Government
Science
Arts
The main strategy is how to find good separators 6
![Page 7: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/7.jpg)
Using Rocchio for text classification• Use standard TF‐IDF weighted vectors to represent text documents
• For training documents in each category, compute a prototype vector by summing the vectors of the training documents in the category.–Prototype = centroid of members of class
• Assign test documents to the category with the closest prototype vector based on cosine similarity.
7
![Page 8: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/8.jpg)
Definition of centroid
where Dc is the set of all documents that belong to class c and v(d) is the vector space representation of d.
• Note that centroid will in general not be a unit vector even when the inputs are unit vectors.
(c) 1
| Dc |v (d)
d Dc
8
![Page 9: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/9.jpg)
Illustration of Rocchio Text Categorization
Note: Centroid vectors are illustrated by directions only9
![Page 10: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/10.jpg)
Rocchio Properties
• Forms a simple generalization of the examples in each class (a prototype).
• Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length.
• Classification is based on similarity to class prototypes.
10
![Page 11: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/11.jpg)
Rocchio Anomaly
• Prototype models have problems with polymorphic (disjunctive) categories.
11
![Page 12: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/12.jpg)
Rocchio classfication
12
![Page 13: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/13.jpg)
Rocchio classification
• Rocchio forms a simple representation for each class: the centroid/prototype
• Classification is based on similarity to / distance from the prototype/centroid
• It does not guarantee that classifications are consistent with the given training data
• It is little used outside text classification, but has been used quite effectively for text classification
• Again, cheap to train and test documents
13
![Page 14: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/14.jpg)
k Nearest Neighbor Classification
• kNN = k Nearest Neighbor
• To classify document d into class c:• Define k‐neighborhood N as k nearest neighbors of d• Count number of documents i in N that belong to c• Estimate P(c|d) as i/k• Choose as class argmaxc P(c|d) [ = majority class]
14
![Page 15: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/15.jpg)
Example: k=6 (6NN)
Government
Science
Arts
15
![Page 16: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/16.jpg)
Nearest‐Neighbor Learning Algorithm• Learning is just storing the representations of the training
examples in D.• Testing instance x (under 1NN):
– Compute similarity between x and all examples in D.– Assign x the category of the most similar example in D.
• Does not explicitly compute a generalization or category prototypes.
• Also called:– Case‐based learning– Memory‐based learning– Lazy learning
• Rationale of kNN: contiguity hypothesis
16
![Page 17: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/17.jpg)
k Nearest Neighbor
• Using only the closest example (1NN) to determine the class is subject to errors due to:– A single atypical example. – Noise (i.e., an error) in the category label of a single training example.
• More robust alternative is to find the kmost‐similar examples and return the majority category of these k examples.
17
![Page 18: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/18.jpg)
kNN decision boundaries
Government
Science
Arts
Boundaries are in principle arbitrary surfaces –but usually polyhedra
kNN gives locally defined decision boundaries betweenclasses – far away points do not influence each classificationdecision (unlike in Naïve Bayes, Rocchio, etc.) 18
![Page 19: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/19.jpg)
Similarity Metrics
• Nearest neighbor method depends on a similarity (or distance) metric.
• For text, cosine similarity of tf.idf weighted vectors is typically most effective.
19
![Page 20: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/20.jpg)
Illustration of 3 Nearest Neighbor for Text Vector Space
20
![Page 21: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/21.jpg)
3 Nearest Neighbor Comparison
• Nearest Neighbor tends to handle polymorphic categories better.
21
![Page 22: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/22.jpg)
k Nearest Neighbor
• Value of k is typically odd to avoid ties; 3 and 5 are most common, but larger values between 50 to 100 are also used.
• Alternatively, we can select k that gives best results on a held‐out portion of the training set.
22
![Page 23: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/23.jpg)
kNN AlgorithmTraining (Preprocessing) and Testing
TRAIN‐KNN(C,D)1 D′ ← PREPROCESS(D)2 k ← SELECT‐K(C,D′)3 return D′, k
APPLY‐KNN(C,D′, k, d)1 ← COMPUTENEARESTNEIGHBORS(D′, k, d)2 for each ∈ C3 do ← ∩ ⁄4 return argmax
is an estimate for |denotes the set of all documents in the class
23
![Page 24: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/24.jpg)
kNN: Discussion
• No feature selection necessary• Scales well with large number of classes
– Don’t need to train n classifiers for n classes
• Classes can influence each other– Small changes to one class can have ripple effect
• Can avoid training if preferred• May be more expensive at test time
24
![Page 25: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/25.jpg)
Linear Classifiers
25
![Page 26: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/26.jpg)
Linear Classifiers
• Consider 2 class problems– Deciding between two classes, perhaps, government and non‐government
• One‐versus‐rest classification
• How do we define (and find) the separating surface?
• How do we decide which region a test doc is in?
26
![Page 27: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/27.jpg)
Separation by Hyperplanes• A strong high‐bias assumption is linear separability:
– in 2 dimensions, can separate classes by a line– in higher dimensions, need hyperplanes
• separator can be expressed as ax + by = c
27
![Page 28: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/28.jpg)
Which Hyperplane?
In general, lots of possiblesolutions for a,b,c.
28
![Page 29: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/29.jpg)
Which Hyperplane?• Lots of possible solutions for a,b,c.• Some methods find an optimal separating
hyperplane[according to some criterion of expected goodness]
• Which points should influence optimality?– All points
• Linear regression• Naïve Bayes
– Only “difficult points” close to decision boundary
• Support vector machines29
![Page 30: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/30.jpg)
High‐Dimensional Linear Classifier
• For general linear classifiers, assign the document d with m features d=(d1,…dM) to one class if:
Otherwise, assign to the other class.
01
M
iiidw
30
![Page 31: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/31.jpg)
Applying Linear Classifier
APPLYLINEARCLASSIFIER( , θ, )1 score ← ∑2 if score > θ 3 then return 14 else return 0
31
![Page 32: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/32.jpg)
Linear classifier: Example
• Class: “interest” (as in interest rate)• Example features of a linear classifier• wi ti wi ti
• To classify, find dot product of feature vector and weights
• 0.70 prime• 0.67 rate• 0.63 interest• 0.60 rates• 0.46 discount• 0.43 bundesbank
• -0.71 dlrs• -0.35 world• -0.33 sees• -0.25 year• -0.24 group• -0.24 dlr
32
![Page 33: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/33.jpg)
Linear Classifiers
• Many common text classifiers are linear classifiers– Naïve Bayes– Perceptron– Rocchio– Logistic regression– Support vector machines (with linear kernel)– Linear regression
• Despite this similarity, noticeable performance differences
33
![Page 34: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/34.jpg)
Two‐class Rocchio as a linear classifier
• To show Rocchio is a linear classifier, consider that a vector is on the decision boundary if it has equal distance to the two class centroids:
01
M
iiidw
w (c1)
(c2)
0.5 (|(c1) |2 |
(c2) |2)
34
![Page 35: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/35.jpg)
Linear Problem
• A linear problem ‐ The underlying distributions of the two classes are separated by a line.
• This separating line is called class boundary.– It is the “true” boundary and we distinguish it from the decision boundary that the learning method computes
35
![Page 36: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/36.jpg)
Linear Problem
• A linear problem that classifies whether a Web page is a Chinese Web page (solid circles)
• The class boundary is represented by short dashed line• There are three noise documents 36
![Page 37: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/37.jpg)
Linear Problem
• If there exists a hyperplane that perfectly separates the two classes, then we call the two classes linearly separable.
• If linear separability holds, there is an infinite number of linear separators.–We need a criterion for selecting among all decision hyperplanes.
37
![Page 38: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/38.jpg)
Nonlinear Problem
• If a problem is nonlinear problem and its class boundaries cannot be approximated well with linear hyperplanes, then nonlinear classifiers are better
• An example of a nonlinear classifier is kNN 38
![Page 39: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/39.jpg)
More Than Two Classes
• Any‐of or multivalue classification– Classes are independent of each other.– A document can belong to 0, 1, or >1 classes.– Quite common for documents
39
document class1 , ,2 ,: :
![Page 40: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/40.jpg)
Set of Binary Classifiers: Any‐of
• Build a separator between each class and its complementary set (docs from all other classes).
• Decompose into binary classification problems.
• Given test doc, evaluate it for membership in each class.
• Apply decision criterion of classifiers independently
• Though maybe you could do better by considering dependencies between categories
40
![Page 41: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/41.jpg)
More Than Two Classes
• One‐of or multinomial or polytomousclassification– Classes are mutually exclusive.– Each document belongs to exactly one class– E.g., digit recognition is polytomous classification
• Digits are mutually exclusive
41
![Page 42: Text Classification Vector Space Based and Linear Classifier](https://reader033.vdocuments.us/reader033/viewer/2022042421/62608948cdb9e338723ebbd9/html5/thumbnails/42.jpg)
Set of Binary Classifiers: One‐of
• Build a separator between each class and its complementary set (docs from all other classes).
• Given test doc, evaluate it for membership in each class.
• Assign document to class with:– maximum score– maximum confidence– maximum probability
?
??
42