vector space classification
DESCRIPTION
Vector space classificationTRANSCRIPT
VECTOR SPACE CLASSIFICATION
INTRODUCTION Each document is a vector, one component for each term A training set is a set of documents, each labelled with its class
In vector space classification, this set corresponds to a labelled set of points Documents in the same class form a contiguous region of space Documents from different class don’t overlap
We define surfaces to delineate classes in the space
DOCUMENTS IN VECTOR SPACE
Government
Science
Arts
TEST DOCUMENT OF WHAT CLASS
Government
Science
Arts
TEST DOCUMENT = GOVERNMENT
Government
Science
Arts
ASIDE 2D/3D GRAPHS CAN BE MISLEADING
VECTOR SPACE CLASSIFICATION METHODS Rocchio classification
Divides the vector space into regions centered on centroids / center of mass Simple and efficient Classes should be approximately spherical with similar radii
kNN classification No explicit training is required Less efficient Can handle non-spherical and complex classes better than rocchio
Two-class classifiers One-of task – a document may be assigned to exactly one of several mutually
exclusive classes Any-of task – a document can be assigned to any number of classes
USING ROCCHIO FOR TEXT CLASSIFICATION Relevance feedback method is adapted – it’s a two-class classification:
relevant or non-relevant
Use standard tf-idf weighted vector to represent text document
For training document in each category, compute a prototype vector by summing the vectors of all the training documents in the category Prototype = centroid of members of class
Assign test documents to the category with closest prototype vector based on cosine similarity
ILLUSTRATION OF ROCCHIO TEXT CATEGORIZATION
DEFINITION OF CENTROID Where Dc is the set of all documents that belong to class c and v(d) is the
vector space representation of d
Properties: Forms a simple generalization of examples in each class Classification is based on the similarity to class prototype Does not guarantee the consistency of classification with given training data
(c) 1
|Dc |
v (d)
d Dc
ROCCHIO ANOMALY Prototype models face problems with polymorphic categories
ROCCHIO CLASSIFICATION Forms a simple representation for each class – centroid/prototype
Classification is based on the distance form the centroid
Cheap to train and test documents
Not preferred outside text classification Used quiet effectively used for text classification Worse than naïve bayes
K NEAREST NEIGHBOR CLASSIFICATION To classify the document d in class c
Define k neighborhood N as the nearest neighbors of d
Count number of document i in N that belong to c
Estimate P(c/d) as i/k
Choose as class argmax P(c/d) [=majority class]
NEAREST-NEIGHBOR LEARNING ALGORITHM Learning is just storing the representation of training examples in D
Testing instance x (1NN): Compute similarity between x and all examples in D Assign x the category of most similar example in D
Does not explicitly compute the category
Also called as: Case-based learning Memory based learning Lazy learning
Rationale of kNN: contiguity hypothesis (documents in same class form a continuous region and regions of different classes do not overlap)
K NEAREST NEIGHBOR Using the closest example to determine the class is subject to errors due to
a single typical example Noise in the category label of single training example
SEPARATION BY HYPERPLANES A strong high-bias assumption is linear seperability
In 2d can separate class by a line In higher dimension need a hyperplane
seperating hyperplane can be found by linear programming (or a perceptron) Can be expressed as ax + by = c
LINEAR PROGRAMMING/PERCEPTRON
Find a,b,c, such thatax + by > c for red pointsax + by < c for blue points
WHICH HYPERPLANE
In general, lots of possible solutions for a,b,c.
WHICH HYPERPLANE Lots of possible solutions for a, b, c
Somp methods find a separating hyperplane but not the optimal one. Ex. Perceptron
Which points should influence optimality? All points
Linear/logistic regression Naïve bayes
Only difficult points close to boundary Support vector machines
LINEAR CLASSIFIERS Many common text classifiers are linear classifies:
Naïve Bayes Perceptron Rocchio Logistic regression Support Vector Machines
Despite the similarity, noticeable performances differ For separate problems, there are infinite number of separating hyperplanes.
Which one do you choose? What to do for non-separable problems? Different training methods pick different hyperplanes
ROCCHIO IS A LINEAR CLASSIFIER
A NON –LINEAR PROBLEMA linear classifier like naïve bayes does badly on this task
kNN does well (assuming enough training data)
HIGH-DIMENSIONAL DATA Pictures like the one on the right is misleading
Documents are zero along almost all axes
Most document pairs are very far apart (i.e. not strictly orthogonal, but only share very few common words)
In classification terms – often document sets are separable
This is part of why linear classifiers are quiet successful in this domain
MORE THAN TWO CLASSES Any-of or multivalue classification
Classes are independent of each other A document can belong to any number of classes Decomposes into n binary problems Quiet common for documents
One-of or multinomial or polytomous classification Classes are mutually exclusive Each document belong to exactly one class
SET OF BINARY CLASSIFIERS Any-of
Build a separator between each class and its complementary set Given test documents, evaluate it for membership in each class Apply decision criteria of classifiers independently
One-of Build a separator between each class and it complementary set Given test doc, evaluate it for membership in each class Assign document to class with
Maximum score Maximum confidence Maximum probability
?
??