text categorization
DESCRIPTION
Text Categorization. Categorization. Given: A description of an instance, x X , where X is the instance language or instance space . A fixed set of categories: C= { c 1 , c 2 ,… c n } Determine: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/1.jpg)
1
Text Categorization
![Page 2: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/2.jpg)
2
Categorization
• Given:– A description of an instance, xX, where X
is the instance language or instance space.– A fixed set of categories:
C={c1, c2,…cn}
• Determine:– The category of x: c(x)C, where c(x) is a
categorization function whose domain is X and whose range is C.
![Page 3: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/3.jpg)
3
Learning for Categorization
• A training example is an instance xX, paired with its correct category c(x): <x, c(x)> for an unknown categorization function, c.
• Given a set of training examples, D.• Find a hypothesized categorization
function, h(x), such that:)()(: )(, xcxhDxcx
Consistency
![Page 4: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/4.jpg)
4
Sample Category Learning Problem
• Instance language: <size, color, shape>– size {small, medium, large}– color {red, blue, green}– shape {square, circle, triangle}
• C = {positive, negative}• D: Example Size Color Shape Category
1 small red circle positive
2 large red circle positive
3 small red triangle negative
4 large blue circle negative
![Page 5: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/5.jpg)
5
General Learning Issues
• Many hypotheses are usually consistent with the training data.
• Bias– Any criteria other than consistency with the
training data that is used to select a hypothesis.• Classification accuracy (% of instances
classified correctly).– Measured on independent test data.
• Training time (efficiency of training algorithm).• Testing time (efficiency of subsequent
classification).
![Page 6: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/6.jpg)
6
Generalization
• Hypotheses must generalize to correctly classify instances not in the training data.
• Simply memorizing training examples is a consistent hypothesis that does not generalize.
• Occam’s razor:– Finding a simple hypothesis helps ensure
generalization.
![Page 7: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/7.jpg)
7
Text Categorization
• Assigning documents to a fixed set of categories.
• Applications:– Web pages
• Recommending• Yahoo-like classification
![Page 8: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/8.jpg)
8
Text Categorization
• Applications: – Newsgroup Messages
• Recommending• spam filtering
– News articles • Personalized newspaper
– Email messages • Routing• Prioritizing • Folderizing• spam filtering
![Page 9: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/9.jpg)
9
Learning for Text Categorization
• Manual development of text categorization functions is difficult.
• Learning Algorithms:– Bayesian (naïve)– Neural network– Relevance Feedback (Rocchio)– Rule based (Ripper)– Nearest Neighbor (case based)– Support Vector Machines (SVM)
![Page 10: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/10.jpg)
10
Using Relevance Feedback (Rocchio)
• Relevance feedback methods can be adapted for text categorization.
• Use standard TF/IDF weighted vectors to represent text documents (normalized by maximum term frequency).
• For each category, compute a prototype vector by summing the vectors of the training documents in the category.
• Assign test documents to the category with the closest prototype vector based on cosine similarity.
![Page 11: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/11.jpg)
11
Rocchio Text Categorization Algorithm(Training)
Assume the set of categories is {c1, c2,…cn}For i from 1 to n let pi = <0, 0,…,0> (init. prototype vectors)For each training example <x, c(x)> D Let d be the frequency normalized TF/IDF
term vector for doc x Let i = j: (cj = c(x))
(sum all the document vectors in ci to get pi)
Let pi = pi + d
![Page 12: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/12.jpg)
12
Rocchio Text Categorization Algorithm (Test)
Given test document xLet d be the TF/IDF weighted term vector for xLet m = –2 (init. maximum cosSim)For i from 1 to n: (compute similarity to prototype vector) Let s = cosSim(d, pi) if s > m let m = s let r = ci (update most similar class prototype)Return class r
![Page 13: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/13.jpg)
13
Illustration of Rocchio Text Categorization
![Page 14: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/14.jpg)
14
Rocchio Properties
• Does not guarantee a consistent hypothesis.
• Forms a simple generalization of the examples in each class (a prototype).
• Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length.
• Classification is based on similarity to class prototypes.
![Page 15: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/15.jpg)
15
Rocchio Time Complexity
• Note: The time to add two sparse vectors is proportional to minimum number of non-zero entries in the two vectors.
• Training Time: O(|D|(Ld + |Vd|)) = O(|D| Ld) where Ld is the average length of a document in D and Vd is the average vocabulary size for a document in D.
• Test Time: O(Lt + |C||Vt|) where Lt is the average length of a test document and |Vt | is the average vocabulary size for a test document.– Assumes lengths of pi vectors are computed and stored during
training, allowing cosSim(d, pi) to be computed in time proportional to the number of non-zero entries in d (i.e. |Vt|)
![Page 16: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/16.jpg)
16
Nearest-Neighbor Learning Algorithm
• Learning is just storing the representations of the training examples in D.
• Testing instance x:– Compute similarity between x and all examples in D.– Assign x the category of the most similar example in
D.• Does not explicitly compute a generalization or
category prototypes.• Also called:
– Case-based– Memory-based– Lazy learning
![Page 17: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/17.jpg)
17
K Nearest-Neighbor
• Using only the closest example to determine categorization is subject to errors due to:– A single atypical example. – Noise (i.e. error) in the category label of a
single training example.• More robust alternative is to find the k
most-similar examples and return the majority category of these k examples.
• Value of k is typically odd to avoid ties, 3 and 5 are most common.
![Page 18: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/18.jpg)
18
Similarity Metrics
• Nearest neighbor method depends on a similarity (or distance) metric.
• Simplest for continuous m-dimensional instance space is Euclidian distance.
• Simplest for m-dimensional binary instance space is Hamming distance (number of feature values that differ).
• For text, cosine similarity of TF-IDF weighted vectors is typically most effective.
![Page 19: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/19.jpg)
19
3 Nearest Neighbor Illustration(Euclidian Distance)
.. ..
. .. ....
![Page 20: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/20.jpg)
20
K Nearest Neighbor for Text
Training:For each each training example <x, c(x)> D Compute the corresponding TF-IDF vector, dx, for document x
Test instance y:Compute TF-IDF vector d for document yFor each <x, c(x)> D Let sx = cosSim(d, dx)Sort examples, x, in D by decreasing value of sx
Let N be the first k examples in D. (get most similar neighbors)Return the majority class of examples in N
![Page 21: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/21.jpg)
21
Illustration of 3 Nearest Neighbor for Text
![Page 22: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/22.jpg)
22
Rocchio Anomoly
• Prototype models have problems with polymorphic (disjunctive) categories.
![Page 23: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/23.jpg)
23
3 Nearest Neighbor Comparison
• Nearest Neighbor tends to handle polymorphic categories better.
![Page 24: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/24.jpg)
24
Nearest Neighbor Time Complexity
• Training Time: O(|D| Ld) to compose TF-IDF vectors.
• Testing Time: O(Lt + |D||Vt|) to compare to all training vectors.– Assumes lengths of dx vectors are computed and
stored during training, allowing cosSim(d, dx) to be computed in time proportional to the number of non-zero entries in d (i.e. |Vt|)
• Testing time can be high for large training sets.
![Page 25: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/25.jpg)
25
Nearest Neighbor with Inverted Index
• Determining k nearest neighbors is the same as determining the k best retrievals using the test document as a query to a database of training documents.
• Use standard VSR inverted index methods to find the k nearest neighbors.
• Testing Time: O(B|Vt|) where B is the average number of training documents in which a test-document word appears.
• Therefore, overall classification is O(Lt + B|Vt|) – Typically B << |D|
![Page 26: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/26.jpg)
26
Bayesian Methods
• Learning and classification methods based on probability theory.
• Bayes theorem plays a critical role in probabilistic learning and classification.
• Uses prior probability of each category given no information about an item.
• Categorization produces a posterior probability distribution over the possible categories given a description of an item.
![Page 27: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/27.jpg)
27
Axioms of Probability Theory
• All probabilities between 0 and 1
• True proposition has probability 1, false has probability 0.
P(true) = 1 P(false) = 0.• The probability of disjunction is:
1)(0 AP
)()()()( BAPBPAPBAP
A BBA
![Page 28: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/28.jpg)
28
Conditional Probability
• P(A | B) is the probability of A given B• Assumes that B is all and only
information known.• Defined by:
)()()|(
BPBAPBAP
A BBA
![Page 29: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/29.jpg)
29
Independence
• A and B are independent iff:
• Therefore, if A and B are independent:
)()|( APBAP
)()|( BPABP
)()(
)()|( APBP
BAPBAP
)()()( BPAPBAP
These two constraints are logically equivalent
![Page 30: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/30.jpg)
30
Bayes Theorem
Simple proof from definition of conditional probability:
)()()|()|(
EPHPHEPEHP
)()()|(
EPEHPEHP
)()()|(
HPEHPHEP
)()|()( HPHEPEHP
QED:
(Def. cond. prob.)
(Def. cond. prob.)
)()()|()|(
EPHPHEPEHP
![Page 31: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/31.jpg)
31
Bayesian Categorization
• Let set of categories be {c1, c2,…cn}• Let E be description of an instance.• Determine category of E by determining for
each ci
• P(E) can be determined since categories are complete and disjoint.
)()|()()|(
EPcEPcPEcP ii
i
n
i
iin
ii EP
cEPcPEcP11
1)(
)|()()|(
n
iii cEPcPEP
1
)|()()(
![Page 32: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/32.jpg)
32
Bayesian Categorization (cont.)
• Need to know:– Priors: P(ci) – Conditionals: P(E | ci)
• P(ci) are easily estimated from data. – If ni of the examples in D are in ci,then P(ci) = ni / |D|
• Assume instance is a conjunction of binary features:
• Too many possible instances (exponential in m) to estimate all P(E | ci)
meeeE 21
![Page 33: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/33.jpg)
33
Naïve Bayesian Categorization
• If we assume features of an instance are independent given the category (ci) (conditionally independent).
• Therefore, we then only need to know P(ej | ci) for each feature and category.
)|()|()|(1
21
m
jijimi cePceeePcEP
![Page 34: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/34.jpg)
34
Naïve Bayes Example
• C = {allergy, cold, well}• e1 = sneeze; e2 = cough; e3 = fever• E = {sneeze, cough, fever}
Prob Well Cold AllergyP(ci) 0.9 0.05 0.05
P(sneeze|ci) 0.1 0.9 0.9
P(cough|ci) 0.1 0.8 0.7
P(fever|ci) 0.01 0.7 0.4
![Page 35: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/35.jpg)
35
Naïve Bayes Example (cont.)
P(well | E) = (0.9)(0.1)(0.1)(0.99)/P(E)=0.0089/P(E)P(cold | E) = (0.05)(0.9)(0.8)(0.3)/P(E)=0.01/P(E)P(allergy | E) = (0.05)(0.9)(0.7)(0.6)/P(E)=0.019/P(E)
Most probable category: allergyP(E) = 0.089 + 0.01 + 0.019 = 0.0379P(well | E) = 0.23P(cold | E) = 0.26P(allergy | E) = 0.50
Probability Well Cold Allergy
P(ci) 0.9 0.05 0.05
P(sneeze | ci) 0.1 0.9 0.9
P(cough | ci) 0.1 0.8 0.7
P(fever | ci) 0.01 0.7 0.4
E={sneeze, cough, fever}
![Page 36: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/36.jpg)
36
Estimating Probabilities
• Normally, probabilities are estimated based on observed frequencies in the training data.
• If D contains ni examples in category ci, and nij of these ni examples contains feature ej, then:
• However, estimating such probabilities from small training sets is error-prone.
• If due only to chance, a rare feature, ek, is always false in the training data, ci :P(ek | ci) = 0.
• If ek then occurs in a test example, E, the result is that ci: P(E | ci) = 0 and ci: P(ci | E) = 0
i
ijij n
nceP )|(
![Page 37: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/37.jpg)
37
Smoothing
• To account for estimation from small samples, probability estimates are adjusted or smoothed.
• Laplace smoothing using an m-estimate assumes that each feature is given a prior probability, p, that is assumed to have been previously observed in a “virtual” sample of size m.
• For binary features, p is simply assumed to be 0.5.
mnmpn
cePi
ijij
)|(
![Page 38: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/38.jpg)
38
Naïve Bayes for Text
• Modeled as generating a bag of words for a document in a given category by repeatedly sampling with replacement from a vocabulary V = {w1, w2,…wm} based on the probabilities P(wj | ci).
• Smooth probability estimates with Laplace m-estimates assuming a uniform distribution over all words (p = 1/|V|) and m = |V|– Equivalent to a virtual sample of seeing each word in each
category exactly once.
![Page 39: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/39.jpg)
39
Text Naïve Bayes Algorithm(Train)
Let V be the vocabulary of all words in the documents in DFor each category ci C
Let Di be the subset of documents in D in category ci
P(ci) = |Di| / |D| Let Ti be the concatenation of all the
documents in Di
Let ni be the total number of word occurrences in Ti
For each word wj V Let nij be the number of occurrences of wj in Ti
Let P(wi | ci) = (nij + 1) / (ni + |V|)
![Page 40: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/40.jpg)
40
Text Naïve Bayes Algorithm(Test)
Given a test document XLet n be the number of word occurrences in XReturn the category:
where ai is the word occurring the ith
position in X
)|()(argmax1
n
iiii
CiccaPcP
![Page 41: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/41.jpg)
41
Naïve Bayes Time Complexity
• Training Time: O(|D|Ld + |C||V|)) where Ld is the average length of a document in D.– Assumes V and all Di , ni, and nij pre-computed in O(|
D|Ld) time during one pass through all of the data.– Generally just O(|D|Ld) since usually |C||V| < |D|Ld
• Test Time: O(|C| Lt) where Lt is the average length of a test document.
• Very efficient overall, linearly proportional to the time needed to just read in all the data.
• Similar to Rocchio time complexity.
![Page 42: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/42.jpg)
42
Underflow Prevention
• Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.
• Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.
• Class with highest final un-normalized log probability score is still the most probable.
![Page 43: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/43.jpg)
43
Naïve Bayes Posterior Probabilities
• Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate.
• However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not.– Output probabilities are generally very
close to 0 or 1.
![Page 44: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/44.jpg)
44
Evaluating Categorization
• Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances).
• Classification accuracy: c/n where n is the total number of test instances and c is the number of test instances correctly classified by the system.
• Results can vary based on sampling error due to different training and test sets.
• Average results over multiple training and test sets (splits of the overall data) for the best results.
![Page 45: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/45.jpg)
45
N-Fold Cross-Validation
• Ideally, test and training sets are independent on each trial.– But this would require too much labeled data.
• Partition data into N equal-sized disjoint segments.• Run N trials, each time using a different segment
of the data for testing, and training on the remaining N1 segments.
• This way, at least test-sets are independent.• Report average classification accuracy over the N
trials.• Typically, N = 10.
![Page 46: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/46.jpg)
46
Learning Curves
• In practice, labeled data is usually rare and expensive.
• Would like to know how performance varies with the number of training instances.
• Learning curves plot classification accuracy on independent test data (Y axis) versus number of training examples (X axis).
![Page 47: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/47.jpg)
47
N-Fold Learning Curves
• Want learning curves averaged over multiple trials.
• Use N-fold cross validation to generate N full training and test sets.
• For each trial, train on increasing fractions of the training set, measuring accuracy on the test data for each point on the desired learning curve.
![Page 48: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/48.jpg)
48
Sample Learning Curve(Yahoo Science Data)
![Page 49: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/49.jpg)
49
Text Clustering
![Page 50: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/50.jpg)
50
Clustering
• Partition unlabeled examples into disjoint subsets of clusters, such that:– Examples within a cluster are very similar– Examples in different clusters are very
different• Discover new categories in an
unsupervised manner (no sample category labels provided).
![Page 51: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/51.jpg)
51
.
Clustering Example
.. ..
. .. ..
..
....
.. ..
. .. ..
..
....
.
![Page 52: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/52.jpg)
52
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples.
• Recursive application of a standard clustering algorithm can produce a hierarchical clustering.
animal
vertebrate
fish reptile amphib. mammal worm insect crustacean
invertebrate
![Page 53: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/53.jpg)
53
Aglommerative vs. Divisive Clustering
• Aglommerative (bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters.
• Divisive (partitional, top-down) separate all examples immediately into clusters.
![Page 54: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/54.jpg)
54
Direct Clustering Method
• Direct clustering methods require a specification of the number of clusters, k, desired.
• A clustering evaluation function assigns a real-value quality measure to a clustering.
• The number of clusters can be determined automatically by explicitly generating clusterings for multiple values of k and choosing the best result according to a clustering evaluation function.
![Page 55: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/55.jpg)
55
Hierarchical Agglomerative Clustering (HAC)
• Assumes a similarity function for determining the similarity of two instances.
• Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster.
• The history of merging forms a binary tree or hierarchy.
![Page 56: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/56.jpg)
56
HAC Algorithm
Start with all instances in their own cluster.Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci cj
![Page 57: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/57.jpg)
57
Cluster Similarity
• Assume a similarity function that determines the similarity of two instances: sim(x,y).– Cosine similarity of document vectors.
• How to compute similarity of two clusters each possibly containing multiple instances?– Single Link: Similarity of two most similar members.– Complete Link: Similarity of two least similar
members.– Group Average: Average similarity between
members.
![Page 58: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/58.jpg)
58
Single Link Agglomerative Clustering
• Use maximum similarity of pairs:
• Can result in “straggly” (long and thin) clusters due to chaining effect.– Appropriate in some domains, such as
clustering islands.
),(max),(,
yxsimccsimji cycxji
![Page 59: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/59.jpg)
59
Single Link Example
![Page 60: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/60.jpg)
60
Complete Link Agglomerative Clustering
• Use minimum similarity of pairs:
• Makes more “tight,” spherical clusters that are typically preferable.
),(min),(,
yxsimccsimji cycxji
![Page 61: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/61.jpg)
61
Complete Link Example
![Page 62: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/62.jpg)
62
Computational Complexity
• In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2).
• In each of the subsequent n2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters.
• In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant time.
![Page 63: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/63.jpg)
63
Computing Cluster Similarity
• After merging ci and cj, the similarity of the resulting cluster to any other cluster, ck, can be computed by:– Single Link:
– Complete Link:
)),(),,(max()),(( kjkikji ccsimccsimcccsim
)),(),,(min()),(( kjkikji ccsimccsimcccsim
![Page 64: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/64.jpg)
64
Group Average Agglomerative Clustering
• Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters.
• Compromise between single and complete link.
• Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters.
)( :)(
),()1(
1),(ji jiccx xyccyjiji
ji yxsimcccc
ccsim
![Page 65: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/65.jpg)
65
Computing Group Average Similarity
• Assume cosine similarity and normalized vectors with unit length.
• Always maintain sum of vectors in each cluster.
• Compute similarity of clusters in constant time:
jcx
j xcs
)(
)1||||)(|||(||)||(|))()(())()((
),(
iiii
iijijiji cccc
cccscscscsccsim
![Page 66: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/66.jpg)
66
Non-Hierarchical Clustering
• Typically must provide the number of desired clusters, k.
• Randomly choose k instances as seeds, one per cluster.
• Form initial clusters based on these seeds.• Iterate, repeatedly reallocating instances to
different clusters to improve the overall clustering.
• Stop when clustering converges or after a fixed number of iterations.
![Page 67: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/67.jpg)
67
K-Means
• Assumes instances are real-valued vectors.
• Clusters based on centroids, center of gravity, or mean of points in a cluster, c:
• Reassignment of instances to clusters is based on distance to the current cluster centroids.
cx
xc
||
1(c)μ
![Page 68: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/68.jpg)
68
Distance Metrics
• Euclidian distance (L2 norm):
• L1 norm:
• Cosine Similarity (transform to a distance by subtracting from 1):
2
12 )(),( i
m
ii yxyxL
m
iii yxyxL
11 ),(
yxyx
1
![Page 69: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/69.jpg)
69
K-Means Algorithm
Let d be the distance measure between instances.Select k random instances {s1, s2,… sk} as seeds.Until clustering converges or other stopping criterion: For each instance xi: Assign xi to the cluster cj such that d(xi, sj) is
minimal. (Update the seeds to the centroid of each
cluster) For each cluster cj sj = (cj)
![Page 70: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/70.jpg)
70
K Means Example(K=2)
Pick seeds
Reassign clusters
Compute centroids
xx
Reasssign clusters
xx xx Compute centroids
Reassign clusters
Converged!
![Page 71: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/71.jpg)
71
Time Complexity
• Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors.
• Reassigning clusters: O(kn) distance computations, or O(knm).
• Computing centroids: Each instance vector gets added once to some centroid: O(nm).
• Assume these two steps are each done once for I iterations: O(Iknm).
• Linear in all relevant factors, assuming a fixed number of iterations, more efficient than O(n2) HAC.
![Page 72: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/72.jpg)
72
Seed Choice
• Results can vary based on random seed selection.
• Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings.
• Select good seeds using a heuristic or the results of another method.
![Page 73: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/73.jpg)
73
Text Clustering
• HAC and K-Means have been applied to text in a straightforward way.
• Typically use normalized, TF/IDF-weighted vectors and cosine similarity.
• Optimize computations for sparse vectors.
![Page 74: Text Categorization](https://reader035.vdocuments.us/reader035/viewer/2022062502/56815655550346895dc3f694/html5/thumbnails/74.jpg)
74
Text Clustering
• Applications:– During retrieval, add other documents in the
same cluster as the initial retrieved documents to improve recall.
– Clustering of results of retrieval to present more organized results to the user.
– Automated production of hierarchical taxonomies of documents for browsing purposes (à la Yahoo).