info 4300 / cs4300 information retrieval [0.5cm] slides adapted … · 2013. 2. 6. · ir 21/25:...
TRANSCRIPT
INFO 4300 / CS4300Information Retrieval
slides adapted from Hinrich Schutze’s,linked from http://informationretrieval.org/
IR 21/25: Flat clustering
Paul Ginsparg
Cornell University, Ithaca, NY
15 Nov 2011
1 / 90
Administrativa
Assignment 4 due 2 Dec (extended til 4 Dec).
2 / 90
Overview
1 Clustering: Introduction
2 Clustering in IR
3 K -means
4 How many clusters?
5 Evaluation
6 Labeling clusters
7 Feature selection
3 / 90
Outline
1 Clustering: Introduction
2 Clustering in IR
3 K -means
4 How many clusters?
5 Evaluation
6 Labeling clusters
7 Feature selection
4 / 90
Classification vs. Clustering
Classification: supervised learning
Clustering: unsupervised learning
Classification: Classes are human-defined and part of theinput to the learning algorithm.
Clustering: Clusters are inferred from the data without humaninput.
However, there are many ways of influencing the outcome ofclustering: number of clusters, similarity measure,representation of documents, . . .
5 / 90
What is clustering?
(Document) clustering is the process of grouping a set ofdocuments into clusters of similar documents.
Documents within a cluster should be similar.
Documents from different clusters should be dissimilar.
Clustering is the most common form of unsupervised learning.
Unsupervised = there are no labeled or annotated data.
6 / 90
Data set with clear cluster structure
0.0 0.5 1.0 1.5 2.0
0.0
0.5
1.0
1.5
2.0
2.5
7 / 90
Outline
1 Clustering: Introduction
2 Clustering in IR
3 K -means
4 How many clusters?
5 Evaluation
6 Labeling clusters
7 Feature selection
8 / 90
The cluster hypothesis
Cluster hypothesis. Documents in the same cluster behavesimilarly with respect to relevance to information needs.
All applications in IR are based (directly or indirectly) on thecluster hypothesis.
9 / 90
Applications of clustering in IR
Application What is Benefit Exampleclustered?
Search result clustering searchresults
more effective infor-mation presentationto user
next slide
Scatter-Gather (subsets of)collection
alternative user inter-face: “search withouttyping”
two slides ahead
Collection clustering collection effective informationpresentation for ex-ploratory browsing
McKeown et al. 2002,news.google.com
Cluster-based retrieval collection higher efficiency:faster search
Salton 1971
10 / 90
Global clustering for navigation: Google News
http://news.google.com
11 / 90
Clustering for improving recall
To improve search recall:
Cluster docs in collection a prioriWhen a query matches a doc d , also return other docs in thecluster containing d
Hope: if we do this: the query “car” will also return docscontaining “automobile”
Because clustering groups together docs containing “car” withthose containing “automobile”.Both types of documents contain words like “parts”, “dealer”,“mercedes”, “road trip”.
12 / 90
Data set with clear cluster structure
0.0 0.5 1.0 1.5 2.0
0.0
0.5
1.0
1.5
2.0
2.5
Exercise: Come up with analgorithm for finding the threeclusters in this case
13 / 90
Document representations in clustering
Vector space model
As in vector space classification, we measure relatednessbetween vectors by Euclidean distance . . .
. . . which is almost equivalent to cosine similarity.
Almost: centroids are not length-normalized.
For centroids, distance and cosine give different results.
14 / 90
Issues in clustering
General goal: put related docs in the same cluster, putunrelated docs in different clusters.
But how do we formalize this?
How many clusters?
Initially, we will assume the number of clusters K is given.
Often: secondary goals in clustering
Example: avoid very small and very large clusters
Flat vs. hierarchical clustering
Hard vs. soft clustering
15 / 90
Flat vs. Hierarchical clustering
Flat algorithms
Usually start with a random (partial) partitioning of docs intogroupsRefine iterativelyMain algorithm: K -means
Hierarchical algorithms
Create a hierarchyBottom-up, agglomerativeTop-down, divisive
16 / 90
Hard vs. Soft clustering
Hard clustering: Each document belongs to exactly onecluster.
More common and easier to do
Soft clustering: A document can belong to more than onecluster.
Makes more sense for applications like creating browsablehierarchiesYou may want to put a pair of sneakers in two clusters:
sports apparel
shoes
You can only do that with a soft clustering approach.
For soft clustering, see course text: 16.5,18
Today: Flat, hard clusteringNext time: Hierarchical, hard clustering
17 / 90
Flat algorithms
Flat algorithms compute a partition of N documents into aset of K clusters.
Given: a set of documents and the number K
Find: a partition in K clusters that optimizes the chosenpartitioning criterion
Global optimization: exhaustively enumerate partitions, pickoptimal one
Not tractable
Effective heuristic method: K -means algorithm
18 / 90
Outline
1 Clustering: Introduction
2 Clustering in IR
3 K -means
4 How many clusters?
5 Evaluation
6 Labeling clusters
7 Feature selection
19 / 90
K -means
Perhaps the best known clustering algorithm
Simple, works well in many cases
Use as default / baseline for clustering documents
20 / 90
K -means
Each cluster in K -means is defined by a centroid.
Objective/partitioning criterion: minimize the average squareddifference from the centroid
Recall definition of centroid:
~µ(ω) =1
|ω|
∑
~x∈ω
~x
where we use ω to denote a cluster.
We try to find the minimum average squared difference byiterating two steps:
reassignment: assign each vector to its closest centroidrecomputation: recompute each centroid as the average of thevectors that were assigned to it in reassignment
21 / 90
K -means algorithm
K -means(~x1, . . . , ~xN,K )1 (~s1,~s2, . . . ,~sK )← SelectRandomSeeds(~x1, . . . , ~xN,K )2 for k ← 1 to K3 do ~µk ← ~sk4 while stopping criterion has not been met5 do for k ← 1 to K6 do ωk ← 7 for n← 1 to N8 do j ← arg minj ′ |~µj ′ − ~xn|9 ωj ← ωj ∪ ~xn (reassignment of vectors)
10 for k ← 1 to K11 do ~µk ←
1|ωk |
∑
~x∈ωk~x (recomputation of centroids)
12 return ~µ1, . . . , ~µK
22 / 90
Set of points to be clustered
b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
23 / 90
Random selection of initial cluster centers (k = 2 means)
×
×b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
Centroids after convergence?
24 / 90
Assign points to closest centroid
b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
×
×
25 / 90
Assignment
2
1
1
2
1
1
1 111
1
1
1
11
2
11
2 2
×
×
26 / 90
Recompute cluster centroids
2
1
1
2
1
1
1 111
1
1
1
11
2
11
2 2
×
×
×
×
27 / 90
Assign points to closest centroid
b
b
b
b
b
b bb
b
b
b
b
b
b
bb
bb
×
×b b
28 / 90
Assignment
2
2
1
2
1
1
1 111
1
2
1
11
2
11
2 2
×
×
29 / 90
Recompute cluster centroids
2
2
1
2
1
1
1 111
1
2
1
11
2
11
2 2
×
×
×
×
30 / 90
Assign points to closest centroid
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
×
×b
31 / 90
Assignment
2
2
2
2
1
1
1 111
1
2
1
11
2
11
2 2
×
×
32 / 90
Recompute cluster centroids
2
2
2
2
1
1
1 111
1
2
1
11
2
11
2 2
×
×
×
×
33 / 90
Assign points to closest centroid
b
b
b
b
b
b
b b
b
b
b
b
b
b
b
bb
bb
×
×
b
34 / 90
Assignment
2
2
2
2
1
1
1 121
1
2
1
11
2
11
2 2
×
×
35 / 90
Recompute cluster centroids
2
2
2
2
1
1
1 121
1
2
1
11
2
11
2 2
×
×
×
×
36 / 90
Assign points to closest centroid
b
b
b
b
b
b
b bb
b
b
b
b
b
bb
b
×
×b
bb
37 / 90
Assignment
2
2
2
2
1
1
1 122
1
2
1
11
1
11
2 1
×
×
38 / 90
Recompute cluster centroids
2
2
2
2
1
1
1 122
1
2
1
11
1
11
2 1
××
×
×
39 / 90
Assign points to closest centroid
b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
b
××
b
40 / 90
Assignment
2
2
2
2
1
1
1 122
1
2
1
11
1
11
1 1
××
41 / 90
Recompute cluster centroids
2
2
2
2
1
1
1 122
1
2
1
11
1
11
1 1
××
×
×
42 / 90
Assign points to closest centroid
b
b
b
b
b
b
b bb
b
b
b
b
b
b
bb
bb
×× b
43 / 90
Assignment
2
2
2
2
1
1
1 122
1
1
1
11
1
11
1 1
××
44 / 90
Recompute cluster centroids
2
2
2
2
1
1
1 122
1
1
1
11
1
11
1 1
××
×
×
45 / 90
Centroids and assignments after convergence
2
2
2
2
1
1
1 122
1
1
1
11
1
11
1 1
××
46 / 90
Set of points clustered
b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
47 / 90
Set of points to be clustered
b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
48 / 90
K -means is guaranteed to converge
Proof:
The sum of squared distances (RSS) decreases duringreassignment, because each vector is moved to a closercentroid(RSS = sum of all squared distances between documentvectors and closest centroids)
RSS decreases during recomputation (see next slide)
There is only a finite number of clusterings.
Thus: We must reach a fixed point.(assume that ties are broken consistently)
49 / 90
Recomputation decreases average distance
RSS =∑K
k=1 RSSk – the residual sum of squares (the “goodness”measure)
RSSk(~v) =∑
~x∈ωk
‖~v − ~x‖2 =∑
~x∈ωk
M∑
m=1
(vm − xm)2
∂RSSk(~v)
∂vm
=∑
~x∈ωk
2(vm − xm) = 0
vm =1
|ωk |
∑
~x∈ωk
xm
The last line is the componentwise definition of the centroid!We minimize RSSk when the old centroid is replaced with the newcentroid.RSS, the sum of the RSSk , must then also decrease duringrecomputation.
50 / 90
K -means is guaranteed to converge
But we don’t know how long convergence will take!
If we don’t care about a few docs switching back and forth,then convergence is usually fast (< 10-20 iterations).
However, complete convergence can take many moreiterations.
51 / 90
Optimality of K -means
Convergence does not mean that we converge to the optimalclustering!
This is the great weakness of K -means.
If we start with a bad set of seeds, the resulting clustering canbe horrible.
52 / 90
Exercise: Suboptimal clustering
0 1 2 3 40
1
2
3
×
×
×
×
×
×d1 d2 d3
d4 d5 d6
What is the optimal clustering for K = 2?
Do we converge on this clustering for arbitrary seeds di1 , di2?
53 / 90
Exercise: Suboptimal clustering
0 1 2 3 40
1
2
3
×
×
×
×
×
×d1 d2 d3
d4 d5 d6
What is the optimal clustering for K = 2?
Do we converge on this clustering for arbitrary seeds di1 , di2?
For seeds d2 and d5, K -means converges tod1, d2, d3, d4, d5, d6 (suboptimal clustering).
For seeds d2 and d3, instead converges tod1, d2, d4, d5, d3, d6 (global optimum for K = 2).
54 / 90
Initialization of K -means
Random seed selection is just one of many ways K -means canbe initialized.
Random seed selection is not very robust: It’s easy to get asuboptimal clustering.
Better heuristics:
Select seeds not randomly, but using some heuristic (e.g., filterout outliers or find a set of seeds that has “good coverage” ofthe document space)Use hierarchical clustering to find good seeds (next class)Select i (e.g., i = 10) different sets of seeds, do a K -meansclustering for each, select the clustering with lowest RSS
55 / 90
Time complexity of K -means
Computing one distance of two vectors is O(M).
Reassignment step: O(KNM) (we need to compute KNdocument-centroid distances)
Recomputation step: O(NM) (we need to add each of thedocument’s < M values to one of the centroids)
Assume number of iterations bounded by I
Overall complexity: O(IKNM) – linear in all importantdimensions
However: This is not a real worst-case analysis.
In pathological cases, the number of iterations can be muchhigher than linear in the number of documents.
56 / 90
k-means clustering, redux
Goal
cluster similar data points
Approach:given data points and distance function
select k centroids ~µa
assign ~xi to closest centroid ~µa
minimize∑
a,i d(~xi , ~µa)
Algorithm:
randomly pick centroids, possibly from data points
assign points to closest centroidaverage assigned points to obtain new centroids
repeat 2,3 until nothing changes
Issues:
- takes superpolynomial time on some inputs- not guaranteed to find optimal solution
+ converges quickly in practice57 / 90
Outline
1 Clustering: Introduction
2 Clustering in IR
3 K -means
4 How many clusters?
5 Evaluation
6 Labeling clusters
7 Feature selection
58 / 90
How many clusters?
Either: Number of clusters K is given.
Then partition into K clustersK might be given because there is some external constraint.Example: In the case of Scatter-Gather, it was hard to showmore than 10–20 clusters on a monitor in the 90s.
Or: Finding the “right” number of clusters is part of theproblem.
Given docs, find K for which an optimum is reached.How to define “optimum”?We can’t use RSS or average squared distance from centroidas criterion: always chooses K = N clusters.
59 / 90
Exercise
Suppose we want to analyze the set of all articles published bya major newspaper (e.g., New York Times or SuddeutscheZeitung) in 2008.
Goal: write a two-page report about what the major newsstories in 2008 were.
We want to use K -means clustering to find the major newsstories.
How would you determine K?
60 / 90
Simple objective function for K (1)
Basic idea:
Start with 1 cluster (K = 1)Keep adding clusters (= keep increasing K )Add a penalty for each new cluster
Trade off cluster penalties against average squared distancefrom centroid
Choose K with best tradeoff
61 / 90
Simple objective function for K (2)
Given a clustering, define the cost for a document as(squared) distance to centroid
Define total distortion RSS(K) as sum of all individualdocument costs (corresponds to average distance)
Then: penalize each cluster with a cost λ
Thus for a clustering with K clusters, total cluster penalty isKλ
Define the total cost of a clustering as distortion plus totalcluster penalty: RSS(K) + Kλ
Select K that minimizes (RSS(K) + Kλ)
Still need to determine good value for λ . . .
62 / 90
Finding the “knee” in the curve
2 4 6 8 10
1750
1800
1850
1900
1950
number of clusters
resi
dual
sum
of s
quar
es
Pick the number of clusters where curve “flattens”. Here: 4 or 9.63 / 90
Outline
1 Clustering: Introduction
2 Clustering in IR
3 K -means
4 How many clusters?
5 Evaluation
6 Labeling clusters
7 Feature selection
64 / 90
What is a good clustering?
Internal criteria
Example of an internal criterion: RSS in K -means
But an internal criterion often does not evaluate the actualutility of a clustering in the application.
Alternative: External criteria
Evaluate with respect to a human-defined classification
65 / 90
External criteria for clustering quality
Based on a gold standard data set, e.g., the Reuters collectionwe also used for the evaluation of classification
Goal: Clustering should reproduce the classes in the goldstandard
(But we only want to reproduce how documents are dividedinto groups, not the class labels.)
First measure for how well we were able to reproduce theclasses: purity
66 / 90
External criterion: Purity
purity(Ω,C ) =1
N
∑
k
maxj|ωk ∩ cj |
Ω = ω1, ω2, . . . , ωK is the set of clusters andC = c1, c2, . . . , cJ is the set of classes.
For each cluster ωk : find class cj with most members nkj in ωk
Sum all nkj and divide by total number of points
67 / 90
Example for computing purity
x
o
x x
x
x
o
x
o
o ⋄o x
⋄ ⋄
⋄
x
cluster 1 cluster 2 cluster 3
To compute purity:5 = maxj |ω1 ∩ cj | (class x, cluster 1);4 = maxj |ω2 ∩ cj | (class o, cluster 2);and3 = maxj |ω3 ∩ cj | (class ⋄, cluster 3).Purity is (1/17) × (5 + 4 + 3) = 12/17 ≈ 0.71.
68 / 90
Rand index
Definition: RI = TP+TNTP+FP+FN+TN
Based on 2x2 contingency table of all pairs of documents:same cluster different clusters
same class true positives (TP) false negatives (FN)different classes false positives (FP) true negatives (TN)
TP+FN+FP+TN is the total number of pairs.
There are(
N2
)
pairs for N documents.
Example:(
172
)
= 136 in o/⋄/x example
Each pair is either positive or negative (the clustering puts thetwo documents in the same or in different clusters) . . .
. . . and either “true” (correct) or “false” (incorrect): theclustering decision is correct or incorrect.
69 / 90
As an example, we compute RI for the o/⋄/x example. We firstcompute TP + FP. The three clusters contain 6, 6, and 5 points,respectively, so the total number of “positives” or pairs ofdocuments that are in the same cluster is:
TP + FP =
(
62
)
+
(
62
)
+
(
52
)
= 40
Of these, the x pairs in cluster 1, the o pairs in cluster 2, the ⋄pairs in cluster 3, and the x pair in cluster 3 are true positives:
TP =
(
52
)
+
(
42
)
+
(
32
)
+
(
22
)
= 20
Thus, FP = 40 − 20 = 20. FN and TN are computed similarly.
(TN = 5(4 + 1 + 3) + 1(1 + 1 + 2 + 3) + 1 · 3 + 4(2 + 3) + 1 · 2
= 40 + 7 + 3 + 20 + 2 = 72)
70 / 90
Rand measure for the o/⋄/x example
same cluster different clusterssame class TP = 20 FN = 24different classes FP = 20 TN = 72
RI is then (20 + 72)/(20 + 20 + 24 + 72) = 92/136 ≈ 0.68.
71 / 90
Two other external evaluation measures
Two other measures
Normalized mutual information (NMI)
How much information does the clustering contain about theclassification?Singleton clusters (number of clusters = number of docs) havemaximum MITherefore: normalize by entropy of clusters and classes
F measure
Like Rand, but “precision” and “recall” can be weighted
72 / 90
Evaluation results for the o/⋄/x example
purity NMI RI F5
lower bound 0.0 0.0 0.0 0.0maximum 1.0 1.0 1.0 1.0value for example 0.71 0.36 0.68 0.46
All four measures range from 0 (really bad clustering) to 1 (perfectclustering).
73 / 90
Outline
1 Clustering: Introduction
2 Clustering in IR
3 K -means
4 How many clusters?
5 Evaluation
6 Labeling clusters
7 Feature selection
74 / 90
Major issue in clustering – labeling
After a clustering algorithm finds a set of clusters: how canthey be useful to the end user?
We need a pithy label for each cluster.
For example, in search result clustering for “jaguar”, Thelabels of the three clusters could be “animal”, “car”, and“operating system”.
Topic of this section: How can we automatically find goodlabels for clusters?
75 / 90
Exercise
Come up with an algorithm for labeling clusters
Input: a set of documents, partitioned into K clusters (flatclustering)
Output: A label for each cluster
Part of the exercise: What types of labels should we consider?Words?
76 / 90
Discriminative labeling
To label cluster ω, compare ω with all other clusters
Find terms or phrases that distinguish ω from the otherclusters
We can use any of the feature selection criteria used in textclassification to identify discriminating terms:(i) mutual information, (ii) χ2, (iii) frequency(but the latter is actually not discriminative)
77 / 90
Non-discriminative labeling
Select terms or phrases based solely on information from thecluster itself
Terms with high weights in the centroid (if we are using avector space model)
Non-discriminative methods sometimes select frequent termsthat do not distinguish clusters.
For example, Monday, Tuesday, . . . in newspaper text
78 / 90
Using titles for labeling clusters
Terms and phrases are hard to scan and condense into aholistic idea of what the cluster is about.
Alternative: titles
For example, the titles of two or three documents that areclosest to the centroid.
Titles are easier to scan than a list of phrases.
79 / 90
Cluster labeling: Example
labeling method# docs centroid mutual information title
4 622oil plant mexico pro-duction crude power
000 refinery gas bpd
plant oil productionbarrels crude bpdmexico dolly capac-
ity petroleum
MEXICO: HurricaneDolly heads for Mex-ico coast
9 1017
police security rus-
sian people militarypeace killed toldgrozny court
police killed militarysecurity peace toldtroops forces rebels
people
RUSSIA: Russia’sLebed meets rebelchief in Chechnya
10 1259
00 000 tonnes tradersfutures wheat pricescents september
tonne
delivery traders fu-tures tonne tonnesdesk wheat prices000 00
USA: Export Business- Grain/oilseeds com-plex
Three methods: most prominent terms in centroid, differential labeling usingMI, title of doc closest to centroidAll three methods do a pretty good job.
80 / 90
Outline
1 Clustering: Introduction
2 Clustering in IR
3 K -means
4 How many clusters?
5 Evaluation
6 Labeling clusters
7 Feature selection
81 / 90
Feature selection
In text classification, we usually represent documents in ahigh-dimensional space, with each dimension corresponding toa term.
In this lecture: axis = dimension = word = term = feature
Many dimensions correspond to rare words.
Rare words can mislead the classifier.
Rare misleading features are called noise features.
Eliminating noise features from the representation increasesefficiency and effectiveness of text classification.
Eliminating features is called feature selection.
82 / 90
Example for a noise feature
Let’s say we’re doing text classification for the class China.
Suppose a rare term, say arachnocentric, has noinformation about China . . .
. . . but all instances of arachnocentric happen to occur inChina documents in our training set.
Then we may learn a classifier that incorrectly interpretsarachnocentric as evidence for the China.
Such an incorrect generalization from an accidental propertyof the training set is called overfitting.
Feature selection reduces overfitting and improves theaccuracy of the classifier.
83 / 90
Basic feature selection algorithm
SelectFeatures(D, c , k)1 V ← ExtractVocabulary(D)2 L← []3 for each t ∈ V4 do A(t, c)← ComputeFeatureUtility(D, t, c)5 Append(L, 〈A(t, c), t〉)6 return FeaturesWithLargestValues(L, k)
How do we compute A, the feature utility?
84 / 90
Different feature selection methods
A feature selection method is mainly defined by the feature utilitymeasures it employs.
Feature utility measures:
Frequency – select the most frequent terms
Mutual information – select the terms with the highest mutualinformation (mutual information is also called informationgain in this context)
χ2 (Chi-square)
85 / 90
Information
H[p] =∑
i=1,n−pi log2 pi measures information uncertainty(p.91 in book)
has maximum H = log2 n for all pi = 1/n
Consider two probability distributions:p(x) for x ∈ X and p(y) for y ∈ Y
MI: I [X ;Y ] = H[p(x)] + H[p(y)]− H[p(x , y)] measures howmuch information p(x) gives about p(y) (and vice versa)
MI is zero iff p(x , y) = p(x)p(y), i.e., x and y areindependent for all x ∈ X and y ∈ Y
can be as large as H[p(x)] or H[p(y)]
I [X ;Y ] =∑
x∈X ,y∈Y
p(x , y) log2
p(x , y)
p(x)p(y)
86 / 90
Mutual information
Compute the feature utility A(t, c) as the expected mutualinformation (MI) of term t and class c .
MI tells us “how much information” the term contains aboutthe class and vice versa.
For example, if a term’s occurrence is independent of the class(same proportion of docs within/without class contain theterm), then MI is 0.
Definition:
I (U; C )=∑
et∈1,0
∑
ec∈1,0
P(U =et , C =ec) log2
P(U =et , C =ec)
P(U =et)P(C =ec)
= p(t, c) log2
p(t, c)
p(t)p(c)+ p(t, c) log2
p(t, c)
p(t)p(c)
+ p(t, c) log2
p(t, c)
p(t)p(c)+ p(t, c) log2
p(t, c)
p(t)p(c)
87 / 90
How to compute MI valuesBased on maximum likelihood estimates, the formula weactually use is:
I (U;C ) =N11
Nlog2
NN11
N1.N.1+
N10
Nlog2
NN10
N1.N.0(1)
+N01
Nlog2
NN01
N0.N.1+
N00
Nlog2
NN00
N0.N.0
N11: # of documents that contain t (et = 1) and are in c (ec = 1)
N10: # of documents that contain t (et = 1) and not in c (ec = 0)
N01: # of documents that don’t contain t (et = 0) and in c (ec = 1)
N00: # of documents that don’t contain t (et = 0) and not in c (ec = 0)
N = N00 + N01 + N10 + N11
p(t, c) ≈ N11/N , p(t, c) ≈ N01/N , p(t, c) ≈ N10/N , p(t, c) ≈ N00/N
N1.= N10 + N11: # documents that contain t, p(t) ≈ N1.
/N
N.1 = N01 + N11: # documents in c , p(c) ≈ N
.1/N
N0.= N00 + N01: # documents that don’t contain t, p(t) ≈ N0.
/N
N.0 = N00 + N10: # documents not in c , p(c) ≈ N
.0/N
88 / 90
MI example for poultry/export in Reuters
ec = epoultry = 1 ec = epoultry = 0et = eexport = 1 N11 = 49 N10 = 141et = eexport = 0 N01 = 27,652 N00 = 774,106
Plug these values into formula:
I (U;C ) =49
801,948log2
801,948 · 49
(49+27,652)(49+141)
+141
801,948log2
801,948 · 141
(141+774,106)(49+141)
+27,652
801,948log2
801,948 · 27,652
(49+27,652)(27,652+774,106)
+774,106
801,948log2
801,948 · 774,106
(141+774,106)(27,652+774,106)
≈ 0.000105
89 / 90
MI feature selection on Reuters
Terms with highest mutual information for three classes:
coffee
coffee 0.0111bags 0.0042growers 0.0025kg 0.0019colombia 0.0018brazil 0.0016export 0.0014exporters 0.0013exports 0.0013crop 0.0012
sports
soccer 0.0681cup 0.0515match 0.0441matches 0.0408played 0.0388league 0.0386beat 0.0301game 0.0299games 0.0284team 0.0264
poultry
poultry 0.0013meat 0.0008chicken 0.0006agriculture 0.0005avian 0.0004broiler 0.0003veterinary 0.0003birds 0.0003inspection 0.0003pathogenic 0.0003
I (export,poultry) ≈ .000105 not among the ten highest for classpoultry, but still potentially significant.
90 / 90