review. topics to review for the final exam evaluation of classification –predicting performance,...
Post on 15-Jan-2016
215 views
TRANSCRIPT
Review
Topics to review for the final exam• Evaluation of classification
– predicting performance, confidence intervals– ROC analysis– Precision, recall, F-Measure
• Association Analysis– APRIORI– FP-Tree/FPGrowth– Maximal, Closed frequent itemsets– Cross support, h-measure– Confidence vs. interestingness– Mining sequences– Mining graphs
• Cluster Analysis– K-Means, Bisecting K-means– SOM– DBSCAN– Hierarchical clustering
• Web search– IR– Reputation ranking
Single-side help sheet allowed.
Mining Association RulesTwo-step approach:
1. Frequent Itemset Generation– Generate all itemsets whose support minsup (these itemsets are called
frequent itemsets)
2. Rule Generation– Generate high confidence rules from each frequent itemset
• The computational requirements for frequent itemset generation are more expensive than those of rule generation.
Candidate itemsets are generated and then tested against the database to see whether they are frequent.
• Apriori principle:– If an itemset is frequent, then all of its subsets must also be frequent
• Apriori principle conversely said:– If an itemset is infrequent, then all of its supersets must be infrequent
too.
Found to be Infrequent
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Illustrating Apriori Principlenull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDEPruned supersets
Apriori Algorithm• Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identifiedk=k+1
1. Generate length k candidate itemsets from length k-1 frequent itemsets
2. Prune candidate itemsets containing subsets of length k-1 that are infrequent
3. Count the support of each candidate by scanning the DB and eliminate candidates that are infrequent, leaving only those that are frequent
Fk-1Fk-1 Method• Merge a pair of frequent (k-1) itemsets only if their first k-2 items are identical.
• E.g. frequent itemsets {Bread, Diapers} and {Bread, Milk} are merged to form a candidate 3 itemset {Bread, Diapers, Milk}.
{Bread, Diapers, Milk}
Fk-1Fk-1
Completeness
• We don’t merge {Beer, Diapers} with {Diapers, Milk} because the first item in both itemsets is different.
• Do we loose {Beer, Diapers, Milk}?
Prunning
• Before checking a candidate against the DB, a candidate pruning step is needed to ensure that the remaining subsets of k-1 elements are frequent.
Counting
• Finally, the surviving candidates are tested (counted) on the DB.
Rule Generation• Computing the confidence of an association rule does not require additional
scans of the transactions.
• Consider {1, 2}{3}. • The rule confidence is ({1, 2, 3}) / ({1, 2}) • Because {1, 2, 3} is frequent, the anti monotone property of support ensures
that {1, 2} must be frequent, too, and we do know the supports of frequent itemsets.
• Initially, all the high confidence rules that have only one item in the rule consequent are extracted.
• These rules are then used to generate new candidate rules.
• For example, if
– {acd} {b} and {abd} {c} are high confidence rules, then the candidate rule {ad} {bc} is generated by merging the consequents of both rules.
– Then the candidate rules are checked for confidence.
Other concepts and algorithms• FP-Tree/FP-Growth
See corresponding slide set and Assignment 2 solution.
• Maximal Frequent Itemsets
• Closed Itemset
• Interest factor
• Mining Sequences
Maximal Frequent Itemsets
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Border
Infrequent Itemsets
Maximal Itemsets
An itemset is maximal frequent if none of its immediate supersets is frequent
Maximal frequent itemsets form the smallest set of itemsets from which all frequent itemsets can be derived.
Closed Itemsets• Despite providing a compact representation, maximal frequent
itemsets do not contain the support information of their subsets. – An additional pass over the data set is needed to determine the
support counts of the non maximal frequent itemsets.
• It might be desirable to have a minimal representation of frequent itemsets that preserves the support information. – Such representation is the set of the closed frequent itemsets.
• An itemset is closed if none of its immediate supersets has the same support as the itemset.– An itemset is a closed frequent itemset if it is closed and its
support is greater than or equal to minsup.
Maximal vs Closed Frequent Itemsetsnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
Minimum support = 2
# Closed = 9
# Maximal = 4
Closed and maximal
Closed but not maximal
Transaction Ids
Not supported by any transactions
TID Items
1 ABC
2 ABCD
3 BCE
4 ACDE
5 DE
Deriving Frequent Itemsets From Closed Frequent Itemsets
• E.g., consider the frequent itemset {a, d}. Because the itemset is not closed, its support count must be identical to one of its immediate supersets. – The key is to determine which
superset among
{a, b, d}, {a, c, d}, or {a, d, e}
has exactly the same support count as {a, d}.
– By the Apriori principle the support for {a, d} must be equal to the largest support among its supersets.
• So, the support for {a, d} must be identical to the support for {a, c, d}.
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
ACD
Support counting using closed frequent itemsets
Let C denote the set of closed frequent itemsets
Let kmax denote the maximum length of closed frequent itemsets
Fkmax ={f | f C, | f | = kmax } {Frequent itemsets of size kmax}
for k = kmax – 1 downto 1 do
Set Fk to be all sub-itemsets of length k from the frequent itemsets in Fk+1
for each f Fk do if f C then
f.support = max{f’.support | f’ Fk+1, f f’} end if
end for end for
Contingency Table• Given a rule X Y, the information needed to compute rule
interestingness can be obtained from a contingency table
Y Y
X f11 f10 f1+
X f01 f00 f0+
f+1 f+0 |T|
Contingency table for X Y
f11: support of X and Yf10: support of X and Yf01: support of X and Yf00: support of X and Y
Pitfall of ConfidenceCoffee Coffee
Tea 150 50 200
Tea 750 150 900
900 200 1100
Consider association rule: Tea Coffee
Confidence=
P(Coffee,Tea)/P(Tea) = P(Coffee|Tea) = 150/200 = 0.75 (seems quite high)
But, P(Coffee) = 0.9
Thus knowing that a person is a tea drinker actually decreases his/her probability of being a coffee drinker from 90% to 75%!
Although confidence is high, rule is misleading
In fact P(Coffee|Tea) = P(Coffee, Tea)/P(Tea) = 750/900 = 0.83
Interest Factor• Measure that takes into account statistical dependence
)()(
),(
YPXP
YXPInterest 11
11
11
11
ff
fN
NfNf
Nf
• f11/N is an estimate for the joint probability P(A,B)
• f1+ /N and f+1 /N are the estimates for P(A) and P(B), respectively.
• If A and B are statistically independent,
then P(A,B)=P(A)×P(B), thus the Interest is 1.
Example: Interest
Association Rule: Tea Coffee
Interest =
150*1100 / (200*900)= 0.92
(< 1, therefore they are negatively correlated)
Coffee Coffee
Tea 150 50 200
Tea 750 150 900
900 200 1100
Cross support patterns• They are patterns that relate a high frequency item such as milk to a low
frequency item such as caviar.
• Likely to be spurious because their correlations tend to be weak.
– E.g. confidence of {caviar}{milk} is likely to be high, but still the pattern is spurious, since there isn’t probably any correlation between caviar and milk.
• Observation: On the other hand, the confidence of {milk}{caviar} is very low.
• Cross support patterns can be detected and eliminated by examining the lowest confidence rule that can be extracted from a given itemset.
– Such confidence should be above certain level for the pattern to not be cross-support one.
Finding lowest confidence • Recall the anti monotone property of confidence:
conf( {i1 ,i2}{i3,i4,…,ik} ) conf( {i1 ,i2 , i3}{i4,…,ik} )
• This property suggests that confidence never increases as we shift more items from the left to the right hand side of an association rule.
• Hence, the lowest confidence rule that can be extracted from a frequent itemset contains only one item on its left hand side.
Finding lowest confidence • Given a frequent itemset {i1,i2,i3,i4,…,ik}, the rule
{ij}{i1 ,i2 , i3, ij-1, ij+1, i4,…,ik}
has the lowest confidence if
s(ij) = max {s(i1), s(i2),…,s(ik)}
• This follows directly from the definition of confidence as the ratio between the rule's support and the support of the rule antecedent.
Finding lowest confidence• Summarizing, the lowest confidence attainable from a frequent
itemset {i1,i2,i3,i4,…,ik}, is
k
k
isisis
iiis
,...,,max
,...,,
21
21
• This is also known as the h-confidence measure or all-confidence measure.
• Cross support patterns can be eliminated by ensuring that the h confidence values for the patterns exceed some user specified threshold hc.
• h-confidence is anti monotone, i.e.,
h confidence({i1,i2,…, ik}) h confidence({i1,i2,…, ik+1 })
and thus can be incorporated directly into the mining algorithm.
Examples of Sequence• Web sequence:
{Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation} {Return to Shopping}
• Purchase history of a given customer{Java in a Nutshell, Intro to Servlets} {EJB Patterns},…
• Sequence of classes taken by a computer science major:
{Algorithms and Data Structures, Introduction to Operating Systems} {Database Systems, Computer Architecture} {Computer Networks, Software Engineering} {Computer Graphics, Parallel Programming} …
Formal Definition of a Sequence• A sequence is an ordered list of elements (transactions)
s = < e1 e2 e3 … >
– Each element contains a collection of events (items)
ei = {i1, i2, …, ik}
– Each element is attributed to a specific time or location
• A k-sequence is a sequence that contains k events (items)
Sequence
E1E2
E1E3
E2E3E4E2
Element (Transaction
)
Event (Item)
Formal Definition of a Subsequence• A sequence a1 a2 … an is contained in another sequence b1
b2 … bm (m ≥ n) if there exist integers i1 < i2 < … < in such that a1 bi1 , a2 bi2, …, an bin
Data sequence Subsequence Contained?
{2,4} {3,5,6} {8} {2} {3,5} Yes
{1,2} {3,4} {1} {2} No
{2,4} {2,4} {2,5} {2} {4} Yes
• Support of a subsequence w is the fraction of data sequences that contain w
• A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥ minsup)
APRIORI-like Algorithm• Make the first pass over the sequence database to yield all the 1-element
frequent sequences
• Repeat until no new frequent sequences are found
Candidate Generation:
• Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items
Candidate Pruning:
• Prune candidate k-sequences that contain infrequent (k-1)-subsequences
Support Counting:
• Make a new pass over the sequence database to find the support for these candidate sequences
• Eliminate candidate k-sequences whose actual support is less than minsup
Candidate Generation• Base case (k=2):
– Merging two frequent 1-sequences <{i1}> and <{i2}> will produce four candidate 2-sequences:
– <{i1}, {i2}>, <{i2}, {i1}>, <{i1, i2}>, <{i2, i1}>
• General case (k>2):– A frequent (k-1)-sequence w1 is merged with another frequent
(k-1)-sequence w2 to produce a candidate k-sequence if the subsequence obtained by removing the first event in w1 is the same as the subsequence obtained by removing the last event in w2
– The resulting candidate after merging is given by the sequence w1 extended with the last event of w2.
– If the last two events in w2 belong to the same element, then the last event in w2 becomes part of the last element in w1
– Otherwise, the last event in w2 becomes a separate element appended to the end of w1
Candidate Generation Examples• Merging the sequences
w1=<{1} {2 3} {4}> and w2 =<{2 3} {4 5}>
will produce the candidate sequence < {1} {2 3} {4 5}> because the last two events in w2 (4 and 5) belong to the same element
• Merging the sequences w1=<{1} {2 3} {4}> and w2 =<{2 3} {4} {5}>
will produce the candidate sequence < {1} {2 3} {4} {5}> because the last two events in w2 (4 and 5) do not belong to the same element
• Finally, the sequences <{1}{2}{3}> and <{1}{2, 5}> don’t have to be merged (Why?)
• Because removing the first event from the first sequence doesn’t give the same subsequence as removing the last event from the second sequence.
• If <{1}{2,5}{3}> is a viable candidate, it will be generated by merging a different pair of sequences, <{1}{2,5}> and <{2,5}{3}>.
Example
< {1} {2} {3} >< {1} {2 5} >< {1} {5} {3} >< {2} {3} {4} >< {2 5} {3} >< {3} {4} {5} >< {5} {3 4} >
< {1} {2} {3} {4} >< {1} {2 5} {3} >< {1} {5} {3 4} >< {2} {3} {4} {5} >< {2 5} {3 4} >
< {1} {2 5} {3} >
Frequent3-sequences
CandidateGeneration
CandidatePruning
Timing Constraints
{A B} {C} {D E}
<= max-span
<= max-gap
Data sequence Subsequence Contained?
<{2,4} {3,5,6} {4,7} {4,5} {8}> < {6} {5} > Yes
<{1} {2} {3} {4} {5}> < {1} {4} > No
<{1} {2,3} {3,4} {4,5}> < {2} {3} {5} > Yes
<{1,2} {3} {2,3} {3,4} {2,4} {4,5}>
< {1,2} {5} > No
max-gap = 2, max-span= 4
Mining Sequential Patterns with Timing Constraints
• Approach 1:– Mine sequential patterns without timing constraints
– Postprocess the discovered patterns
• Approach 2:– Modify algorithm to directly prune candidates that violate timing
constraints
– Question: • Does APRIORI principle still hold?
APRIORI Principle for Sequence Data
Object Timestamp EventsA 1 1,2,4A 2 2,3A 3 5B 1 1,2B 2 2,3,4C 1 1, 2C 2 2,3,4C 3 2,4,5D 1 2D 2 3, 4D 3 4, 5E 1 1, 3E 2 2, 4, 5
Suppose:
max-gap = 1
max-span = 5
<{2} {5}>
support = 40%
but
<{2} {3} {5}>
support = 60% !! (APRIORI doesn’t hold)
Problem exists because of max-gap constraint
This problem can avoided by using the concept of a contiguous subsequence.
Contiguous Subsequences• s is a contiguous subsequence of
w = <e1, e2 ,…, ek>
if any of the following conditions holds:1. s is obtained from w by deleting an item from either e1 or ek
2. s is obtained from w by deleting an item from any element ei that contains at least 2 items
3. s is a contiguous subsequence of s’ and s’ is a contiguous subsequence of w (recursive definition)
• Examples: s = < {1} {2} > – is a contiguous subsequence of
< {1} {2 3}>, < {1 2} {2} {3}>, and < {3 4} {1 2} {2 3} {4} >
– is not a contiguous subsequence of < {1} {3} {2}> and < {2} {1} {3} {2}>
Modified Candidate Pruning Step• Modified APRIORI Principle
– If a k-sequence is frequent, then all of its contiguous (k-1)-subsequences must also be frequent
• Candidate generation doesn’t change. Only pruning changes.
• Without maxgap constraint:– A candidate k-sequence is pruned if at least one of its (k-1)-
subsequences is infrequent
• With maxgap constraint:– A candidate k-sequence is pruned if at least one of its contiguous
(k-1)-subsequences is infrequent
Cluster Analysis• Find groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated to) the objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
K-means Clustering• Partitional clustering approach • Each cluster is associated with a centroid (center point)
– typically the mean of the points in the cluster. • Each point is assigned to the cluster with the closest centroid• Number of clusters, K, must be specified• The basic algorithm is very simple
Importance of Choosing Initial Centroids
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
Importance of Choosing Initial Centroids
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
xy
Iteration 5
Solutions to Initial Centroids Problem• Multiple runs
– Helps, but probability is not on your side
• Bisecting K-means– Not as susceptible to initialization issues
Bisecting K meansStraightforward extension of the basic K means algorithm. Simple idea:
To obtain K clusters, split the set of points into two clusters, select one of these clusters to split, and so on, until K clusters have been produced.
Algorithm
Initialize the list of clusters to contain the cluster consisting of all points.
repeat
Remove a cluster from the list of clusters.
//Perform several ``trial'' bisections of the chosen cluster.
for i = 1 to number of trials do
Bisect the selected cluster using basic K means (i.e. 2-means).
end for
Select the two clusters from the bisection with the lowest total SSE.
Add these two clusters to the list of clusters. until the list of clusters contains K clusters.
Bisecting K-means Example
Limitations of K-means• K-means has problems when clusters are of differing
– Sizes
– Densities
– Non-globular shapes
• K-means has problems when the data contains outliers.
Exercise
• For each figure, could you use K-means to find the patterns represented by the nose, eyes, and mouth?
• Only for (b) and (d). – For (b), K-means would find the nose, eyes, and mouth, but
the lower density points would also be included. – For (d), K-means would find the nose, eyes, and mouth
straightforwardly as long as the number of clusters was set to 4.
• What limitation does clustering have in detecting all the patterns formed by the points in figure c?– Clustering techniques can only find patterns of points, not of
empty spaces.
Agglomerative Clustering AlgorithmCompute the proximity matrix
Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the proximity matrix
Until only a single cluster remains
• Key operation is the computation of the proximity of two clusters
– Different approaches to defining the distance between clusters distinguish the different algorithms
Cluster Similarity: MIN• Similarity of two clusters is based on the two most similar
(closest) points in the different clusters– Determined by one pair of points
Hierarchical Clustering: MIN
Nested Clusters Dendrogram
1
2
3
4
5
6
1
2
3
4
5
3 6 2 5 4 10
0.05
0.1
0.15
0.2
Strength of MIN
Original Points Two Clusters
Can handle non-globular shapes
Limitations of MIN
Sensitive to noise and outliers
Original Points Four clusters Three clusters:
The yellow points got wrongly merged with the red ones, as opposed to the green one.
Cluster Similarity: MAX• Similarity of two clusters is based on the two least similar (most
distant) points in the different clusters– Determined by all pairs of points in the two clusters
Hierarchical Clustering: MAX
Nested Clusters Dendrogram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1
2
3
4
5
6
1
2 5
3
4
Strengths of MAX
Robust with respect to noise and outliers
Original Points Four clusters Three clusters:
The yellow points get now merged with the green one.
Cluster Similarity: Group Average• Proximity of two clusters is the average of pairwise proximity between points
in the two clusters.
||Cluster||Cluster
)p,pproximity(
)Cluster,Clusterproximity(ji
ClusterpClusterp
ji
jijjii
Hierarchical Clustering: Group Average
Nested Clusters Dendrogram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
1
2
3
4
5
6
1
2
5
3
4
DBSCANDBSCAN is a density-based algorithm.
Locates regions of high density that are separated from one another by regions of low density.
• Density = number of points within a specified radius (Eps)
• A point is a core point if it has more than a specified number of points
(MinPts) within Eps – These are points that are at the interior of a cluster
• A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point
• A noise point is any point that is not a core point or a border point.
DBSCAN Algorithm• Any two core points that are close enough---within a distance
Eps of one another---are put in the same cluster.
• Likewise, any border point that is close enough to a core point is put in the same cluster as the core point.
• Ties may need to be resolved if a border point is close to core points from different clusters.
• Noise points are discarded.
When DBSCAN Works Well
Original Points Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well
DBSCAN: Determining EPS and MinPts• Look at the behavior of the distance from a point to its k-th
nearest neighbor, called the k dist.
• For points that belong to some cluster, the value of k dist will be small [if k is not larger than the cluster size].
• However, for points that are not in a cluster, such as noise points, the k dist will be relatively large.
• So, if we compute the k dist for all the data points for some k, sort them in increasing order, and then plot the sorted values, we expect to see a sharp change at the value of k dist that corresponds to a suitable value of Eps.
• If we select this distance as the Eps parameter and take the value of k as the MinPts parameter, then points for which k dist is less than Eps will be labeled as core points, while other points will be labeled as noise or border points.
DBSCAN: Determining EPS and MinPts
• Eps determined in this way depends on k, but does not change dramatically as k changes.
• If k is too small ? then even a small number of closely spaced points that are noise or outliers will be incorrectly labeled as clusters.
• If k is too large ? then small clusters (of size less than k) are likely to be labeled as noise.
• Original DBSCAN used k = 4, which appears to be a reasonable value for most two dimensional data sets.
IR-Web queries• Keyword queries
• Boolean queries (using AND, OR, NOT)
• Phrase queries
• Proximity queries
• Full document queries
• Natural language questions
From: Bing Liu. Web Data Mining. 2007
Vector space model• Documents are also treated as a “bag” of words or terms.
• Each document is represented as a vector.
• Term Frequency (TF) Scheme: The weight of a term ti in document dj is the number of times that ti appears in dj, denoted by fij. Normalization may also be applied.
• Shortcoming of the TF scheme is that it doesn’t consider the situation where a term appears in many documents of the collection. – Such a term may not be discriminative.
From: Bing Liu. Web Data Mining. 2007
TF-IDF term weighting scheme• The most well known weighting
scheme
– TF: (normalized) term frequency
– IDF: inverse document frequency.
N: total number of docs
dfi: the number of docs that ti appears. More documents a word (term) appears, less discriminative it is, and thus, less weight we should give.
• The final TF-IDF term weight is:
From: Bing Liu. Web Data Mining. 2007
Retrieval in vector space model• Query q is represented in the same way as a document.
• The term wiq of each term ti in q can also computed in the same way as in normal document.
• Relevance of dj to q: Compare the similarity of query q and document dj.
• For this, use cosine similarity (the cosine of the angle between the two vectors)
From: Bing Liu. Web Data Mining. 2007
Page Rank (PR)Intuitively, we solve the recursive definition of “importance”:
A page is important if important pages link to it.
• Page rank is the estimated page importance.
• In short PageRank is a “vote”, by all the other pages on the Web, about how important a page is.
– A link to a page counts as a vote of support.
– If there’s no link there’s no support (but it’s an abstention from voting rather than a vote against the page).
From: Jeff Ullman’s lecture
Page Rank FormulaPR(A) = PR(T1)/C(T1) +…+ PR(Tn)/C(Tn)
1. PR(Tn) - Each page has a notion of its own self-importance, which is say 1 initially.
2. C(Tn) – Count of outgoing links from page Tn.
1. Each page spreads its vote out evenly amongst all of it’s outgoing links.
3. PR(Tn)/C(Tn) –
a) Each page spreads its vote out evenly amongst all of it’s outgoing links.
b) So if our page (say page A) has a back link from page “n” the share of the vote page A will get from page “n” is “PR(Tn)/C(Tn).”
Web Matrix• Capture the formula by the web matrix WebM, that is:
• ij th entry is – 1/n if page i is one of the n successors of page j, and – 0 otherwise.
• Then, the importance vector containing the rank of each page is calculated by:
Ranknew = WebM • Rankold
• Start with Rank=(1,1,…)
• Observe that the above matrix vector product conforms to the PageRank formula in the previous slide.
Example• In 1839, the Web consisted on only three pages: Netscape,
Microsoft, and Amazon.
old
old
old
new
new
new
a
m
n
a
m
n
0121
2100
21021
The first four iterations give the following estimates:n = 1
m = 1
a = 1
1
1/2
3/2
5/4
3/4
1
9/8
1/2
11/8
5/4
11/16
17/16
In the limit, the solution is n = a = 6/5; m = 3/5. From: Jeff Ullman’s lecture
Problems With Real Web GraphsDead ends: a page that has no successors has
nowhere to send its importance.
Eventually, all importance will “leak out of” the Web.
Example: Suppose Microsoft tries to claim that it is a monopoly by removing all links from its site.
The new Web, and the rank vectors for the first 4 iterations are shown.
n = 1 1 3/4 5/8 1/2m = 1 1/2 1/4 1/4 3/16a = 1 1/2 1/2 3/8 5/16
Eventually, each of n, m, and a become 0; i.e., all the importance leaked out.
old
old
old
new
new
new
a
m
n
a
m
n
0021
2100
21021
From: Jeff Ullman’s lecture
Problems With Real Web GraphsSpider traps: a group of one or more
pages that have no links out of the group will eventually accumulate all the importance of the Web.
Example: Angered by the decision, Microsoft decides it will link only to itself from now on. Now, Microsoft has become a spider trap.
The new Web, and the rank vectors for the first 4 iterations are shown.
n = 1 1 3/4 5/8 1/2m = 1 3/2 7/4 2 35/16a = 1 1/2 1/2 3/8 5/16
Now, m converges to 3, and n = a = 0.
old
old
old
new
new
new
a
m
n
a
m
n
0021
2110
21021
From: Jeff Ullman’s lecture
Google Solution to Dead Ends and Spider Traps
Stop the other pages having too much influence.This total vote is “damped down” by multiplying it by a factor.
Example: If we use a 20% damp-down, the equation of previous example becomes:
old
old
old
old
old
old
new
new
new
a
m
n
a
m
n
a
m
n
20.0
0021
2110
21021
80.0
The solution to this equation is n = 7/11; m = 21/11; a = 5/11.
From: Jeff Ullman’s lecture
Hubs and Authorities• Intuitively, we define “hub” and “authority” in a mutually
recursive way: – a hub links to many authorities, and
– an authority is linked to by many hubs.
• Authorities turn out to be pages that offer information about a topic, e.g., http://www.bctransit.com
• Hubs are pages that don't provide the information, but tell you where to find the information, e.g., http://yahoo.com
Matrix formulation• Use a matrix formulation similar to that of PageRank, but
without the stochastic restriction. – We count each link as 1, regardless of how many successors or
predecessors a page has.
• Namely, define a matrix A whose rows and columns correspond to Web pages, with entry Aij = 1 if page i links to page j, and 0 if not.– Notice that AT , the transpose of A, looks like the matrix used for
computing Page rank, but AT has 1's where the Page-rank matrix has fractions.
Authority and Hubbiness Vectors • Let a and h be vectors, whose i th component corresponds to the
degrees of authority and hubbiness of the i th page.
• Let and be suitable scaling factors.
Then we can state:
1. h = A aThat is, the hubbiness of each page is the sum of the authorities of all the pages it links to, scaled by .
2. a = AT h That is, the authority of each page is the sum of the hubbiness of all the pages that link to it, scaled by .
Simple substitutions• We can derive from (1) and (2), using simple substitution, two
equations that relate vectors a and h only to themselves.
1. a = ATA a
2. h = A AT h
• As a result, we can compute h and a by iterations.
Example
If we use = = 1 and assume that the vectors h = [hn, hm, ha] = [1, 1, 1], and a = [an, am, aa] = [1, 1, 1],
the first three iterations of the equations for a and h are:From: Jeff Ullman’s lecture