review. topics to review for the final exam evaluation of classification –predicting performance,...

75
Review

Post on 15-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Review

Page 2: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Topics to review for the final exam• Evaluation of classification

– predicting performance, confidence intervals– ROC analysis– Precision, recall, F-Measure

• Association Analysis– APRIORI– FP-Tree/FPGrowth– Maximal, Closed frequent itemsets– Cross support, h-measure– Confidence vs. interestingness– Mining sequences– Mining graphs

• Cluster Analysis– K-Means, Bisecting K-means– SOM– DBSCAN– Hierarchical clustering

• Web search– IR– Reputation ranking

Single-side help sheet allowed.

Page 3: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Mining Association RulesTwo-step approach:

1. Frequent Itemset Generation– Generate all itemsets whose support minsup (these itemsets are called

frequent itemsets)

2. Rule Generation– Generate high confidence rules from each frequent itemset

• The computational requirements for frequent itemset generation are more expensive than those of rule generation.

Candidate itemsets are generated and then tested against the database to see whether they are frequent.

• Apriori principle:– If an itemset is frequent, then all of its subsets must also be frequent

• Apriori principle conversely said:– If an itemset is infrequent, then all of its supersets must be infrequent

too.

Page 4: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Found to be Infrequent

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Illustrating Apriori Principlenull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDEPruned supersets

Page 5: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Apriori Algorithm• Method:

– Let k=1

– Generate frequent itemsets of length 1

– Repeat until no new frequent itemsets are identifiedk=k+1

1. Generate length k candidate itemsets from length k-1 frequent itemsets

2. Prune candidate itemsets containing subsets of length k-1 that are infrequent

3. Count the support of each candidate by scanning the DB and eliminate candidates that are infrequent, leaving only those that are frequent

Page 6: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Fk-1Fk-1 Method• Merge a pair of frequent (k-1) itemsets only if their first k-2 items are identical.

• E.g. frequent itemsets {Bread, Diapers} and {Bread, Milk} are merged to form a candidate 3 itemset {Bread, Diapers, Milk}.

{Bread, Diapers, Milk}

Page 7: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Fk-1Fk-1

Completeness

• We don’t merge {Beer, Diapers} with {Diapers, Milk} because the first item in both itemsets is different.

• Do we loose {Beer, Diapers, Milk}?

Prunning

• Before checking a candidate against the DB, a candidate pruning step is needed to ensure that the remaining subsets of k-1 elements are frequent.

Counting

• Finally, the surviving candidates are tested (counted) on the DB.

Page 8: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Rule Generation• Computing the confidence of an association rule does not require additional

scans of the transactions.

• Consider {1, 2}{3}. • The rule confidence is ({1, 2, 3}) / ({1, 2}) • Because {1, 2, 3} is frequent, the anti monotone property of support ensures

that {1, 2} must be frequent, too, and we do know the supports of frequent itemsets.

• Initially, all the high confidence rules that have only one item in the rule consequent are extracted.

• These rules are then used to generate new candidate rules.

• For example, if

– {acd} {b} and {abd} {c} are high confidence rules, then the candidate rule {ad} {bc} is generated by merging the consequents of both rules.

– Then the candidate rules are checked for confidence.

Page 9: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Other concepts and algorithms• FP-Tree/FP-Growth

See corresponding slide set and Assignment 2 solution.

• Maximal Frequent Itemsets

• Closed Itemset

• Interest factor

• Mining Sequences

Page 10: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Maximal Frequent Itemsets

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Border

Infrequent Itemsets

Maximal Itemsets

An itemset is maximal frequent if none of its immediate supersets is frequent

Maximal frequent itemsets form the smallest set of itemsets from which all frequent itemsets can be derived.

Page 11: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Closed Itemsets• Despite providing a compact representation, maximal frequent

itemsets do not contain the support information of their subsets. – An additional pass over the data set is needed to determine the

support counts of the non maximal frequent itemsets.

• It might be desirable to have a minimal representation of frequent itemsets that preserves the support information. – Such representation is the set of the closed frequent itemsets.

• An itemset is closed if none of its immediate supersets has the same support as the itemset.– An itemset is a closed frequent itemset if it is closed and its

support is greater than or equal to minsup.

Page 12: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Maximal vs Closed Frequent Itemsetsnull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4

Minimum support = 2

# Closed = 9

# Maximal = 4

Closed and maximal

Closed but not maximal

Transaction Ids

Not supported by any transactions

TID Items

1 ABC

2 ABCD

3 BCE

4 ACDE

5 DE

Page 13: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Deriving Frequent Itemsets From Closed Frequent Itemsets

• E.g., consider the frequent itemset {a, d}. Because the itemset is not closed, its support count must be identical to one of its immediate supersets. – The key is to determine which

superset among

{a, b, d}, {a, c, d}, or {a, d, e}

has exactly the same support count as {a, d}.

– By the Apriori principle the support for {a, d} must be equal to the largest support among its supersets.

• So, the support for {a, d} must be identical to the support for {a, c, d}.

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4

ACD

Page 14: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Support counting using closed frequent itemsets

Let C denote the set of closed frequent itemsets

Let kmax denote the maximum length of closed frequent itemsets

Fkmax ={f | f C, | f | = kmax } {Frequent itemsets of size kmax}

for k = kmax – 1 downto 1 do

Set Fk to be all sub-itemsets of length k from the frequent itemsets in Fk+1

for each f Fk do if f C then

f.support = max{f’.support | f’ Fk+1, f f’} end if

end for end for

Page 15: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Contingency Table• Given a rule X Y, the information needed to compute rule

interestingness can be obtained from a contingency table

Y Y

X f11 f10 f1+

X f01 f00 f0+

f+1 f+0 |T|

Contingency table for X Y

f11: support of X and Yf10: support of X and Yf01: support of X and Yf00: support of X and Y

Page 16: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Pitfall of ConfidenceCoffee Coffee

Tea 150 50 200

Tea 750 150 900

900 200 1100

Consider association rule: Tea Coffee

Confidence=

P(Coffee,Tea)/P(Tea) = P(Coffee|Tea) = 150/200 = 0.75 (seems quite high)

But, P(Coffee) = 0.9

Thus knowing that a person is a tea drinker actually decreases his/her probability of being a coffee drinker from 90% to 75%!

Although confidence is high, rule is misleading

In fact P(Coffee|Tea) = P(Coffee, Tea)/P(Tea) = 750/900 = 0.83

Page 17: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Interest Factor• Measure that takes into account statistical dependence

)()(

),(

YPXP

YXPInterest 11

11

11

11

ff

fN

NfNf

Nf

• f11/N is an estimate for the joint probability P(A,B)

• f1+ /N and f+1 /N are the estimates for P(A) and P(B), respectively.

• If A and B are statistically independent,

then P(A,B)=P(A)×P(B), thus the Interest is 1.

Page 18: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Example: Interest

Association Rule: Tea Coffee

Interest =

150*1100 / (200*900)= 0.92

(< 1, therefore they are negatively correlated)

Coffee Coffee

Tea 150 50 200

Tea 750 150 900

900 200 1100

Page 19: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Cross support patterns• They are patterns that relate a high frequency item such as milk to a low

frequency item such as caviar.

• Likely to be spurious because their correlations tend to be weak.

– E.g. confidence of {caviar}{milk} is likely to be high, but still the pattern is spurious, since there isn’t probably any correlation between caviar and milk.

• Observation: On the other hand, the confidence of {milk}{caviar} is very low.

• Cross support patterns can be detected and eliminated by examining the lowest confidence rule that can be extracted from a given itemset.

– Such confidence should be above certain level for the pattern to not be cross-support one.

Page 20: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Finding lowest confidence • Recall the anti monotone property of confidence:

conf( {i1 ,i2}{i3,i4,…,ik} ) conf( {i1 ,i2 , i3}{i4,…,ik} )

• This property suggests that confidence never increases as we shift more items from the left to the right hand side of an association rule.

• Hence, the lowest confidence rule that can be extracted from a frequent itemset contains only one item on its left hand side.

Page 21: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Finding lowest confidence • Given a frequent itemset {i1,i2,i3,i4,…,ik}, the rule

{ij}{i1 ,i2 , i3, ij-1, ij+1, i4,…,ik}

has the lowest confidence if

s(ij) = max {s(i1), s(i2),…,s(ik)}

• This follows directly from the definition of confidence as the ratio between the rule's support and the support of the rule antecedent.

Page 22: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Finding lowest confidence• Summarizing, the lowest confidence attainable from a frequent

itemset {i1,i2,i3,i4,…,ik}, is

k

k

isisis

iiis

,...,,max

,...,,

21

21

• This is also known as the h-confidence measure or all-confidence measure.

• Cross support patterns can be eliminated by ensuring that the h confidence values for the patterns exceed some user specified threshold hc.

• h-confidence is anti monotone, i.e.,

h confidence({i1,i2,…, ik}) h confidence({i1,i2,…, ik+1 })

and thus can be incorporated directly into the mining algorithm.

Page 23: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Examples of Sequence• Web sequence:

{Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation} {Return to Shopping}

• Purchase history of a given customer{Java in a Nutshell, Intro to Servlets} {EJB Patterns},…

• Sequence of classes taken by a computer science major:

{Algorithms and Data Structures, Introduction to Operating Systems} {Database Systems, Computer Architecture} {Computer Networks, Software Engineering} {Computer Graphics, Parallel Programming} …

Page 24: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Formal Definition of a Sequence• A sequence is an ordered list of elements (transactions)

s = < e1 e2 e3 … >

– Each element contains a collection of events (items)

ei = {i1, i2, …, ik}

– Each element is attributed to a specific time or location

• A k-sequence is a sequence that contains k events (items)

Sequence

E1E2

E1E3

E2E3E4E2

Element (Transaction

)

Event (Item)

Page 25: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Formal Definition of a Subsequence• A sequence a1 a2 … an is contained in another sequence b1

b2 … bm (m ≥ n) if there exist integers i1 < i2 < … < in such that a1 bi1 , a2 bi2, …, an bin

Data sequence Subsequence Contained?

{2,4} {3,5,6} {8} {2} {3,5} Yes

{1,2} {3,4} {1} {2} No

{2,4} {2,4} {2,5} {2} {4} Yes

• Support of a subsequence w is the fraction of data sequences that contain w

• A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥ minsup)

Page 26: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

APRIORI-like Algorithm• Make the first pass over the sequence database to yield all the 1-element

frequent sequences

• Repeat until no new frequent sequences are found

Candidate Generation:

• Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items

Candidate Pruning:

• Prune candidate k-sequences that contain infrequent (k-1)-subsequences

Support Counting:

• Make a new pass over the sequence database to find the support for these candidate sequences

• Eliminate candidate k-sequences whose actual support is less than minsup

Page 27: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Candidate Generation• Base case (k=2):

– Merging two frequent 1-sequences <{i1}> and <{i2}> will produce four candidate 2-sequences:

– <{i1}, {i2}>, <{i2}, {i1}>, <{i1, i2}>, <{i2, i1}>

• General case (k>2):– A frequent (k-1)-sequence w1 is merged with another frequent

(k-1)-sequence w2 to produce a candidate k-sequence if the subsequence obtained by removing the first event in w1 is the same as the subsequence obtained by removing the last event in w2

– The resulting candidate after merging is given by the sequence w1 extended with the last event of w2.

– If the last two events in w2 belong to the same element, then the last event in w2 becomes part of the last element in w1

– Otherwise, the last event in w2 becomes a separate element appended to the end of w1

Page 28: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Candidate Generation Examples• Merging the sequences

w1=<{1} {2 3} {4}> and w2 =<{2 3} {4 5}>

will produce the candidate sequence < {1} {2 3} {4 5}> because the last two events in w2 (4 and 5) belong to the same element

• Merging the sequences w1=<{1} {2 3} {4}> and w2 =<{2 3} {4} {5}>

will produce the candidate sequence < {1} {2 3} {4} {5}> because the last two events in w2 (4 and 5) do not belong to the same element

• Finally, the sequences <{1}{2}{3}> and <{1}{2, 5}> don’t have to be merged (Why?)

• Because removing the first event from the first sequence doesn’t give the same subsequence as removing the last event from the second sequence.

• If <{1}{2,5}{3}> is a viable candidate, it will be generated by merging a different pair of sequences, <{1}{2,5}> and <{2,5}{3}>.

Page 29: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Example

< {1} {2} {3} >< {1} {2 5} >< {1} {5} {3} >< {2} {3} {4} >< {2 5} {3} >< {3} {4} {5} >< {5} {3 4} >

< {1} {2} {3} {4} >< {1} {2 5} {3} >< {1} {5} {3 4} >< {2} {3} {4} {5} >< {2 5} {3 4} >

< {1} {2 5} {3} >

Frequent3-sequences

CandidateGeneration

CandidatePruning

Page 30: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Timing Constraints

{A B} {C} {D E}

<= max-span

<= max-gap

Data sequence Subsequence Contained?

<{2,4} {3,5,6} {4,7} {4,5} {8}> < {6} {5} > Yes

<{1} {2} {3} {4} {5}> < {1} {4} > No

<{1} {2,3} {3,4} {4,5}> < {2} {3} {5} > Yes

<{1,2} {3} {2,3} {3,4} {2,4} {4,5}>

< {1,2} {5} > No

max-gap = 2, max-span= 4

Page 31: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Mining Sequential Patterns with Timing Constraints

• Approach 1:– Mine sequential patterns without timing constraints

– Postprocess the discovered patterns

• Approach 2:– Modify algorithm to directly prune candidates that violate timing

constraints

– Question: • Does APRIORI principle still hold?

Page 32: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

APRIORI Principle for Sequence Data

Object Timestamp EventsA 1 1,2,4A 2 2,3A 3 5B 1 1,2B 2 2,3,4C 1 1, 2C 2 2,3,4C 3 2,4,5D 1 2D 2 3, 4D 3 4, 5E 1 1, 3E 2 2, 4, 5

Suppose:

max-gap = 1

max-span = 5

<{2} {5}>

support = 40%

but

<{2} {3} {5}>

support = 60% !! (APRIORI doesn’t hold)

Problem exists because of max-gap constraint

This problem can avoided by using the concept of a contiguous subsequence.

Page 33: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Contiguous Subsequences• s is a contiguous subsequence of

w = <e1, e2 ,…, ek>

if any of the following conditions holds:1. s is obtained from w by deleting an item from either e1 or ek

2. s is obtained from w by deleting an item from any element ei that contains at least 2 items

3. s is a contiguous subsequence of s’ and s’ is a contiguous subsequence of w (recursive definition)

• Examples: s = < {1} {2} > – is a contiguous subsequence of

< {1} {2 3}>, < {1 2} {2} {3}>, and < {3 4} {1 2} {2 3} {4} >

– is not a contiguous subsequence of < {1} {3} {2}> and < {2} {1} {3} {2}>

Page 34: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Modified Candidate Pruning Step• Modified APRIORI Principle

– If a k-sequence is frequent, then all of its contiguous (k-1)-subsequences must also be frequent

• Candidate generation doesn’t change. Only pruning changes.

• Without maxgap constraint:– A candidate k-sequence is pruned if at least one of its (k-1)-

subsequences is infrequent

• With maxgap constraint:– A candidate k-sequence is pruned if at least one of its contiguous

(k-1)-subsequences is infrequent

Page 35: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Cluster Analysis• Find groups of objects such that the objects in a group will be

similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Page 36: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

K-means Clustering• Partitional clustering approach • Each cluster is associated with a centroid (center point)

– typically the mean of the points in the cluster. • Each point is assigned to the cluster with the closest centroid• Number of clusters, K, must be specified• The basic algorithm is very simple

Page 37: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Page 38: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

xy

Iteration 5

Page 39: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Solutions to Initial Centroids Problem• Multiple runs

– Helps, but probability is not on your side

• Bisecting K-means– Not as susceptible to initialization issues

Page 40: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Bisecting K meansStraightforward extension of the basic K means algorithm. Simple idea:

To obtain K clusters, split the set of points into two clusters, select one of these clusters to split, and so on, until K clusters have been produced.

Algorithm

Initialize the list of clusters to contain the cluster consisting of all points.

repeat

Remove a cluster from the list of clusters.

//Perform several ``trial'' bisections of the chosen cluster.

for i = 1 to number of trials do

Bisect the selected cluster using basic K means (i.e. 2-means).

end for

Select the two clusters from the bisection with the lowest total SSE.

Add these two clusters to the list of clusters. until the list of clusters contains K clusters.

Page 41: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Bisecting K-means Example

Page 42: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Limitations of K-means• K-means has problems when clusters are of differing

– Sizes

– Densities

– Non-globular shapes

• K-means has problems when the data contains outliers.

Page 43: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Exercise

• For each figure, could you use K-means to find the patterns represented by the nose, eyes, and mouth?

• Only for (b) and (d). – For (b), K-means would find the nose, eyes, and mouth, but

the lower density points would also be included. – For (d), K-means would find the nose, eyes, and mouth

straightforwardly as long as the number of clusters was set to 4.

• What limitation does clustering have in detecting all the patterns formed by the points in figure c?– Clustering techniques can only find patterns of points, not of

empty spaces.

Page 44: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Agglomerative Clustering AlgorithmCompute the proximity matrix

Let each data point be a cluster

Repeat

Merge the two closest clusters

Update the proximity matrix

Until only a single cluster remains

• Key operation is the computation of the proximity of two clusters

– Different approaches to defining the distance between clusters distinguish the different algorithms

Page 45: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Cluster Similarity: MIN• Similarity of two clusters is based on the two most similar

(closest) points in the different clusters– Determined by one pair of points

Page 46: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Hierarchical Clustering: MIN

Nested Clusters Dendrogram

1

2

3

4

5

6

1

2

3

4

5

3 6 2 5 4 10

0.05

0.1

0.15

0.2

Page 47: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Strength of MIN

Original Points Two Clusters

Can handle non-globular shapes

Page 48: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Limitations of MIN

Sensitive to noise and outliers

Original Points Four clusters Three clusters:

The yellow points got wrongly merged with the red ones, as opposed to the green one.

Page 49: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Cluster Similarity: MAX• Similarity of two clusters is based on the two least similar (most

distant) points in the different clusters– Determined by all pairs of points in the two clusters

Page 50: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Hierarchical Clustering: MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

2

3

4

5

6

1

2 5

3

4

Page 51: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Strengths of MAX

Robust with respect to noise and outliers

Original Points Four clusters Three clusters:

The yellow points get now merged with the green one.

Page 52: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Cluster Similarity: Group Average• Proximity of two clusters is the average of pairwise proximity between points

in the two clusters.

||Cluster||Cluster

)p,pproximity(

)Cluster,Clusterproximity(ji

ClusterpClusterp

ji

jijjii

Page 53: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Hierarchical Clustering: Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

1

2

3

4

5

6

1

2

5

3

4

Page 54: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

DBSCANDBSCAN is a density-based algorithm.

Locates regions of high density that are separated from one another by regions of low density.

• Density = number of points within a specified radius (Eps)

• A point is a core point if it has more than a specified number of points

(MinPts) within Eps – These are points that are at the interior of a cluster

• A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point

• A noise point is any point that is not a core point or a border point.

Page 55: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

DBSCAN Algorithm• Any two core points that are close enough---within a distance

Eps of one another---are put in the same cluster.

• Likewise, any border point that is close enough to a core point is put in the same cluster as the core point.

• Ties may need to be resolved if a border point is close to core points from different clusters.

• Noise points are discarded.

Page 56: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise

• Can handle clusters of different shapes and sizes

Page 57: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

When DBSCAN Does NOT Work Well

Page 58: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

DBSCAN: Determining EPS and MinPts• Look at the behavior of the distance from a point to its k-th

nearest neighbor, called the k dist.

• For points that belong to some cluster, the value of k dist will be small [if k is not larger than the cluster size].

• However, for points that are not in a cluster, such as noise points, the k dist will be relatively large.

• So, if we compute the k dist for all the data points for some k, sort them in increasing order, and then plot the sorted values, we expect to see a sharp change at the value of k dist that corresponds to a suitable value of Eps.

• If we select this distance as the Eps parameter and take the value of k as the MinPts parameter, then points for which k dist is less than Eps will be labeled as core points, while other points will be labeled as noise or border points.

Page 59: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

DBSCAN: Determining EPS and MinPts

• Eps determined in this way depends on k, but does not change dramatically as k changes.

• If k is too small ? then even a small number of closely spaced points that are noise or outliers will be incorrectly labeled as clusters.

• If k is too large ? then small clusters (of size less than k) are likely to be labeled as noise.

• Original DBSCAN used k = 4, which appears to be a reasonable value for most two dimensional data sets.

Page 60: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

IR-Web queries• Keyword queries

• Boolean queries (using AND, OR, NOT)

• Phrase queries

• Proximity queries

• Full document queries

• Natural language questions

From: Bing Liu. Web Data Mining. 2007

Page 61: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Vector space model• Documents are also treated as a “bag” of words or terms.

• Each document is represented as a vector.

• Term Frequency (TF) Scheme: The weight of a term ti in document dj is the number of times that ti appears in dj, denoted by fij. Normalization may also be applied.

• Shortcoming of the TF scheme is that it doesn’t consider the situation where a term appears in many documents of the collection. – Such a term may not be discriminative.

From: Bing Liu. Web Data Mining. 2007

Page 62: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

TF-IDF term weighting scheme• The most well known weighting

scheme

– TF: (normalized) term frequency

– IDF: inverse document frequency.

N: total number of docs

dfi: the number of docs that ti appears. More documents a word (term) appears, less discriminative it is, and thus, less weight we should give.

• The final TF-IDF term weight is:

From: Bing Liu. Web Data Mining. 2007

Page 63: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Retrieval in vector space model• Query q is represented in the same way as a document.

• The term wiq of each term ti in q can also computed in the same way as in normal document.

• Relevance of dj to q: Compare the similarity of query q and document dj.

• For this, use cosine similarity (the cosine of the angle between the two vectors)

From: Bing Liu. Web Data Mining. 2007

Page 64: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Page Rank (PR)Intuitively, we solve the recursive definition of “importance”:

A page is important if important pages link to it.

• Page rank is the estimated page importance.

• In short PageRank is a “vote”, by all the other pages on the Web, about how important a page is.

– A link to a page counts as a vote of support.

– If there’s no link there’s no support (but it’s an abstention from voting rather than a vote against the page).

From: Jeff Ullman’s lecture

Page 65: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Page Rank FormulaPR(A) = PR(T1)/C(T1) +…+ PR(Tn)/C(Tn)

1. PR(Tn) - Each page has a notion of its own self-importance, which is say 1 initially.

2. C(Tn) – Count of outgoing links from page Tn.

1. Each page spreads its vote out evenly amongst all of it’s outgoing links.

3. PR(Tn)/C(Tn) –

a) Each page spreads its vote out evenly amongst all of it’s outgoing links.

b) So if our page (say page A) has a back link from page “n” the share of the vote page A will get from page “n” is “PR(Tn)/C(Tn).”

Page 66: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Web Matrix• Capture the formula by the web matrix WebM, that is:

• ij th entry is – 1/n if page i is one of the n successors of page j, and – 0 otherwise.

• Then, the importance vector containing the rank of each page is calculated by:

Ranknew = WebM • Rankold

• Start with Rank=(1,1,…)

• Observe that the above matrix vector product conforms to the PageRank formula in the previous slide.

Page 67: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Example• In 1839, the Web consisted on only three pages: Netscape,

Microsoft, and Amazon.

old

old

old

new

new

new

a

m

n

a

m

n

0121

2100

21021

The first four iterations give the following estimates:n = 1

m = 1

a = 1

1

1/2

3/2

5/4

3/4

1

9/8

1/2

11/8

5/4

11/16

17/16

In the limit, the solution is n = a = 6/5; m = 3/5. From: Jeff Ullman’s lecture

Page 68: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Problems With Real Web GraphsDead ends: a page that has no successors has

nowhere to send its importance.

Eventually, all importance will “leak out of” the Web.

Example: Suppose Microsoft tries to claim that it is a monopoly by removing all links from its site.

The new Web, and the rank vectors for the first 4 iterations are shown.

n = 1 1 3/4 5/8 1/2m = 1 1/2 1/4 1/4 3/16a = 1 1/2 1/2 3/8 5/16

Eventually, each of n, m, and a become 0; i.e., all the importance leaked out.

old

old

old

new

new

new

a

m

n

a

m

n

0021

2100

21021

From: Jeff Ullman’s lecture

Page 69: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Problems With Real Web GraphsSpider traps: a group of one or more

pages that have no links out of the group will eventually accumulate all the importance of the Web.

Example: Angered by the decision, Microsoft decides it will link only to itself from now on. Now, Microsoft has become a spider trap.

The new Web, and the rank vectors for the first 4 iterations are shown.

n = 1 1 3/4 5/8 1/2m = 1 3/2 7/4 2 35/16a = 1 1/2 1/2 3/8 5/16

Now, m converges to 3, and n = a = 0.

old

old

old

new

new

new

a

m

n

a

m

n

0021

2110

21021

From: Jeff Ullman’s lecture

Page 70: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Google Solution to Dead Ends and Spider Traps

Stop the other pages having too much influence.This total vote is “damped down” by multiplying it by a factor.

Example: If we use a 20% damp-down, the equation of previous example becomes:

old

old

old

old

old

old

new

new

new

a

m

n

a

m

n

a

m

n

20.0

0021

2110

21021

80.0

The solution to this equation is n = 7/11; m = 21/11; a = 5/11.

From: Jeff Ullman’s lecture

Page 71: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Hubs and Authorities• Intuitively, we define “hub” and “authority” in a mutually

recursive way: – a hub links to many authorities, and

– an authority is linked to by many hubs.

• Authorities turn out to be pages that offer information about a topic, e.g., http://www.bctransit.com

• Hubs are pages that don't provide the information, but tell you where to find the information, e.g., http://yahoo.com

Page 72: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Matrix formulation• Use a matrix formulation similar to that of PageRank, but

without the stochastic restriction. – We count each link as 1, regardless of how many successors or

predecessors a page has.

• Namely, define a matrix A whose rows and columns correspond to Web pages, with entry Aij = 1 if page i links to page j, and 0 if not.– Notice that AT , the transpose of A, looks like the matrix used for

computing Page rank, but AT has 1's where the Page-rank matrix has fractions.

Page 73: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Authority and Hubbiness Vectors • Let a and h be vectors, whose i th component corresponds to the

degrees of authority and hubbiness of the i th page.

• Let and be suitable scaling factors.

Then we can state:

1. h = A aThat is, the hubbiness of each page is the sum of the authorities of all the pages it links to, scaled by .

2. a = AT h That is, the authority of each page is the sum of the hubbiness of all the pages that link to it, scaled by .

Page 74: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Simple substitutions• We can derive from (1) and (2), using simple substitution, two

equations that relate vectors a and h only to themselves.

1. a = ATA a

2. h = A AT h

• As a result, we can compute h and a by iterations.

Page 75: Review. Topics to review for the final exam Evaluation of classification –predicting performance, confidence intervals –ROC analysis –Precision, recall,

Example

If we use = = 1 and assume that the vectors h = [hn, hm, ha] = [1, 1, 1], and a = [an, am, aa] = [1, 1, 1],

the first three iterations of the equations for a and h are:From: Jeff Ullman’s lecture