ss-zg548: advanced data mining · recap association rule mining involves the discovery of frequent...

SS-ZG548: ADVANCED DATA MINING

Lecture-04: Incremental Mining and Clustering

Dr. Kamlesh Tiwari,Assistant Professor,

Department of Computer Science and Information Systems,BITS Pilani, Rajasthan-333031 INDIA

Feb 02, 2019 (WILP @ BITS-Pilani Jan-Apr 2019)

Recap

Association Rule Mining involves the discovery of frequent item-setsbsed on support and confidence parameters

Approaches involves Apriori, Hash Based (DHP), Partition BasedAlgorithmIncremental association rule mining is needed as real databasesare generally dynamic D

′= D− M− + M+

Fast UPdate (FUP) can handle insertions1 An original frequent item set X ∈ L, becomes infrequent in D′ iff

support(X )D′ < Smin2 An item set X /∈ L, becomes frequent in D′ iff support(X )M+ ≥ Smin3 If a k -item set X whose (k − 1)-subset(s) becomes infrequent, i .e.,

the subset is in Lk−1 but not in L′k−1, then X must be infrequent inD′.

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 2 / 28

Recap: FUP at work

Consider the database D and therelated frequent set discovered withApriori

Item set Support{A} 6/9{B} 6/9{C} 6/9{E} 4/9{A B} 5/9{A C} 4/9{B C} 4/9{A B C} 4/9

Consider the arrival of M+ more transactions

The first iteration, is as below.


Recap: FUP2 at workFUP2 can handle insertion and deletion both

Ck is divided into two parts. Pk = Lk and Qk = Ck − Pk

Being frequent, support for all items in Pi is known. It could beupdated using M− and M+ onlyCount({A})D′ = Count({A})D− Count({A})M−+ Count({A})M+

= 6− 3 + 1 = 4

Item set Support{A} 6/9{B} 6/9{C} 6/9{E} 4/9{A B} 5/9{A C} 4/9{B C} 4/9{A B C} 4/9

Frequent itemsets of D


Variations of FUPUpdate With Early Pruning (UWEP): Occurrence of potentiallyhuge set of candidate itemset and multiple scans of the databaseis the issue

I If a k-itemset is frequent in M+ but infrequent in D′, it is notconsidered when generating Ck+1

I This can significantly reduce the number of candidate itemsets, withthe trade-off that an additional set of unchecked itemsets has to bemaintained.

Utilizing Negative Borders: Negative border set consists of allitemsets that are closest to be frequent

I Negative border consists of all itemsets that were candidates oflevel-vise method but did not have enough support

Bd−(L) = Ck − Lk

I Find negative border set forL = {{A}, {B}, {C}, {E}, {AB}, {AC}, {BC}, {ABC}}

I Full scan of dataset is only required when itemsets outside negativeborder set get added to frequent itemsets or negative border set.


Law of large number

Prob(|x1 + x2 + x3 + ...+ xn

n− E(x)| ≥ ε) ≤ var(x)

nε2

Markov’s inequalityWhen x be a nonnegative random variable. Then for a > 0

Prob(x ≥ a) ≤ E(x)a

Chebyshev’s InequalityLet x be a random variable. Then for c > 0

Prob(|x − E(x)| ≥ c) ≤ Var(x)c2


Variations of FUPDifference Estimation for Large Itemsets (DELI): Usessampling technique

I Estimate the difference between old and new frequent itemsetsI Only if the difference is large enough, update operation using FUP2

is performedI Let S be m transactions drawn from D− with replacement, then

support of itemset X in D− is

σ̂X =Tx

m.|D−|

where Tx is occurrence count of X in S. For large m we have100(1-α)% confidence interval [ax ,bx ] with

ax = σ̂X − za/2

√σ̂X (|D−| − σ̂X )

m

bx = σ̂X + za/2

√σ̂X (|D−| − σ̂X )

mwhere za/2 is a value such that the area beyond it in standardnormal curve is exactly α/2


Sliding Window Filtering

Partition-Based Algorithm for Incremental Mining: If X is a frequentitemset in a database divided into partitions p1, p2, ..., pn then X mustbe a frequent itemset in at least one of the partitions

Uses threshold to generatecandidate itemsetFrequent itemset remains frequentfrom some Pk to Pn

A list of 2-itemsets CF is maintainedto track possible frequent 2-itemsets.Locally frequent 2-itemsets of eachpartition is added (with its startingpartition and supports)Scan reduction technique can makeone database scan enough


SWF at workWith Smin = 40% generate frequent 2-itemsets

No new 2-itemset added when processing P2 since no extra frequent2-itemsets. Moreover, the counts for itemsets {A,B}, {A,C} and {B,C}are all increased. Their counts are no less than 6×0.4


SWF at workScan reduction technique is used to generate Ck (k = 2,3, ...,n)using C2

C2 is used to generate the candidate 3-itemsets and its sequentialC′k1 be utilized to generate C′kC′3 generated from C2 ∗ C2 instead of L2 ∗ L2 will have size greaterbut near to |C3|Second scan would suffice for pruning

Merit of SWF lies in its incremental procedure. There are threesub-steps

Generating C2 inD− = db1,3− M−

Generating C2 indb2,4 = D−+ M+

Scanning db2,4 once


Clustering

Grouping data based on their homogeneity (similarity or closeness).

Objects within a group are similar (or related) and are different fromthe objects in other groups. When it is better?


Clustering

Unsupervised in nature (i.e. right answers are not known)Clustering is useful to 1) Summarization, 2) Compression, and 3)Efficiently Finding Nearest NeighborsType:

I Hierarchical (nested) versus PartitionalI Exclusive versus Overlapping versus FuzzyI Complete versus Partial

K-means: This is a prototype-based1, partitional clusteringtechnique that attempts to find a user-specified number of clusters(K), which are represented by their centroids.

1object is closer (more similar) to a prototypeADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 12 / 28

Clustering


K-means Algorithm

Number of clusters i .e. the value of K is provided by the user

Algorithm 0.1: K-means

1 Randomly select K points as centroids2 repeat3 foreach datum point di do4 Assign di to one of the closest centroids

(thereby forming K clusters)5 Recompute centroid (mean) for each cluster6 until The centroids converge;

Closeness is measured by Euclidean distance, cosine similarity,correlation, Bregman divergence etc


K-means in Action


Evaluation of K-meansFor a given data set {xa, x2, ..., xn}, let K-means partitions it in{S1,S2, ..., sK} then the objective is

argminS

K∑i=1

∑x∈Si

dist2(x , µi)

Where µi corresponds to i th centroid. µi =1

mi

∑x∈Si

xTypical choice for dist function is Euclidean Distance

How to proceed?Choose a K (How? 2)

I Run K-means algorithm multiple timesI Choose clusters corresponding to the one that minimized sum of

squared error (SSE)If K == n, no error.Good clustering has smaller K

2Hamerly, Greg and Elkan, Charles, “Learning the k in k-means”, pp 281–288,NIPS-2003


Evaluation of K-means

Choosing K: 1) Domain Knowledge, 2) Preprocessing withanother algorithm, 3) Iteration on KInitialization of Centers: 1) Random point in space, 2) Randompoint of data, 3) look for dense region, 4) Space uniformly infeature spaceCluster Quality: 1) Diameter of cluster verses Inter-clusterdistance, 2) Distance between members of a cluster and thecluster center, 3) Diameter of smallest sphere, 4) Ability todiscover hidden patterns


Limitations of K-means

Has problem when data hasI Different size clustersI Different densitiesI Non-globular shape

Handling Empty ClustersWhen there are outliersUpdating Centroids Incrementally


Other Approaches

K-Medoids: chooses data-points as centers and minimizes a sumof pairwise dissimilarities. Resistance to noise and/or outliersAgglomerative Hierarchical Clustering: repeatedly merging thetwo closest clusters until a single (Single Link)

DBSCAN: density-based clustering algorithm that produces apartitional clustering, in which the number of clusters isautomatically determined by the algorithm.


Clustering in dynamic databases

Can we use k -Means clustering algorithm?

I Randomly choose k datapoints as centroid

I Assign each data point toclosest centroid

I Update centroid untilconverge

I Typically SSD (sum ofsquare error) is optimized

What about PAM (partitioning around medoids)? NOSingle/Average/Farthest link clustering?


Single Link

1 Starts with each point as cluster2 Merges two nearest clusters n − k times

Consider following data points in2D space

Finally we get

Can we make it adaptable for dynamic databases?


DBSCANDBSCAN (Density-Based Spatial Clustering of Applications withNoise) is a spatial clustering algorithm of KDD96

Parameters (Eps/MinPts) and points (core/border/noise)Uses DFS

Advantage: clusters of arbitrary shapeDisadvantage: Sensitive to parameters


Incremental DBSCAN (Addition)

Noise

Create

Absorb

Merge


Incremental DBSCAN (Deletion)

No Change

Remove

Split

Shrink


Incremental DBSCAN

Insertion and deletion are treated separatelyBased on change in density in affected region, clusters areupdated.Update cost is proportional to number of points in affected regionthat is highYou may be doing redundant operations

Differ update for some timeAssume periodic arrival of updates.

Cluster new dataMerge it with previous clusters (it is easy to see the densitychange)


Incremental DBSCAN

Initial Database New Date

Region based merging is applied


Incremental DBSCAN

Overlapping of clusters madefrom original and the new datapoints looks as

Point p is in set ofintersection I′ if ∃p′ ∈ D suchthat p and p′ are neighborIt is necessary an sufficient toprocess all p ∈ I′

Efficiently compute I′. How?


Thank You!

Thank you very much for your attention!

Queries ?


ss-zg548: advanced data mining · recap association rule mining involves the discovery of frequent...

Documents