ss-zg548: advanced data mining · recap association rule mining involves the discovery of frequent...

36
SS-ZG548: ADVANCED DATA MINING Lecture-04: Incremental Mining and Clustering Dr. Kamlesh Tiwari, Assistant Professor, Department of Computer Science and Information Systems, BITS Pilani, Rajasthan-333031 INDIA Feb 02, 2019 (WILP @ BITS-Pilani Jan-Apr 2019)

Upload: others

Post on 13-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

SS-ZG548: ADVANCED DATA MINING

Lecture-04: Incremental Mining and Clustering

Dr. Kamlesh Tiwari,Assistant Professor,

Department of Computer Science and Information Systems,BITS Pilani, Rajasthan-333031 INDIA

Feb 02, 2019 (WILP @ BITS-Pilani Jan-Apr 2019)

Recap

Association Rule Mining involves the discovery of frequent item-setsbsed on support and confidence parameters

Approaches involves Apriori, Hash Based (DHP), Partition BasedAlgorithmIncremental association rule mining is needed as real databasesare generally dynamic D

′= D− M− + M+

Fast UPdate (FUP) can handle insertions1 An original frequent item set X ∈ L, becomes infrequent in D′ iff

support(X )D′ < Smin2 An item set X /∈ L, becomes frequent in D′ iff support(X )M+ ≥ Smin3 If a k -item set X whose (k − 1)-subset(s) becomes infrequent, i .e.,

the subset is in Lk−1 but not in L′k−1, then X must be infrequent inD′.

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 2 / 28

Recap: FUP at work

Consider the database D and therelated frequent set discovered withApriori

Item set Support{A} 6/9{B} 6/9{C} 6/9{E} 4/9{A B} 5/9{A C} 4/9{B C} 4/9{A B C} 4/9

Consider the arrival of M+ more transactions

The first iteration, is as below.

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 3 / 28

Recap: FUP2 at workFUP2 can handle insertion and deletion both

Ck is divided into two parts. Pk = Lk and Qk = Ck − Pk

Being frequent, support for all items in Pi is known. It could beupdated using M− and M+ onlyCount({A})D′ = Count({A})D− Count({A})M−+ Count({A})M+

= 6− 3 + 1 = 4

Item set Support{A} 6/9{B} 6/9{C} 6/9{E} 4/9{A B} 5/9{A C} 4/9{B C} 4/9{A B C} 4/9

Frequent itemsets of D

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 4 / 28

Variations of FUPUpdate With Early Pruning (UWEP): Occurrence of potentiallyhuge set of candidate itemset and multiple scans of the databaseis the issue

I If a k-itemset is frequent in M+ but infrequent in D′, it is notconsidered when generating Ck+1

I This can significantly reduce the number of candidate itemsets, withthe trade-off that an additional set of unchecked itemsets has to bemaintained.

Utilizing Negative Borders: Negative border set consists of allitemsets that are closest to be frequent

I Negative border consists of all itemsets that were candidates oflevel-vise method but did not have enough support

Bd−(L) = Ck − Lk

I Find negative border set forL = {{A}, {B}, {C}, {E}, {AB}, {AC}, {BC}, {ABC}}

I Full scan of dataset is only required when itemsets outside negativeborder set get added to frequent itemsets or negative border set.

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 5 / 28

Law of large number

Prob(|x1 + x2 + x3 + ...+ xn

n− E(x)| ≥ ε) ≤ var(x)

nε2

Markov’s inequalityWhen x be a nonnegative random variable. Then for a > 0

Prob(x ≥ a) ≤ E(x)a

Chebyshev’s InequalityLet x be a random variable. Then for c > 0

Prob(|x − E(x)| ≥ c) ≤ Var(x)c2

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 6 / 28

Variations of FUPDifference Estimation for Large Itemsets (DELI): Usessampling technique

I Estimate the difference between old and new frequent itemsetsI Only if the difference is large enough, update operation using FUP2

is performedI Let S be m transactions drawn from D− with replacement, then

support of itemset X in D− is

σ̂X =Tx

m.|D−|

where Tx is occurrence count of X in S. For large m we have100(1-α)% confidence interval [ax ,bx ] with

ax = σ̂X − za/2

√σ̂X (|D−| − σ̂X )

m

bx = σ̂X + za/2

√σ̂X (|D−| − σ̂X )

mwhere za/2 is a value such that the area beyond it in standardnormal curve is exactly α/2

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 7 / 28

Sliding Window Filtering

Partition-Based Algorithm for Incremental Mining: If X is a frequentitemset in a database divided into partitions p1, p2, ..., pn then X mustbe a frequent itemset in at least one of the partitions

Uses threshold to generatecandidate itemsetFrequent itemset remains frequentfrom some Pk to Pn

A list of 2-itemsets CF is maintainedto track possible frequent 2-itemsets.Locally frequent 2-itemsets of eachpartition is added (with its startingpartition and supports)Scan reduction technique can makeone database scan enough

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 8 / 28

SWF at workWith Smin = 40% generate frequent 2-itemsets

No new 2-itemset added when processing P2 since no extra frequent2-itemsets. Moreover, the counts for itemsets {A,B}, {A,C} and {B,C}are all increased. Their counts are no less than 6×0.4

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 9 / 28

SWF at workScan reduction technique is used to generate Ck (k = 2,3, ...,n)using C2

C2 is used to generate the candidate 3-itemsets and its sequentialC′k1 be utilized to generate C′kC′3 generated from C2 ∗ C2 instead of L2 ∗ L2 will have size greaterbut near to |C3|Second scan would suffice for pruning

Merit of SWF lies in its incremental procedure. There are threesub-steps

Generating C2 inD− = db1,3− M−

Generating C2 indb2,4 = D−+ M+

Scanning db2,4 once

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 10 / 28

Clustering

Grouping data based on their homogeneity (similarity or closeness).

Objects within a group are similar (or related) and are different fromthe objects in other groups. When it is better?

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 11 / 28

Clustering

Unsupervised in nature (i.e. right answers are not known)Clustering is useful to 1) Summarization, 2) Compression, and 3)Efficiently Finding Nearest NeighborsType:

I Hierarchical (nested) versus PartitionalI Exclusive versus Overlapping versus FuzzyI Complete versus Partial

K-means: This is a prototype-based1, partitional clusteringtechnique that attempts to find a user-specified number of clusters(K), which are represented by their centroids.

1object is closer (more similar) to a prototypeADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 12 / 28

Clustering

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 13 / 28

K-means Algorithm

Number of clusters i .e. the value of K is provided by the user

Algorithm 0.1: K-means

1 Randomly select K points as centroids2 repeat3 foreach datum point di do4 Assign di to one of the closest centroids

(thereby forming K clusters)5 Recompute centroid (mean) for each cluster6 until The centroids converge;

Closeness is measured by Euclidean distance, cosine similarity,correlation, Bregman divergence etc

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 14 / 28

K-means in Action

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 15 / 28

Evaluation of K-meansFor a given data set {xa, x2, ..., xn}, let K-means partitions it in{S1,S2, ..., sK} then the objective is

argminS

K∑i=1

∑x∈Si

dist2(x , µi)

Where µi corresponds to i th centroid. µi =1

mi

∑x∈Si

xTypical choice for dist function is Euclidean Distance

How to proceed?Choose a K (How? 2)

I Run K-means algorithm multiple timesI Choose clusters corresponding to the one that minimized sum of

squared error (SSE)If K == n, no error.Good clustering has smaller K

2Hamerly, Greg and Elkan, Charles, “Learning the k in k-means”, pp 281–288,NIPS-2003

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 16 / 28

Evaluation of K-means

Choosing K: 1) Domain Knowledge, 2) Preprocessing withanother algorithm, 3) Iteration on KInitialization of Centers: 1) Random point in space, 2) Randompoint of data, 3) look for dense region, 4) Space uniformly infeature spaceCluster Quality: 1) Diameter of cluster verses Inter-clusterdistance, 2) Distance between members of a cluster and thecluster center, 3) Diameter of smallest sphere, 4) Ability todiscover hidden patterns

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 17 / 28

Limitations of K-means

Has problem when data hasI Different size clustersI Different densitiesI Non-globular shape

Handling Empty ClustersWhen there are outliersUpdating Centroids Incrementally

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 18 / 28

Other Approaches

K-Medoids: chooses data-points as centers and minimizes a sumof pairwise dissimilarities. Resistance to noise and/or outliersAgglomerative Hierarchical Clustering: repeatedly merging thetwo closest clusters until a single (Single Link)

DBSCAN: density-based clustering algorithm that produces apartitional clustering, in which the number of clusters isautomatically determined by the algorithm.

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 19 / 28

Clustering in dynamic databases

Can we use k -Means clustering algorithm?

I Randomly choose k datapoints as centroid

I Assign each data point toclosest centroid

I Update centroid untilconverge

I Typically SSD (sum ofsquare error) is optimized

What about PAM (partitioning around medoids)? NOSingle/Average/Farthest link clustering?

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 20 / 28

Single Link

1 Starts with each point as cluster2 Merges two nearest clusters n − k times

Consider following data points in2D space

Finally we get

Can we make it adaptable for dynamic databases?

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 21 / 28

DBSCANDBSCAN (Density-Based Spatial Clustering of Applications withNoise) is a spatial clustering algorithm of KDD96

Parameters (Eps/MinPts) and points (core/border/noise)Uses DFS

Advantage: clusters of arbitrary shapeDisadvantage: Sensitive to parameters

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 22 / 28

Incremental DBSCAN (Addition)

Noise

Create

Absorb

Merge

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 23 / 28

Incremental DBSCAN (Addition)

Noise

Create

Absorb

Merge

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 23 / 28

Incremental DBSCAN (Addition)

Noise

Create

Absorb

Merge

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 23 / 28

Incremental DBSCAN (Addition)

Noise

Create

Absorb

Merge

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 23 / 28

Incremental DBSCAN (Addition)

Noise

Create

Absorb

Merge

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 23 / 28

Incremental DBSCAN (Deletion)

No Change

Remove

Split

Shrink

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 24 / 28

Incremental DBSCAN (Deletion)

No Change

Remove

Split

Shrink

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 24 / 28

Incremental DBSCAN (Deletion)

No Change

Remove

Split

Shrink

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 24 / 28

Incremental DBSCAN (Deletion)

No Change

Remove

Split

Shrink

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 24 / 28

Incremental DBSCAN (Deletion)

No Change

Remove

Split

Shrink

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 24 / 28

Incremental DBSCAN

Insertion and deletion are treated separatelyBased on change in density in affected region, clusters areupdated.Update cost is proportional to number of points in affected regionthat is highYou may be doing redundant operations

Differ update for some timeAssume periodic arrival of updates.

Cluster new dataMerge it with previous clusters (it is easy to see the densitychange)

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 25 / 28

Incremental DBSCAN

Initial Database New Date

Region based merging is applied

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 26 / 28

Incremental DBSCAN

Overlapping of clusters madefrom original and the new datapoints looks as

Point p is in set ofintersection I′ if ∃p′ ∈ D suchthat p and p′ are neighborIt is necessary an sufficient toprocess all p ∈ I′

Efficiently compute I′. How?

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 27 / 28

Thank You!

Thank you very much for your attention!

Queries ?

ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 28 / 28