ss-zg548: advanced data mining · recap association rule mining involves the discovery of frequent...
TRANSCRIPT
SS-ZG548: ADVANCED DATA MINING
Lecture-04: Incremental Mining and Clustering
Dr. Kamlesh Tiwari,Assistant Professor,
Department of Computer Science and Information Systems,BITS Pilani, Rajasthan-333031 INDIA
Feb 02, 2019 (WILP @ BITS-Pilani Jan-Apr 2019)
Recap
Association Rule Mining involves the discovery of frequent item-setsbsed on support and confidence parameters
Approaches involves Apriori, Hash Based (DHP), Partition BasedAlgorithmIncremental association rule mining is needed as real databasesare generally dynamic D
′= D− M− + M+
Fast UPdate (FUP) can handle insertions1 An original frequent item set X ∈ L, becomes infrequent in D′ iff
support(X )D′ < Smin2 An item set X /∈ L, becomes frequent in D′ iff support(X )M+ ≥ Smin3 If a k -item set X whose (k − 1)-subset(s) becomes infrequent, i .e.,
the subset is in Lk−1 but not in L′k−1, then X must be infrequent inD′.
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 2 / 28
Recap: FUP at work
Consider the database D and therelated frequent set discovered withApriori
Item set Support{A} 6/9{B} 6/9{C} 6/9{E} 4/9{A B} 5/9{A C} 4/9{B C} 4/9{A B C} 4/9
Consider the arrival of M+ more transactions
The first iteration, is as below.
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 3 / 28
Recap: FUP2 at workFUP2 can handle insertion and deletion both
Ck is divided into two parts. Pk = Lk and Qk = Ck − Pk
Being frequent, support for all items in Pi is known. It could beupdated using M− and M+ onlyCount({A})D′ = Count({A})D− Count({A})M−+ Count({A})M+
= 6− 3 + 1 = 4
Item set Support{A} 6/9{B} 6/9{C} 6/9{E} 4/9{A B} 5/9{A C} 4/9{B C} 4/9{A B C} 4/9
Frequent itemsets of D
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 4 / 28
Variations of FUPUpdate With Early Pruning (UWEP): Occurrence of potentiallyhuge set of candidate itemset and multiple scans of the databaseis the issue
I If a k-itemset is frequent in M+ but infrequent in D′, it is notconsidered when generating Ck+1
I This can significantly reduce the number of candidate itemsets, withthe trade-off that an additional set of unchecked itemsets has to bemaintained.
Utilizing Negative Borders: Negative border set consists of allitemsets that are closest to be frequent
I Negative border consists of all itemsets that were candidates oflevel-vise method but did not have enough support
Bd−(L) = Ck − Lk
I Find negative border set forL = {{A}, {B}, {C}, {E}, {AB}, {AC}, {BC}, {ABC}}
I Full scan of dataset is only required when itemsets outside negativeborder set get added to frequent itemsets or negative border set.
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 5 / 28
Law of large number
Prob(|x1 + x2 + x3 + ...+ xn
n− E(x)| ≥ ε) ≤ var(x)
nε2
Markov’s inequalityWhen x be a nonnegative random variable. Then for a > 0
Prob(x ≥ a) ≤ E(x)a
Chebyshev’s InequalityLet x be a random variable. Then for c > 0
Prob(|x − E(x)| ≥ c) ≤ Var(x)c2
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 6 / 28
Variations of FUPDifference Estimation for Large Itemsets (DELI): Usessampling technique
I Estimate the difference between old and new frequent itemsetsI Only if the difference is large enough, update operation using FUP2
is performedI Let S be m transactions drawn from D− with replacement, then
support of itemset X in D− is
σ̂X =Tx
m.|D−|
where Tx is occurrence count of X in S. For large m we have100(1-α)% confidence interval [ax ,bx ] with
ax = σ̂X − za/2
√σ̂X (|D−| − σ̂X )
m
bx = σ̂X + za/2
√σ̂X (|D−| − σ̂X )
mwhere za/2 is a value such that the area beyond it in standardnormal curve is exactly α/2
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 7 / 28
Sliding Window Filtering
Partition-Based Algorithm for Incremental Mining: If X is a frequentitemset in a database divided into partitions p1, p2, ..., pn then X mustbe a frequent itemset in at least one of the partitions
Uses threshold to generatecandidate itemsetFrequent itemset remains frequentfrom some Pk to Pn
A list of 2-itemsets CF is maintainedto track possible frequent 2-itemsets.Locally frequent 2-itemsets of eachpartition is added (with its startingpartition and supports)Scan reduction technique can makeone database scan enough
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 8 / 28
SWF at workWith Smin = 40% generate frequent 2-itemsets
No new 2-itemset added when processing P2 since no extra frequent2-itemsets. Moreover, the counts for itemsets {A,B}, {A,C} and {B,C}are all increased. Their counts are no less than 6×0.4
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 9 / 28
SWF at workScan reduction technique is used to generate Ck (k = 2,3, ...,n)using C2
C2 is used to generate the candidate 3-itemsets and its sequentialC′k1 be utilized to generate C′kC′3 generated from C2 ∗ C2 instead of L2 ∗ L2 will have size greaterbut near to |C3|Second scan would suffice for pruning
Merit of SWF lies in its incremental procedure. There are threesub-steps
Generating C2 inD− = db1,3− M−
Generating C2 indb2,4 = D−+ M+
Scanning db2,4 once
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 10 / 28
Clustering
Grouping data based on their homogeneity (similarity or closeness).
Objects within a group are similar (or related) and are different fromthe objects in other groups. When it is better?
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 11 / 28
Clustering
Unsupervised in nature (i.e. right answers are not known)Clustering is useful to 1) Summarization, 2) Compression, and 3)Efficiently Finding Nearest NeighborsType:
I Hierarchical (nested) versus PartitionalI Exclusive versus Overlapping versus FuzzyI Complete versus Partial
K-means: This is a prototype-based1, partitional clusteringtechnique that attempts to find a user-specified number of clusters(K), which are represented by their centroids.
1object is closer (more similar) to a prototypeADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 12 / 28
K-means Algorithm
Number of clusters i .e. the value of K is provided by the user
Algorithm 0.1: K-means
1 Randomly select K points as centroids2 repeat3 foreach datum point di do4 Assign di to one of the closest centroids
(thereby forming K clusters)5 Recompute centroid (mean) for each cluster6 until The centroids converge;
Closeness is measured by Euclidean distance, cosine similarity,correlation, Bregman divergence etc
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 14 / 28
K-means in Action
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 15 / 28
Evaluation of K-meansFor a given data set {xa, x2, ..., xn}, let K-means partitions it in{S1,S2, ..., sK} then the objective is
argminS
K∑i=1
∑x∈Si
dist2(x , µi)
Where µi corresponds to i th centroid. µi =1
mi
∑x∈Si
xTypical choice for dist function is Euclidean Distance
How to proceed?Choose a K (How? 2)
I Run K-means algorithm multiple timesI Choose clusters corresponding to the one that minimized sum of
squared error (SSE)If K == n, no error.Good clustering has smaller K
2Hamerly, Greg and Elkan, Charles, “Learning the k in k-means”, pp 281–288,NIPS-2003
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 16 / 28
Evaluation of K-means
Choosing K: 1) Domain Knowledge, 2) Preprocessing withanother algorithm, 3) Iteration on KInitialization of Centers: 1) Random point in space, 2) Randompoint of data, 3) look for dense region, 4) Space uniformly infeature spaceCluster Quality: 1) Diameter of cluster verses Inter-clusterdistance, 2) Distance between members of a cluster and thecluster center, 3) Diameter of smallest sphere, 4) Ability todiscover hidden patterns
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 17 / 28
Limitations of K-means
Has problem when data hasI Different size clustersI Different densitiesI Non-globular shape
Handling Empty ClustersWhen there are outliersUpdating Centroids Incrementally
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 18 / 28
Other Approaches
K-Medoids: chooses data-points as centers and minimizes a sumof pairwise dissimilarities. Resistance to noise and/or outliersAgglomerative Hierarchical Clustering: repeatedly merging thetwo closest clusters until a single (Single Link)
DBSCAN: density-based clustering algorithm that produces apartitional clustering, in which the number of clusters isautomatically determined by the algorithm.
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 19 / 28
Clustering in dynamic databases
Can we use k -Means clustering algorithm?
I Randomly choose k datapoints as centroid
I Assign each data point toclosest centroid
I Update centroid untilconverge
I Typically SSD (sum ofsquare error) is optimized
What about PAM (partitioning around medoids)? NOSingle/Average/Farthest link clustering?
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 20 / 28
Single Link
1 Starts with each point as cluster2 Merges two nearest clusters n − k times
Consider following data points in2D space
Finally we get
Can we make it adaptable for dynamic databases?
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 21 / 28
DBSCANDBSCAN (Density-Based Spatial Clustering of Applications withNoise) is a spatial clustering algorithm of KDD96
Parameters (Eps/MinPts) and points (core/border/noise)Uses DFS
Advantage: clusters of arbitrary shapeDisadvantage: Sensitive to parameters
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 22 / 28
Incremental DBSCAN (Addition)
Noise
Create
Absorb
Merge
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 23 / 28
Incremental DBSCAN (Addition)
Noise
Create
Absorb
Merge
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 23 / 28
Incremental DBSCAN (Addition)
Noise
Create
Absorb
Merge
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 23 / 28
Incremental DBSCAN (Addition)
Noise
Create
Absorb
Merge
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 23 / 28
Incremental DBSCAN (Addition)
Noise
Create
Absorb
Merge
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 23 / 28
Incremental DBSCAN (Deletion)
No Change
Remove
Split
Shrink
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 24 / 28
Incremental DBSCAN (Deletion)
No Change
Remove
Split
Shrink
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 24 / 28
Incremental DBSCAN (Deletion)
No Change
Remove
Split
Shrink
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 24 / 28
Incremental DBSCAN (Deletion)
No Change
Remove
Split
Shrink
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 24 / 28
Incremental DBSCAN (Deletion)
No Change
Remove
Split
Shrink
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 24 / 28
Incremental DBSCAN
Insertion and deletion are treated separatelyBased on change in density in affected region, clusters areupdated.Update cost is proportional to number of points in affected regionthat is highYou may be doing redundant operations
Differ update for some timeAssume periodic arrival of updates.
Cluster new dataMerge it with previous clusters (it is easy to see the densitychange)
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 25 / 28
Incremental DBSCAN
Initial Database New Date
Region based merging is applied
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 26 / 28
Incremental DBSCAN
Overlapping of clusters madefrom original and the new datapoints looks as
Point p is in set ofintersection I′ if ∃p′ ∈ D suchthat p and p′ are neighborIt is necessary an sufficient toprocess all p ∈ I′
Efficiently compute I′. How?
ADVANCED DATA MINING (SS-ZG548) WILP @ BITS-Pilani Lecture-04 (Feb 02, 2019) 27 / 28