data mining - clustering. clustering 4 clustering - unsupervised classification 4 clustering - the...

DATA MINING - CLUSTERING

Clustering

Clustering - unsupervised classification Clustering - the process of grouping

physical or abstract objects into classes of similar objects

Clustering - help in construct meaningful partitioning of a large set of objects

Data clustering in statistics, machine learning, spatial database and data mining

CLARANS Algorithm

CLARANS - Clustering Large Applications Based on Randomized Search - presented by Ng and Han

CLARANS - based on randomized search and 2 statistics algorithm: PAM and CLARA

Method of algorithm - search of local optimum

Example of algorithm usage

Focusing Methods

FM - based on CLARANS algorithm and efficient spatial access method, like R*-tree

The focusing on representative objects technique

The focus on relevant clusters technique The focus on a cluster technique Examples of usage

Pattern-Based Similarity Search

Searching for similar patterns in a temporal or spatial-temporal database

Two types of queries encountered in data mining operations:– object - relative similarity query– all -pair similarity query

Various approaches:– similarity measures chosen– type of comparison chosen– subsequence parameters chosen

Similarity Measures (1)

1st Measure - the Euclidean distance between two sequences:

{xi} - the target sequence of length n

{yi} - a sequence of length N in thedatabase

{ziJ} - J-th subsequence of length n of {yi}

Similarity Measures (2)

3th Measure - the correlation

(Discrete Fourier Transforms)

between two sequences:

2nd Measure - the linear

correlation between two

sequences:

Alternative approaches

Matching all of the data points of a sequence simultaneously

Matching each sequence into a small set of multidimensional rectangles in the featude space – Fourier transformation– SVD - Singular Value Decomposition– The Karhunen-Loeve transformation

Hierarchy Scan - new approach

Mining Path Traversal Patterns

Solution of the problem of mining traversal patterns:first step: devise to convert the original sequence

of log data into a set of traversal subsequences (maximal forward reference)

second step: determine the frequent traversal patterns, term large reference sequences

Problems with finding large reference sequences

Mining Path Traversal Patterns - Example

Traversal path for a user:

{A,B,C,D,C,B,E,G,H,G,W,A,O,U,O,V}

The set of maximal forward references for this user:{ ABCD, ABEGH, ABEGW, AOU, AOV }

Clustering Features and CF-trees

Clustering Features -

triplet summarizing

information about

subclusters of points:

ii XSSXLS

SSLSNCF

CF-tree - a balanced

tree with 2 parameters: branching factor B -

max number of children threshold T - max

diameter of subclusters stored at the leaf nodes

Construct of CF-tree

Usage of this structure in BIRCH algorithm

150120

2 270 2 100

The non leaf nodes are storing sums of their’ s children’s CFs

The CF-tree is build dynamically as data points are inserted

A point is inserted to the closest leaf entry (subcluster)

If the diameter of the subcluster after insertion is larger than T value, the leaf node(s) are split

BIRCH Algorithm

- Balanced

- Interactive

- Reducing and

- Clustering using

- Hierarchies

BIRCH Algorithm (1)

PHASE 1

Scan all data and built an initial in memory CF tree using the given amount of memory and recycling space on disk

Phase 1: Load into memory

Phase 2: Condense

Phase 3: Global Clustering

Phase 4: Cluster Refining

BIRCH Algorithm (2)

PHASE 2 (optional)

Scan the leaf entries in the initial CF tree to rebuild a smaller CF tree, while removing more outliers and grouping crowded clusters into largest one

Phase 2: Condense

BIRCH Algorithm (3)

PHASE 3

Adapt an existing clustering algorithm for a set of data points to work with a set of subclusters, each described by its CF vector.

Phase 2: Condense

BIRCH Algorithm (4)

PHASE 4 (optional)

Pass over the data to correct inaccuracies and refine clusters further. Phase 4 entails the cost of additional pass.

Phase 2: Condense

CURE Algorithm

- Clustering

- Using

- Representatives

CURE Algorithm (1)

Draw random sample Partition sample Partially cluster partitions

Label data in disk Cluster partial clusters Eliminate outliers

CURE begins by drawing random sample from the database.

CURE Algorithm (2)

In order to further speed up clustering, CURE first partition the random sample into p partitions, each of size n/p.

CURE Algorithm (3)

Partially cluster each partition until the final number of clusters in each partition reduce to n/pq for some constant q > 1.

CURE Algorithm (4)

Outliers do not belong to any of the clusters.

In CURE outliers are eliminated at multiply steps.

CURE Algorithm (5)

Cluster in a final pass to generate the final k clusters.

CURE Algorithm (6)

Each data point is assigned to the cluster containing the representative point closest to it.

CURE - cluster procedure

procedure cluster(S,k)begin T := build_kd_tree(S) Q := buid_heap(S) while size(Q) > k do { u := extract_min(Q) v := u.closet delete(Q,v) w := merge(u,v) delete_rep(T,u) delete_rep(T,u) insert_rep(T,w) w.closet := x /*x is an arbitrary cluster in Q */

for each xQ do { if dist(w,x) < dist(w,w.closet) w.closet := x if x.closet is either u or v { if dist(x,x.closet)< dist(x,w) a.closet := closet_cluster(T,x,dist(x,w)) else x.closet :=w relocate(Q,x) }else if dist(x,x.closet)>dist(x,w){ x.closet := w relocate(Q,x) } }insert (Q,w) }end

CURE - merge procedure

procedure merge(u,v)begin w := uUv w.mean := |u|u.mean+|v|v.mean/|u|+|v| tmpSet := for i := 1 to c do { maxDist := 0 foreach point p in cluster w do { if i = 1 maxDist := dist(p,w.mean) else

minDist := min{dist(p,q) : q tmpSetif (minDist maxDist) { maxDist := minDist maxPoint := p } }tmpSet := tmpSet U {maxPoint}}foreach point p in tmpSet do w.rep := w.rep U {p+*(w.mean-p)} return wend

The Intelligent Miner of IBM

Clustering - Demographic provides fast and natural clustering of very large databases. It automatically determines the number of clusters to be generated.

Similarities between records are determined by comparing their field values. The clusters are then defined so that Condorcet’s criterion is maximised:

(sum of all record similarities of pairs in the same cluster) - (sum of all record similarities of pairs in different clusters)

The Intelligent Miner - example

Suppose that you have a database of a supermarket that includes customer identification and information abut the date and time of the purchases. The clustering mining function clusters this data to enable the identification of different types of shoppers. For example, this might reveal that customers buy many articles on Friday and usually pay by credit card.

data mining - clustering. clustering 4 clustering - unsupervised classification 4 clustering - the...

data mining clustering

data mining slide

data clustering

clustering features

examples of usage slide

clarans algorithm

data points

data mining operations

Documents

chapter 9 clustering and unsupervised...

unsupervised learning: k-means...

deep clustering for unsupervised learning of visual...

unsupervised single-particle deep clustering via

unsupervised clustering and active learning of

clustering unsupervised learning introduction machine...

tutorial 8 clustering 1. general methods –unsupervised...

unsupervised learning (a.k.a...

unsupervised clustering on dynamic...

a hybrid unsupervised clustering-based anomaly detection

unsupervised clustering methods for image segmentation

clustering time series using unsupervised-shapelets...

unsupervised learning and clustering

uva cs 6316: machine learning lecture 19:unsupervised...

unsupervised clustering - sites.cs.ucsb.edu

unsupervised grouping of moving objects based on...

unsupervised clustering for nontextual web document...

deep multimodal clustering for unsupervised audiovisual...

unsupervised learning learning unsupervised...unsupervised...

transition state clustering: unsupervised surgical...