algorithms for clustering large datasets in arbitrary metric spaces

27
abase Management Systems, R. Ramakrishnan Algorithms for clustering large datasets in arbitrary metric spaces

Upload: harley

Post on 20-Jan-2016

28 views

Category:

Documents


1 download

DESCRIPTION

Algorithms for clustering large datasets in arbitrary metric spaces. A set of 2-dimensional points shown adjacent. They clearly form three distinct groups (called clusters ). The goal of any clustering algorithm is to find such groups in data to better understand its distribution. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 1

Algorithms for clustering large

datasets in arbitrary metric spaces

Page 2: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 2

Introduction

A set of 2-dimensional points shown adjacent.

They clearly form three distinct groups (called clusters).

The goal of any clustering algorithm is to find such groups in data to better understand its distribution.

Page 3: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 3

Introduction: What is Clustering?

Input: – Database of objects.– A distance function that captures the notion

of similarity between objects.– Number of groups.

Goal: – Partition the database into the specified

number of groups such that each group consists of “similar” objects.

Page 4: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 4

Goals of our clustering algorithm

Good clustering quality Scalability Only use a bounded amount of main

memory

Page 5: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 5

Outline

Introduction The BIRCH* framework BIRCH for n-dimensional spaces BUBBLE for arbitrary metric spaces BUBBLE-FM: An improvement over

BUBBLE. Experimental evaluation Conclusions

Page 6: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 6

BIRCH*: Introduction

BIRCH* is a framework for scalable incremental clustering algorithms.– Output is a set of sub-clusters which can

further be analyzed by a more expensive domain-specific clustering algorithm.

BIRCH* can be instantiated to yield different clustering algorithms.

Page 7: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 7

BIRCH*: Incremental Algorithm

Clusters evolve as data is scanned. A current set of clusters is always

maintained in memory. Each new object is either

– inserted into the cluster to which it is “closest”, or

– it forms a cluster of its own.

Requirements:– a representation for clusters.– a structure to search for the closest cluster.

Page 8: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 8

BIRCH*: Important features

Cluster features (CF)– Condensed representation for a cluster of

objects CF-tree

– A height-balanced index for CFs Rebuilding algorithm

– When the allocated amount of memory is exhausted, a smaller CF-tree is built from the old tree.

Page 9: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 9

BIRCH*:Cluster Feature (CF)

CFs are summarized representations of clusters.

They contain sufficient information to find– the distance between a cluster and an object.– the distance between any two clusters.

They are incrementally maintainable– when new objects are inserted in clusters.– when two clusters are merged.

Page 10: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 10

BIRCH*: CF-tree

Two parameters– Branching factor– Threshold

Each entry contains the CF of the cluster of objects in the sub-tree beneath it.

Starting from the root, the “closest” entry is selected to traverse downwards until a leaf node is reached.

new object

closest leaf cluster

Page 11: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 11

BIRCH*: CF-Tree insertion (contd)

At the leaf node, the closest cluster is selected to insert the object.

If the threshold criterion is satisfied, the object is absorbed into the cluster. Else, it forms a new cluster on the leaf node.

The path from the root to the leaf is updated to reflect the insertion.

Absorbed intothe cluster

Forms a clusterof its own

Page 12: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 12

BIRCH*: CF-tree Insertion (contd)

If there is no space on the leaf node it is split and the entries are redistributed based on the “closeness” criterion.

A new entry is created at its parent to reflect the formation of a new leaf node.

Page 13: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 13

BIRCH*: Rebuilding Algorithm

If the CF-tree grows to occupy more space than it is allocated, the threshold is increased and the CF-tree is rebuilt.

CFs of leaf clusters are inserted into the new tree. The insertion algorithm is the same as for individual objects.

Page 14: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 14

BIRCH*: Instantiation Summary

To instantiate BIRCH* we have to define: Cluster features at leaf and non-leaf

levels. Incremental maintenance of leaf-level

CFs and updates to non-leaf level CFs when new objects are inserted.

Distance measures between any two CFs to define the notion of closeness.

Page 15: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 15

BIRCH*: Instantiation of BIRCH CF of a cluster of n k-dimensional vectors,

V1,…,Vn is defined as (n, LS, SS)– n is the number of vectors – LS is the sum of vectors– SS is the sum of squares of vectors

CF1+CF2 = (n1+n2, LS1+LS2, SS1+SS2)– This property is used for incremental

maintaining cluster features. Distance between two clusters C1 and C2 is

defined to be the distance between their centroids.

Page 16: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 16

Arbitrary metric space (AMS): Issues

Only operation allowed between objects is the distance computation.– Specifically, the notion of a centroid of a set

of objects does not exist.

The distance function can be computationally very expensive. E.g., the edit distance between strings.

Page 17: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 17

Definitions

Given a set O of objects O1,…,On

Row sum of Oi is defined as Clustroid of O is the object with the

least row sum value. – Clustroid is a concept parallel to that of the

centroid in the Euclidean space.

n

jji OOd

1

2),(

Page 18: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 18

BUBBLE: CF

The CF of a set O of objects O1,…,On is defined as (n, O0, SS, R, RS).

N: number of objects.O0 : clustroid

SS: sum of squared distances of all objects from O0

R: set of representative objects (explained later)RS: row sum values of the representative objects

Page 19: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 19

BUBBLE: Non-leaf CFs

Non-leaf CFs direct a new object to an appropriate child node.– They capture the distribution of objects in

the sub-tree below them. A set of sample objects randomly

collected from the sub-tree at a non-leaf entry forms its CF.

Page 20: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 20

BUBBLE: Incremental Maintenance (Leaf CF)

Types of insertionType I: Insertion of a single object.Type II: Insertion of a cluster of

objects. Under Type I insertion, the

location of the new clustroid is within a bounded distance of the old clustroid. (The bound depends on the threshold of the cluster.)

Heuristic1: Maintain a few objects close to the clustroid.

new object new clustroidold clustroid

Page 21: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 21

BUBBLE:Incremental Maintenance (Leaf CF)

Under Type II insertions, the location of the new clustroid is between the two old clustroids.

Heuristic2: Maintain a few objects farthest from the clustroid in the leaf CF.

The set of objects maintained at each leaf cluster are its representative objects.

Page 22: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 22

BUBBLE:Updates to Non-leaf CFs

The sample objects at a non-leaf entry are updated whenever its child node splits.– The distribution of clusters changes

significantly whenever a node splits.

Page 23: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 23

BUBBLE: Distance measures

Distance between two leaf level clusters is defined to be the distance between their clustroids. – If C1,C2 are leaf clusters with clustroids O10, O20 then

D(C1,C2) = d(O10,O20)

Distance between two non-leaf level clusters C1, C2 with sample objects S1,S2 is defined to be the average distance between S1 and S2.

– D(C1,C2) =

S Sd

i jji OO

1 2

1 1

,

Page 24: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 24

BUBBLE-FM

Distance functions in arbitrary metric spaces can be computationally expensive.

Idea: Use the Euclidean distance function instead.

Page 25: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 25

BUBBLE-FM: Non-leaf CF

Map S using FastMap into a k-d Euclidean image space.

Each non-leaf CF now contains the centroid of the image vectors of its sample objects.

New objects are mapped into the image space and the Euclidean distance function is used.

S1...Sk

V1...Vk

FastMap

SS kS ...

1

Page 26: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 26

Scalability

0

20

40

60

80

100

120

140

160

50 100 200 300 400 500

No. of 20-dimensional Points in 1000s

Tim

e

BUBBLE BUBBLE-FM

Page 27: Algorithms for clustering large datasets in arbitrary metric spaces

Database Management Systems, R. Ramakrishnan 27

Conclusions

BIRCH* framework for scalable incremental clustering algorithms.

Instantiation for n-d spaces (BIRCH). Instantiation for AMS (BUBBLE). FastMap to reduce the number of times

the distance function is called.