a new initialization method for fuzzy c-means using fuzzy subtractive clustering thanh le, tom...

A new initialization method for Fuzzy C-Means usingFuzzy Subtractive Clustering

Thanh Le, Tom AltmanUniversity of Colorado Denver

July 19, 2011

Overview Introduction

Data clustering: approaches and current challenges

fzSC a novel fuzzy subtractive clustering

method for FCM parameter initialization Datasets

artificial and real datasets for testing fzSC Experimental results Discussion

Clustering problem Data points are clustered based on

Similarity Dissimilarity

Clusters are defined by Number of clusters Cluster boundaries & overlaps Compactness within clusters Separation between clusters

Clustering approaches Hierarchical approach Partitioning approach

Hard clustering approach Crisp cluster boundaries Crisp cluster membership

Soft/Fuzzy clustering approach Soft/Fuzzy membership Overlapping cluster boundaries Most appropriate for the real problems

Fuzzy C-Means algorithm The model

Features:Fuzzy membership, soft cluster boundariesEach data point can belong to multiple clusters, more relationship information provided

c

1kki

2

ki

n

1i

c

1k

mki

n..1i,1u

1mmin,vxu)V,U|X(J

Fuzzy C-Means (contd.) Possibility-based model Fuzzy sets to describe clusters Model parameters estimated using an

iteration process Rapid convergence Challenges:

Determining the number of clusters Initializing the partition matrix to avoid local

optima

Methods for partition matrix initialization Based on randomization

Problem: Different randomization methods depend on

different data distributions

Using heuristic algorithms: Particle Swarm Problem:

Slow convergence because of velocity adjustment

Integrated with optimization algorithms Problem:

Still based on other methods of partition matrix initialization

Methods for partition matrix…(contd) using Subtractive Clustering Mountain function; the data density,

, : mountain peak radius Mountain amendment; density adjustment,

, : mountain radius Cluster candidate; the most dense data point

, : threshold to stop the cluster center selection

n

1j

2

xx

i

2

2ji

e)x(M

2

2

2jx

*x

eM)x(M)x(M *j1tjt

*0

*t

M

M

Subtractive Clustering methodThe problems Mountain peak radius?

Remaining density to be selected?

Mountain radius?

OK

NO

OKNO

Computational time: O(n2)

The proposed method: fzSCfor partition matrix initialization

1. Generate a random fuzzy partition2. Compute cluster density using

histogram3. Use strong uniform fuzzy partition

concept4. Estimate mountain function based

on cluster density5. Amend mountain function:

1. Update cluster density (step 2)2. Re-estimate mountain function (step 4)

fzSC:Optimal number of clusters

1. The most dense data point is a cluster candidate

Data density is not much affected, say less than 0.05 of the data density removed by the mountain function amendment process.

The number of such points is less than n

2. , , are not required3. Computational time: O(c*n)

Datasets Artificial datasets

Finite mixture model based datasetsA manually created (MC) dataset

Data were generated using finite mixture modelClusters were moved to have different distances among clusters

Real datasetsIris, Wine, Glass and Breast Cancer Wisconsin datasets at UC Irvine Machine Learning Repository

Visualization of fzSC result on the manually created (MC) dataset

Rectangles- cluster centers of random fuzzy partition, Circles- cluster centers by fzSC

A visualization…

Stars- cluster centers of random fuzzy partition, Circles- cluster centers by fzSCThe utility is available online: http://ouray.ucdenver.edu/~tnle/fzsc/

Experimental results onmanually created dataset

The algorithm performance on the MC dataset

AlgorithmCorrectness ratio by class

Avg. Ratio1 2 3 4 5 6

fzSC 1.00 1.00 1.00 1.00 1.00 1.00 1.00

k-means 0.97 0.87 1.00 1.00 1.00 0.75 0.93

k-medians

0.95 0.82 1.00 1.00 1.00 0.62 0.90

FCM 0.97 1.00 0.95 1.00 1.00 0.96 0.98

Experimental results onartificial datasets

The number of clusters generated in the

dataset

The dataset dimension

2 3 4 5

5 0.97 1.00 1.00 1.00

6 1.00 0.98 0.90 1.00

7 1.00 1.00 1.00 1.00

8 1.00 0.99 0.97 1.00

9 0.87 0.99 1.00 0.96

Correctness ratio in determining cluster number

Experimental results onReal datasets

Dataset# data points

known #clusters

predicted #clusters

ratio

Iris 150 3 3 1.00

Wine 178 3 3 1.00

Glass 214 665

0.950.05

Breast Cancer Wisconsin

699 665

0.650.35

Correctness ratio in determining cluster number

Discussion:The advantages of fzSC Traditional subtractive clustering

, , are not required Computational time O(c*n) vs. O(n2)

Heuristic based approaches Rapid convergence Escape local optima

Probability model based Rapid convergence No assumption of data distribution

Discussion:Future work

Combine fzSC with biological cluster validation methods and optimization algorithms for novel clustering algorithms regarding the gene expression data analysis problem.

Thank you!

Questions?

We acknowledge the support from Vietnamese Ministry of Education and

Training, the 322 scholarship program.

a new initialization method for fuzzy c-means using fuzzy subtractive clustering thanh le, tom...

Documents

cluster candidatedata

circles cluster centers

update cluster density

partition matrixcontd

dense data point

remaining density

possibilitybased modelfuzzy

real problemsfuzzy c