a tool for interactive data visualization and...

42
A tool for Interactive Data Visualization and Clustering R. Tagliaferri Dipartimento di Matematica e Informatica – Università degli Studi di Salerno “Modelling & Simulation in Science” Data Analysis in Astronomy <Livio Scarsi> Erice, April 18 2007

Upload: trinhminh

Post on 08-Nov-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

A tool for Interactive Data Visualization and Clustering

R. Tagliaferri

Dipartimento di Matematica e Informatica – Università degliStudi di Salerno

“Modelling & Simulation in Science”Data Analysis in Astronomy <Livio Scarsi>

Erice, April 18 2007

Overview

•Aim of this work is to propose an interactive multi-step approach for interactive data exploration.

• With data exploration we refer here, specifically, to clustering and visualization of complex scientific data.

• The data analysis process permits to explore and study raw datasets following, roughly, four steps:

• Pre-processing • Clustering• Data Visualization• Interactive data exploration

• The project started under the Astroneural collaboration between the University “Federico II” of Naples and the University of Salerno.

input and normalization

Data

in workspace

Preprocessing:

Input normalization

Linear PCA

Non Linear PCA

ICA

Feature selection:

univariate statistical Tess

Graham-Schmidt

PCA

Backward feature elimination

Supervised Algoritmi :

LDA

SVM-RFE

SVM-Mangasarjan

LRC

BLR-PC

NN-ARD

Fuzzy Relation NN

Clustering (Data exploration &

parameterestimation):

K-means

SOM

EM

PPS

Interactive and hierarchical

visualization :

Dendrograms

NEC

Clustering (first visualization):

K-means

SOM

EM

PPS

ITACA

Data analysissoftware project

IHCT

input and normalization

Data

in workspace

Preprocessing:

Input normalization

Linear PCA

Non Linear PCA

ICA

Feature selection:

univariate statistical Tess

Graham-Schmidt

PCA

Backward feature elimination

Supervised Algoritmi :

LDA

SVM-RFE

SVM-Mangasarjan

LRC

BLR-PC

NN-ARD

Fuzzy Relation NN

Clustering (Data exploration &

parameterestimation):

K-means

SOM

EM

PPS

Interactive and hierarchical

visualization :

Dendrograms

NEC

Clustering (first visualization):

K-means

SOM

EM

PPS

ITACA

Data analysissoftware project

IHCT

Preprocessing

• Sometimes before applying any clustering or whatever data mining technique, it is useful and also necessary to preprocess the data.

• Most real-world datasets, in fact, suffer of two major problems: noiseand high dimensionality.

• These could derive from the method (or device) used to gain the data or from the nature of the data itself.

• However, in both cases a dataset could be intractable: on one hand, from the point of view of the accuracy of the results (high noise) and on the other hand, for the high computational cost (exponentially growing with dimensionality).

• The proposed tool includes different pre-processing methods, from data normalization to a non-linear robust PCA feature extraction.

Preprocessing

• The preprocessing consists of the following steps:

• Data Normalization (0-mean 1-variance, ecc.)• Statistical and Feature Analysis (single feature, mean, average; 1 or 2-D visualizations, ecc.)• PCA (input dimension reduction) • Non-linear PCA (feature extraction for time series analysis)• ICA (feature extraction for time series analysis).

• The proposed tool has been developed in the Matlab® computing environment and the implementation result is a user-friendly software tool which offers several clustering, visualization and data interaction techniques.

input and normalization

Data

in workspace

Preprocessing:

Input normalization

Linear PCA

Non Linear PCA

ICA

Feature selection:

univariate statistical Tess

Graham-Schmidt

PCA

Backward feature elimination

Supervised Algoritmi :

LDA

SVM-RFE

SVM-Mangasarjan

LRC

BLR-PC

NN-ARD

Fuzzy Relation NN

Clustering (Data exploration &

parameterestimation):

K-means

SOM

EM

PPS

Interactive and hierarchical

visualization :

Dendrograms

NEC

Clustering (first visualization):

K-means

SOM

EM

PPS

ITACA

Data analysissoftware project

IHCT

Supervised Learning & Assessment

In supervised classification most of the available assessment tools are embedded in the training process; optimization techniques are used to minimize fixed objective functions and, finally, validation techniques can be used to test the classifier

We know the number of classes to produce and we know the “approximated shape” of a class typical prototype (we learn it in the training step)

Finally, the dataset is surely partitionable

Unsupervised Learning and Clustering Our aim is providing partitions of a fixed dataset but we have not

already classified samples so we cannot have a training step before classifying

All we can do is to investigate the underlying structure of the dataset only using similarity between its samples

So, by a very general point of view, given a dataset

1{ , , }nD = x x… , a clusterization over it is the partition

1 ,( , )

i jh

k

i jh

C

f=

∑∑x x

x x1{ , , }kC C= … that minimizes

Where :f D D R× → is a fixed dissimilarity function

(usually a Minkowski metric, i.e. Euclidean distance)

Assessment In unsupervised contests assessment is not so simple…

There are questions to answer before trying Blind-Clustering:

- Is the dataset really clusterizable? In other words, does it contain localized uniformed groups? Are these groups well separated ?

- How many cluster should we produce ?

234

Assessment

And there are questions to answer when the clustering is done:- How much each cluster is reliable ?

- How much can we trust in the membership of a data-point to a cluster?

- If we manually reassign some point to a different cluster how much does the total reliability change? How much does each cluster reliabilitychange?

- Can we extract, from the original dataset, a subset of points for which the clusterization has a reliability greater than a threshold? In which clusters do the points of this subset lie? And how do the points outlying this subset lead to belong to a fixed cluster?

ITACA: principal functionalities

- n-dimensional dataset import (supervised labeling is supported)

- synthetic n-dimensional dataset generation using a mixture model pdf with visually configurable gaussian distributions- datasets dimensionality reduction using Multi-Dimensional Scaling and Principal Component Analysis- Clustering with K-means, Expectation Maximization and Self-Organizing Maps- Cluster visualization with a tool providing an appealing tridimensional representation (jewel representation)- Ability to export the whole clusterization or a single cluster- Topological properties measurement using several scores

- Estimation of clusters reliability for labeled datasets

- Estimation of similarity between different clusterizations on the same dataset

ITACA: principal functionalities

- Clusterization stability estimation using several methods

- Finding the number of clusters providing maximal stability (more generally, algorithmic parameters estimation), using Model Explorer algorithm

- Single cluster reliability estimation using a random projection based method

- Analysis of the random projection distorsion introduced on the dataset

- Single points membership function estimation and Fuzzy-Membership Analysis

- Clusterization reduction to obtain a level of reliability greater than a giventhreshold

- Interactive clustering

Topological evaluation scores: examples

1

min { ( , )}( )

max { ( )}i j i j

l K l

C CI

Cδ≠

≤ ≤

=∆

• Dunn Index

1

1 K

ii

DB RK =

= ∑• Davies-Bouldin Index

with

,, 1, , emaxi i ji j K i j

R R= ≠

=… ,

( ( ) ( ))( , )i j

i ji j

S C S CR

C Cδ+

=and

• Intra-Cluster Variance1( ) ( , )

∈ ∈

= ∑ ∑k k

kC C

V d cN x

x

And many others…

Classical similarity measures between partitions

Based on the relations between single clusterization similarity matrices and their element-wise product

,

1 if : ,

0 otherwisei j

i j

C CM

∃ ∈⎧= ⎨⎩

x x, , ,AB A B

i j i j i jM M M=

1

1

1

A

B

AB

χ

χ

χ

1 1

1 1A B

MAB A

I χ χχ χ

+=

+

1

1 1 13 ABJ

AB A B

I χχ χ χ

⎛ ⎞= ⎜ ⎟+ +⎝ ⎠

1

1 1AB

C

A B

I χχ χ

=

With

Minkowski index:

Jaccard index:

Correlation:

= number of non-zero entries in the first similarity matrix

= number of non-zero entries in the second similarity matrix

= number of non-zero entries in the element-wise productSome implemented measures in ITACA are:

And many others…

A novel similarity measure based on the entropy

Confusion matrix between two clusterization:

,AB A B

i j i jM C C= ∩Properties

1. , the sum of the i-th row entries is equal to the

first clusterization i-th cluster cardinality (K = #clusters of the 1st clusterization)

2. , the sum of the j-th column entries is equal to

the second clusterization j-th cluster cardinality (H = #clusters of the 2nd )

,1

1, , :H

AB Ai j i

ji K M C

=

∀ = =∑…

,1

1, , :K

AB Bi j j

ij H M C

=

∀ = =∑…

( )A

( )B

3. , the sum of the all entries is equal to the cardinality of the whole

dataset

,1 1

K HAB

i ji j

M D= =

=∑∑

A novel similarity measure based on the entropy

1log( )p A

1

1( ) ( ) log( )

n

ii i

H X P xP x=

=∑

Self-information of an event A:

Entropy for a discrete random variable:

(Its realizations Self-information average).

We can look at each entry of the normalized confusion matrix

, , , ,,

P AB AB ABi j i j i j i j

i jM M M M D= =∑

as to the empirical probability of the event: “A random picked up pattern from dataset D belongs to i-th cluster of the first clusterization and belongs to the j-thcluster of the second” - (by the frequentist definition of probability)

PM Contains the values of a random variable probability function

A novel similarity measure based on the entropy

,1 1 ,

1( ) ( ) log( )

K HP P

i j Pi j i j

H M P MP M= =

=∑∑Looking at the properties shown in (1,2), we can also define the single clusterization entropy as follow

We can apply the entropy measure to the confusion matrix keepingits original meaning

A novel similarity measure based on the entropy

If two clusterizations are completely independent it can be shown that always holds:

If the clusterizations perfectly match each other, it can be shown that always hold:

So it can be shown that

Similarity measure between partitions based on the entropy of the confusion matrix

Resampling methods• Informally, the stability of a clusterization is a measure that quantifies how much a perturbation on the dataset affects it• Cluster analysis is based on the following (almost philosophical) assumption: “the phenomenon that generated the data we want to analyze can be modeled by a statistical point of view”

• Can we say how much an assumption on the artificial model (the clusterization) is closed to the real model ?

• Can we say how much the “statistic of the Creator” (statistic model underlying data) has been discovered by a clusterization ?

Resampling methodsIf we had the “statistic of the Creator”…

1

2

3

1D2D3D

nD…

Parameterized clusteringprocedure …

n

1 2 3 n≈ ≈ ≈ ≈…The prior assumption (in our case, values of the algorithmic parameters, number of cluster in K-means, number of centers in EM, map dimension in SOM approach) is very closed to the real model and the structure that we can observe in the clusterization is stable.

Stability can then be seen as indicator of correctness of our assumptions

Stima dei parametri algoritmici - 1Model explorer algorithm

1st Problem: We do not know the statistical

model underlying data

Sub-Sampling:We simulate the generator statistic picking up random

sub-samples from D

2nd Problem: How to compare clusterizationsobtained on different datasets ?

Comparisons are made on the intersection

between the different subsets

Input:• dataset D• testing range [Kstart, Kend]• parametrized clustering procedure P• similarity measure between clusterizations s• number of trials Tmax

• sampling ratio fOutput:• a list of values [Zstart,…, Zend]saying the average stability for each K in the testing range

D

Single cluster reliability

1D2D3D

nD…

Random Projection obeying Johnson-

Lindenstrauss Lemma1

2

3D

µ

Clustering procedure …

n

1 2 nM M MMn

+ + +=

…Fuzzy similarity matrix

Single cluster realibility

Considering, in the FSM, only the entries with same indexes of the points lying in a cluster, we obtain a reliability measure for that cluster

Fuzzy membership analysis

In this model clusters become fuzzy setsStarting by the FSM, we can compute the value of each membershipfunction for each pattern• ModelGiven a clusterization on the dataset

where is a generic cluster,

in which we define the value

of the cluster membership function for the pattern as follow

Fuzzy membership analysis

• Starting by FSM, these functions quantify “how much” single points belong toa fixed group of other points over different trials of clustering on the perturbedversions of the dataset ,

• It can be easily shown that always holds:

input and normalization

Data

in workspace

Preprocessing:

Input normalization

Linear PCA

Non Linear PCA

ICA

Feature selection:

univariate statistical Tess

Graham-Schmidt

PCA

Backward feature elimination

Supervised Algoritmi :

LDA

SVM-RFE

SVM-Mangasarjan

LRC

BLR-PC

NN-ARD

Fuzzy Relation NN

Clustering (Data exploration &

parameterestimation):

K-means

SOM

EM

PPS

Interactive and hierarchical

visualization :

Dendrograms

NEC

Clustering (first visualization):

K-means

SOM

EM

PPS

ITACA

Data analysissoftware project

IHCT

Interactive data exploration

• We stress that our interactive scientific data exploration tool permits to accomplish both clustering/labelling and visualization tasks.

• The approach is mainly based on the interaction of the modules:

• Clustering algorithms (ITACA, NEC)

• Visualization algorithms (PPS, MDS, PCA etc)

• Interactive data exploration algorithms

Clustering Maps

Clustering techniques offer a big variety of different solutions to be considered before choosing the final clusterization. In general it happens that:

• Different clustering techniques give different clustering solutions

• Different parameters of the same clustering technique give different clustering solutions.

• Different initializations of the same clustering technique with the same parameters give different solutions.

Clustering Maps

To deal with such complexity, we build the Clustering Maps. A clustering map is composed by five steps:

• Choice of a set of clustering techniques / parameters / initializations to obtain a corresponding set of clusterizations.

• Choice of a distance measure between clusterizations.

• 2D Multidimensional Scaling based on the chosen distance matrixbetween all the clusterizations of the set.

• Choice of a function to evaluate the fitness of each clusterization.

• Final 3D visualization based on the MDS 2D mapping with the addition of the fitness values as third coordinate.

Clustering Maps – Example on the Iris dataset

X

Y

−0.1 −0.05 0 0.05 0.1 0.15 0.2−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0 0.1 0.2 0.3 0.40

1000

2000

3000

4000

5000

6000

7000

Point to point distance

Fre

quen

cy

8 10 120

20

40

60

80

100

120

Distorsion

Fre

quen

cyClustering Map for 100 clusterizations on a 2x3 SOM with random initializationand sum of point-to-center distances as fitness values (best fitness on brightregions).

Interactive agglomerative clustering

•After the application of the clustering phase an interactive agglomerative methodology is applied to obtain the final partitioning of data.

• Since rarely the number of clusters is known in advance, weadopt an interactive agglomerative technique.

• The technique is performed on the clusters obtained in the clustering phase where the user can choose different partitioning of the data on the base of a threshold value.

Interactive agglomerative clustering

• The result is a dendrogram in which each leaf is associated with a cluster from the pre-clustering phase.

• The clustering technique is based on different models which useEuclidean or information distances.

• These distances can be computed directly between the clusters computed in the previous step.

Interactive agglomerative clustering

• The dendrogram plotted by our tool allows an interactive analysis.

On the left, the MDS visualization is shown. On the right, the threshold is set toobtain a coloring that is reported in real time in the MDS plot. The pointsbelonging to the clusters selected on the dendrogram (dashed line) are highlighted in the MDS plot (bigger points).

Data Visualization and Exploration

Different visualizations updated simultaneously. From top to down and from left to right: controller window, dendrogram visualization, Spherical PPS visualization, MDS visualization and 2D “Robinson projection” PPS visualization.

Visualization of prior knowledge

• It often happens that some or all the objects of a dataset are known to belong to some classes.

• This information can be used in at least two ways, to:• Validate a clustering result.• Infer new knowledge.

• The validation of a clustering can be obtained comparing the prior knowledge and the knowledge obtained from the clustering itself.

• The prior knowledge can also be used to produce new knowledge inferred by the presence or absence of objects of a certain class in a specific cluster.

Visualization of prior knowledge

• Prior knowledge can be visually shown using different forms of points in the dataset belonging to different classes (e.g. circles, squares, triangles).

• This permits to visualize, on all the visualization methods described, distance and cluster information together with the prior knowledge.

• Our tool allows the use of different sets of prior knowledge information and it permits to easy switch from one set to another.

Data Visualization and Exploration

MDS visualization with free-hand point selection. Selected points can be exploredseparately and saved apart for further studies.

Results Analysis

• Once the final partitioning of the data is obtained, analytic measures can be employed to assess its meaning.

• Typical measures are the mean and variance of the features of all objects in a cluster.

• In this process a well characterized cluster is, certainly, one with low variance, while clearly separated clusters must also have clear different mean behaviors.

• Our tool is able to summarize all this information, including a mean and variance plot, for the objects falling in any subset of the dendrogram, or for the clusters defined by the threshold.

Results Analysis: an example

Example of visual inspection of the green cluster from the previous Iris clusterization. On the left of the window: a list of all the points in the cluster. On the right: a mean-variance plot of the cluster in blue and a simple plot of the selected point (92) in red.

Back to numbers: the confusion matrix

Conclusions

• In this work we presented a multi-step approach to data exploration allowing the reduction of information contained in very large datasets to a human-tractable size.

• We introduced the two main subtools: ITACA and IHCT.

• We showed how to assess and validate clusterizations by using ITACA and by answering to many of the initial questions.

• We showed how a user is supported during the interactive part of the process with different kinds of data clustering and visualizations and how he can work with the data projected in a reduced space.

• We also showed how the results of the process are checked to assure the significance of the clustering.

• Our software implementation is realized under the Matlab computing environment which provides the interactive environment for the exploration of data and results.