a two-way visualization method for clustered data

Post on 04-Jan-2016

30 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

A Two-Way Visualization Method for Clustered Data. Advisor : Dr. Hsu Presenter : Keng-Wei Chang Author: Yehuda Koren and David Harel. ACM SIGKDD international conference on Knowledge discovery and datamining. Outline. Motivation Objective Introduction Basic Notions - PowerPoint PPT Presentation

TRANSCRIPT

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Advisor : Dr. Hsu

Presenter : Keng-Wei Chang

Author: Yehuda Koren and David Harel

A Two-Way Visualization Method for Clustered Data

ACM SIGKDD international conference on Knowledge discovery and datamining

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Outline

Motivation Objective Introduction Basic Notions Computing The x-Coordinates Computing The y-Coordinates Result Related Work Conclusions Personal Opinion

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

A number of technological development have led to an explosion of raw data that has to be analyzed

We are especially interested in two families of tools in this domain

Clustering algorithms and data visualization methods

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objective

in this paper, we integrate the two approacheshierarchical clustering depicted as a dendrogram

low-dimensional embedding

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction

A number of technological development have led to an explosion of raw data that has to be analyzed

We are especially interested in two families of tools in this domain

Clustering algorithms and data visualization methods

Clustering methods can be broadly classifiedHierarchical and partitional

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction

Our main interest here is hierarchical clustering

The clustering hierarchy is often visualized as a dendrogram

A full binary tree

has a significant disadvantagedoes not provide exploratory visual representations of the data itself

another issue is that of cluster validity

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction

we are particularly interested in methods for achieving a low-dimensional embedding of data

principal component analysis (PCA)

multidimensional scaling (MDS)

force-directed placement

solve some limitations of dendrogrambut, cannot utilize external clustering information

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction

for a demonstration of the relative merits of the two approaches

a dendrogram vs. a low-dimensional embedding

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction

in this paper, we integrate the two approacheshierarchical clustering depicted as a dendrogram

low-dimensional embedding

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Basic Notions

given data about n elements {1,…,n}

relationships between pairs of elements are bydistances dij ≥ 0 or

similarities wij ≥ 0

2-dimentional embedding of the dataid defined by two vectors x, y Є

the coordinates of element i are ( xi, yi)

n

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Computing The x-Coordinates

The embedding must place each element exactly below its corresponding leaf in the dendrogram

this means that the x-coordinate must corresponding leaf in the dendrogram

face the problem of computing the x-coordinates of the dendrogram leaves

preserves the relationships among the data as much as possible

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Computing The x-Coordinates

we exhaust all the existing methods, opting for a twofold process

find the best orientation of the dendrogramthis step determines the ordering of the leaves

decide on the exact gaps between consecutive leaves in the ordering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Dendrogram orientation

a dendrogram has 2n-1 different orientationsexample :

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Dendrogram orientation

one way of defining formally what should be considered a “good” ordering

associate a cost function with the dendrogram

such that finding the best ordering is equivalent to optimizing this function

be the classical minimum linear arrangement problem

ji

jiij

def

sim xxwxLA,

.

minimizes

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Dendrogram orientation

in our particular problemalso faced with an ordering task

a permutation of {1, …, n}

however, here we should not consider all possible permutations, but only agree with dendrogram’s structure

n! 2n-1

using dynamic programming, running time is exponential in the dendrogram’s height not in its size

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Dendrogram orientation

introduce an additional form of the cost function

ji

jiij

def

dist xxdxLA,

.

maximizes

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Dendrogram orientation

given an ordered dendrogram T

a node v

Leaves(v) : the set of leaves in the substree rooted by v

x be the ordering on the leaves

Let S be Leaves(v)L be the set of leaves of left of S

R be the set of leaves of right of S

if |L| = l, |S| = s, we have x(L) = {1,…,l},

x(S) = {l+1,…,l+x}, x(R) = {l+s+1,…,n}

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Dendrogram orientation

a key concept of the algorithm is local arrangement cost, defined as :

RS,ji RjLiijiij

Sji LjSiiijjiij

defT

swxslw

lxwxxwvLocalLA

,

, ,

if |L| = l, |S| = s, we have x(L) = {1,…,l}, x(S) = {l+1,…,l+x}, x(R) = {l+s+1,…,n}

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Dendrogram orientation

two additional related terms will be used

another term that will be used in the algorithm

RjSi

ij

defT

LjSiij

defT wvRightCutwvLeftCut

,,

,

ij

rightvLeavesjleftvLeavesi

wvInnerCut

..

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Determining coordinates of the leaves

computing the exact gaps between each two consecutive leaves

example :

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Determining coordinates of the leaves

a better approach is to take a weighted average over all influenced leaf pairs

ikij

kj

ikiji jk

d

jkgap

,

1

,

1

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Computing The y-Coordinates

Principle component analysis

Classical multidimensional scaling

Eigen-projection

Stress minimization

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Result

Odors datasetconsists of 30 volatile odorous pure chemicals

contains 262 elements, natural clusters : 30

use a UPGMA agglomerative clustering to construct

the dendrogram

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Result

Iris datasetan example of discriminant analysis

contains 150 elements, natural clusters : 3

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Result

Gene expression data : CDC15-synchronized cell cycle

a much larger dataset of gene-expression data

contains 6113 elements

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Related Work

TreeViewdendrogram over a color-coded matrix

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Discussion

success for integrating two key methods in exploratory data analysis

cluster analysis and low-dimensional embedding

two unique propertiesGuaranteed separation between any kind of given clusters

The ability to deal with a predefined hierarchical clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Personal Opinion

Advantages─ has success for integrating two of clustering methods.─ more intuition in analyzing

Application─ Real data for clustering and analyzing.─ May solve the problem lack of clustering information

Limited ─ cannot show the real shape of clusters

top related