hierarchical and pyramidal clustering for symbolic...

1

1

HIERARCHICAL AND PYRAMIDAL CLUSTERING

FOR SYMBOLIC DATA

Paula Brito

Univ. Porto, Portugal

2

OUTLINE

•The hierachical model

• The pyramidal model

• Numerical hierarchical / pyramidal clustering

•Symbolic Clustering

•The property of completeness

• The generality degree

• The clustering algorithm

•Examples

•The HIPYR Module

2

3

Hierarchical Model :set of nested partitions

• E ∈∈∈∈ H

• ∀∀∀∀ a ∈∈∈∈ ΕΕΕΕ , { a } ∈∈∈∈ H

• ∀∀∀∀h, h' ∈∈∈∈ H, h ∩∩∩∩ h' = Ø or h ⊆⊆⊆⊆ h' or h' ⊆⊆⊆⊆ h

Let ΕΕΕΕ be the observations set (the set being clustered)

Hierarchy on E :

Family H on non-empty subsets of E such that

4

x1 x2 x3 x4 x5

h

h'

x3

x4 x5

x1 x2

h

h'

3

5

PYRAMIDAL REPRESENTATION

PYRAMID : set of nested overlappings

P = { {x1} , {x2} , {x3} , {x4} , {x5} , {x1 , x2} , {x3 , x4} ,

{x4 , x5} , {x1 , x2 , x3} , {x1 , x2 , x3 , x4} , {x3 , x4 , x5} ,

{x1 , x2 , x3 , x4 , x5 } }

x1 x2 x3 x4 x5

x3

x5

x1 x2

x4

Diday (1984, 1986), Bertrand, Diday (1985), Bertrand,(1986)

6

PYRAMID P on ΕΕΕΕ

Family on non-empty subsets of E such that :

• E ∈∈∈∈ P

• ∀∀∀∀ a ∈∈∈∈ ΕΕΕΕ , { a } ∈∈∈∈ P

• ∀∀∀∀ p, p' ∈∈∈∈ P, p ∩∩∩∩ p' = Ø or p ∩∩∩∩ p' ∈∈∈∈ P

• there exists a linear order θθθθ / every element of P is an

interval of θθθθ

Pyramidal model :

x1 x2 x3 x4 x5

Hierarchy : nested partitions

Pyramid : nested overlappings

Clustering

Seriation

4

7

SUCCESSOR AND PREDECESSOR

p SUCCESSOR of p’ if

- p ⊆⊆⊆⊆ p’

- ¬¬¬¬∃∃∃∃ p’’ / p ⊆⊆⊆⊆ p” ⊆⊆⊆⊆ p’

p’ is a PREDECESSOR of p

Pyramid : Each cluster has at most TWO predecessors

x1 x2 x3 x4 x5

Hierarchy : Each cluster has at most ONE predecessor

x1 x2 x3 x4 x5

h

h'

8

ASCENDING CLUSTERING ALGORITHM

Starting with the one cluster elements,

merge at each step the MERGEABLE clusters for which

the dissimilarity (aggregation index) is MINIMUM

MERGEABLE CLUSTERS :

→ if the structure is a hierarchy : none of them has

been aggregated before ;

→ if the structure is a pyramid : none of them has been

aggregated twice, and p is an interval of a total order θθθθ

on E

5

9

Given a set of symbolic objects ,

build an hierarchical / pyramidal clustering

such that each cluster is a pair

SYMBOLIC CLUSTERING

EACH CLUSTER HAS AN AUTOMATIC SYMBOLIC

REPRESENTATION

BY A SYMBOLIC OBJECT

ndescriptio its - INTENSION

members its - EXTENSION

OBJECTIVE :

10

Symbolic clustering methods need:

• Generalization Operator

C ⊆⊆⊆⊆ C’

s’ (representing C’) is more general than

s (representing C)

• Generality degree measure

6

11

GENERALISATION

s is more general than s’ if its extent contains

the extent of s’

s’ is more specific than s

Generalisation of two symbolic objects s and s’ :

determining s’’ : s’’ is more general than both s and s’.

This procedure differs according to the variable type :

s ≤≤≤≤ s ∪∪∪∪ s’ and s’≤≤≤≤ s ∪∪∪∪ s’

ext (s ∪∪∪∪ s’) ⊇⊇⊇⊇ ext (s) and ext (s ∪∪∪∪s’) ⊇⊇⊇⊇ ext (s’)

12

1) Interval variables

s1 = [ y ∈∈∈∈ [a1, b1] ]

s2 = [ y ∈∈∈∈ [a2, b2] ]

s1 ∪∪∪∪ s2 = [ y ∈∈∈∈ [min {a1,a2}, max {b1,b2}]]

s1 = [ time ∈∈∈∈ [5, 15] ]

s2 = [ time ∈∈∈∈ [10, 20] ]

s1 ∪∪∪∪ s2 = [ time ∈∈∈∈ [ 5, 20 ]]

Example :

7

13

2) Multi-valued categorical variables

s1 = [ y ∈∈∈∈ V1]

s2 = [ y ∈∈∈∈ V2 ]

s1 ∪∪∪∪ s2 = [ y1 ∈∈∈∈ V1 ∪∪∪∪ V2 ]

s1 = [job ∈∈∈∈{secretary, teacher}]

s2 = [job ∈∈∈∈ {employee}]

s1 ∪∪∪∪ s2 = [job ∈∈∈∈{secretary, teacher, employee}]

Example :

14

3) Modal variables

Two possibilities proposed :

take for each category the Minimum of its frequencies

take for each category the Maximum of its frequencies

8

15

a) Generalisation by the Maximum

= ] } )(pm , ,)(pm [y ] } )(pm , ,)(pm [y 2

kk

2

11

1

kk

1

11 …………{{{{====∪∪∪∪…………{{{{====

] } )(pm , ,)(pm [y kk11 …………{{{{====

}}}}2

j

1

jj p , p {Max p ====with

[Type of Job ∈∈∈∈{ (0.3) administration, (0.7) teaching}] ∪∪∪∪

[Type of Job ∈∈∈∈{ (0.6) admin., (0.2) teaching , (0.2) secretary}]

= [Type of Job ∈∈∈∈{ (0.6) admin., (0.7) teaching , (0.2) secret.}]

Example :

Extension : k} , 1,=j , p p : a { j

a

j …………≤≤≤≤

“at most” principle

16

b) Generalisation by the Minimum

= ] } )(pm , ,)(pm [y ] } )(pm , ,)(pm [y 2

kk

2

11

1

kk

1

11 …………{{{{====∪∪∪∪…………{{{{====

] } )(pm , ,)(pm [y kk11 …………{{{{====

with }}}}2

j

1

jj p , p { Min p ====

[[Type of Job ∈∈∈∈{ (0.3) administration, (0.7) teaching}] ∪∪∪∪

[Type of Job ∈∈∈∈{ (0.6) admin., (0.2) teaching , (0.2) secretary}]

= [Type of Job ∈∈∈∈{ (0.3) admin., (0.2) teaching}]

Example :

Extension :

“at least” principle

k} , 1,=j , p p : a { j

a

j …………≥≥≥≥

9

17

COMPLETENESS AND COMPLETE OBJECTS

COMPLETE symbolic object

• defined by all the properties that characterize its

extension

• the most specific to fulfill this condition

Galois connections, lattice theory

Union preservs completeness

Barbut & Monjardet (1970); Wille (1982); Ganter (1984)

18

y1

y2

y3

Age Weight Sex

w120 45 F

w2

50 55 F

w3

30 50 F

w4

60 60 M

a = [ Weight = [ 40, 50 ] ] ∧∧∧∧ [ Age = [ 20, 50 ] ]

ext a = {w1 , w3}

a IS NOT COMPLETE

a = [Weight = [45, 50]] ∧∧∧∧ [Age = [20, 30]] ∧∧∧∧ [Sex ={F}]

IS COMPLETE

Example:

10

19

THE BASIC ALGORITHM

• p1, p2 can be merged together

• s is complete

• s is more general than s1, s2 ⇒⇒⇒⇒ union s1 ∪∪∪∪ s2

• extE s = p

Non - uniqueness ⇒⇒⇒⇒ numerical criterion

Clusters with more specific descriptions

are formed first

STARTING WITH THE ONE-OBJECT CLUSTERS

{ai}, i = 1,…,n

At each step, FORM A CLUSTER p union of p1 , p2 ,

REPRESENTED BY s such that

20

GENERALITY DEGREE

a = ∧∧∧∧ [ yi ∈∈∈∈ Vi ], Vi ⊆⊆⊆⊆ Oi Oi bounded

i

e = a if

)i

(eG = )

i(O m

)i

(V m

= (a)G

p

1=i

p

1=i

∧∧∧∧

∏∏∏∏∏∏∏∏

PROPORTION of the description space covered by a

the more possible members of the extension of a ,

the greater the generality degree of a

11

21

Interval variables

m(Vi) = max Vi – min Vi (range)

55,045

25

1560

2045)11e(G ==

−

−=

2,010000

2000

010000

10003000)12e(G ==

−

−=

11,02,0*55,0)1s(G ==

Example :

Describing groups of people

defined on variables age and salary.

Age ranges from 15 to 60 , salary ranges from 0 to 10000.

Consider a group described by a symbolic object

s1 = [age ∈∈∈∈ [ 20 , 45]] ∧∧∧∧ [salary ∈∈∈∈ [1000 , 3000]] = e11 ∧∧∧∧ e12

22

Multi-valued categorical variables

m(Vi) = card Vi

Example:

Describing grous of people from the UE,

defined on variables sex and nationality.

5,02

1)11e(G == 08,0

25

2)12e(G ==

04,008,05,0)1s(G =×=

Consider one group described by :

s1 = [sex ∈∈∈∈ { M }] ∧∧∧∧ [nationality ∈∈∈∈ {French, English}]

12

23

∑∑∑∑∏∏∏∏========

====jk

i

ij

p

j j

pk

aG11

1

1)(

which is the affinity coefficient (Matusita, 1951) between

(p1,…,pk) and the uniform distribution

This means that we consider an object

the more general the more similar it is

to the uniform distribution

G1(a) is maximum (=1) when pi = 1/k, i=1,…k : uniform

Modal variables

a) Generalising by the Maximum

24

∑∑∑∑∏∏∏∏========

−−−−−−−−

====jk

i

ij

p

j jj

pkk

aG11

2 )1()1(

1)(

Again, G2(a) is maximum (=1)

when pi = 1/k, i=1,…k : uniform

b) Generalising by the minimum

13

25

ALGORITHM

• p1, p2 can be merged together

• s = s1 ∪∪∪∪ s2 (complete)

• extE s = p

• G(s) is minimum

Starting with the one-object clusters {ai}, i = 1,…,n

At each step, FORM A CLUSTER p union of p1 , p2 ,

REPRESENTED BY s such that

26

each cluster is represented by a

"complete" symbolic object

whose extension is the cluster itself

The algorithm builds a hierarchy / pyramid on E, such that

AUTOMATIC SYMBOLIC REPRESENTATION

OF THE CLUSTERS

CLUSTER = (p, s) p = ext s

14

27

y1 y2 y3 y4

w1 1 1 1 2

w2 1 2 1 3

w3 1 2 2 2

w4 2 1 1 2

w5 3 3 2 1

28

w 4 w 1 w2 w3 w5

2/54

4 /54

8/54

16/54

24 /54

32/54

1

P 1P2

P3

P4P5 P 6

P7

P8

P9

P10

P6 : : : : s6 = [ y1 = {1} ] ∧∧∧∧ [ y2 = {1,2} ] ∧∧∧∧ [ y4 = {2,3} ]

P7 : : : : s7 = [ y1 = {1,2} ] ∧∧∧∧ [ y2 = {1,2}] ∧∧∧∧ [y4 = {2,3} ]

15

29

Each of the 12 groups is described

by interval variables, representing the variation

of the underlying variables in the class.

Application

INE Labour Force Survey data :

12 sex * age groups

30

16

31

32

P36 [Looking_for_job=[2.010,3.595]U[4.579,6.707]

]^

[Financial Act.=[3.639,5.851]]^

[Public_ Admin=[1.336,2.620]U[3.004,4.871]]^

[Hotels_Rest_=[1.874,3.497]]^

[Commerce=[9.582,13.161]]^

[Construction=[11.244,14.526]]^

[Education=[4.589,8.829]]^

[Elect__gas_water=[0.347,1.757]]^

[Industry=[32.700,37.277]U[38.671,43.205]]^

[Other_Serv_=[4.316,7.551]]^

[Primary=[5.921,9.799]]^

[Health=[1.857,3.622]]^

[Transp_Comunic_=[2.352,4.675]]^

[None=[2.603,4.285]U[4.609,6.839]]^...

17

33

its members

Man 15 to 54

Women 15 to 24

are the only elements

fulfilling these conditions.

34

Application

Cultural activities of 11 socio-professional groups

1509 individual observations grouped

18

35

36

19

37

The HIPYR Module

Objective :

Perform Hierarchical or Pyramidal clustering

on a set of SO’s

� from a dissimilarity matrix

→→→→ numerical clustering

� directly based on the data set

→→→→ symbolic clustering: clusters are "concepts "

38

The HIPYR module

20

39

Main Parameters

Structure :

Data Source :

Aggregation Index :

Hierarchy or Pyramid

� Dissimilarity Matrix (Numerical Clustering)

� Symbolic objects (Symbolic Clustering)

� Numerical Clustering : Maximum, Minimum, Average,

Diameter

� Symbolic Clustering : Minimum Generality

Minimum Increase in Generality

40

Main Parameters

�Use Taxonomies for generalization

(nominal or categorical multi-valued variables) : Y, N

� Select “best” classes : Y, N

� Write induced dissimilarity/generality matrix : Y, N

� Order Variable (optional) : quantitative single variable;

to impose an order compatible with the pyramid

� Modal variables generalization :

� Maximum

� Minimum

21

41

Main Parameters

42

Main Parameters

22

43

Induced dissimilarity/generality matrix

For each pair of SO, si, sj, i, j, =1,..., n,

d*(si, sj,) = index (height) of the “smallest” class

that contains si and sj

d*(si, sj,) = Min {f(C), si ∈∈∈∈ C, sj, ∈∈∈∈ C}

Evaluation of the obtained indexed hierarchy /

pyramid:

Comparision between the initial and the induced

dissimilarity/generality matrices.

44

Evaluation value

For si, sj, i, j, =1,..., n, d(si, sj) :

�the given dissimilarity matrix (numerical clustering)

�generality degree of si ∪∪∪∪ sj (symbolic clustering)

∑∑∑∑ ∑∑∑∑

∑∑∑∑ ∑∑∑∑

−−−−

==== ++++====

−−−−

==== ++++====

−−−−

====1n

1i

n

1ijji

1n

1i

n

1ij

2jiji

)s,s(d

))s,s(*d)s,s(d(

EV

23

45

Cluster selectionIdentify the most interesting clusters :

A cluster is “interesting” if its variability is

small as compared to its predecessors.

Variability indicated by index values f(h).

Compute mean value and standard deviation

of height increase values.

A class is selected if the corresponding increase value

is more than 2 stand. dev. over the mean value.

46

Cluster selection

C

24

47

HIPYROUTPUT

� Text file

� Sodas file

� Interactive Graphical Representation (VPYR)

48

� The labels of the individuals

� The labels of the variables

� The description of each node :

� the symbolic object associated to each node

� its extension

� Evaluation value

�Selected clusters, if asked for

�The induced matrix, if asked for

The output listing contains:

25

49

Graphical Representation

50

Options of the Graphical

Representation

A cluster is selected by clicking on it.

�description of the cluster in terms of

list of chosen variables

�representation by a Zoom Star

26

51


52


27

53

HIPYR HIPYR -- VPYRVPYR

54

Pruning the hierarchy or pyramid using the

aggregation heights as a criterion.

Suppressing cluster p if :

f(p’) - f(p) < αααα f(E) ∧∧∧∧ p has a single predecessor


Representation

Rate of simplification αααα

chosen by the user,

new graphic window with

the simplified structure.

28

55

Pruning

56

Rule Generation

If the hierarchy/pyramid are built from a symbolic

data table, rules may be generated and saved in a

specified file.

(p1,s1) (p2,s2)

(p,s)

(p1,s1) (p2,s2)

(p,s)

Fission method :

s ⇒⇒⇒⇒ s1 ∨∨∨∨ s2

Fussion method

(pyramids only) :

s1 ∧∧∧∧ s2⇒⇒⇒⇒ s

29

57

Rule generation

58

Reduction

Should the user be interested in a particular cluster,

he may obtain a window with the structure

restricted to this cluster and its successors.


Representation

30

59

Reduction

hierarchical and pyramidal clustering for symbolic...

Documents