hierarchical and pyramidal clustering for symbolic...
TRANSCRIPT
1
1
HIERARCHICAL AND PYRAMIDAL CLUSTERING
FOR SYMBOLIC DATA
Paula Brito
Univ. Porto, Portugal
2
OUTLINE
•The hierachical model
• The pyramidal model
• Numerical hierarchical / pyramidal clustering
•Symbolic Clustering
•The property of completeness
• The generality degree
• The clustering algorithm
•Examples
•The HIPYR Module
2
3
Hierarchical Model :set of nested partitions
• E ∈∈∈∈ H
• ∀∀∀∀ a ∈∈∈∈ ΕΕΕΕ , { a } ∈∈∈∈ H
• ∀∀∀∀h, h' ∈∈∈∈ H, h ∩∩∩∩ h' = Ø or h ⊆⊆⊆⊆ h' or h' ⊆⊆⊆⊆ h
Let ΕΕΕΕ be the observations set (the set being clustered)
Hierarchy on E :
Family H on non-empty subsets of E such that
4
x1 x2 x3 x4 x5
h
h'
x3
x4 x5
x1 x2
h
h'
3
5
PYRAMIDAL REPRESENTATION
PYRAMID : set of nested overlappings
P = { {x1} , {x2} , {x3} , {x4} , {x5} , {x1 , x2} , {x3 , x4} ,
{x4 , x5} , {x1 , x2 , x3} , {x1 , x2 , x3 , x4} , {x3 , x4 , x5} ,
{x1 , x2 , x3 , x4 , x5 } }
x1 x2 x3 x4 x5
x3
x5
x1 x2
x4
Diday (1984, 1986), Bertrand, Diday (1985), Bertrand,(1986)
6
PYRAMID P on ΕΕΕΕ
Family on non-empty subsets of E such that :
• E ∈∈∈∈ P
• ∀∀∀∀ a ∈∈∈∈ ΕΕΕΕ , { a } ∈∈∈∈ P
• ∀∀∀∀ p, p' ∈∈∈∈ P, p ∩∩∩∩ p' = Ø or p ∩∩∩∩ p' ∈∈∈∈ P
• there exists a linear order θθθθ / every element of P is an
interval of θθθθ
Pyramidal model :
x1 x2 x3 x4 x5
Hierarchy : nested partitions
Pyramid : nested overlappings
Clustering
Seriation
4
7
SUCCESSOR AND PREDECESSOR
p SUCCESSOR of p’ if
- p ⊆⊆⊆⊆ p’
- ¬¬¬¬∃∃∃∃ p’’ / p ⊆⊆⊆⊆ p” ⊆⊆⊆⊆ p’
p’ is a PREDECESSOR of p
Pyramid : Each cluster has at most TWO predecessors
x1 x2 x3 x4 x5
Hierarchy : Each cluster has at most ONE predecessor
x1 x2 x3 x4 x5
h
h'
8
ASCENDING CLUSTERING ALGORITHM
Starting with the one cluster elements,
merge at each step the MERGEABLE clusters for which
the dissimilarity (aggregation index) is MINIMUM
MERGEABLE CLUSTERS :
→ if the structure is a hierarchy : none of them has
been aggregated before ;
→ if the structure is a pyramid : none of them has been
aggregated twice, and p is an interval of a total order θθθθ
on E
5
9
Given a set of symbolic objects ,
build an hierarchical / pyramidal clustering
such that each cluster is a pair
SYMBOLIC CLUSTERING
EACH CLUSTER HAS AN AUTOMATIC SYMBOLIC
REPRESENTATION
BY A SYMBOLIC OBJECT
ndescriptio its - INTENSION
members its - EXTENSION
OBJECTIVE :
10
Symbolic clustering methods need:
• Generalization Operator
C ⊆⊆⊆⊆ C’
s’ (representing C’) is more general than
s (representing C)
• Generality degree measure
6
11
GENERALISATION
s is more general than s’ if its extent contains
the extent of s’
s’ is more specific than s
Generalisation of two symbolic objects s and s’ :
determining s’’ : s’’ is more general than both s and s’.
This procedure differs according to the variable type :
s ≤≤≤≤ s ∪∪∪∪ s’ and s’≤≤≤≤ s ∪∪∪∪ s’
ext (s ∪∪∪∪ s’) ⊇⊇⊇⊇ ext (s) and ext (s ∪∪∪∪s’) ⊇⊇⊇⊇ ext (s’)
12
1) Interval variables
s1 = [ y ∈∈∈∈ [a1, b1] ]
s2 = [ y ∈∈∈∈ [a2, b2] ]
s1 ∪∪∪∪ s2 = [ y ∈∈∈∈ [min {a1,a2}, max {b1,b2}]]
s1 = [ time ∈∈∈∈ [5, 15] ]
s2 = [ time ∈∈∈∈ [10, 20] ]
s1 ∪∪∪∪ s2 = [ time ∈∈∈∈ [ 5, 20 ]]
Example :
7
13
2) Multi-valued categorical variables
s1 = [ y ∈∈∈∈ V1]
s2 = [ y ∈∈∈∈ V2 ]
s1 ∪∪∪∪ s2 = [ y1 ∈∈∈∈ V1 ∪∪∪∪ V2 ]
s1 = [job ∈∈∈∈{secretary, teacher}]
s2 = [job ∈∈∈∈ {employee}]
s1 ∪∪∪∪ s2 = [job ∈∈∈∈{secretary, teacher, employee}]
Example :
14
3) Modal variables
Two possibilities proposed :
take for each category the Minimum of its frequencies
take for each category the Maximum of its frequencies
8
15
a) Generalisation by the Maximum
= ] } )(pm , ,)(pm [y ] } )(pm , ,)(pm [y 2
kk
2
11
1
kk
1
11 …………{{{{====∪∪∪∪…………{{{{====
] } )(pm , ,)(pm [y kk11 …………{{{{====
}}}}2
j
1
jj p , p {Max p ====with
[Type of Job ∈∈∈∈{ (0.3) administration, (0.7) teaching}] ∪∪∪∪
[Type of Job ∈∈∈∈{ (0.6) admin., (0.2) teaching , (0.2) secretary}]
= [Type of Job ∈∈∈∈{ (0.6) admin., (0.7) teaching , (0.2) secret.}]
Example :
Extension : k} , 1,=j , p p : a { j
a
j …………≤≤≤≤
“at most” principle
16
b) Generalisation by the Minimum
= ] } )(pm , ,)(pm [y ] } )(pm , ,)(pm [y 2
kk
2
11
1
kk
1
11 …………{{{{====∪∪∪∪…………{{{{====
] } )(pm , ,)(pm [y kk11 …………{{{{====
with }}}}2
j
1
jj p , p { Min p ====
[[Type of Job ∈∈∈∈{ (0.3) administration, (0.7) teaching}] ∪∪∪∪
[Type of Job ∈∈∈∈{ (0.6) admin., (0.2) teaching , (0.2) secretary}]
= [Type of Job ∈∈∈∈{ (0.3) admin., (0.2) teaching}]
Example :
Extension :
“at least” principle
k} , 1,=j , p p : a { j
a
j …………≥≥≥≥
9
17
COMPLETENESS AND COMPLETE OBJECTS
COMPLETE symbolic object
• defined by all the properties that characterize its
extension
• the most specific to fulfill this condition
Galois connections, lattice theory
Union preservs completeness
Barbut & Monjardet (1970); Wille (1982); Ganter (1984)
18
y1
y2
y3
Age Weight Sex
w120 45 F
w2
50 55 F
w3
30 50 F
w4
60 60 M
a = [ Weight = [ 40, 50 ] ] ∧∧∧∧ [ Age = [ 20, 50 ] ]
ext a = {w1 , w3}
a IS NOT COMPLETE
a = [Weight = [45, 50]] ∧∧∧∧ [Age = [20, 30]] ∧∧∧∧ [Sex ={F}]
IS COMPLETE
Example:
10
19
THE BASIC ALGORITHM
• p1, p2 can be merged together
• s is complete
• s is more general than s1, s2 ⇒⇒⇒⇒ union s1 ∪∪∪∪ s2
• extE s = p
Non - uniqueness ⇒⇒⇒⇒ numerical criterion
Clusters with more specific descriptions
are formed first
STARTING WITH THE ONE-OBJECT CLUSTERS
{ai}, i = 1,…,n
At each step, FORM A CLUSTER p union of p1 , p2 ,
REPRESENTED BY s such that
20
GENERALITY DEGREE
a = ∧∧∧∧ [ yi ∈∈∈∈ Vi ], Vi ⊆⊆⊆⊆ Oi Oi bounded
i
e = a if
)i
(eG = )
i(O m
)i
(V m
= (a)G
p
1=i
p
1=i
∧∧∧∧
∏∏∏∏∏∏∏∏
PROPORTION of the description space covered by a
the more possible members of the extension of a ,
the greater the generality degree of a
11
21
Interval variables
m(Vi) = max Vi – min Vi (range)
55,045
25
1560
2045)11e(G ==
−
−=
2,010000
2000
010000
10003000)12e(G ==
−
−=
11,02,0*55,0)1s(G ==
Example :
Describing groups of people
defined on variables age and salary.
Age ranges from 15 to 60 , salary ranges from 0 to 10000.
Consider a group described by a symbolic object
s1 = [age ∈∈∈∈ [ 20 , 45]] ∧∧∧∧ [salary ∈∈∈∈ [1000 , 3000]] = e11 ∧∧∧∧ e12
22
Multi-valued categorical variables
m(Vi) = card Vi
Example:
Describing grous of people from the UE,
defined on variables sex and nationality.
5,02
1)11e(G == 08,0
25
2)12e(G ==
04,008,05,0)1s(G =×=
Consider one group described by :
s1 = [sex ∈∈∈∈ { M }] ∧∧∧∧ [nationality ∈∈∈∈ {French, English}]
12
23
∑∑∑∑∏∏∏∏========
====jk
i
ij
p
j j
pk
aG11
1
1)(
which is the affinity coefficient (Matusita, 1951) between
(p1,…,pk) and the uniform distribution
This means that we consider an object
the more general the more similar it is
to the uniform distribution
G1(a) is maximum (=1) when pi = 1/k, i=1,…k : uniform
Modal variables
a) Generalising by the Maximum
24
∑∑∑∑∏∏∏∏========
−−−−−−−−
====jk
i
ij
p
j jj
pkk
aG11
2 )1()1(
1)(
Again, G2(a) is maximum (=1)
when pi = 1/k, i=1,…k : uniform
b) Generalising by the minimum
13
25
ALGORITHM
• p1, p2 can be merged together
• s = s1 ∪∪∪∪ s2 (complete)
• extE s = p
• G(s) is minimum
Starting with the one-object clusters {ai}, i = 1,…,n
At each step, FORM A CLUSTER p union of p1 , p2 ,
REPRESENTED BY s such that
26
each cluster is represented by a
"complete" symbolic object
whose extension is the cluster itself
The algorithm builds a hierarchy / pyramid on E, such that
AUTOMATIC SYMBOLIC REPRESENTATION
OF THE CLUSTERS
CLUSTER = (p, s) p = ext s
14
27
y1 y2 y3 y4
w1 1 1 1 2
w2 1 2 1 3
w3 1 2 2 2
w4 2 1 1 2
w5 3 3 2 1
28
w 4 w 1 w2 w3 w5
2/54
4 /54
8/54
16/54
24 /54
32/54
1
P 1P2
P3
P4P5 P 6
P7
P8
P9
P10
P6 : : : : s6 = [ y1 = {1} ] ∧∧∧∧ [ y2 = {1,2} ] ∧∧∧∧ [ y4 = {2,3} ]
P7 : : : : s7 = [ y1 = {1,2} ] ∧∧∧∧ [ y2 = {1,2}] ∧∧∧∧ [y4 = {2,3} ]
15
29
Each of the 12 groups is described
by interval variables, representing the variation
of the underlying variables in the class.
Application
INE Labour Force Survey data :
12 sex * age groups
30
16
31
32
P36 [Looking_for_job=[2.010,3.595]U[4.579,6.707]
]^
[Financial Act.=[3.639,5.851]]^
[Public_ Admin=[1.336,2.620]U[3.004,4.871]]^
[Hotels_Rest_=[1.874,3.497]]^
[Commerce=[9.582,13.161]]^
[Construction=[11.244,14.526]]^
[Education=[4.589,8.829]]^
[Elect__gas_water=[0.347,1.757]]^
[Industry=[32.700,37.277]U[38.671,43.205]]^
[Other_Serv_=[4.316,7.551]]^
[Primary=[5.921,9.799]]^
[Health=[1.857,3.622]]^
[Transp_Comunic_=[2.352,4.675]]^
[None=[2.603,4.285]U[4.609,6.839]]^...
17
33
its members
Man 15 to 54
Women 15 to 24
are the only elements
fulfilling these conditions.
34
Application
Cultural activities of 11 socio-professional groups
1509 individual observations grouped
18
35
36
19
37
The HIPYR Module
Objective :
Perform Hierarchical or Pyramidal clustering
on a set of SO’s
� from a dissimilarity matrix
→→→→ numerical clustering
� directly based on the data set
→→→→ symbolic clustering: clusters are "concepts "
38
The HIPYR module
20
39
Main Parameters
Structure :
Data Source :
Aggregation Index :
Hierarchy or Pyramid
� Dissimilarity Matrix (Numerical Clustering)
� Symbolic objects (Symbolic Clustering)
� Numerical Clustering : Maximum, Minimum, Average,
Diameter
� Symbolic Clustering : Minimum Generality
Minimum Increase in Generality
40
Main Parameters
�Use Taxonomies for generalization
(nominal or categorical multi-valued variables) : Y, N
� Select “best” classes : Y, N
� Write induced dissimilarity/generality matrix : Y, N
� Order Variable (optional) : quantitative single variable;
to impose an order compatible with the pyramid
� Modal variables generalization :
� Maximum
� Minimum
21
41
Main Parameters
42
Main Parameters
22
43
Induced dissimilarity/generality matrix
For each pair of SO, si, sj, i, j, =1,..., n,
d*(si, sj,) = index (height) of the “smallest” class
that contains si and sj
d*(si, sj,) = Min {f(C), si ∈∈∈∈ C, sj, ∈∈∈∈ C}
Evaluation of the obtained indexed hierarchy /
pyramid:
Comparision between the initial and the induced
dissimilarity/generality matrices.
44
Evaluation value
For si, sj, i, j, =1,..., n, d(si, sj) :
�the given dissimilarity matrix (numerical clustering)
�generality degree of si ∪∪∪∪ sj (symbolic clustering)
∑∑∑∑ ∑∑∑∑
∑∑∑∑ ∑∑∑∑
−−−−
==== ++++====
−−−−
==== ++++====
−−−−
====1n
1i
n
1ijji
1n
1i
n
1ij
2jiji
)s,s(d
))s,s(*d)s,s(d(
EV
23
45
Cluster selectionIdentify the most interesting clusters :
A cluster is “interesting” if its variability is
small as compared to its predecessors.
Variability indicated by index values f(h).
Compute mean value and standard deviation
of height increase values.
A class is selected if the corresponding increase value
is more than 2 stand. dev. over the mean value.
46
Cluster selection
C
24
47
HIPYROUTPUT
� Text file
� Sodas file
� Interactive Graphical Representation (VPYR)
48
� The labels of the individuals
� The labels of the variables
� The description of each node :
� the symbolic object associated to each node
� its extension
� Evaluation value
�Selected clusters, if asked for
�The induced matrix, if asked for
The output listing contains:
25
49
Graphical Representation
50
Options of the Graphical
Representation
A cluster is selected by clicking on it.
�description of the cluster in terms of
list of chosen variables
�representation by a Zoom Star
26
51
Graphical Representation
52
Graphical Representation
27
53
HIPYR HIPYR -- VPYRVPYR
54
Pruning the hierarchy or pyramid using the
aggregation heights as a criterion.
Suppressing cluster p if :
f(p’) - f(p) < αααα f(E) ∧∧∧∧ p has a single predecessor
Options of the Graphical
Representation
Rate of simplification αααα
chosen by the user,
new graphic window with
the simplified structure.
28
55
Pruning
56
Rule Generation
If the hierarchy/pyramid are built from a symbolic
data table, rules may be generated and saved in a
specified file.
(p1,s1) (p2,s2)
(p,s)
(p1,s1) (p2,s2)
(p,s)
Fission method :
s ⇒⇒⇒⇒ s1 ∨∨∨∨ s2
Fussion method
(pyramids only) :
s1 ∧∧∧∧ s2⇒⇒⇒⇒ s
29
57
Rule generation
58
Reduction
Should the user be interested in a particular cluster,
he may obtain a window with the structure
restricted to this cluster and its successors.
Options of the Graphical
Representation
30
59
Reduction