an integrated framework for visualized and exploratory pattern discovery in mixed data

1Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

Advisor : Dr. Hsu

Presenter : Jing-Wei Lin

2

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Outline

Motivation Objective Introduction SOM and AOI GSOM and EAOI Exploratory clustering and pattern extraction Experimental results Conclusions

3


N.Y.U.S.T.

I. M.

Motivation A successful integration relies on appropriate individual techniques. However, the traditional self-organizing ma

p (SOM) and attribute-oriented induction (AOI) have some drawbacks.

The traditional self-organizing map (SOM) is incapable of

directly handling the categorical data.

The attribute-oriented induction (AOI) may fail to preserve major values of an attribute, leading to over generalization.比如說：台北的薪資資料有 30 筆、桃園和新竹的資料各一筆而使用 AOI 處理後代表北台灣的所得的值可能會使人發生誤會

4


N.Y.U.S.T.

I. M.

SOM The SOM is an unsupervised neural network which projects high-dimensional data onto a low-dimensional grid, usually two-dimensional, and preserves the topological relationships of the original data.

5


N.Y.U.S.T.

I. M.AOI

Attribute-oriented induction extracts data patterns in a large amount of data and produces a set of concise rules, which represent the general patterns hidden in the data.

註： AOI 是一個可以對關聯式資料庫進行資料特微擷取的技術

6


N.Y.U.S.T.

I. M.

Objective A generalized self-organizing map (GSOM) and an extended

attribute-oriented induction (EAOI), which not only overcome

the drawbacks of their original algorithms but also provide

additional analysis capabilities.

7


N.Y.U.S.T.

I. M.

Among unsupervised clustering techniques, a lot of attention has been paid to self-organizing map (SOM), which projects high-dimensional data to low-dimensional grids, without losing their topological order.

Regarding pattern extraction techniques, attribute-oriented induction (AOI) is a popular and effective approach.

Introduction

8


N.Y.U.S.T.

I. M.

The integrated analysis framework works as follows: train the

GSOM using preprocessed data, perform data clustering visually and exploratory on the trained map, and then extract the characteristics of individual clusters using the EAOI.

Introduction (cont.)

9


N.Y.U.S.T.

I. M.

The GSOM is able to directly handle categorical data ：因為在二元轉換的過程中會造成資料損失或不完整的情況發生，故利用概念階層樹給予每一個 link 一個權重來計算出種類型資料間確切的距離


10


N.Y.U.S.T.

I. M.

The EAOI offers the additional capability of preserving major values in the data ：即在傳統的 AOI 法中，另外考慮了『重複次數』來獲得特徵值的分佈程度，並針對種

類型資料也提出『主要特徵』的指標來解決太過一般化的問題。 EAOI ：


11


N.Y.U.S.T.

I. M.

Training on SOM essentially involves two steps ： The identifying ： each training pattern compares with all the units of the map and identifies the best matching unit (BMU) that is most similar to the training pattern.

The adjusting ： the BMU and its neighbors are updated to resemble the training pattern.

SOM

12


N.Y.U.S.T.

I. M.Problem with the SOM

The conventional SOM can not directly handle categorical attributes.

The binary transformation approach has at least four disadvantages. (1) Similarity information among categorical values is not conveyed (2) When the domain of a categorical attribute is large, the

transformation increases the dimensionality of the transformed

relation (3) Maintenance is difficult (4) The names of binary attributes fail to preserve the semantics of the

original

categorical attribute

13


N.Y.U.S.T.

I. M.AOI

The induction method mainly includes two steps,

attribute removal and attribute generalization Attribute removal：相異資料過大的欄位、意義重覆的欄位將被移除 Attribute generalization ： for each remaining attribute, the original at

tribute values, which are more specific, are replaced by the values closer to the root of its concept hierarchy, which are more general.

14


N.Y.U.S.T.

I. M.Problems with handling major value and numeric attributes The traditional AOI is incapable of revealing major values and suffers from discretizing numeric attributes.

Regarding the construction of concept hierarchies for numeric attributes, there are two problems:

(1) subjectivity of the construction ：因概念階層建立的標準，造成相似的資料被區分到不同的類別去，因為標準是由人主觀給定的

(2) The generalization of boundary values ：如：當標準設為 50—100為中階時而 49.9 和 50 僅只有小小差異卻被分到低階去

15


N.Y.U.S.T.

I. M.GSOM-Distance hierarchy

To alleviate the drawbacks resulting from binary

transformation, we propose distance hierarchy.

A concept hierarchy extended with weights, as the

mechanism to facilitate the representation and measurement

of the distance between categorical values.

16


N.Y.U.S.T.

I. M.GSOM-Distance hierarchy (cont.)

The least common ancestor of two points X and Y, denoted

as LCA(X, Y)

i.e., LCA(X, Z)=Drink.

17


N.Y.U.S.T.


The least common point of two points X and Y, denoted as

LCP(X, Y), is defined as one of the three cases:

(1) either X or Y if they are at the same position (i.e.,

equivalent);

(2) Y if Y is an ancestor of X; otherwise

(3) LCP(X, Y)=LCA(X, Y)

18


N.Y.U.S.T.


The distance between two points in a distance hierarchy is

the total weight between them.

Let X=(NX, dX) and Y=(NY, dY) be the two points, the

distance between X and Y is defined as

註： d=offset represents the distance from the root of the

hierarchy to X.

19


N.Y.U.S.T.

I. M.

For example, assume that a two-dimensional pattern

x=(x1,x2)=(Coke, 9), Dom(x2)=[5, 20], and distance

hierarchies dh1 and dh2 are given as shown in Fig.

x1=Coke is mapped to X=(Coke, 2) in dh1. x2=9 is mapped

to X=(MAX, 4) in dh2.

種類型數值型

註： dhi=Xi-Leaf distance

GSOM-Distance between a pattern and a map unit (cont.)

20


N.Y.U.S.T.

I. M.GSOM-Distance between a pattern and a map unit (cont.) Assume a unit m consists of n components,

m=[m1, m2, …, mn]

Each mi, which can be categorical or numeric, is composed

of two parts: (N, d). For the categorical

That is, mi =(N, d) is mapped to a point M with the value

(N, d), denoted as dhi(mi)=M=(N, d), indicating the anchor of

the mapping point M is N and the offset from the root is d.

21


N.Y.U.S.T.

I. M.GSOM-Distance between a pattern and a map unit (cont.) Suppose x, m, and dh represent a training pattern, a map unit,

and a set of distance hierarchies, respectively. Then the distance between x and m is defined as

For example, the differences between the paired mapping points of x and m are |(Coke,2)-(Coke, 0.3)|=1.7 and |(MAX, 4)-(MAX, 6)|=2, respectively, making the distance between x

and m (1.7**2+2**2)**1/2=2.62.

( 註：解決了種類型的資料不需要二元轉換即可處理 )

22


N.Y.U.S.T.

I. M.GSOM-Adaptation of a unit component

Let X=(P, dX), M=(Q, dM), ( 德耳塔 )be the adjusting amount, and NLCA be the least common ancestor of the anchors P and Q

Case 1: new M is (Q, dM+) Case 2: new M is (P, dM+) Case 3: new M is (Q, dM- ) Case 4: new M is (P, 2dNLCA-dM+)

23


N.Y.U.S.T.

I. M.GSOM-Adaptation of a unit component

For a numeric component, the adjusting process is simpler due to its degenerated hierarchy. Let X=(MAX, dX),

M=(MAX, dM), and be the adjusting amount. If dM > dX ,

the new M is (MAX, dM- ), otherwise (MAX, dM+ ).

24


N.Y.U.S.T.

I. M.EAOI For the exploration of major values, we introduced a paramete

r, majority threshold β. If some values (i.e., major values) take up a major portion (exceeding β) of an attribute, the EAOI preserves those major values and generalizes other non-major values, β is set to 1, the EAOI degenerates to the AOI. 註： 0<β<=1

EAOI 除了分群和類別兩種特徵維度外，在數值型資料裡還加入了平均數和標準差來解決傳統的 AOI 會造成資料特徵有偏誤的現象

25


N.Y.U.S.T.

I. M.EAOI (cont.) 　

26


N.Y.U.S.T.

I. M.EAOI (cont.)

Algorithm: An EAOI algorithm for major values and

alternative processing of numeric attributes Input: A relation W with an attribute set A; a set of concept

hierarchies; generalization threshold θ and majority

threshold β. Output: A generalized relation P.

27


N.Y.U.S.T.

I. M.EAOI (cont.) 　Method：1. Determine whether to generalize numeric attributes.

2. For each attribute Ai to be generalized in W,2.1 Determine whether Ai should be removed, and if not, determine its

　 minimum desired generalization 　 level Li in its concept hierarchy.

2.2 Construct its major-value set Mi according to θand β.

2.3 For vDom(Ai), if vMi, construct the mapping pair as (v, vLi-MLi);

　 otherwise, as (v, v).

3. Derive the generalized relation P by replacing each value v by

　 its mapping value and computing other aggregate 　 values.

28


N.Y.U.S.T.

I. M.Exploratory clustering and pattern extraction The GSOM alone is incapable of extracting clusters’ character

istics, whereas the EAOI alone will result in over generalization if the data are diversified and not clustered before generalization.

Three kinds of patterns can be analyzed: cluster characteristics, discriminant rules, and characteristic rules.

29


N.Y.U.S.T.

I. M.

Exploratory clustering and pattern extraction (cont.)

Cluster Characteristics ：Extracted by EAOI from each cluster Ci, cluster characteristics

can be expressed as:

For example, C1: {[(City=Taipei, Salary=(51000, 0));0.97], [(City=North_Taiwan-{Taipei}, Salary=(51000, 1000));0.03]} represents two patterns, which take up 97% and 3% supports, extracted　 from C1.

30


N.Y.U.S.T.

I. M.

Exploratory clustering and pattern extraction (cont.)

Discriminate rules ：

For instance, If C1:{[(City=Taipei, Salary=(51000,0));0.97],

[(City=North_Taiwan-{Taipei},

Salary=(51000,1000));0.03]}{A(0.7), B(0.3)} indicates

that C1 has two patterns taking up 97% and 3%, respectively,

and these patterns imply Class A with 70% confidence or

Class B with 30% confidence.

31


N.Y.U.S.T.

I. M.

Exploratory clustering and pattern extraction (cont.) Characteristic Rules：　

IF 飲料 ((birthPlace= 台中 , company= 企管 , amt=(200,3.4), (C2, 0.8)) or (birthPlace= 臺北 , company= 管理學院 - 企管 , amt=(150,2.1), (C1, 0.2))) ，表「飲料」類別中，包含兩個規則，一為 80% 屬於第二群，其特徵是台中、企管、平均購買金額與標準差分別為 200 與 3.4 ；二為 20% 屬於第一群，特徵為臺北、管理學院 - 企管、平均購買金額與標準差分別為 150 與 2.1 ，主要特徵為「企管」

32


N.Y.U.S.T.

I. M.Experimental results- Synthetic data

This experiment aims to compare the results by using the conventional SOM and AOI with those of the GSOM and EAOI on a synthetic, mixed dataset.

We designed a dataset of 400 tuples, which has four attributes plus one class attribute, as shown in Table 1.

33


N.Y.U.S.T.


34


N.Y.U.S.T.


The hierarchies for attributes are shown in Fig the hierarchies of the Age and the amount are for the traditional AOI.

35


N.Y.U.S.T.


The map size is 64 units, the learning rate is a linear function

with the initial value and a neighborhood radius

function set to the side length of the map, training time T is at least 10 times of the map size.

36


N.Y.U.S.T.


GSOM SOM

Shows the training results of 12,000 training time ：

37


N.Y.U.S.T.


　 We further use EAOI and AOI to extract discriminate rules　 for the four groups formed on 　 the GSOM. The parameters　 are set as follows: the attribute generalization threshold θ=3　 and the 　 majority threshold β=0.75

38


N.Y.U.S.T.


GSOM

39


N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset The dataset has 15 attributes including eight categorical, six numerical, and one class attributes Salary indicating whether the salary is over 50K (>50K) or less than 50K (<=50K). 　

40


N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset

41


N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset We use three criteria to cluster the training results.

42


N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset For instance, the second criterion (d <=2.828) merges Cluster 4 and 7 of the GSOM in Fig. 10(a) and merges Cluster 1, 2, 5, 6, 10, 12 and 13 of the SOM in Fig. 10(b).

43


N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset The average categorical utility of a set of clusters is calculated as follows.

where P(Ai=Vij|Ck) is the conditional probability that the attribute Ai has the values Vij given the cluster Ck, and P(Ai=Vij) is the overall probability of Ai having Vij in the entire data set.

44


N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset We compute the ACU of categorical values of clusters formed by the three clustering criteria at the leaf level and Level 1 of the distance hierarchies, and the increased rate, as shown in

45


N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset The expected entropy of an attribute C in a set of clusters can be used to measure how the class values are distributed in the clusters, formula is as follows

where Vj denotes one of the possible values that C can take, |Ck| is the size of Cluster k, and |D| is the dataset size.

The chaining effect results in a reduced cluster number and the increased expected entropy

46


N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset The Salary class distributions in the clusters are shown in Table 6, where Cluster 4 and 1 have the largest ratios of >50K. Cluster 5, 3, and 7 have much lower ratios of >50K compared to the dataset.

We use EAOI and AOI to extract cluster patterns. The parameters were set as follows: the attribute generalization threshold θ=4 and the majority threshold β=0.75.

47


N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset Table 7 and 8 are referred to for a portion of the patterns from Cluster 4, 2, and 7 by both methods.

48


N.Y.U.S.T.

I. M.Experimental results- UCI adult dataset

49


N.Y.U.S.T.

I. M.Experimental results- Sales data

In another experiment, we used a subset of sales records of a store at a university during 4/12/1999 to 7/17/2000.

50


N.Y.U.S.T.


51


N.Y.U.S.T.


52


N.Y.U.S.T.


53


N.Y.U.S.T.

I. M.Conclusions

The attributes participating in the training of the GSOM have

significant impact on the results due to the distance metric used

in the training algorithm.

If a class attribute is involved in the data, relevance analysis

between the class attribute and the others (or feature selection)

should be performed before training to ensure the quality of

cluster analysis.

an integrated framework for visualized and exploratory pattern discovery in mixed data

Documents