an integrated framework for visualized and exploratory pattern discovery in mixed data
DESCRIPTION
An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data. Advisor : Dr. Hsu Presenter : Jing-Wei Lin. Outline. Motivation Objective Introduction SOM and AOI GSOM and EAOI Exploratory clustering and pattern extraction Experimental results Conclusions. - PowerPoint PPT PresentationTRANSCRIPT
1Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data
Advisor : Dr. Hsu
Presenter : Jing-Wei Lin
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline
Motivation Objective Introduction SOM and AOI GSOM and EAOI Exploratory clustering and pattern extraction Experimental results Conclusions
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation A successful integration relies on appropriate individual techniques. However, the traditional self-organizing ma
p (SOM) and attribute-oriented induction (AOI) have some drawbacks.
The traditional self-organizing map (SOM) is incapable of
directly handling the categorical data.
The attribute-oriented induction (AOI) may fail to preserve major values of an attribute, leading to over generalization.比如說:台北的薪資資料有 30 筆、桃園和新竹的資料各一筆而使用 AOI 處理後代表北台灣的所得的值可能會使人發生誤會
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SOM The SOM is an unsupervised neural network which projects high-dimensional data onto a low-dimensional grid, usually two-dimensional, and preserves the topological relationships of the original data.
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.AOI
Attribute-oriented induction extracts data patterns in a large amount of data and produces a set of concise rules, which represent the general patterns hidden in the data.
註: AOI 是一個可以對關聯式資料庫進行資料特微擷取的技術
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective A generalized self-organizing map (GSOM) and an extended
attribute-oriented induction (EAOI), which not only overcome
the drawbacks of their original algorithms but also provide
additional analysis capabilities.
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Among unsupervised clustering techniques, a lot of attention has been paid to self-organizing map (SOM), which projects high-dimensional data to low-dimensional grids, without losing their topological order.
Regarding pattern extraction techniques, attribute-oriented induction (AOI) is a popular and effective approach.
Introduction
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
The integrated analysis framework works as follows: train the
GSOM using preprocessed data, perform data clustering visually and exploratory on the trained map, and then extract the characteristics of individual clusters using the EAOI.
Introduction (cont.)
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
The GSOM is able to directly handle categorical data :因為在二元轉換的過程中會造成資料損失或不完整的情況發生,故利用概念階層樹給予每一個 link 一個權重來計算出種類型資料間確切的距離
Introduction (cont.)
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
The EAOI offers the additional capability of preserving major values in the data : 即在傳統的 AOI 法中,另外考慮了『重複次數』來獲得特徵值的分佈程度,並針對種
類 型資料也提出『主要特徵』的指標來解決太過一般化的問題。 EAOI :
Introduction (cont.)
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Training on SOM essentially involves two steps : The identifying : each training pattern compares with all the units of the map and identifies the best matching unit (BMU) that is most similar to the training pattern.
The adjusting : the BMU and its neighbors are updated to resemble the training pattern.
SOM
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Problem with the SOM
The conventional SOM can not directly handle categorical attributes.
The binary transformation approach has at least four disadvantages. (1) Similarity information among categorical values is not conveyed (2) When the domain of a categorical attribute is large, the
transformation increases the dimensionality of the transformed
relation (3) Maintenance is difficult (4) The names of binary attributes fail to preserve the semantics of the
original
categorical attribute
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.AOI
The induction method mainly includes two steps,
attribute removal and attribute generalization Attribute removal:相異資料過大的欄位、意義重覆的欄位將被移除 Attribute generalization : for each remaining attribute, the original at
tribute values, which are more specific, are replaced by the values closer to the root of its concept hierarchy, which are more general.
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Problems with handling major value and numeric attributes The traditional AOI is incapable of revealing major values and suffers from discretizing numeric attributes.
Regarding the construction of concept hierarchies for numeric attributes, there are two problems:
(1) subjectivity of the construction :因概念階層建立的標準,造成相似的資料被區分到不同的類別去,因為標準是由人主觀給定的
(2) The generalization of boundary values :如:當標準設為 50—100為中階時而 49.9 和 50 僅只有小小差異卻被分到低階去
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.GSOM-Distance hierarchy
To alleviate the drawbacks resulting from binary
transformation, we propose distance hierarchy.
A concept hierarchy extended with weights, as the
mechanism to facilitate the representation and measurement
of the distance between categorical values.
16
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.GSOM-Distance hierarchy (cont.)
The least common ancestor of two points X and Y, denoted
as LCA(X, Y)
i.e., LCA(X, Z)=Drink.
17
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.GSOM-Distance hierarchy (cont.)
The least common point of two points X and Y, denoted as
LCP(X, Y), is defined as one of the three cases:
(1) either X or Y if they are at the same position (i.e.,
equivalent);
(2) Y if Y is an ancestor of X; otherwise
(3) LCP(X, Y)=LCA(X, Y)
18
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.GSOM-Distance hierarchy (cont.)
The distance between two points in a distance hierarchy is
the total weight between them.
Let X=(NX, dX) and Y=(NY, dY) be the two points, the
distance between X and Y is defined as
註: d=offset represents the distance from the root of the
hierarchy to X.
19
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
For example, assume that a two-dimensional pattern
x=(x1,x2)=(Coke, 9), Dom(x2)=[5, 20], and distance
hierarchies dh1 and dh2 are given as shown in Fig.
x1=Coke is mapped to X=(Coke, 2) in dh1. x2=9 is mapped
to X=(MAX, 4) in dh2.
種類型 數值型
註: dhi=Xi-Leaf distance
GSOM-Distance between a pattern and a map unit (cont.)
20
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.GSOM-Distance between a pattern and a map unit (cont.) Assume a unit m consists of n components,
m=[m1, m2, …, mn]
Each mi, which can be categorical or numeric, is composed
of two parts: (N, d). For the categorical
That is, mi =(N, d) is mapped to a point M with the value
(N, d), denoted as dhi(mi)=M=(N, d), indicating the anchor of
the mapping point M is N and the offset from the root is d.
21
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.GSOM-Distance between a pattern and a map unit (cont.) Suppose x, m, and dh represent a training pattern, a map unit,
and a set of distance hierarchies, respectively. Then the distance between x and m is defined as
For example, the differences between the paired mapping points of x and m are |(Coke,2)-(Coke, 0.3)|=1.7 and |(MAX, 4)-(MAX, 6)|=2, respectively, making the distance between x
and m (1.7**2+2**2)**1/2=2.62.
( 註:解決了種類型的資料不需要二元轉換即可處理 )
22
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.GSOM-Adaptation of a unit component
Let X=(P, dX), M=(Q, dM), ( 德耳塔 )be the adjusting amount, and NLCA be the least common ancestor of the anchors P and Q
Case 1: new M is (Q, dM+) Case 2: new M is (P, dM+) Case 3: new M is (Q, dM- ) Case 4: new M is (P, 2dNLCA-dM+)
23
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.GSOM-Adaptation of a unit component
For a numeric component, the adjusting process is simpler due to its degenerated hierarchy. Let X=(MAX, dX),
M=(MAX, dM), and be the adjusting amount. If dM > dX ,
the new M is (MAX, dM- ), otherwise (MAX, dM+ ).
24
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.EAOI For the exploration of major values, we introduced a paramete
r, majority threshold β. If some values (i.e., major values) take up a major portion (exceeding β) of an attribute, the EAOI preserves those major values and generalizes other non-major values, β is set to 1, the EAOI degenerates to the AOI. 註: 0<β<=1
EAOI 除了分群和類別兩種特徵維度外,在數值型資料裡還加入了平均數和標準差來解決傳統的 AOI 會造成資料特徵有偏誤的現象
25
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.EAOI (cont.)
26
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.EAOI (cont.)
Algorithm: An EAOI algorithm for major values and
alternative processing of numeric attributes Input: A relation W with an attribute set A; a set of concept
hierarchies; generalization threshold θ and majority
threshold β. Output: A generalized relation P.
27
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.EAOI (cont.) Method:1. Determine whether to generalize numeric attributes.
2. For each attribute Ai to be generalized in W,2.1 Determine whether Ai should be removed, and if not, determine its
minimum desired generalization level Li in its concept hierarchy.
2.2 Construct its major-value set Mi according to θand β.
2.3 For vDom(Ai), if vMi, construct the mapping pair as (v, vLi-MLi);
otherwise, as (v, v).
3. Derive the generalized relation P by replacing each value v by
its mapping value and computing other aggregate values.
28
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Exploratory clustering and pattern extraction The GSOM alone is incapable of extracting clusters’ character
istics, whereas the EAOI alone will result in over generalization if the data are diversified and not clustered before generalization.
Three kinds of patterns can be analyzed: cluster characteristics, discriminant rules, and characteristic rules.
29
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Exploratory clustering and pattern extraction (cont.)
Cluster Characteristics :Extracted by EAOI from each cluster Ci, cluster characteristics
can be expressed as:
For example, C1: {[(City=Taipei, Salary=(51000, 0));0.97], [(City=North_Taiwan-{Taipei}, Salary=(51000, 1000));0.03]} represents two patterns, which take up 97% and 3% supports, extracted from C1.
30
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Exploratory clustering and pattern extraction (cont.)
Discriminate rules :
For instance, If C1:{[(City=Taipei, Salary=(51000,0));0.97],
[(City=North_Taiwan-{Taipei},
Salary=(51000,1000));0.03]}{A(0.7), B(0.3)} indicates
that C1 has two patterns taking up 97% and 3%, respectively,
and these patterns imply Class A with 70% confidence or
Class B with 30% confidence.
31
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Exploratory clustering and pattern extraction (cont.) Characteristic Rules:
IF 飲料 ((birthPlace= 台中 , company= 企管 , amt=(200,3.4), (C2, 0.8)) or (birthPlace= 臺北 , company= 管理學院 - 企管 , amt=(150,2.1), (C1, 0.2))) ,表「飲料」類別中,包含兩個規則,一為 80% 屬於第二群,其特徵是台中、企管、平均購買金額與標準差分別為 200 與 3.4 ;二為 20% 屬於第一群,特徵為臺北、管理學院 - 企管、平均購買金額與標準差分別為 150 與 2.1 ,主要特徵為「企管」
32
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- Synthetic data
This experiment aims to compare the results by using the conventional SOM and AOI with those of the GSOM and EAOI on a synthetic, mixed dataset.
We designed a dataset of 400 tuples, which has four attributes plus one class attribute, as shown in Table 1.
33
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- Synthetic data
34
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- Synthetic data
The hierarchies for attributes are shown in Fig the hierarchies of the Age and the amount are for the traditional AOI.
35
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- Synthetic data
The map size is 64 units, the learning rate is a linear function
with the initial value and a neighborhood radius
function set to the side length of the map, training time T is at least 10 times of the map size.
36
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- Synthetic data
GSOM SOM
Shows the training results of 12,000 training time :
37
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- Synthetic data
We further use EAOI and AOI to extract discriminate rules for the four groups formed on the GSOM. The parameters are set as follows: the attribute generalization threshold θ=3 and the majority threshold β=0.75
38
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- Synthetic data
GSOM
39
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- UCI adult dataset The dataset has 15 attributes including eight categorical, six numerical, and one class attributes Salary indicating whether the salary is over 50K (>50K) or less than 50K (<=50K).
40
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- UCI adult dataset
41
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- UCI adult dataset We use three criteria to cluster the training results.
42
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- UCI adult dataset For instance, the second criterion (d <=2.828) merges Cluster 4 and 7 of the GSOM in Fig. 10(a) and merges Cluster 1, 2, 5, 6, 10, 12 and 13 of the SOM in Fig. 10(b).
43
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- UCI adult dataset The average categorical utility of a set of clusters is calculated as follows.
where P(Ai=Vij|Ck) is the conditional probability that the attribute Ai has the values Vij given the cluster Ck, and P(Ai=Vij) is the overall probability of Ai having Vij in the entire data set.
44
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- UCI adult dataset We compute the ACU of categorical values of clusters formed by the three clustering criteria at the leaf level and Level 1 of the distance hierarchies, and the increased rate, as shown in
45
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- UCI adult dataset The expected entropy of an attribute C in a set of clusters can be used to measure how the class values are distributed in the clusters, formula is as follows
where Vj denotes one of the possible values that C can take, |Ck| is the size of Cluster k, and |D| is the dataset size.
The chaining effect results in a reduced cluster number and the increased expected entropy
46
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- UCI adult dataset The Salary class distributions in the clusters are shown in Table 6, where Cluster 4 and 1 have the largest ratios of >50K. Cluster 5, 3, and 7 have much lower ratios of >50K compared to the dataset.
We use EAOI and AOI to extract cluster patterns. The parameters were set as follows: the attribute generalization threshold θ=4 and the majority threshold β=0.75.
47
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- UCI adult dataset Table 7 and 8 are referred to for a portion of the patterns from Cluster 4, 2, and 7 by both methods.
48
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- UCI adult dataset
49
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- Sales data
In another experiment, we used a subset of sales records of a store at a university during 4/12/1999 to 7/17/2000.
50
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- Sales data
51
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- Sales data
52
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental results- Sales data
53
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Conclusions
The attributes participating in the training of the GSOM have
significant impact on the results due to the distance metric used
in the training algorithm.
If a class attribute is involved in the data, relevance analysis
between the class attribute and the others (or feature selection)
should be performed before training to ensure the quality of
cluster analysis.