Relevant characteristics extraction from semantically
unstructured data
PhD title : Data mining in unstructured data
Daniel I. MORARIU, MSc
PhD Supervisor: Lucian N. VINŢANSibiu, 2006
Contents Prerequisites Correlation of the SVM kernel’s parameters
Polynomial kernel Gaussian kernel
Feature selection using Genetic Algorithms Chromosome encoding Genetic operators
Meta-classifier with SVM Non-adaptive method – Majority Vote Adaptive methods
Selection based on Euclidean distance Selection based on cosine
Initial data set scalability Choosing training and testing data sets Conclusions and further work
Prerequisites Reuters Database Processing
806791 total documents, 126 topics, 366 regions, 870 industry codes
Industry category selection – “system software” 7083 documents (4722 training /2361 testing) 19038 attributes (features) 24 classes (topics)
Data representation Binary Nominal Cornell SMART
Classifier using Support Vector Machine techniques
kernels
Correlation of the SVM kernel’s parameters
Polynomial kernel
Gaussian kernel
dxxdxxk '2)',(
Cn
xxxxk
2'
exp)',(
Polynomial kernel Commonly used kernel
d – degree of the kernel b – the offset
Our suggestion b = 2 * d
Polynomial kernel parameter’s correlation
ddk '2)',( xxxx
dbk xxxx ),(
Bias – Polynomial kernel
dbk xxxx ),(Influence of the bias - Nominal representation of input data
65
70
75
80
85
90
0 1 2 3 4 5 6 7 8 9 10
50
10
0
50
0
10
00
13
09
Values of the bias (b)
Acc
ura
cy (
%)
d=1
d=2
d=3
d=4
OurChoice
ddk '2)',( xxxx
Gaussian kernel parameter’s correlation
Gaussian kernel Commonly used kernel
C – usually represents the dimension of the set
Our suggestion n – numbers of distinct features greater
than 0
Ck
'exp)',(
xxxx
Cn
k2
'exp)',(
xxxx
n – Gaussian kernel
Ck
'exp)',(
xxxx
Influence of n - Cornell Smart data representation
50
55
60
65
70
75
80
85
90
1 10 50 100 500 654 1000 1309 auto
Values of parameter n
Acc
urac
y (%
) C=1.0
C=1.3
C=1.8
C=2.1
Cn
k2
'exp)',(
xxxx
auto
Feature selection using Genetic Algorithms
Chromosome
Fitness (ci) = SVM (ci)
Methods of selecting parents Roulette Wheel Gaussian selection
Genetic operators Selection Mutation Crossover
bwwwc ,,...,, 1903810
m
iin bbwwwfcf
121 ,),,...,,(()( xw
Methods of selecting the parents
Roulette Wheel each individual is represented by a space
that corresponds proportionally to its fitness
Gaussian : maxim value (m=1) and dispersion (σ =
0.4)
2
)((
2
1exp)(
mcfitness
cP ii
The process of obtaining the next
generation
Current generation
The best chromosome is copied from old
population into the new population
Selects two parents.
We create two children from selected parents using crossover with parents split
Need more chromosomes into the set?
Randomly eliminate one of the parents
Mutation – randomly change the sign for a random number of
elements
Selection Crossover Mutation
New generation
Yes
No
GA_FS versus SVM_FS for 1309 features
0
10
20
30
40
50
60
70
80
90Acc
ura
cy(%
)
D1.0 D2.0 D3.0 D4.0 D5.0
Kernel degree
GA-BINGA-NOMGA-CSSVM-BINSVM-NOMSVM-CS
Training time, polynomial kernel, d= 2, NOM
0
10
20
30
40
50
60
70
80
475 1309 2488 8000
number of features
Tim
e[m
inute
s]
GA_FSSVM_FSIG_FS
GA_FS versus SVM_FS for 1309 features
81.5
82
82.5
83
83.5
84Acc
ura
cy(%
)
C1.0 C1.3 C1.8 C2.1 C2.8 C3.1
Parameter C
GA-BINGA-CSSVM-BINSVM-CS
Training time, Gaussian kernel, C=1.3, BIN
0
20
40
60
80
100
120
475 1309 2488 8000
number of features
Tim
e[m
inute
s]
GA_FSSVM_FSIG_FS
Meta-classifier with SVM
Set of SVM’s Polynomial degree 1, Nominal Polynomial degree 2, Binary Polynomial degree 2, Cornell Smart Polynomial degree 3, Cornell Smart Gaussian C=1.3, Binary Gaussian C=1.8, Cornell Smart Gaussian C=2.1, Cornell Smart Gaussian C=2.8, Cornell Smart
Upper limit (94.21%)
Meta-classifier methods’
Non-adaptive method Majority Vote – each classifier votes a
specific class for a current document Adaptive methods - Compute the similarity between a
current sample and error samples from the self queue
Selection based on Euclidean distance First good classifier The best classifier
Selection based on cosine First good classifier The best classifier Using average
n
ii
n
ii
n
iii
xx
xx
1
2
1
2
1
]'[][
]'[][
'
',cos
xx
xx
n
iii xxEucl
1
2)][]([),( xx
Selection based on Euclidean distance
Classification accuracy
78808284868890929496
1 3 5 7 9 11 13
Steps
Acc
ura
cy(%
)
Upper LimitFC-SBEDBC-SBED
Selection based on cosine
Classification accuracy
80
82
84
86
88
90
92
94
96
1 3 5 7 9 11 13
Steps
Acc
ura
cy(%
)
Upper Limit
FC-SBCOS
BC-SBCOS
BC-SBCOS - withaverage
Comparison between SBED and SBCOS
Classification Accuracy
80
82
84
86
88
90
92
94
96
1 3 5 7 9 11 13
Steps
Acc
ura
cy(%
)
Majority VoteSBEDSBCOSUpper Limit
Comparison between SBED and SBCOS
Processing Time
0
10
20
30
40
50
60
70
80
1 3 5 7 9 11 13
Steps
Tim
e [
min
ute
s]
Majority VoteSBEDSBCOS
Initial data set scalability
Normalize each sample (7053)Group initial set based on distance (4474)
Take relevant vector (4474)Use relevant vector in classification process
Select only support vectors (847)
Take samples grouped in selected support vectors (4256)Make the classification (with 4256 samples)
Polynomial kernel – 1309 features, NOM
74
76
78
80
82
84
86
88
Acc
ura
cy (
%)
D1.0 D2.0 D3.0 D4.0 D5.0
Degree of kernel
Kernel's degree influence
SVM -7053SVM-4256
Gaussian kernel – 1309 features, CS
70
72
74
76
78
80
82
84
86
88
90Acc
ura
cy(%
)
1 1.3 1.8 2.1 2.8
parameter C
SVM-7053SVM-4256
Training time
0
10
20
30
40
50
Tim
e [
min
ute
s]
C1.0 C1.3 C1.8 C2.1 C2.8
7053-Bin4256-Bin
Parameter C
7053-Bin7053-CS4256-Bin4256-CS
Choosing training and testing data set
74
76
78
80
82
84
86
88Acc
ura
cy(%
)
D1.0
D2.0
D3.0
D4.0
D5.0
Aver
age
Kernel's degree
1309 Features - Polynomial kernel
average over oldsetaverage over newset
Choosing training and testing data set
7072747678808284868890
Acc
ura
cy(%
)
C1.0
C1.3
C1.8
C2.1
C2.8
Aver
age
Kernel's degree
1309 Features - Gaussian kernel
average over oldsetaverage over newset
Conclusions – other results
Using our correlation 3% better for Polynomial kernel 15% better for Gaussian kernel
Reduced number of features between 2.5% (475) and 6% (1309)
GA _FS faster than SVM_FS Polynomial kernel with nominal representation
and small degree Gaussian kernel with Cornell Smart
representation Reuter’s database is linearly separable SBED is better and faster than SBCOS Classification accuracy decreases with 1.2 %
when the data set is reduced
Further work
Features extraction and selection
Association rules between words (Mutual Information)
Synonym and Polysemy problem Using families of words (WordNet)
Web mining application Classifying larger text data sets A better method of grouping data Using classification and clustering
together