fast kernel-density-based classification and clustering using p-trees anne denton major advisor:...
TRANSCRIPT
Fast Kernel-Density-Based Classification and Clustering
Using P-Trees
Anne DentonMajor Advisor: William Perrizo
Outline Introduction P-Trees
Concepts Implementation
Kernel Methods Paper 1: Rule-Based Classification Paper 2: Kernel-Based Semi-Naïve Bayes
Classifier Paper 3: Hierarchical Clustering Outlook
Introduction
Data Mining Information from data Considers storage issues
P-Tree Approach Bit-column-based storage
Compression Hardware optimization Simple index construction Flexibility in storage organization
P-Tree Concepts
Ordering (details) New: Generalized Peano order sorting
Compression
Impact of Peano Order SortingImpact of Sorting on Storage Requirements
0
1000020000
30000
4000050000
60000
adult
spam
mus
hroo
m
func
tion
crop
P-t
ree
No
des
Unsorted
Simple Sorting
Generalized PeanoSorting
Impact of Sorting on Execution Speed
0
20
40
60
80
100
120
adult
spam
mus
hroo
m
func
tion
crop
Tim
e in
Sec
on
ds Unsorted
Simple Sorting
Generalized PeanoSorting
P-Tree Implementation Implementation in Java
Was ported to C / C++ (Amal Perera, Masum Serazi) Fastest compressing P-tree implementation so far
Array indices as pointers (details) Grandchild purity
Kernel-Density-Based Classification
Probability of an attribute vector x Conditional on class label value ci
[] is 1 if is true, 0 otherwise Depending on N training points xt
Kernel function K(x,xt) can be, e.g., Gaussian function or step function
][),(1
)|(1
i
N
tt cCK
NcCPi
xxx
Higher Order Basic Bit Distance HOBbit P-trees make count evaluation efficient for the
following intervals
Paper 1: Rule-Based Classification Goal: High accuracy on large data sets including
standard ones (UCI ML Repository) Neighborhood evaluated through
Equality of categorical attributes HOBbit interval for continuous attributes
Curse of dimensionality Volume empty with high likelihood
Information gain to select attributes Attributes considered binary, based on test sample
(“Lazy decision trees”, Friedman ’96 [4]) Continuous data: Interval around test sample Exact information gain (details) Pursuing multiple paths
Results: Accuracy Comparable to C4.5
after much less development time
5 data sets from UCI Machine Learning Repository (details)
2 additional data sets Crop Gene-function
Improvement through multiple paths (20)
C4.5 20 paths
adult 15.54 14.93
kr-vs-kp 0.8 0.84
mushroom 0 0
20 Paths vs. 1 Path
-2
0
2
4
6
adult spam sick-euthyroid
kr-vs-kp gene-function
crop
Rel
ativ
e A
ccur
acy
Results: Speed Used on largest UCI data sets Scaling of execution time as a function of
training set size
0
20
40
60
80
0 10000 20000 30000
Number of Training Points
Tim
e p
er
Te
st
Sa
mp
le in
M
illis
ec
on
ds
Measured Execution Time
Linear Interpolation
Paper 2: Kernel-Based Semi-Naïve Bayes Classifier
M
ki
ki cCxPcCP
1
)( )|()|(x
Goal: Handling many attributes Naïve Bayes
x(k) is value of kth attribute
Semi-naïve Bayes Correlated attributes are joined Has been done for categorical data
Kononenko ’91 [5], Pazzani ’96 [6] Previously: Continuous data discretized
Kernel-Based Naïve Bayes Alternatives for continuous data
Discretization Distribution function
Gaussian with mean and standard deviation from data
No alternative forsemi-naïve approach
Kernel density estimate (Hastie [7])
0
0.02
0.04
0.06
0.08
0.1
kerneldensityestimate
distributionfunction
data points
Correlations Correlation between
attributes a and bN: Number of training points t
Kernel function forcontinuous datadEH: Exponential HOBbit distance
1
),(
),(
),(
, 1
)()()(
1 ,
)()()(
bak
N
t
kt
kk
N
t bak
kt
kk
xxK
xxK
baCorr
2
2)()()()()(
2
),(exp
2
1),(
k
kt
kEH
k
kt
kkGauss
xxdxxK
Results
-10
0
10
20
30
spam cr
opad
ult
sick-
euth
yroi
d
mus
hroo
m
gene
-functi
onsp
lice
wavef
orm
kr-v
s-kp
Dec
reas
e in
Err
or
Rat
e
P-tree Naïve Bayes Difference only for
continuous dataSemi-Naïve Bayes 3 parameter
combinations Blue: t = 1
3 iterations Red: t = 0.3
incl. anti-corr. White: t = 0.05(t: threshold)
-4
0
4
8
12
spam crop adult sick-euthyroid
waveform
De
cre
as
e in
Err
or
Ra
te
Paper 3: Hierarchical Clustering [10] Goal:
Understand relationship between standard algorithms
Combine the “best” aspects of three major ones Partition-based
Relationship to k-medoids [8] demonstrated Same cluster boundary definition
Density-based (kernel-based, DENCLUE [9]) Similar cluster center definition
Hierarchical Follows naturally from above definitions
Results: Speed Comparison with K-Means
1
10
100
1000
10000
1000 10000 100000 1000000 10000000
K-Means
Leaf Node in Hierarchy Internal Node
Results: Clustering Effectiveness
1 4 7 10 13 16 19 22 25 28 31S1
S4
S7
S10
S13
S16
S19
S22
S25
S28
S31
Attribute 1
Attrib
ute
2
Histogram
20-25
15-20
10-15
5-10
0-5
K-means
Our Algorithm
Summary P-tree representation for non-spatial data Fast implementation Paper1: Rule-Based Algorithm
Test-sample-centered intervals, multiple paths Competitive on “standard” (UCI) data
Paper 2: Kernel-Based Semi-Naïve Bayes New algorithm to handle large attribute numbers Attribute joining shown to be beneficial
Paper 3: Hierarchical Clustering [10] Competitive for speed and effectiveness Hierarchical structure
Outlook
Software engineering aspects Column-oriented design Relationship with P-tree API
“Non-standard” data Data with graph structure Hierarchical data, concept slices [11]
Visualization Visualization of data on a graph
Software Engineering Business-problems row-based
Match between database tables and objects Scientific / engineering problems column-
based Collective properties of interest Standard OO unsuitable, instead
Fortran Array-based languages (ZPL)
Solution? Design pattern? Library?
Ptree API
“Non-standard” Data Types of data
Biological (KDD-cup ’02: Our team got honorary mention!)
Sensor Networks Types of problems
Small probability of minority class label A-ROC Evaluation Multi-valued attributes Bit-vector representation ideal for P-trees Graphs Rich supply of new problems / techniques (work with
Chris Besemann) Hierarchical categorical attributes [11]
Visualization
Idea: Use graph visualization tool
E.g. http://www.touchgraph.com/ Visualize node data through glyphs Visualize edge data